VARIATIONAL BAYES INFERENCE OF ISING MODELS AND THEIR APPLICATIONS By Minwoo Kim A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2022 ABSTRACT VARIATIONAL BAYES INFERENCE OF ISING MODELS AND THEIR APPLICATIONS By Minwoo Kim Ising models originated in statistical physics have been widely used in modeling spatial data and computer vision problems. However, statistical inference of this model and its application to many practical fields remain challenging due to intractable nature of the normalizing constant in the likelihood. This dissertation consists of two main themes, (1) parameter estimation of Ising model and (2) structured variable selection based on the Ising model using variational Bayes (VB). In Chapter 1, we review the background, research questions and development of Ising model, variational Bayes, and other statistical concepts. An Ising model basically deal with a binary random vector in which each component is dependent on its neighbors. There exist various versions of Ising model depending on parameterization and neighboring structure. In Chapter 2, with two-parameter Ising model, we describe a novel procedure for the pa- rameter estimation based on VB which is computationally efficient and accurate compared to existing methods. Traditional pseudo maximum likelihood estimate (PMLE) can pro- vide accurate results only for smaller number of neighbors. A Bayesian approach based on Markov chain Monte Carlo (MCMC) performs better even with a large number of neighbors. Computational costs of MCMC, however, are quite expensive in terms of time. Accordingly, we propose a VB method with two variational families, mean-field (MF) Gaussian family and bivariate normal (BN) family. Extensive simulation studies validate the efficacy of the families. Using our VB methods, computing times are remarkably decreased without dete- rioration in performance accuracy, or in some scenarios we get much more accurate output. In addition, we demonstrates theoretical properties of the proposed VB method under MF family. The main theoretical contribution of our work lies in establishing the consistency of the variational posterior for the Ising model with the true likelihood replaced by the pseudo- likelihood. Under certain conditions, we first derive the rates at which the true posterior based on the pseudo-likelihood concentrates around the εn - shrinking neighborhoods of the true parameters. With a suitable bound on the Kullback-Leibler distance between the true and the variational posterior, we next establish the rate of contraction for the variational pos- terior and demonstrate that the variational posterior also concentrates around εn -shrinking neighborhoods of the true parameter. In Chapter 3, we propose a Bayesian variable selection technique for a regression setup in which the regression coefficients hold structural dependency. We employ spike and slab priors on the regression coefficients as follows: (i) In order to capture the intrinsic structure, we first consider Ising prior on latent binary variables. If a latent variable takes one, the corresponding regression coefficient is active, otherwise, it is inactive. (ii) Employing spike and slab prior, we put Gaussian priors (slab) on the active coefficients and inactive coefficients will be zeros with probability one (spike). Copyright by MINWOO KIM 2022 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisors Dr. Tapabrata Maiti and Dr. Shrijita Bhattacharya for their guidance toward my researches and my life in US. They have always helped me make great progress with their insightful suggestions. I would also like to extend my appreciation to my dissertation committee members, Dr. Yimin Xiao and Dr. Jiliang Tang. Their feedback and comments are really beneficial for my research. I am also grateful to my previous advisor Dr. Chaeyoung Lim and my future advisor Dr. Marc G. Genton for giving me wonderful opportunities and encouraging me all the time. I made a lot of friends and I have never felt lonely because of them. I really appreciate my friends. Last but not least, I would like to express my sincere thanks to my family for their support and concerns. During my five years at Michigan State University, I have learned a lot from the courses and seminars and I am able to apply the knowledge I learned to my research. Thanks to all professors and friends and family, I am incredibly confident in my academic success. I am very much looking forward to my life as a statistician. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Pseudo-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Variational Bayes (VB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Black box variational inference (BBVI) . . . . . . . . . . . . . . . . . 6 1.3 Adaptive learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Posterior consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 CHAPTER 2 A VARIATIONAL BAYES ALGORITHM AND POSTERIOR CON- SISTENCY FOR TWO-PARAMETER ISING MODEL ESTIMATION 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 VB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 VB algorithm with MF family . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 VB algorithm with BN family . . . . . . . . . . . . . . . . . . . . . . 16 2.3 PMLE and MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Generating observed data . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Image reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Extension to multi threshold parameters . . . . . . . . . . . . . . . . . . . . 26 2.7 Posterior consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7.1 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.8 Preliminary notations and Lemmas . . . . . . . . . . . . . . . . . . . . . . . 33 2.9 Taylor expansion for log-likelihood . . . . . . . . . . . . . . . . . . . . . . . 34 2.10 Technical details of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.11 Technical details of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.12 Technical details of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.13 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.14 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.14.1 Proof of Relation (2.53) . . . . . . . . . . . . . . . . . . . . . . . . . 52 vi CHAPTER 3 BAYESIAN VARIABLE SELECTION IN A STRUCTURED RE- GRESSION MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1 Model and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.1 ELBO optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 Li and Zhang (2010)’s Gibbs sampling scheme . . . . . . . . . . . . . 62 3.3.2 Hyper parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.1 True posterior consistency with true Ising prior . . . . . . . . . . . . 73 3.4.2 True posterior consistency with pseudo Ising prior . . . . . . . . . . . 83 3.4.3 Bounded KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.4 Variational posterior consistency . . . . . . . . . . . . . . . . . . . . 89 CHAPTER 4 DISCUSSION AND FUTURE RESEARCH . . . . . . . . . . . . . . 92 4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . . 93 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 vii LIST OF TABLES Table 2.1: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 21 Table 2.2: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 22 Table 2.3: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 23 Table 2.4: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 24 Table 2.5: The estimated parameters with standard errors (SE) in parentheses and time costs for the features gender and school. . . . . . . . . . . . . . . . . 24 Table 3.1: Examples of hyper parameter choices . . . . . . . . . . . . . . . . . . . . 63 Table 3.2: Exact normalizing constants with varying a and b. . . . . . . . . . . . . . 91 viii LIST OF FIGURES Figure 1.1: An undirected graph with three nodes . . . . . . . . . . . . . . . . . . . 3 Figure 2.1: Left: ELBO convergence with two variational families (BN and MF) and S = 20. Right: ELBO convergence with two variational families (BN and MF) and S = 200. Blue lines represent BN family and orange lines represent MF family in each plot. . . . . . . . . . . . . . . . . . . . 20 Figure 2.2: Original images (left) and the estimated images (right). . . . . . . . . . . 25 Figure 2.3: Visualization of Facebook network data where the size of circle repre- sents the degree of the node. . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 2.4: Density plots for the estimated parameters (left: β, right: B) from VB with BN family (red), VB with MF family (green), and MCMC (blue) for the features gender and school. . . . . . . . . . . . . . . . . . . . . . 27 Figure 3.1: Example of the structured regression coefficients. White pixels denote the corresponding βi ’s are zeros and darker pixels denote the corre- sponding βi ’s have larger values. . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 3.2: γ on a circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 3.3: ROC curves for the three variable selection methods with the hyper parameter w1 = 1 (left) and w1 = 3 (right) respectively when the co- variates are independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 3.4: ROC curves for the three variable selection methods with the hyper pa- rameter w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively when the covariates are independent. . . . . . . . . . . . . . . . . . . . . 65 Figure 3.5: ROC curves for the three variable selection methods with the hyper parameter w1 = 1 (left) and w1 = 3 (right) respectively when the co- variates are correlated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 3.6: ROC curves for the three variable selection methods with the hyper pa- rameter w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively when the covariates are correlated. . . . . . . . . . . . . . . . . . . . . . 66 Figure 3.7: γ on a lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 3.8: Signal areas in an image . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 ix Figure 3.9: ROC curves in scenario 2 for the three variable selection methods with the hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom) respectively when the covariates are independent. . . . . . . . . . . . . . 68 Figure 3.10: ROC curves in scenario 2 for the two VB methods with the hyper param- eter w1 = 7 (left) and w1 = 9 (right) respectively when the covariates are independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 3.11: ROC curves in scenario 2 for the three variable selection methods with the hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom) respectively when the covariates are correlated. . . . . . . . . . . . . . . 70 Figure 3.12: ROC curves in scenario 2 for the two VB methods with the hyper param- eter w1 = 7 (left) and w1 = 9 (right) respectively when the covariates are correlated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 x LIST OF ALGORITHMS Algorithm 1.1: Adam learning rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Algorithm 2.1: Black box variational inference (BBVI) . . . . . . . . . . . . . . . . . 15 xi CHAPTER 1 INTRODUCTION In this Chapter, we briefly introduce basic concepts without complete details related to our work, Ising model and its pseudo-likelihood (Section 1.1), variational Bayes (Section 1.2), Adam learning rates (Section 1.3), variable selection (Section 1.4), and posterior consistency (Section 1.5). In addition, previous studies that have produced significant results in those fields are introduced. 1.1 Ising model A popular way of modeling a binary vector x = (x1 , . . . , xn )⊤ , xi ∈ {−1, 1}, in which elements are pairwise dependent is to take advantage of Ising model named after the physicist Ernst Ising Ising (1924) which has been used in a wide range of applications including spatial data analysis and computer vision. Many different versions of Ising model have emerged in the literature. Among them, the probability mass function of a general Ising model is of the form: n ! 1 X P (n) (X = x) = exp x⊤ Kn x + Bi xi , Zn (Kn , B1 , . . . , Bn ) i=1 where Kn ∈ Rn×n and (B1 , . . . , Bn ) ∈ Rn are model parameters and Zn (Kn , B1 , . . . , Bn ) is the normalizing constant that makes the sum of the probability mass function over all possible combinations of x equal to 1: n ! X X Zn (Kn , B1 , . . . , Bn ) = exp x⊤ Kn x + Bi xi . x∈{−1,1}n i=1 The general Ising model can be reduced to two-parameter Ising model, assuming that all nonzero entries of Kn take the same value and Bi = B for all i, which has an inverse temperature parameter β > 0 (also known as interaction parameter) and a magnetization parameter B ̸= 0 (also known as threshold parameter). With a specified symmetric coupling 1 matrix An ∈ Rn×n , a probability mass function of two-parameter Ising model is: n ! (n) 1 β X Pβ,B (X = x) = exp x⊤ A n x + B xi , (1.1) Zn (β, B) 2 i=1 In the two-parameter Ising model, β characterizes the strength of interactions among xi ’s and B represents external influence on x. In the first place, Ising model has been introduced for the relations between atom spins with the domain {−1, 1}n (Brush, 1967). While we work with the domain {−1, 1}n , in many current applications, Ising model has been defined with another domain {0, 1}n . One can read Haslbeck et al. (2021) for more details on two different domains. An Ising model is usually represented by an undirected graph. Consider an undirected graph which has n vertices (nodes) xi , i = 1, . . . , n. Each vertex of the graph takes a value either -1 or 1, i.e., xi ∈ {−1, 1}, and let E = {(i, j) | i ∼ j, 1 ≤ i, j ≤ n} represent the set of edges in the graph where i ∼ j denotes that the vertices i and j are connected. Then, one common choice of the coupling matrix An is a scaled adjacency matrix of the underlying graph whose all diagonal elements are zeros and the other elements are non-negative: Definition 1 (Scaled adjacency matrix). A scaled adjacency matrix for a graph Gn with n vertices is defined as:  n    2|Gn | if (i, j) ∈ E An (i, j) := ,  0  otherwise. where |Gn | denotes the number of edges in the graph Gn . For a simple example, consider three nodes (x1 , x2 , x3 ) and two edges between x1 and x2 , and between x2 and x3 as in Figure 1.1. Then, n = 3, |Gn | = 2, and the scaled adjacency matrix is:    0 0.75 0    An =   0.75 0 0.75 .    0 0.75 0 2 Figure 1.1: An undirected graph with three nodes 1.1.1 Pseudo-likelihood One of the largest challenges in using Ising model is the unknown normalizing constant Zn (β, B) in the likelihood (1.1): n ! X β ⊤ X Zn (β, B) = exp x An x + B xi 2 x∈{−1,1}n i=1 One can notice that the exact calculation of the normalizing constant involves sum of 2n terms, which is available only for small n. Due to the intractable nature of the normalizing constant, standard statistical methodologies based on the true likelihood are infeasible. One way to approximate the normalizing constant is importance sampling, see Geyer (1994); Gelman and Meng (1998); Molkaraie (2014). Another approach to handling the normalizing constant is to use a pseudo-likelihood. The conditional probability of xi is easily calculated because xi is binary: (n)  eβmi (x)+B Pβ,B ̸ i = βmi (x)+B Xi = 1|Xj , j = , e + e−βmi (x)−B 3 where mi (x) = An (i, j)xj . The pseudo-likelihood of Ising model corresponding to the Pn j=1 true likelihood in (1.1), is defined as the product of one dimensional conditional distributions: Yn (n) Pβ,B (Xi = xi | Xj , j ̸= i) i=1 n ! X = 2−n exp (βxi mi (x) + Bxi − log cosh(βmi (x) + B)) . (1.2) i=1 Fauske (2009) is an example on empirical study using pseudo-likelihood and Ghosal et al. (2020) provides theoretical justification on use of pseudo-likelihood. For v, w ∈ [−1, 1]n , defining a function g as: n X 1 + vi 1 + wi 1 − vi 1 − wi g(v, w) = log + log , i=1 2 2 2 2 we point out that the pseudo-likelihood can be written as: Yn (n) Pβ,B (Xi = xi | Xj , j ̸= i) = eg(x,b(x)) , (1.3) i=1 where b(x) = (b1 (x; θ), · · · , bn (x; θ))⊤ and bi (x) = E(Xi | Xj , j ̸= i) = tanh(βmi (x) + B), i = 1, . . . , n. 1.2 Variational Bayes (VB) Let θ be the set of parameters of interest with a prior distribution p(θ). In a Bayesian inference, a typical objective is to obtain posterior distribution given data D. We can derive the exact posterior distribution π(θ | D) using Bayes’ theorem: p(θ)p(D | θ) π(θ | D) = R , (1.4) θ p(θ)p(D | θ)dθ where p(D | θ) is a likelihood given parameter θ. The exact posterior in (1.4), however, is not typically available except for a few well-known examples. To get an approximated posterior, many statisticians have widely used sampling based Markov chain Monte Carlo (MCMC) methods but it is hardly scalable to high-dimensional cases. Beyond sampling methods, 4 variational Bayes (VB) also called variational inference or variational approximation (Jordan et al., 1999) has been popularized as an efficient alternative to MCMC. VB recasts the sampling problem as an optimization problem minimizing Kullback-Leibler (KL) divergence between a surrogate distribution (called a variational distribution) and the true posterior distribution (Blei et al., 2017). Definition 2 (Kullback-Leibler (KL) divergence). For two probability measures P1 and P2 over a set Ω, the KL divergence between P1 and P2 is defined as: KL (P1 , P2 ) = Ep1 (log p1 − log p2 ) (1.5) Z   p1 (ω) = log dP1 (1.6) Ω p2 (ω) where p1 and p2 are corresponding densities to P1 and P2 respectively. As the first step in building a VB algorithm, we need to define a family of distributions (called variational family) denoted by Q which contains candidates of the best approximation to the true posterior (1.4): Q = {q(θ; ν) : q is a probability density function that can be easily handled.}, where ν is a set of parameters (called variational parameters) that characterize variational distributions. For instance, if Q is a Gaussian family, then ν includes mean (µ) and stan- dard deviation (σ) of a Gaussian distribution. After an appropriate variational family is chosen, VB seeks the best surrogate function (called variational posterior) by minimizing KL divergence with the true posterior: q ∗ = arg min KL (Q, Π (· | D)) . (1.7) q∈Q Observe that: KL (Q, Π (· | D)) = Eq (log q(θ) − log π (θ | D)) = −Eq (log p (θ, D) − log q(θ)) + log m (D) , (1.8) 5 where m (D) is the marginal distribution of data which does not depend on θ. So, we can find the minimizer of KL divergence by maximizing Eq (log p (θ, D) − log q(θ)) which is called evidence lower bound (ELBO). 1.2.1 Black box variational inference (BBVI) Black box variational inference (BBVI) is an stochastic gradient optimization technique suggested by Ranganath et al. (2014) to maximize ELBO using unbiased gradients. Consider the ELBO as a function of variational parameters ν: L(ν) = Eq (log p (θ, D) − log q(θ)) . (1.9) In each BBVI iteration, a variational parameter ν ∈ ν is updated in the direction that the objective function L(ν) increases as follows: ν (t+1) ← ν (t) + ηt ∇ν L(ν), (1.10) where ν (t) is the variational parameter at t-th iteration and ηt ’s, t = 1, 2, . . . , are learning rates which satisfy Robbin-Monro conditions (Robbins and Monro, 1951): X∞ X∞ ηt = ∞ and ηt2 < ∞. t=1 t=1 A closed form of the gradient ∇ν L(ν) is not always available. Ranganath et al. (2014) proposed an unbiased Monte Carlo estimate. Observe that: Z ∇ν L(ν) = ∇ν q(θ) (log p (θ, D) − log q(θ)) dθ Z Z = ∇ν q(θ) (log p (θ, D) − log q(θ)) dθ − q(θ)∇ν log q(θ)dθ Z = ∇ν q(θ) (log p (θ, D) − log q(θ)) dθ Z = q(θ)∇ν log q(θ) (log p (θ, D) − log q(θ)) dθ = Eq (∇ν log q(θ) (log p (θ, D) − log q(θ))) (1.11) 6 The third equality is the fact that the expectation of a score function is zero. From the last expectation form in (1.11), we can induce a Monte Carlo estimate as follow: S 1X \ ∇ν L(ν) = ∇ν log q(θs ) (log p (θs , D) − log q(θs )) , (1.12) S s=1 where θs is a draw from the current q(θ; ν). Also, we define an empirical ELBO as the Monte Carlo estimate: S 1X [ L(ν) = (log p (θs , D) − log q(θs )) , (1.13) S s=1 Replacing ∇ν L(ν) in (1.10) with the unbiased estimate in (1.12), even though the closed forms of ELBO and its gradients are not exactly computable, we can update variational parameters until the empirical ELBO (1.13) converges. 1.3 Adaptive learning rates To optimize the objective function (1.9), appropriate learning rates ηt in (1.10) can vary widely between variational parameters. Instead of a single learning rate, one can use an adaptive learning rate method. Kingma and Ba (2014) proposed an algorithm for first-order gradient-based optimization based on adaptive estimates of lower-order moments named Adam. Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. Algorithm 1.1 outlines the Adam algorithm. Adam algorithm updates exponential moving averages of the gradient (mt ) and the squared gradient (ut ) where α1 , α2 ∈ [0, 1) control the exponential decay rates of these mov- ing averages. Kingma and Ba (2014) suggested good default settings α0 = 0.001, α1 = 0.9, α0 = 0.999, and ϵ = 10−8 . We use Adam learning rates in Chapter 3 for our variable selection algorithm. 7 Algorithm 1.1 Adam learning rates. Initialize: α0 : Stepsize Initialize: α1 , α2 ∈ [0, 1): Exponential decay rates for the moment estimates Initialize: The hyper parameters m0 ← 0 and u0 ← 0 1: while ELBO increases  do 2: Draw θ ∼ q θ; ν , s = 1, . . . , S; (s) (t) 3: Get ∇\ ν L(ν) based on the  S samplepoints; 4: mt ← α1 mt−1 + (1 − α1 ) · ∇\ ν L(ν) ;  2 5: ut ← α2 ut−1 + (1 − α2 ) · ∇\ ν L(ν) ; 6: m̂t ← mt /(1 − α1t ), where α1t denotes α1 to the power of t; 7: ût ← ut /(1 − α2t ), where α2t denotes α2 to the power of t;  √  8: Update the parameter of interest: ν ← ν (t) (t−1) + α0 · m̂t / û + ϵ ; 9: end while Output: Optimal variational parameters ν ∗ 1.4 Variable selection Statistical methodologies are well-established if the set of variables to consider is fixed and small. However, in high-dimensional setup in which the number of covariates (p) are much larger than the number of observations (n), classical models do not perform well, which leads statisticians to develop various variable selection methods. Variable selection in statistics means selecting among many variables which to include in a statistical model. From a total list of variables, significant variables are selected by removing irrelevant or redundant variables. In this section, we introduce a Bayesian variable selection method based on a spike and slab prior which was first suggested by Mitchell and Beauchamp (1988). In a linear regression model, consider a sparse vector of the regression coefficients β = (β1 , . . . , βp )⊤ ∈ Rpn : y = Xβ + e, (1.14) where p > n, y ∈ Rn , X ∈ Rn×p , and e ∈ Rn . We assume that only a few number of βi is nonzero and define an activation set of nonzero coefficients: A := {βi : βi ̸= 0}. (1.15) 8 To select explanatory variables corresponding to the nonzero coefficients, it is desirable that a tall and narrow function around zero is assigned to βi ’s in Ac . We call the tall and narrow function a "Spike" distribution. Whereas, for βi ’s in A, we use a flatter and diffused function called a "Slab" distribution. Plus, given that it is not known a priori which covariates should be included in a model, we introduce a latent binary vector γ = (γ1 , . . . , γp )⊤ ∈ {−1, 1}pn . If γi = −1, a spike distribution is used as a prior of βi . If γi = 1, for a prior of βi , a slab distribution is used as follows: p Y p(β | γ) = p(βi | γi ), i=1 1 − γi 1 + γi βi | γi ∼ f1 (βi ) + f2 (βi ), 2 2 where f1 and f2 denote a spike distribution and a slab distribution respectively. One simple choice of a prior distribution of γ is independent Bernoulli: p Y (1+γi )/2 p(γ) = ϕi (1 − ϕi )(1−γi )/2 , (1.16) i=1 where ϕi is the probability that γi = 1. Various versions of spike and slab priors have emerged in the past literature. Mitchell and Beauchamp (1988) used a mixture of a point mass at zero (spike) and a diffuse uniform distribution (slab). With a point mass spike distribution, many previous studies used a Gaussian slab distribution which include but not limited to Li and Zhang (2010), Andersen et al. (2014), Xi et al. (2016) and Andersen et al. (2017). Another group of previous researches which includes but not limited to Johnstone and Silverman (2004), Castillo and Roquain (2020), and Ray and Szabó (2021) used a Laplace (double exponential) slab with a point mass spike. Two continuous distributions have been also considered as spike and slab distributions. George and McCulloch (1993) used a mixture Gaussian prior as follows: 1 − γi 1 + γi βi | γi ∼ N (0, τi2 ) + N (0, c2i τi2 ), 2 2 where ci is set to be large. Park et al. (2022) used similar mixture Gaussian priors with applications to educational data. In a series of studies Ročková and George (2016), Ročková 9 (2018), and Ročková and George (2018), they have developed spike and slab LASSO priors using two Laplace distributions as follows: p     Y 1 − γi λ0 1 + γi λ1 p(β | γ) = exp (−λ0 |βi |) + exp (−λ1 |βi |) , i=1 2 2 2 2 where λ0 is a large scale parameter and λ1 is a smaller scale parameter. Under the spike and slab LASSO priors, Gan et al. (2019) proposed an approach for precision matrix estima- tion called BAGUS, short for “Bayesian regularization for Graphical models with Unequal Shrinkage”. We can simply use the independent Bernoulli prior distributions on γ in (1.16) assum- ing that there is no inter-dependence between covariates or between regression coefficients. Dependent data, however, are now routinely analyzed. Some previous researches have used combination of an Ising model and a spike and slab prior to facilitate the catch of depen- dence. Li and Zhang (2010) considered Ising prior on γ and suggested a MCMC method with known structure among the covariates. Li et al. (2015) proposed a joint Ising and DiriChlet process for grouping and selecting significant voxels in functional magnetic reso- nance imaging (fMRI) data. In Chapter 3, using a spike and slab prior with Ising model on γ, we will describe a VB algorithm for simultaneously selecting significant explanatory variables and estimating the regression coefficients when there exists structural dependence among the regression coefficients. 1.5 Posterior consistency Establishing posterior consistency with contraction rates of a statistical method has been a fundamental research topic for a Bayesian to provide theoretical justification. In an esti- mation problem, the basic idea of posterior consistency is that the posterior distribution is concentrated around the true parameter. Many previous studies have constructed a theoret- ical basis for Bayesian methodology by dealing with posterior consistency. For example, in a high-dimensional linear regression setup, the posterior consistency of the regression coeffi- cients with popular shrinkage priors under mild conditions was proved. More specifically, in 10 the high-dimensional linear model, y = Xβ + e, posterior distribution satisfies: π (β : ||β − β 0 || > ϵ | y) → 0, where ϵ is any positive constant and β 0 are the true regression coefficients (See Armagan et al. (2013) for detailed statements and proof). As other examples, Sriram et al. (2013) provided a justification to use of (misspecified) asymmetric Laplace density for the response in Bayesian quantile regression by proving posterior consistency. More recently, Ghosh et al. (2018) considered a VAR model with two priors for the coefficient matrix and showed pos- terior consistency. Cao et al. (2019), for a covariance estimation and selection problem, established strong graph selection consistency and posterior convergence rates for estima- tion using Gaussian directed acyclic graph model. For VB, there are a few theoretical results. Wang and Blei (2019) established frequentist consistency and asymptotic normality of VB methods by proving a variational Bernstein–von Mises theorem. Bhattacharya and Maiti (2021) established the mean-field variational posterior consistency for a feed-forward artificial neural network model. In our two-parameter Ising model estimation problem, we say the posterior is consistent if   (n) π θ ∈ Nnc | X (n) → 0 in P0 probability, (n) where Nn is a shrinking neighborhood of the true parameter (β0 , B0 ), P0 is the distribution induced by the true likelihood (1.1) with (β0 , B0 ), and X (n) is a sample vector from the Ising model with (β0 , B0 ). Analogous to the true posterior, we say the variational posterior is consistent if (n) q ∗ (θ ∈ Nnc ) → 0 in P0 probability. In Chapter 3, we derive variational posterior consistency with contraction rates obtained under the pseudo-likelihood and Gaussian mean-field variational family. 11 CHAPTER 2 A VARIATIONAL BAYES ALGORITHM AND POSTERIOR CONSISTENCY FOR TWO-PARAMETER ISING MODEL ESTIMATION In this chapter, we describe a VB algorithm for Ising model estimation under pseudo- likelihood along with numerical studies for assessing performances. 2.1 Introduction Estimation of Ising model parameters has received considerable attention in statistics and computer science literature. The existing literature can be broadly divided into two groups. Some literature assume that i.i.d. (independently and identically distributed) copies of data are available for inference; see Anandkumar et al. (2012), Bresler (2015), Lokhov et al. (2018), Ravikumar et al. (2010), and Xue et al. (2012). Another category of literature assumes that only one sample is observable; see Bhattacharya et al. (2018), Chatterjee et al. (2007), Comets (1992), Comets and Gidas (1991), Ghosal et al. (2020), Gidas (1988), and Guyon and Künsch (1992). In this dissertation, using variational Bayes (VB), we provide a new Bayesian methodology for model parameter estimation when one observes the data only once. Under the assumption of only one observation, Comets and Gidas (1991) showed that the MLE of β > 0 for Curie-Weiss model is consistent if B ̸= 0 is known, and vice versa. They also proved that the joint MLE does not exist when neither β nor B is given. In this regard, Ghosal et al. (2020) addressed joint estimation of (β, B) using pseudo-likelihood and showed that the pseudo-likelihood estimator is consistent under some conditions on coupling matrix An . We also assume only one observation of x and provide a variational Bayes algorithm for model parameter estimation with its posterior consistency. One of the main challenges in the Bayesian estimation of Ising models lies in the in- tractable nature of the normalizing constant in the likelihood. Following the works of Ghosal et al. (2020), Bhattacharya et al. (2018) and Okabayashi et al. (2011), we replace the true 12 likelihood of the Ising model by a pseudo-likelihood and we establish that the posterior based on the pseudo-likelihood is consistent for a suitable choice of the prior distribution. Further, we use variational Bayes (VB) approach which has recently become a popular and compu- tationally powerful alternative to MCMC. In order to approximate the unknown posterior distribution using VB, we propose a Gaussian mean field family and general bivariate nor- mal family with transformation of the parameters to (log β, B). For implementation of VB, we consider a black box variational inference (BBVI), Ranganath et al. (2014). In BBVI, we need to be able to evaluate the likelihood to compute the gradient estimates, but the existence of an unknown normalizing constant in likelihood of Ising model prevents us using BBVI directly. So, as mentioned above, we use pseudo-likelihood as in Ghosal et al. (2020). Replacing the true likelihood of Ising model with pseudo-likelihood, we are able to compute all the quantities needed for implementing BBVI. Our VB algorithm based on optimiza- tion is computationally more powerful than the sampling based MCMC methods (Møller et al., 2006). Also, use of PyTorch’s automatic differentiation enables us to further reduce computational costs. 2.2 VB algorithm Let θ = (β, B) be the parameter set in a two-parameter Ising model. To develop vari- ational Bayes algorithm, we consider the following independent prior distribution p(θ) = pβ (β)pB (B), with pβ (β) as a log-normal prior for β and pB (B) as a normal prior for B as follows: 1 (log β)2 1 B2 pβ (β) = √ e− 2 , pB (B) = √ e− 2 . (2.1) β 2π 2π The assumption of log-normal prior on β is to ensure the positivity of β. Let L(θ) be the pseudo-likelihood function given by (1.2), then the above prior structure leads to the following posterior distribution R R π(θ, X (n) )dθ L(θ)p(θ)dθ Π(A | X (n) )= A = RA , (2.2) m (X (n) ) L(θ)p(θ)dθ 13 for any set A ⊆ Θ where Θ denotes the parameter space of θ. Note, π(θ, X (n) ) is the joint density of θ and the data X (n) and m X (n) = L(θ)p(θ)dθ is the marginal density of X (n)  R which is free from the parameter set θ. Next, we provide a variational approximation to the posterior distribution (2.2) considering two choices of the variational family in order to obtain approximated posterior distribution (variational posterior). One candidate of our variational family, for the virtue of simplicity, is a mean-field (MF) Gaussian family as follows:   Q MF 2 = q(θ) | q(θ) = qβ (β)qB (B), log β ∼ N (µ1 , σ1 ), B ∼ N (µ2 , σ2 ) . 2 (2.3) The above variational family is the same as a lognormal distribution on β and normal distribution on B. Also, we point out that β and B are independent in QMF and each q(θ) ∈ QMF is governed by its own parameter set, ν MF = (µ1 , µ2 , σ12 , σ22 )⊤ . ν MF denotes the set of variational parameters which will be updated to find the optimal variational distribution closest to the true posterior (2.2). In addition, we suggest a bivariate normal (BN) family to exploit the interdependence among the parameters (β, B) as follows: ( ) Q BN = q(θ) | q(θ) = q(β, B), (log β, B) ∼ M V N (µ, Σ) , (2.4)     µ1  σ11 σ12  where µ =   and Σ =  . µ2 σ12 σ22 QMF can also be represented as (independent) bivariate normal family. The variational parameters of BN family are ν BN = (µ1 , µ2 , σ11 , σ22 , σ12 )⊤ . Once a variational family is se- lected, one can find the variational posterior by maxiimizing the ELBO between a variational distribution q ∈ Q and the true posterior (2.2). Recall the updates in (1.10) and the Monte Carlo estimates of the gradients in (1.12): ν (t+1) ← ν (t) + ηt ∇\ν L(ν), S \ 1X ∇ν L(ν) = ∇ν log q(θs ) (log p (θs , D) − log q(θs )) . S s=1 14 Starting with initial values ν (0) , we update the variational parameters in the direction of increasing ELBO using BBVI. The summary of BBVI algorithm is shown in Algorithm 2.1. Algorithm 2.1 Black box variational inference (BBVI) Initialize: p(θ), q(θ; ν (0) ) and learning rate sequence ηt . 1: while ELBO increases  do 2: Draw θ ∼ q θ; ν , s = 1, . . . , S; (s) (t) 3: Get ∇\ ν L(ν) based on the S sample points; 4: Update ν (t+1) ← ν (t) + ηt ∇\ν L(ν); 5: end while Output: Optimal variational parameters ν ∗ In the next subsection, we discuss more details of implementing BBVI with the mean-field family (2.3). 2.2.1 VB algorithm with MF family In the mean-field family, we have four variational parameters (µ1 , µ2 , σ1 , σ2 ). For each vari- ational parameter ν, we should compute ∇ν log q(θ) to evaluate the Monte Carlo gradients. Although one can simply use PyTorch’s automatic differentiation without manual calcula- tion, it is worth manually calculating the gradients to fully understand a BBVI algorithm. First, for µi , i = 1, 2, the gradients are: log β − µ1 ∇µ1 log q(θ; ν) = , σ12 B − µ2 ∇µ2 log q(θ; ν) = . σ22 For σi , i = 1, 2, we need to be more careful because σi must be always positive. During the updates, it may occur that σi takes a negative value. In order to preclude this issue, ′ we consider a reparametrization σi = log 1 + eσi and update the quantity σi′ , as a free 15 parameter, instead of updating σi . Then, ′ eσ1 (log β − µ1 )2   1 ∇σ1′ log q(θ; ν) = ′ − , 1 + eσ1 σ13 σ1 ′ eσ2 (B − µ2 )2   1 ∇σ2′ log q(θ; ν) = ′ − . 1 + eσ2 σ23 σ2 In the next subsection, we discuss more details with the bivariate normal family (2.4). 2.2.2 VB algorithm with BN family In addition to the positivity condition on some variational parameters, we should control postive definiteness of the covariance matirx Σ in (2.4). To guarantee the positive defi- niteness, we use Cholesky decomposition such that Σ = LLT , where L = ll11 0 . We  12 l22 update the elements of the lower triangular matrix (l11 , l12 , l22 ) in place of directly updating (σ11 , σ12 , σ22 ) to avoid the cases of negative definite Σ. Computation of ∇ν log q(θ; ν) with BN family is as follows: 2  2  1 (l22 + l12 )(log β − µ1 ) l12 (B − µ2 ) ∇µ1 log q(θ; ν) = 2 − , l22 l11 l11   1 l12 (log β − µ1 ) ∇µ2 log q(θ; ν) = 2 B − µ2 − , l22 l11 ∇l11 ′ log q(θ; ν) ′ el11 2 )(log β − µ1 )2 l12 (log β − µ1 )(B − µ2 )   2  1 1 (l22 + l12 = ′ − + 2 3 − 2 , 1 + el11 l11 l22 l11 l11 ∇l22 ′ log q(θ; ν) ′ el22  2 2 )(log β − µ1 )2 2l12 (log β − µ1 )(B − µ2 )  1 (l22 + l12 = ′ 2 2 − l22 (1 + el22 ) l22 l11 l11  2  2 ! (B − µ2 )) log β − µ1 + − −1 , l22 l11 1 (log β − µ1 )(B − µ2 ) l12 (log β − µ1 )2   ∇l12 log q(θ; ν) = 2 − 2 , l22 l11 l11 ′  where lii = log 1 + elii for i = 1, 2. 16 2.3 PMLE and MCMC We compare our VB algorithm with two other methods, a PMLE method (Ghosal et al., 2020) and a MCMC based method (Møller et al., 2006). In thid section, we briefly introduce the two competitors. PMLE: Let h(β, B) denote the pseudo-likelihood in (1.2). Ghosal et al. (2020) used grid search to find pseudo maximum likelihood estimate (PMLE) for Ising parameters which simultaneously satisfies ∂ ∂β log h(β, B) = 0 and ∂ ∂B log h(β, B) = 0. n ∂ X log h(β, B) = mi (x) (xi − tanh (βmi (x) + B)) = 0, ∂β i=1 n ∂ X log h(β, B) = (xi − tanh (βmi (x) + B)) = 0. ∂B i=1 We create a grid such that the search space for β contains all points from 0.1 to 2 in increments of 0.01 and the search space for B increases from -1 to 1 by 0.01. MCMC: Let p(θ) be a prior distribution of θ and p(x | θ) be the true likelihood in (1.1). The closed form of the true posterior (1.4) is not available because the integral in the denominator is not analytically tractable. Instead, one can use a sampling method. Consider a Metropolis-Hastings ratio given by: p(θ′ )fθ′ (x)u(θ | θ′ ) Zn (θ) M H(θ′ | θ) = × , (2.5) p(θ)fθ (x)u(θ′ | θ) Zn (θ′ ) where u(θ′ | θ) is the proposal density and fθ (x) is an unnormalized density of Ising model, i.e, p(θ | x) = Zn (θ)−1 fθ (x). Obtaining the Metropolis-Hastings ratio, however, is still lim- ited because we cannot compute the normalizing constant in p(x | θ) even with a moderate amount of n. In order to remove the ratio of normalizing constants in (2.5), Møller et al. (2006) proposed an alternative approach using an auxiliary variable z with conditional dis- tribution g(z | θ, x). Targeting π(θ, z | x) ∝ p(θ)g(z | θ, x) Zn1(θ) fθ (x) instead of π(θ | x) and taking the proposal density for z ′ to be an Ising likelihood depending on θ′ , that is, z′ ∼ 1 f ′ (z ′ ), Zn (θ′ ) θ we can cancel the normalizing constants in the Metropolis-Hastings ratio 17 (2.5) as follows: g(z ′ | θ′ , x)p(θ′ )fθ′ (x)fθ (z)u(θ | θ′ ) M H(θ′ , z ′ | θ, z) = . (2.6) g(z | θ, x)p(θ)fθ (x)fθ′ (z ′ )u(θ′ | θ) As suggested by Møller et al. (2006), the conditional density g(z | θ, x) is approximated by fθ̃ (z)/Zn (θ̃) which does not depend on θ, where θ̃ is PMLE. Also, using independent log-normal and normal distributions so that u(θ | θ′ )/u(θ′ | θ) = 1, the ratio (2.6) further reduces to: fθ̃ (z ′ )fθ′ (x)fθ (z) M H(θ′ , z ′ | θ, z) = 1 (θ ∈ Θ) . (2.7) fθ̃ (z)fθ (x)fθ′ (z ′ ) We accept the proposal (θ′ , z ′ ) as a new state with probability min{1, M H(θ′ , z ′ | θ, z)}. Although Møller et al. (2006)’s approach intelligently controls the normalizing constant, the algorithm is still computationally expensive because it involves sampling from an Ising likelihood at each Metropolis-Hastings iteration. 2.4 Numerical experiments 2.4.1 Generating observed data For numerical experiments, we need a coupling matrix An and an observed vector xobserved from (1.1). First, for generating a random d-regular graph and its scaled adjacency matrix, we use a python package NetworkX. Using the scaled adjacency matrix as our coupling matrix An , we facilitate Metropolis-Hastings algorithm to generate an observed vector xobserved with true parameters (β0 , B0 ) as follows: 0. Define H(x) = xi and start with a random binary vector x = β0 ⊤ Pn 2 x An x + B0 i=1 (x1 , . . . , xn )⊤ . 1. Randomly choose a spin xi , i ∈ {1, . . . , n}. 2. Flip the chosen spin, i.e. xi = −xi , and calculate ∆H = H(xnew ) − H(xold ) due to this flip. 18 3. The probability that we accept xnew is:  if ∆H > 0,  1,  P(accept xnew ) = exp (∆H) , otherwise.   4. If rejected, put the spin back, i.e. xi = −xi . 5. Go to 1 until the maximum number of iterations (L) is reached. 6. After L = 1, 000, 000 iterations, the last result is a sample xobserved we use. One can read Izenman (2021) for more details of sampling from Ising model. 2.4.2 Performance Comparison We compare the performance of the parameter estimation methods for two-parameter Ising model (1.1) under various combinations of (d, n) and (β0 , B0 ). Using the given coupling matrix An for each scenario, we repeat the following steps R times: • Generate an observed vector x from (1.1) with true parameters (β0 , B0 ). • Using the proposed BBVI algorithm with MF family or BN family, obtain the optimal variational parameters ν ∗ . • We get the estimates θ̂ = (log β̂, B̂)⊤ = (µ∗1 , µ∗2 )⊤ . We use S = 20 or S = 200 as the Monte Carlo sample size in (1.12). Figure 2.1 describes ELBO convergences for the two different sample sizes with MF family and BN family. The figure indicates that the ELBO converges well with a moderate choice of S. Further, for more stable convergence, one might choose higher S and BN family over MF family. We use Mean squared error (MSE) as the measurement for assessing the performances with R = 50 pairs of estimates (β̂1 , B̂1 ), . . . , (β̂R , B̂R ): R   1X  2 2 M SE = β̂r − β0 + B̂r − B0 . (2.8) R r=1 19 Figure 2.1: Left: ELBO convergence with two variational families (BN and MF) and S = 20. Right: ELBO convergence with two variational families (BN and MF) and S = 200. Blue lines represent BN family and orange lines represent MF family in each plot. For each pair of (β0 , B0 ) we take d = 10, 50. The two numbers in each cell of the tables, Table 2.1, 2.2, 2.3, and 2.4, represent MSE values or convergence time when n = 100 and n = 500 respectively. First, we consider a small value of β0 = 0.2 with B0 = ±0.2, ±0.5. In these cases, as shown in Table 2.1 and 2.2, PMLE is the fastest but less accurate especially for d = 50. MCMC achieves smaller MSEs but it has the highest runtimes. Our VB methods notably reduce the runtimes without compromising accuracy. Second, for higher interaction parameter β0 = 0.7 with B0 = ±0.2, ±0.5, our VB algo- rithms are more accurate than the others (See Table 2.3 and 2.4). The numerical studies validate the superiority of our proposed VB methods. For more practical applications, we used our algorithm to regenerate an image in the next subsection. 2.4.3 Image reconstruction Ising model can be used for constructing an image in computer vision field. In particular, the Bayesian procedure facilitate the reconstruction easily by using the posterior predictive distribution Halim (2007). Consider an image in which each pixel represents either −1(white) or 1(black). For choice of coupling matrix An , we use twelve-nearest neighbor structure and construct corresponding scaled adjacency matrix Hurn et al. (2003). Then, we generate such images following the steps in the subsection 2.4.1 with a true parameter pair (β0 , B0 ) and use 20 Table 2.1: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). Degree of Method Monte Carlo (0.2, 0.2) (0.2, −0.2) Convergence graph (d) samples (S) time (sec) 10 PMLE1 - 0.119 / 0.049 0.091 / 0.022 3.2 / 3.6 MCMC2 - 0.098 / 0.194 0.097 / 0.146 165.2 / 675.8 MF3 family 20 0.059 / 0.018 0.057 / 0.014 6.3 / 10.3 200 0.037 / 0.027 0.032 / 0.013 10.7 / 15.9 BN4 family 20 0.064 / 0.021 0.061 / 0.018 7.0 / 12.3 200 0.041 / 0.026 0.035 / 0.009 12.3 / 17.8 50 PMLE - 0.369 / 0.105 0.299 / 0.170 3.1 / 3.6 MCMC - 0.165 / 0.084 0.144 / 0.110 168.1 / 678.0 MF family 20 0.072 / 0.027 0.073 / 0.030 6.4 / 10.1 200 0.070 / 0.023 0.070 / 0.027 10.8 / 15.9 BN family 20 0.081 / 0.045 0.082 / 0.049 7.1 / 12.1 200 0.073 / 0.058 0.067 / 0.067 12.2 / 17.5 1 PMLE, pseudo maximum likelihood estimate (Ghosal et al., 2020); 2 MCMC, markov chain monte carlo (Møller et al., 2006); 3 MF, mean-field; 4 BN, bivariate normal it as our given data xobserved . With the generated image xobserved and coupling matrix An , we obtain (β̂, B̂) after implementing the parameter estimation procedure based on BN family. The estimates (β̂, B̂) are used for data regeneration following the steps in the subsection 2.4.1 again. In Figure 2.2, we plot two original images in the left column. The first original image was generated with β0 = 1.2, B0 = 0.2 (left top) and we use β0 = 1.2, B0 = −0.2 for the second one (left bottom). Also, in the right column, there are two corresponding images regenerated with (β̂ = 1.071, B̂ = 0.357) (right top) and (β̂ = 0.982, B̂ = −0.326) (right bottom) respectively. It seems that, using two-parameter Ising model and our VB method, we can reconstruct the overall tendency of black and white images fairly well. For more precise pixel-by-pixel reconstruction, one can utilize multiple threshold parameters, B = (B1 , . . . , Bn )⊤ (See the section 2.6). 21 Table 2.2: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). Degree of Method Monte Carlo (0.2, 0.5) (0.2, −0.5) Convergence graph (d) samples (S) time (sec) 10 PMLE - 0.270 / 0.053 0.126 / 0.068 3.1 / 3.5 MCMC - 0.228 / 0.275 0.174 / 0.255 164.9 / 673.5 MF family 20 0.075 / 0.017 0.075 / 0.017 6.5 / 10.2 200 0.071 / 0.015 0.070 / 0.014 10.8 / 16.1 BN family 20 0.090 / 0.039 0.091 / 0.038 7.1 / 12.0 200 0.088 / 0.033 0.087 / 0.032 12.3 / 17.1 50 PMLE - 0.792 / 0.197 0.653 / 0.251 3.2 / 3.5 MCMC - 0.407 / 0.153 0.380 / 0.169 166.1 / 675.1 MF family 20 0.088 / 0.023 0.089 / 0.024 6.3 / 10.2 200 0.082 / 0.019 0.080 / 0.021 10.4 / 16.4 BN family 20 0.090 / 0.065 0.112 / 0.072 7.2 / 12.1 200 0.110 / 0.081 0.102 / 0.096 12.2 / 17.2 2.5 Real data analysis In two-parameter Ising model, higher value of β implies stronger interactions between con- nected nodes and the threshold parameter B controls the model size (number of 1s), where the model size is greater for B > 0, smaller for B < 0. In this section, applying our meth- ods to a real data set, we obtain the estimated interaction parameter and the threshold parameter (β̂, B̂). 2.5.1 Data description Stanford Network Analysis Project (SNAP) provides a Facebook network data set (Leskovec and Krevl, 2014) available at http://snap.stanford.edu/data/ego-Facebook.html. The Facebook network consists of 4,039 nodes and 88,234 edges. Each node represents a Facebook user and there is an edge between two nodes if corresponding users are friends. The data set also contains user features such as birthday, school, gender and location. The features are fully anonymized. For instance, while the original data may include a feature “location 22 Table 2.3: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). Degree of Method Monte Carlo (0.7, 0.2) (0.7, −0.2) Convergence graph (d) samples (S) time (sec) 10 PMLE - 0.439 / 0.074 0.393 / 0.083 3.1 / 3.6 MCMC - 0.114 / 0.080 0.111 / 0.064 170.3 / 678.1 MF family 20 0.109 / 0.135 0.095 / 0.144 6.3 / 10.0 200 0.116 / 0.140 0.100 / 0.148 10.5 / 16.0 BN family 20 0.085 / 0.074 0.080 / 0.081 6.9 / 11.9 200 0.087 / 0.080 0.080 / 0.088 12.3 / 17.0 50 PMLE - 0.718 / 0.333 0.730 / 0.386 3.2 / 3.8 MCMC - 0.170 / 0.107 0.172 / 0.120 171.0 / 673.6 MF family 20 0.093 / 0.185 0.093 / 0.178 6.2 / 10.2 200 0.101 / 0.191 0.103 / 0.184 10.4 / 16.1 BN family 20 0.068 / 0.107 0.071 / 0.086 7.0 / 12.2 200 0.068 / 0.117 0.075 / 0.110 12.5 / 17.3 = Michigan", the anonymized data would simply contain“location = anonymized location A". Thus, using the anonymized data, we can determine whether two users stay in the same location, but we do not know where. Among the 4, 039 users, we select only users who disclose gender information to create a sub-graph such that there are 3, 948 nodes and 84, 716 edges in the sub-graph. Later we used the gender information for binary observations. Each node has different number of neighbors (degree). The maximum degree of the sub-graph is 1, 024, and the minimum is 1, with an average degree 42.92 (Figure 2.3 shows the nodes and edges in the reduced network). 2.5.2 Parameter estimation We utilize the selected users (n = 3, 948) as a real data set to apply our VB algorithm with the features school and gender as observed binary vectors. For the school feature, we encode 1 if a user (node) belongs to an anonymized school A, otherwise −1. For the gender feature, we encode a group by 1 and the other group by −1. The model sizes are 114 and 2, 417 for school and gender respectively. Note that, no matter which feature is used, the 23 Table 2.4: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100 (left numbers) and n = 500 (right numbers) given the degree of underlying graph (d). Degree of Method Monte Carlo (0.7, 0.5) (0.7, −0.5) Convergence graph (d) samples (S) time (sec) 10 PMLE - 0.603 / 0.228 0.788 / 0.233 3.2 / 3.7 MCMC - 0.143 / 0.113 0.178 / 0.121 170.0 / 678.5 MF family 20 0.083 / 0.184 0.078 / 0.193 6.3 / 10.1 200 0.103 / 0.188 0.096 / 0.199 10.4 / 16.2 BN family 20 0.049 / 0.074 0.047 / 0.079 7.8 / 12.3 200 0.059 / 0.077 0.056 / 0.083 12.5 / 17.5 50 PMLE - 0.893 / 0.638 0.804 / 0.781 3.1 / 3.8 MCMC - 0.144 / 0.154 0.126 / 0.255 171.9 / 677.1 MF family 20 0.080 / 0.219 0.071 / 0.209 6.2 / 10.0 200 0.095 / 0.225 0.091 / 0.219 10.6 / 16.0 BN family 20 0.044 / 0.080 0.040 / 0.074 7.7 / 12.1 200 0.051 / 0.085 0.048 / 0.079 12.4 / 17.3 connectivity among nodes does not change. One can expect that the interaction parameter β is higher when we use the school feature because people from the same school are more likely to be Facebook friends with each other. Also, we expect the threshold parameter B will be negative for school and positive for gender because of the model sizes. Table 2.5 summarizes the estimated parameters with standard errors (SE) (for VB and MCMC) and runtimes. Table 2.5: The estimated parameters with standard errors (SE) in parentheses and time costs for the features gender and school. Feature Method Monte Carlo β̂ B̂ Convergence samples (S) (SE) (SE) time (sec) Gender PMLE - 0.250(-) 0.180(-) 5.9 MCMC - 0.112(0.066) 0.210(0.031) 27109.2 MF family 200 0.132(0.070) 0.189(0.027) 173.3 BN family 200 0.106(0.089) 0.248(0.033) 271.2 School PMLE - 0.260(-) -1.560(-) 6.0 MCMC - 0.203(0.087) -1.610(0.093) 27059.3 MF family 200 0.252(0.061) -1.445(0.049) 173.4 BN family 200 0.299(0.092) -1.602(0.103) 265.3 24 Figure 2.2: Original images (left) and the estimated images (right). SEs of MCMC in Table 2.5 are calculated based on 10,000 draws after the burn-in period of 10,000 iterations. For our VB methods, to calculate SEs, we sample the same amount of β and B (10,000 draws) from the optimal variational distributions and calculate sample standard deviations. Figure 2.4 indicate density plots of the draws from BN family, MF family, and MCMC for the features gender and school. Computational gain is very clear from the figures in Table 2.5. While estimated parame- ters are comparable for all the methods, the MCMC implementation takes about fifty times more time compared to VB to achieve similar level of accuracy. The PMLE approach does not produce SE and thus limited for statistical inference. 25 Figure 2.3: Visualization of Facebook network data where the size of circle represents the degree of the node. 2.6 Extension to multi threshold parameters Beyond the two-parameter Ising model which is a main material of this paper, we can extend the parameter estimation procedure using VB to more general Ising model. Allowing multi- threshold parameters, that is, B = (B1 , . . . , Bn ), the likelihood is: n ! (n) 1 β ⊤ X Pβ,B (X = x) = exp x An x + Bi xi . (2.9) Zn (β, B) 2 i=1 Note, there are n + 1 unknown parameters in the likelihood (2.9) and the corresponding pseudo-likelihood is: Yn (n) Pβ,B (Xi = xi | Xj , j ̸= i) i=1 n ! X = 2−n exp (βxi mi (x) + Bi xi − log cosh(βmi (x) + Bi )) . (2.10) i=1 We introduce a VB algorithm for estimating the parameters θ := (log β, B1 , . . . , Bn ) given An and x. As a natural extension, consider the following multivariate Gaussian variational 26 Figure 2.4: Density plots for the estimated parameters (left: β, right: B) from VB with BN family (red), VB with MF family (green), and MCMC (blue) for the features gender and school. family: ( ) QM G = q(θ) | q(θ) = q(β, B), (log β, B) ∼ M V N (µ, Σ) , (2.11) where µ ∈ Rn+1 and Σ ∈ R(n+1)×(n+1) . Without any assumption on Σ, updating all the vari- ational parameters and finding variational posterior require quite demanding computaional costs because the total number of updated parameters is (n + 1) + (n + 1)(n + 2)/2. If we as- sume Σ is a diagonal matrix, that is, the variational family (2.11) is mean-field, the number of parameters to be updated reduces to 2(n+1). Under the pseudo-likelihood (2.10) and mul- tivariate Gaussian family (2.11) with a diagonal covariance matrix Σ = diag(σ12 , . . . , σn2 ), we can easily compute the gradients (1.12) for updating the variational parameters and develop 27 a VB algorithm for multi-threshod parameter Ising model. 2.7 Posterior consistency The main theoretical contribution of this work lies in establishing the consistency of the variational posterior for the Ising model with the true likelihood replaced by the pseudo- likelihood. In this direction, we first establish the rates at which the true posterior based on the pseudo-likelihood concentrates around the εn - shrinking neighborhoods of the true parameters. With a suitable bound on the Kulback-Leibler distance between the true pos- terior (under pseudo-likelihood) and the variational posterior, we next establish the rate of contraction for the variational posterior and demonstrate that the variational posterior also concentrates around εn -shrinking neighborhoods of the true parameter. These results have been derived under three set of assumptions on the coupling matrix An . Indeed, we demon- strate that the variational posterior consistency holds for the same set of assumptions on An as those needed for the convergence of the maximum likelihood estimates based on the pseudo-likelihood. One of the main caveats in establishing the posterior contraction rates under the pseudo-likelihood structure is in ensuring that the concentration of the variational (n) (n) posterior occurs in P0 probability where P0 is the distribution induced by the true like- (n) lihood and not the pseudo-likelihood. Indeed, we could show that in P0 probability, the contraction of variational posterior happens at the rate 1 − 1/Mn in contrast to the faster rate 1 − exp(−Cnε2n ), C > 0 for the true posterior. As a final theoretical contribution, we establish that the variational Bayes estimator convergences to the true parameters at the rate 1/εn where εn can be chosen n−δ , 0 < δ < 1/2 provided the An matrix satisfies certain regularity assumptions. 2.7.1 Sketch of Proof In this subsection, we states main theorem and provide a sketch of proof for consistency of the variational posterior (1.7). In this direction, we establish the variational posterior 28 contraction rates to evaluate how well the posterior distribution of β and B under the variational approximation concentrates around the true values β0 and B0 . Towards the proof, we make the following assumptions: Assumption 1 (Bounded row sums of An ). The row sums of An are bounded above X n max An (i, j) ≤ γ, i∈[n] j=1 for a constant γ independent of n. Assumption 1 is the same as (1.2) in Ghosal et al. (2020). As a consequence of Assumption 1, it can be shown |mi (x)| ≤ γ, i = 1 . . . , n. Assumption 2 (Mean field assumption on An ). Let ϵn → 0 and nϵ2n → ∞ such that Xn X n Xn X n (i) An (i, j) = O(nϵ2n ), (ii) An (i, j)2 = o(nϵ2n ). i=1 j=1 i=1 j=1 Assumption 2-(i) is the same as condition (1.4) in Ghosal et al. (2020) on An for ϵn = 1. Assumption 2-(ii) is the same as (1.6) in Ghosal et al. (2020) with ϵn = 1. For more details on the mean field assumption, we refer to Definition 1.3 in Basak and Mukherjee (2017). Assumption 3 (Bounded variance of An ). Let Ān = (1/n) Pn Pn i=1 j=1 An (i, j), n n 1XX lim inf ( An (i, j) − Ān )2 > 0. n→∞ n i=1 j=1 Finally, the Assumption 3 corresponds to (1.7) in Ghosal et al. (2020). The validity of Assumption 3 ensures that Tn (x) = (1/n) ni=1 (mi (x) − m̄(x))2 is bounded below and P above in probability, an essential requirement towards the proof of contraction rates of the variational posterior. Let θ = (β, B) be the model parameter and θ0 = (β0 , B0 ) be the true parameter from which the data are generated. Let L(θ) and L(θ0 ) denote the pseudo-likelihood as in (1.2) under the model parameters and true parameters respectively. Further, let L0 denote the true probability mass function from which X (n) is generated. Thus, L0 is as in (1.1) with 29 (n) (n) θ = θ0 . We shall use the notations E0 and P0 to denote expectation and probability mass function with respect to L0 . We next present the main theorem which establishes the contraction rate for the varia- tional posterior. Following the proof, we next establish the contraction rate of the variational Bayes estimator as a corollary. We shall use the term with dominating probability to imply (n) that under P0 , the probability of the event goes to 1 as n → ∞. Theorem 1 (Posterior Contraction). Let Uεn = {θ : ∥θ − θ0 ∥2 ≤ εn } be neighborhood of the (n) true parameters. Suppose ϵn satisfies Assumption 2, then in P0 probability Q∗ (Uεcn ) → 0, n → ∞, √ where εn = ϵn Mn log n for any slowly increasing sequence Mn → ∞ satisfying εn → 0. The above result establishes that the posterior distribution of β and B concentrates around the true value β0 and B0 at a rate slight larger than ϵn . The proof of the above theorem rests on following lemmas, whose proofs have been deferred to the Section 2.10, 2.11, and 2.12. Lemma 1. There exists a constant C0 > 0, such that for any ϵn → 0, nϵ2n → ∞, Z ! (n) L(θ) 2 P0 log p(θ)dθ ≤ −C0 nϵn → 1, n → ∞. Uϵcn L(θ0 ) Lemma 2. Let ϵn be the sequence satisfying the Assumption 2, then for any C > 0,  Z  (n) L(θ) 2 P0 log p(θ)dθ ≤ Cnϵn log n → 1. L(θ0 ) Lemma 3. Let ϵn be the sequence satisfying Assumption 2, then for some Q ∈ QMF and any C > 0, Z  (n) L(θ0 ) 2 P0 log q(θ)dθ ≤ Cnϵn log n → 1. L(θ) Lemma 1 and Lemma 2 taken together suffice to establish the posterior consistency of the true posterior based on the pseudo-likelihood L(θ) as in (2.2). Lemma 3 on the other hand 30 is the additional condition which needs to hold to ensure the consistency of the variational posterior. We next state an important result which relates the variational posterior to the true posterior. Formula for KL divergence: By Corollary 4.15 in Boucheron et al. (2013), Z Z  f KL(P1 , P2 ) = sup f dP1 − log e dP2 . f Using the above formula in the context of variational distributions, we get Z Z ∗ ∗ f dQ ≤ KL(Q , Π(| X (n) )) + log ef dΠ(| X (n) ). (2.12) The above relation serves as an important tool towards the proof of Theorem 1. Next, we provide a brief sketch of the proof. Further details on the proof have been deferred to Section 2.13. Sketch of proof of Theorem 1: Let f = (C0 /2)nε2n 1[θ ∈ Uεcn ], then 2 (C0 /2)nε2n Q∗ (Uεcn ) ≤ KL(Q∗ , Π(| X (n) )) + log(e(C0 /2)nεn Π(Uεcn | X (n) ) + Π(Uεn | X (n) )) 2 2 2 =⇒ Q∗ (Uεcn ) ≤ 2 KL(Q∗ , Π(| X (n) )) + 2 log(1 + e(C0 /2)nεn Π(Uεcn | X (n) )). C0 nεn C0 nεn By Lemma 2 and 3, it can be established with dominating probability for any C > 0, as n→∞ KL Q∗ , Π(| X (n) ) ≤ Cnϵ2n log n.  By Lemma 1 and 2, it can be established with dominating probability, as n → ∞ 2 Π(Uεcn | X (n) ) ≤ e−C1 nεn , (2.13) for any C1 > C0 /2. Therefore, with dominating probability 2C 2  −(C1 −C0 /2)nε2n  Q∗ (Uεcn ) ≤ + log 1 + e C0 Mn C0 nε2n 2 2C e−(C1 −C0 /2)nεn ∼ + → 0. (2.14) C0 Mn C0 nε2n This completes the proof. 31 Note that (2.13) gives the statement for the contraction of the true posterior. Similarly the contraction rate for the variational posterior follows as a consequence of (2.14). An important difference to note is that Q∗ (Uεcn ) goes to 0 at the rate 1/Mn in contrast to the faster rate e−C1 nεn for the true posterior. 2 Note, Theorem 1 gives the contraction rate of the variational posterior. However, the convergence of the of variational Bayes estimator to the true values of β0 and B0 is not immediate. The following corollary gives the convergence rate for the variational Bayes estimate as long as Assumptions 1, 2 and 3 hold. Corollary 1 (Variational Bayes Estimator Convergence). Let εn be as in Theorem 1, then (n) in P0 probability, 1 EQ∗ (∥θ − θ0 ∥2 ) → 0, as n → ∞. εn Next, we provide a brief sketch of the proof. Further details of the proof have been deferred to Section 2.14. Sketch of proof of Corollary 1: Let f = (C2 /2)nεn ∥θ − θ0 ∥2 , then Z (C2 /2)nεn ∥θ − θ0 ∥2 dQ∗ (θ) Z  ∗ (n) C2 nεn ∥θ−θ0 ∥2 /2 (n) ≤ KL(Q , Π(| X )) + log e dΠ(θ | X ) . By Lemma 2 and 3, it can be established with dominating probability, for any C > 0 KL(Q∗ , Π(| X (n) )) ≤ Cnϵ2n log n. By Lemma 1, and 2, it can be established with dominating probability, for some C2 > 0 Z 1 2 e(C2 /2)nεn ∥θ−θ0 ∥2 dΠ(θ | X (n) ) ≤ 2 eCnϵn log n . (2.15) (C2 /2)nεn Therefore, with dominating probability 2 log(C2 /2) 2εn log(nε2n ) Z 2Cεn 2Cεn ∥θ − θ0 ∥2 dQ∗ (θ) ≤ − − 2 + ≤ εn o(1). C2 Mn C2 nεn C2 nεn C2 Mn This completes the proof. 32 (2.15) follows as a consequence of convergence of the true posterior. An important thing to note that if εn can be made arbitrarily close to n−δ for 0 < δ < 1/2, it guarantees close √ to n convergence. 2.8 Preliminary notations and Lemmas Let θ = (β, B). Define Wn := (W1n , W2n ) where Xn W1n (θ | x) = mi (x) (xi − tanh(βmi (x) + B)) , i=1 n (2.16) X W2n (θ | x) = (xi − tanh(βmi (x) + B)) . i=1 Also, define P Pn  n 2 i=1 mi (x) Si (θ | x) i=1 mi (x)Si (θ | x) Hn (θ | x) :=  , (2.17)  Pn Pn i=1 mi (x)Si (θ | x) i=1 Si (θ | x) P Pn  n 3 3 2 3 m (x) (h (θ | x) − h (θ | x)) m i (x) (h i (θ | x) − h (θ | x))  i=1 i i i i=1 i R1n (θ | x) :=  ,  Pn 2 3 P n 3 i=1 mi (x) (hi (θ | x) − hi (θ | x)) i=1 mi (x) (hi (θ | x) − hi (θ | x)) (2.18) and P  n 2 3 Pn 3 m (x) (h (θ | x) − h (θ | x)) m i (x) (h i (θ | x) − h (θ | x))  i=1 i i i i=1 i R2n (θ | x) :=  ,  Pn 3 P n 3 i=1 mi (x) (hi (θ | x) − hi (θ | x)) i=1 (hi (θ | x) − hi (θ | x)) (2.19) where Si (θ | x) = sech2 (βmi (x) + B) and hi (θ | x) = tanh (βmi (x) + B). Lemma 4. Let W1n and W2n be as in (2.16), then 1 (n) 1 (n) E0 (W1n (θ0 | x))2 < ∞ E0 (W2n (θ0 | x))2 < ∞ n n Proof. See the lemma 2.1 in Ghosal et al. (2020). 33 Lemma 5. Let p1 and p2 be any two density functions. Then,   p1 2 EP1 log ≤ KL(P1 , P2 ) + p2 e Proof. See the lemma 4 in Lee (2000). 1 Pn 1 Pn 2 Lemma 6. Let Tn (x) = n i=1 mi (x) − n i=1 mi (x) . Suppose Assumptions 1, 2 and 3 hold, then Tn (x) = Op (1), 1/Tn (x) = Op (1) Proof. See the theorem 1.4 in Ghosal et al. (2020). 2.9 Taylor expansion for log-likelihood Lemma 7. Consider the term (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) where Hn is the same as in (2.17). Then, for some C1 , C2 > 0, we have (n) C1 n∥θ − θ0 ∥22 ≤ (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) ≤ C2 n∥θ − θ0 ∥22 → 1, n → ∞  P0 Proof. For some M1 , M2 > 0, let A1n = {x : Tn (x) ≤ M1 }, A2n = {x : Tn (x) ≥ M2 }, and (n) An = A1n ∩ A2n , then P0 (An ) → 1. This is because by Lemma 6, there exists M1 and M2 such that (n) (n) P0 (Tn > M1 ) = P0 (1/Tn < 1/M1 ) → 0 (n) (n) P0 (Tn < M2 ) = P0 (1/Tn > 1/M2 ) → 0 The remaining part of the proof works with only x ∈ An . Let eH 1 ≥ e2 be the eigenvalues H of Hn (θ0 | x). The trace of Hn (θ0 | x), is Xn tr (Hn (θ0 | x)) = eH H sech2 (β0 mi (x) + B0 ) m2i (x) + 1 ≤ n(1 + γ 2 )  1 + e2 = i=1 where we used |mi (x)| ≤ γ based on Assumption 1. Note (2.7) in Ghosal et al. (2020) gives a lower bound of eH 2 : sech4 (β0 γ + |B0 |) eH2 ≥ nTn (x), (2.20) 1 + γ2 34 where Tn (x) is as in the Lemma 6. By spectral decomposition of Hn (θ0 | x), (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) ≤ eH (β − β0 )2 + (B − B0 )2  1 ≤ n(1 + γ 2 ) − eH (β − β0 )2 + (B − B0 )2  2 sech4 (β0 γ + |B0 |)   2 (β − β0 )2 + (B − B0 )2  ≤ n (1 + γ ) − 2 Tn (x) 1+γ (2.21) Also, (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) ≥ eH (β − β0 )2 + (B − B0 )2  2 sech4 (β0 γ + |B0 |) nTn (x) (β − β0 )2 + (B − B0 )2 (2.22)  ≥ 2 1+γ Since M2 ≤ Tn (x) ≤ M1 for every x ∈ An , the proof follows. Lemma 8. For R1n and R2n as in (2.18) and (2.19) respectively, let 3Rn (θ̃, θ − θ0 | x) (2.23) = (β − β0 )(θ − θ0 )⊤ R1n (θ̃ | x)(θ − θ0 ) + (B − B0 )(θ − θ0 )⊤ R2n (θ̃ | x)(θ − θ0 ) (2.24) where Rn = Rn (θ̃, θ − θ0 | x) and θ̃ = θ0 + c(θ − θ0 ) 0 < c < 1. Then, as n → ∞ for some C1 , C2 > 0 we have (n) P0 (M1 n∆∗ ≤ Rn ≤ M2 n∆∗ ) → 1, where ∆∗ = ((β − β0 )γ + (B − B0 ))∥θ0 − θ∥22 . Proof. For some M1 , M2 > 0, let A1n = {x : Tn (x) ≤ M1 }, A2n = {x : Tn (x) ≥ M2 }. (n) Let An = A1n ∩ A2n , then P0 (An ) → 1. This is because by Lemma 6, there exists M1 , M2 > 0 such that (n) (n) P0 (Tn > M1 ) = P0 (1/Tn < 1/M1 ) → 0 (n) (n) P0 (Tn < M2 ) = P0 (1/Tn > 1/M2 ) → 0 35 The remaining part of the proof works with only x ∈ An . The determinant of R1n (θ̃ | x) is: det(R1n (θ̃ | x)) n 1X mi (x)mj (x) hi (x) − h3i (x) hj (x) − h3j (x) (mi (x) − mj (x))2   = 2 i,j=1 √ where hi (x) = tanh(β̃mi (x) + B̃). Since tanh(·) − tanh3 (·) has maximum value 0.38 at 3/3 and mi (x) ≤ γ by Assumption 1. Therefore, n n 1 XX |det(R1n (θ̃ | x))| ≤ γ 2 (0.38)2 (mi (x) − mj (x))2 = γ 2 (0.38)2 n2 Tn (x). 2 i=1 j=1 The trace of R1n (θ̃ | x) is: X n mi (x)(hi (x) − h3i (x)) m2i (x) + 1 ≤ n0.38γ(1 + γ 2 ).  tr(R1n (θ̃ | x)) = i=1 Let eR 1 1n ≥ e2R1n be eigenvalues of R1n (θ̃ | x). eR 1n R1n 1 e2 det(R1n (θ̃ | x)) γ 2 (0.38)2 n2 Tn (x) 0.38γ eR 2 1n ≥ R1n R1n = ≥− 2 =− nTn (x). e1 + e2 tr(R1n (θ̃ | x)) n0.38γ(1 + γ ) 1 + γ2 Therefore, 0.38γ (θ − θ0 )⊤ R1n (θ̃ | x)(θ − θ0 ) ≥ eR 2 2 ||θ − θ0 ||2 ≥ − 1n nTn (x)||θ − θ0 ||22 (2.25) 1 + γ2 and (θ − θ0 )⊤ R1n (θ̃ | x)(θ − θ0 ) ≤ eR 2 R1n 1 ||θ − θ0 ||2 = (tr(R1n (θ̃ | x)) − e2 ))||θ − θ0 ||2 1n 2   Tn (x) ≤ 0.38γn (1 + γ ) + 2 ||θ − θ0 ||22 . (2.26) 1 + γ2 With the same argument, we can get: 0.38 (θ − θ0 )⊤ R2n (θ̃ | x)(θ − θ0 ) ≥ − nTn (x)||θ − θ0 ||22 , (2.27) 1 + γ2   Tn (x) t (θ − θ0 ) R2n (θ̃ | x)(θ − θ0 ) ≤ 0.38n (1 + γ ) + 2 ||θ − θ0 ||22 (2.28) 1 + γ2 Using (2.25), (2.26), (2.27) and (2.28) and noting M2 ≤ Tn (x) ≤ M1 for every x ∈ An , the proof follows. 36 Lemma 9. Let q(θ) ∈ QM F with µ1 = log β0 , µ2 = B0 , and σ12 = σ22 = 1/n, then 1 KL(Q, P ) → 0, n → ∞ nϵ2n log n Proof. Using the same notation in (2.3), the KL divergence is: KL(Q, P ) = Eqβ (β)qB (B) (log qβ (β) + log qB (B) − log pβ (β) − log pB (B)) = KL (Qβ , Pβ ) + KL (QB , PB )   1 1 1 = (log β0 ) + + B0 + − 2 + log n = o(nϵ2n log n), 2 2 since nϵ2n → ∞ 2 n n 2.10 Technical details of Lemma 1 Proof of Lemma 1. Let Vϵn = {|β − β0 | < ϵn , |B − B0 | < ϵn }. Then Vϵn ⊆ U√2ϵn which implies U√c 2ϵn ⊆ Vϵcn which further implies Z Z L(θ) L(θ) log p(θ)dθ ≤ log p(θ)dθ (2.29) c U√ L(θ0 ) Vϵcn L(θ0 ) 2ϵn We shall now establish for some C0 > 0 Z ! (n) L(θ) P0 log p(θ)dθ ≤ −C0 nϵ2n → 1, n → ∞, Vϵcn L(θ0 ) which in lieu of (2.29) completes the proof. Define A1n = {x : W1n (θ0 | x)2 + W1n (θ0 | x)2 ≤ n2/3 } and A2n = {x : Tn (x) ≥ M } for some M > 0. Define An = A1n ∩ A2n . (n) Here P0 (An ) → 1. This because by Markov’s inequality and Lemma 4, (n) W1n (θ0 | x)2 + W2n (θ0 | x)2 > n2/3 ε  P0 1 (n) ≤ 4/3 E0 W1n (θ0 | x)2 + W2n (θ0 | x)2 → 0.  n (n) (n) and by Lemma 6 P0 (Tn < M ) = P0 (1/Tn > 1/M ) → 0. 37 We shall show for x ∈ An , L(θ)/L(θ0 ) ≤ e−C0 nϵn , ∀ θ ∈ Vϵcn which implies ∀ x ∈ An , 2 Z Z log (L(θ)/L(θ0 ))p(θ)dθ = log (L(θ)/L(θ0 ))p(θ)dθ Vϵcn Vϵcn Z ! −C0 nϵ2n ≤ log e p(θ)dθ ≤ −C0 nϵ2n , (2.30) Vϵcn (n) since p(Vϵcn ) ≤ 1. This completes the proof since P0 (An ) → 1 as n → ∞. Next, note that Vϵcn is given by the union of the following terms V1n = {(β, B) : β − β0 ≥ ϵn , B ≥ B0 }, V2n = {(β, B) : β − β0 ≥ ϵn , B < B0 } V3n = {(β, B) : β − β0 < −ϵn , B ≥ B0 }, V4n = {(β, B) : β − β0 < −ϵn , B < B0 } V5n = {(β, B) : β ≥ β0 , B − B0 ≥ ϵn }, V6n = {(β, B) : β < β0 , B − B0 ≥ ϵn } V7n = {(β, B) : β ≥ β0 , B − B0 < −ϵn }, V8n = {(β, B) : β < β0 , B − B0 < −ϵn } We shall now show for x ∈ An and θ ∈ V1n , L(θ)/L(θ0 ) ≤ e−C0 nϵn . The proof of other parts 2 follow similarly. (a) Let θ = (β, B) and θ0′ = (β0 + ϵ, B0 ), where β ≥ β0 + ϵ and B ≥ B0 . Also, define θt = θ0′ + t(θ − θ0′ ) where 0 < t < 1. Consider a function g: g(t) = f (θt ) = log L(θt ) − log L(θ0′ ) − ∆n (θ0′ )⊤ (θt − θ0′ ), where ∆n (θ) = (∇β log L(θ), ∇B log L(θ))⊤ . Note that g(t) is a function of t. We want to show g(t) ≤ g(0) provided t > 0. We shall instead show g ′ (t) ≤ 0. By Taylor expansion, g ′ (t) = g ′ (0) + g ′′ (t̃)t. for some t̃ ∈ [0, t]. Here, g ′ (0) = 0 and g ′′ (t̃) = −(θ − θ0′ )⊤ Hn (θt̃ | x)(θ − θ0′ ) ≤ 0 where Hn as in (2.17) is a positive definite matrix (by (2.22) in 7 and Tn (x) ≥ 0). Since g(t) is decreasing for 0 < t < 1, thus g(1) ≤ g(0) =⇒ f (θ) ≤ f (θ0′ ) 38 (b) Similarly, let θ = (β, B) θ0′′ = (β0 , B0 + ϵ), where β ≥ β0 and B ≥ B0 + ϵ. Define θt = θ0′ + t(θ − θ0′′ ) where 0 < t < 1. h(t) = f (θt ) = log L(θt ) − log L(θ0′′ ) − ∆n (θ0′′ )⊤ (θt − θ0′′ ). With similar argument in (a), we conclude that h(1) ≤ h(0) =⇒ f (θ) ≤ f (θ0′′ ). Therefore, sup (log L(θ) − log L(θ0 )) θ∈V1n ≤ sup (log L(θ) − log L(θ0 )) + sup (log L(θ) − log L(θ0 )) {β−β0 ∈[ϵn ,ϵ],B≥B0 } {β>β0 +ϵ,B≥B0 } ≤ sup (log L(θ) − log L(θ0 )) + (log L(θ0′ ) − log L(θ0 )) {β−β0 ∈[ϵn ,ϵ],B≥B0 } ≤ sup (log L(θ) − log L(θ0 )) + sup (log L(θ) − log L(θ0 )) {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]} {β−β0 ∈[ϵn ,ϵ],B>B0 +ϵ} + log L(θ0′ ) − log L(θ0 )) ≤ sup (log L(θ) − log L(θ0 )) + sup (log L(θ) − log L(θ0 )) {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]} {β≥β0 ,B>B0 +ϵ} + log L(θ0′ ) − log L(θ0 )) ≤ sup (log L(θ) − log L(θ0 )) + log L(θ0′′ ) − log L(θ0 ) {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]} + log L(θ0′ ) − log L(θ0 ) ≤ sup 3(log L(θ) − log L(θ0 )) ≤ −C0 nϵ2n (2.31) {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]} where the second inequality follows from (a) and fifth inequality follows from (b) above. Finally for the last inequality, consider Taylor expansion for log L(θ) upto the second order 1 log L(θ) − log L(θ0 ) = Wn (θ0 | x)⊤ (θ − θ0 ) − (θ − θ0 )⊤ Hn (θ̃ | x)(θ − θ0 ) 2 where θ̃ = θ0 + c(θ − θ0 ), 0 < c < 1 and Wn and Hn are as defined in (2.16) and (2.17) respectively. By Cauchy Schwarz inequality, |Wn (θ0 | x)⊤ (θ − θ0 ))| ≤ ( W1n (θ0 | x)2 + W2n (θ0 | x)2 ∥θ − θ0 ∥22 + 1)  ≤ n2/3 ∥θ − θ0 ∥22 + 1 39 for every x ∈ An . Further 1 log L(θ) − log L(θ0 ) ≤ n2/3 ∥θ − θ0 ∥22 + 1 − (θ − θ0 )⊤ Hn (θ̃ | x)(θ − θ0 ) 2 sech4 (β̃γ + |B̃|) ≤ n2/3 ∥θ − θ0 ∥22 + 1 − nTn (x)∥θ − θ0 ∥22 1 + γ2 ! 1 sech 4 (β̃γ + |B̃|) n ≤ n2/3 + − ∥θ − θ0 ∥22 ||θ − θ0 ||22 1 + γ2 M where the second inequality is a consequence of the lower bound (2.20) and the third in- equality holds since x ∈ An . Taking sup over the set {β − β0 ∈ [ϵn , ϵ], B − B0 ∈ [0, ϵ]} on both sides, sup (log L(θ) − log L(θ0 )) {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]} sech4 ((β0 + ϵ)γ + (B0 + ϵ)) n   2/3 1 ≤ sup n + 2 − 2 ∥θ − θ0 ∥22 {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]} ϵn 1 + γ M ≤ −C0 nϵ2n (2.32) for some C0 > 0 as n → ∞ since n2/3 and 1/ϵ2n = o(n). This completes the proof. 2.11 Technical details of Lemma 2 Lemma 10. Let L0 and L(θ0 ) represent the true likelihood (1.1) and the pseudo-likelihood (1.2) with the true parameters θ0 , respectively. Then, 1 (n) E (log L0 − log L(θ0 )) → 0, n → ∞. nϵ2n 0 Proof. efθ0 (x) efθ0 (x) L0 = P fθ0 (x) = x∈{−1,1}n e Zn (θ0 ) where fθ0 (x) = (β0 /2)x⊤ An x + B0 x⊤ 1. Define b(x; θ) = (b1 (x; θ), · · · , bn (x; θ)) where bi (x; θ) = E(Xi | Xj , j ̸= i) = tanh(βmi (x) + B). Then L(θ) = eg(x,b(x;θ)) where the function g for v, w ∈ [−1, 1]n is defined as n X 1 + vi 1 + wi 1 − vi 1 − wi g(v, w) = log + log i=1 2 2 2 2 40 Also, define I(v) = g(v, v). Now, observe that (n) (n) E0 (log L0 − log L(θ0 )) = E0 (fθ0 (x) − g(x, b(x; θ0 )) − log Zn (θ0 )) (n) (n) = E0 (fθ0 (x) − fθ0 (b(x; θ0 )) + E0 (fθ0 (b(x; θ0 )) − I(b(x; θ0 ))) (n) + E0 (I(b(x; θ0 )) − g(x, b(x; θ0 ))) − log Zn (θ0 ) (n) (n) ≤ (E0 (fθ0 (x) − fθ0 (b(x; θ0 )))2 )1/2 + (E0 (I(b(x; θ0 )) − g(x, b(x; θ0 )))2 )1/2 (n) + E0 (fθ0 (b(x; θ0 )) − I(b(x; θ0 ))) − log Zn (θ0 ), (2.33) where the last step is due to Hölder’s inequality. Under Assumption 2, mimicking the proof of Lemmas 3.2 and 3.3 in Basak and Mukherjee (2017) with n replaced by nϵ2n , we get (n) (E0 (fθ0 (x) − fθ0 (b(x; θ0 )))2 )1/2 = o(nϵ2n ) (2.34) (n) (E0 (I(b(x; θ0 )) − g(x, b(x; θ0 )))2 )1/2 = o(nϵ2n ) (2.35) Also for rn = supv∈[−1,1]n (fθ0 (v) − I(v)), we have (n) E0 (fθ0 (b(x, θ0 )) − I(b(x, θ0 ))) ≤ rn By Theorem 1.6 in Chatterjee and Dembo (2016) with the fact ∂ 2 fθ0 /∂x2i = 0, i = 1, · · · , n, we have − log Zn (θ0 ) ≤ −rn . Therefore, (n) E0 (fθ0 (b(x; θ0 )) − I(b(x; θ0 )) − log Zn (θ0 )) ≤ 0. (2.36) Using (2.34), (2.35) and (2.36) in (2.33) completes the proof. Lemma 11. Note that L(θ) is not a valid density function. So, we consider L̃(θ) = P P L(θ)/Jn (θ) where Jn (θ) = x∈{−1,1}n L(θ) such that x∈{−1,1}n L̃(θ) = 1. Then for ev- ery θ, p  √  Jn (θ) ≤ βϵn n(1 + γ 2 )/2 + o(nϵ2n ) log 3 2 − log ϵn . 41 √ Proof. Let Nn (ϵn ) := {i ∈ [n] : |λi (An )| > ϵn / 2} and with the mean field condition in the Assumption 2, it is easy to note that n |Nn (ϵn )| 2 X 2 X ≤ 2 λi (An )2 = 2 An (i, j)2 → 0, n → ∞ (2.37) n nϵn nϵn i,j=1 i∈[n] Set kn = |Nn (ϵn )| and let Dn,0 (ϵn ) be a ϵn n/2 net of the set {f ∈ Rkn : fi2 ≤ n} of size p P √ at most (3 2/ϵn )kn . The existence of such a net is standard (see for example Lemma 2.6 in Milman and Schechtman (1986)). Let {p1 , · · · , pn } be the eigen vectors of An . Then setting    X  Dn,1 (ϵn ) := ci λi (An )pi , c ∈ Dn,0 (ϵn )   i∈Nn (ϵn ) We claim Dn,1 (ϵn ) is ϵn n(1 + γ 2 )/2 of the set {An x : x ∈ {−1, 1}n }. Indeed any x ∈ p {−1, 1}n can be written as ni=1 fi pi where ni=1 fi2 = xi = n. In particular, it means P P P 2 i∈Nn (ϵn ) fi ≤ n, which implies there exists a c ∈ Dn,0 (ϵn ) such that ||c − f || ≤ ϵn P p n/2. Let i∈Nn (ϵn ) ci λi (An )pi ∈ Dn,1 (ϵn ), then P X X X ||An x − ci λi (An )pi ||22 = (ci − fi )2 λi (An )2 + λi (An )2 fi2 i∈Nn (ϵn ) i∈Nn (ϵn ) i∈N / n (ϵn ) γ 2 nϵ2n nϵ2n ≤ + 2 2 where the last inequality is a consequence of maxi∈[n] |λi (An )| ≤ maxi∈[n] Pn j=1 |An (i, j)| ≤ γ and the definition of the set Nn (ϵn ). In particular for any x ∈ {−1, 1}n , there exists at least one p ∈ Dn,1 (ϵn ) such that ||p − m(x)|| ≤ ϵn n(1 + γ 2 )/2. For any p ∈ Dn,1 (ϵn ), let p n p o P(p) := x ∈ {−1, 1}n : ||p − m(x)|| ≤ ϵn n(1 + γ 2 )/2 . Therefore, X X X eg(x,b(x;θ)) = eg(x,b(x;θ)) x∈{−1,1}n p∈Dn,1 (ϵn ) x∈P(p) 42 Setting u(p) := tanh(βp + B) if ||p − m(x)|| ≤ ϵn n(1 + γ 2 )/2, then we have p Xn p |g(x, b(x; θ)) − g(x, u(p))| ≤ 2β |mi (x) − pi | ≤ 2βϵn n(1 + γ 2 )/2 i=1 Finally, X √ X X n(1+γ 2 )/2 eg(x,b(x;θ)) ≤ e2βϵn eg(x,u(p)) x∈{−1,1}n p∈Dn,1 (ϵn ) x∈P(p) √ X X √ n(1+γ 2 )/2 n(1+γ 2 )/2 ≤ e2βϵn eg(x,u(p)) = e2βϵn |Dn,1 (ϵn )| p∈Dn,1 (ϵn ) x∈{−1,1}n where the last equality follows since eg(x,u) = 1 for any u ∈ [−1, 1]n . Therefore, P x∈{−1,1}n p log Jn (θ) ≤ βϵn n(1 + γ 2 )/2 + log |Dn,1 (ϵn )| Since |Dn,1 (ϵn )| = |Dn,0 (ϵn )|, therefore √ log |Dn,1 (ϵn )| ≤ |Nn (ϵn )|(log 3 2 − log ϵn ) The proof follows since |Nn (ϵn )| = o(nϵ2n ). Lemma 12. Define Vϵn := {θ : |β − β0 | < ϵn , |B − B0 | < ϵn }. Then, Vϵn ⊆ Kϵn , for n sufficiently large (n) where Kϵn := {θ : E0 (log(L(θ0 )/L(θ))) < 3nϵ2n }. Proof. For any θ ∈ Vϵn , using the decomposition in (2.43), we get   (n) (n) E0 (log L(θ0 ) − log L(θ)) = E0 − 1 − 2 + 3 − 4 ≤ 3nϵ2n where the last inequality is justified next. For some M > 0 using Lemma 4, we get 1/2 √  (n) (n) 1 (n) −E0 ( 1 ) = (β0 − β)E0 (W1n (θ0 |x) ≤ n|β0 − β| E0 (W1n (θ0 |x))2 n √ ≤ M nϵn 43 1/2 √  (n) (n) 1 (n) −E0 ( 2 ) = (B0 − B)E0 (W2n (θ0 |x) ≤ n|B0 − B| E0 (W2n (θ0 |x))2 n √ ≤ M nϵn By relation (2.21), we get sech4 (β0 γ + |B0 |) (n)   (n) E0 ( 3 ) ≤ n||θ − θ0 ||22 2 (1 + γ ) − E0 (Tn (x)) ≤ 2(1 + γ 2 )nϵ2n 1 + γ2 By relation (2.44), we get (n) 0.38n (n) 0.38γ 2 (1 + γ) 3 −E0 ( 4 ) ≤ E (T n (x)) ||θ − θ || 0 2 2 (|β − β 0 |γ + |B − B 0 |) ≤ nϵn 3(1 + γ 2 ) 0 3(1 + γ 2 ) Lemma 13. With prior distribution p(θ) as in (2.1), we have Z p(θ)dθ ≥ Cϵ2n , for some C > 0 Vϵn Proof. By mean value theorem with β ⋆ ∈ [β0 − ϵn , β0 + ϵn ] and B ⋆ ∈ [B0 − ϵn , B0 + ϵn ], Z Z β0 +ϵ Z B0 +ϵ 1 (log β)2 − 2 1 B2 p(θ)dθ = √ e dβ √ e− 2 dB Vϵn β0 −ϵ β 2π B0 −ϵ 2π 2ϵn − (log β⋆ )2 2ϵn − (B⋆ )2 = √ e 2 √ e 2 β ⋆ 2π 2π   1 ⋆ ⋆ 2 ⋆ 2  = exp −(log π − log 2 − 2 log ϵn ) − 2 log β + (log β ) + (B ) 2   1 ≥ exp −(log π − log 2 − 2 log ϵn ) − (2u1 + ũ1 + u2 ) 2 ≥ Ce2 log ϵn = Cϵ2n where the above result follow since ϵn → 0 implies u1 ≤ max(log(β0 + 1), log(β0 + 1)), ũ1 ≤ max((log(β0 − 1))2 , (log(β0 + 1))2 ), and u2 = max((B0 − 1)2 , (B0 + 1)2 ). Proof of Lemma 2. Let L∗ = L(θ)p(θ)dθ, Jn∗ = L∗ . Then, R P x∈{−1,1}n X Z Jn∗ = L(θ)p(θ)dθ. x∈{−1,1}n 44 Since L(θ)p(θ) > 0, Tonelli’s theorem allows for interchange of the order of summation and integral. Using Lemma 11 and − log ϵn = O(log n), we get Z X Z Jn∗ = L(θ)p(θ)dθ = Jn (θ)p(θ)dθ x∈{−1,1}n p √ = ϵn n(1 + γ 2 )/2EP (β) + o(nϵ2n )(log 3 2 − log ϵn ) √ (2.38) p = ϵn ne(1 + γ 2 )/2 + o(nϵ2n )(log 3 2 − log ϵn ) = o(nϵ2n log n) Also, by Lemma 11 and − log ϵn = O(log n), √ (2.39) p log Jn (θ0 ) = β0 ϵn n(1 + γ 2 )/2 + o(nϵ2n )(log 3 2 − log ϵn ) = o(nϵ2n log n)  Z  P0n log (L(θ)/L(θ0 ))p(θ)dθ > Cnϵ2n log n (2.40)  Z  1 (n) ≤ E log (L(θ)/L(θ0 ))p(θ)dθ Cnϵ2n log n 0 1 (n) = 2 E0 (|log (L∗ /L(θ0 ))|) Cnϵn log n Jn∗   1 ∗ 4 ≤ KL(L0 , L̃ ) + KL(L0 , L̃(θ0 )) + log + Cnϵ2n log n Jn (θ0 ) e 2  (n) (n) ≤ 2 E0 (log L0 − log L(θ0 )) + E0 (log L(θ0 ) − log L∗ ) Cnϵn log n 4 + 2(log Jn∗ + log Jn (θ0 )) + (2.41) e where the second last step follows from Lemma 5. 45 Then, using the set Kϵn in Lemma 12, we get (n) E0 (log L(θ0 ) − log L∗ )) Z (n) = E0 (log L(θ0 ) − log L(θ)p(θ)dθ) Z ! (n) ≤ E0 log(L(θ0 )/ L(θ)p(θ)dθ) Kϵ2 n !! p(Kϵ2n ) Z (n) ≤ E0 log L(θ0 ) − log L(θ)p(θ)dθ)) p(Kϵ2n ) Kϵ2 n   (n) (n) ≤ E0 (log L(θ0 )) − log(p(Kϵ2n ) + E0 Ep|Kϵ2 (− log L(θ)) n Z ! (n) ≤ − log(p(Kϵ2n )) + E0 log L(θ0 ) − log L(θ)p|Kϵ2n (θ)dθ Jensen’s Inequality Kϵ2 n Z (n) = − log(p(Kϵ2n )) + E0 (log(L(θ0 ) − L(θ))p|Kϵ2n (θ)dθ Kϵ2 n ≤ −2 log(C ′ ϵ2n ) + 3nϵ2n = o(nϵ2n log n) (2.42) where the last line follows from Lemma 12 and Lemma 13. The final order is because − log ϵn = O(log n) and nϵ2n → ∞ and log n → ∞. The proof follows by using relations (2.42), (2.39) and (2.38) in (2.40). 2.12 Technical details of Lemma 3 Lemma 14. Let q(θ) ∈ QM F with µ1 = log β0 , µ2 = B0 , and σ12 = σ22 = 1/n, then Z 1 (n) E0 (log L(θ0 ) − log L(θ))q(θ)dθ ≲ 0, n → ∞ nϵ2n Proof. Using the Taylor expansion of log L(θ) around θ = θ0 , we get log L(θ0 ) − log L(θ) = log L(θ0 ) − log L(θ0 ) − (β − β0 )W1n (θ0 |x) − W2n (θ0 |x)(B − B0 ) | {z } | {z } 1 2 1 + (θ − θ0 )⊤ Hn (θ0 |x)(θ − θ0 ) − Rn (θ̃, θ − θ0 |x) (2.43) |2 {z } | {z } 3 4 46 W1n , W2n is as in (2.16), Hn is as in (2.17) and Rn (θ̃, θ − θ0 |x) is defined in (2.23). Therefore, Z (n) E0 (log L(θ0 ) − log L(θ)) q(θ)dθ Z   Z   (n) (n) = − E0 1 q(θ)dθ − E0 2 q(θ)dθ Z   Z   (n) (n) + E0 3 q(θ)dθ − E0 4 q(θ)dθ, Z Z 1 (n)   1 (n) − 2 E0 1 q(θ)dθ = 2 (β0 − β)E0 (W1n (θ0 |x)) q(θ)dθ nϵn nϵn 1/2 √ Z  1 1 (n) 2 ≤ |β0 − β| n E (W1n (θ0 |x)) q(β)dβ, nϵ2n n 0 √ Z M n ≤ |β − β0 |qβ (β)dβ nϵ2n √ Z M n ≤ ( (β − β0 )2 qβ (β)dβ)1/2 nϵ2n √ M n 2 log β0 2/n 1/2n 1/2 M elog β0 = (e (e − 2e + 1)) ∼ →0 nϵ2n nϵ2n where the second inequality above above line holds by Hölder inequality and third inequality holds by Lemma 4, for some constant M . Finally the last convergence to 0 is nϵ2n → ∞. Similarly, Z Z 1 (n)   1 (n) − 2 E0 2 q(θ)dθ = 2 (B0 − B)E0 (W2n (θ0 |x)) q(θ)dθ nϵn nϵn √  1/2 Z n 1 (n) 2 ≤ 2 E (W2n (θ0 |x)) |B − B0 |qB (B)dB nϵn n 0 √ Z M n M 2 ( (B − B0 )2 q(B)dB)1/2 ∼ 2 → 0 nϵn nϵn Using the upper bound (2.21) and nϵ2n → ∞, we get Z 1 (n)   E0 3 q(θ)dθ nϵ2n sech4 (β0 γ + |B0 |) (n)  Z 1 2 (β − β0 )2 + (B − B0 )2 q(θ)dθ  ≤ 2 (1 + γ ) − 2 E0 (Tn (x)) 2ϵn 1+γ (1 + γ 2 ) 2 log β0 2/n 1/2n   ≤ e e − 2e + 1 + 1/n 2ϵ2n 2 log β0 (e + 1)(1 + γ 2 ) ∼ →0 2nϵ2n 47 where the second inequality holds since Tn (x) = (1/n) Pn i=1 (mi (x) − (1/n)mi (x))2 ≥ 0. For the remainder term, using relations (2.25) and (2.27) in relation (2.23), we get (n)   0.38n (n) 4 (β − β0 )2 + (B − B0 )2 {(β − β0 )γ + (B − B0 )}  −E0 ≤ E 0 (Tn (x)) 3(1 + γ 2 ) (2.44) Further, Z (β − β0 )2 + (B − B0 )2 {(β − β0 )γ + (B − B0 )} q(θ)dθ  n Z − elog β0 )2 + (B − B0 )2 (elog β − elog β0 )γ + (B − B0 ) q(θ)dθ  log β  =n (e | {z } 5 Z 5 =n e3 log β − 3e2 log β+log β0 + 3elog β+2 log β0 − e3 log β0 q(β)dβ (2.45)  Z B 3 − 3B0 B 2 + 3B02 B − B03 q(B)dB (2.46)  +n Z e2 log β − 2elog β+log β0 + e2 log β0 (B − B0 ) q(β)q(B)dβdB (2.47)  +n Z (B − B0 )2 elog β − elog β0 q(β)q(B)dβdB (2.48)  + nγ (2.45) = ne3 log β0 e9/2n − 3e2/n + 3e1/2n − e0 ∼ 0  (2.46) = n B03 + 3B0 /n − 3B0 (B02 + 1/n) + 3B03 − B03 = 0,  (2.47) = n B0 e2 log β0 e2/n − 2e1/2n + e0 − B0 e2 log β0 e2/n − 2e1/2n + e0 = 0,   γ γ 0 (2.48) = nelog β0 e1/2n − e = γelog β0 (e1/2n − e0 ) ∼ 0 n n Z 1 (n)   − 2 E0 4 q(θ)dθ ≲ 0 nϵn since Tn (x) ≤ γ 2 (since by Assumption 1, mi (x) ≤ γ) and nϵ2n → ∞. 48 Poof of Lemma 3. With the q as in Lemma 14, using Markov’s inequality, Z  n P0 q(θ) log(L(θ0 )/L(θ))dθ > Cnϵn log n 2 (2.49) Z 1 (n) ≤ E q(θ) log(L(θ0 )/L(θ))dθ Cnϵ2n log n 0 Z  1 (n) ≤ E q(θ) |log(L(θ0 )/L(θ))| dθ Cnϵ2n log n 0 Z 1 (n) ≤ q(θ)E0 (|log(L(θ0 )/L(θ))|) dθ Fubini’s theorem Cnϵ2n log n Z    1 (n) L0 L(θ0 ) = q(θ)E0 log dθ Cnϵ2n log n L(θ) L0 Z !! 1 (n) L 0 L̃(θ0 ) J (θ n 0 ) = q(θ)E0 log dθ Cnϵ2n log n L̃(θ) 0 L J n (θ) Z   !  ! 1 (n) L 0 L̃(θ 0 ) J (θ n 0 ) = q(θ)E0 log + log + log dθ Cnϵ2n log n L̃(θ) L 0 J n (θ) Z   !  ! 1 (n) L 0 L̃(θ 0 ) J (θ n 0 ) ≤ q(θ)E0 log + log + log dθ Cnϵ2n log n L̃(θ) L 0 Jn (θ) Z     1     Jn (θ0 ) 4 ≤ q(θ) KL L0 , L̃(θ) + KL L0 , L̃(θ0 ) + log + dθ (2.50) Cnϵ2n log n Jn (θ) e Therefore, Z  n 2 P0 q(θ) log(L(θ0 )/L(θ))dθ > Cnϵn log n  Z  1 (n) (n) = 2E0 (log L0 − log L(θ0 )) + q(θ)E0 (log L(θ0 ) − log L(θ))dθ Cnϵ2n log n  Z  1 4 + 2 log Jn (θ0 ) + 2 q(θ) log Jn (θ)dθ + →0 (2.51) Cnϵ2n log n e where the inequality in second last step is due to Lemma 5. The last convergence to 0 is explained next. By Lemma 10 and Lemma 14 respectively, we get (n) E0 (log L0 − log L(θ0 )) = o(nϵ2n ) Z (n) q(θ)E0 (log L(θ0 ) − log L(θ))dθ ≤ o(nϵ2n ) By (2.39), log Jn (θ0 ) = o(nϵ2n log n) and by Lemma 11 and − log ϵn = O(log n) Z p Z √ q(θ) log Jn (θ)dθ ≤ ϵn n(1 + γ )/2 βq(β)dβ + o(nϵ2n )(log 3 2 − log ϵn ) = o(nϵ2n ) 2 where we use EQ (β) = exp(log β0 + 1/n) → β0 49 2.13 Proof of Theorem 1 (n) In this section, with dominating probability term is used to imply that under P0 , the probability of the event goes to 1 as n → ∞. Z Z (n) KL(Q, Π(|X )) = q(θ) log q(θ)dθ − q(θ) log π(θ|X (n) )dθ Z Z L(θ)p(θ) = q(θ) log q(θ)dθ − q(θ) log R dθ L(θ)p(θ)dθ Z Z = KL(Q, P ) − log(L(θ)/L(θ0 ))q(θ)dθ + log (L(θ)/L(θ0 ))p(θ)dθ Z Z = KL(Q, P ) + log(L(θ0 )/L(θ))q(θ)dθ + log (L(θ)/L(θ0 ))p(θ)dθ (2.52) By Lemma 9, KL(Q, P ) = o(nϵ2n log n) ≤ (C/3)nϵ2n log n. By Lemma 3, with dominating probability Z log(L(θ0 )/L(θ))q(θ)dθ ≤ (C/3)nϵ2n log n for any C > 0. By Lemma 2, with dominating probability Z log (L(θ)/L(θ0 ))p(θ)dθ ≤ (C/3)nϵ2n log n Therefore, with dominating probability, for any C > 0, KL(Q, Π(|X (n) )) ≤ Cnϵ2n Further, R R Uεcn L(θ)p(θ)dθ Uc (L(θ)/L(θ0 ))p(θ)dθ Π(Uεcn |X (n) ) = R = Rεn L(θ)p(θ)dθ (L(θ)/L(θ0 ))p(θ)dθ By Lemma 1, with dominating probability, for any C > 0, as n → ∞ Z (L(θ)/L(θ0 ))p(θ)dθ ≤ exp(−C0 nε2n ) Uεcn By Lemma 2, with dominating probability Z (L(θ)/L(θ0 ))p(θ)dθ ≥ exp(−Cnϵ2n log n) 50 Therefore, with dominating probability Π(Uεcn |X (n) ) ≤ exp(−C0 nε2n (1 − C/Mn )) ≤ exp(−C1 nε2n ) for any C0 > C1 /2. This is because for n sufficiently large 1 − C/Mn > 1/2. This completes the proof. 2.14 Proof of Corollary 1 By Lemma 1, with dominating probability, there exists C0 (r) > 0 such that as n → ∞, Z (L(θ)/L(θ0 ))p(θ)dθ ≤ exp(−C0 (r)r2 nε2n ) Urεc n Let us assume, C0 (r) ≥ C2 /r for all r > 0 for some constant C2 > 0 (2.53) Numerical evidence for validity of this assumption been provided in the section 2.14.1. How- ever, the explicit theoretical derivation is technically involved and has been avoided in this thesis. By Lemma 2, with dominating probability Z (L(θ)/L(θ0 ))p(θ)dθ ≥ exp(−Cnϵ2n log n) Note, that R R c Urε L(θ)p(θ)dθ c Urε (L(θ)/L(θ0 ))p(θ)dθ c Π(Urε n |X (n) ) = R n = R n L(θ)p(θ)dθ (L(θ)/L(θ0 ))p(θ)dθ Therefore, with dominating probability c Π(Urε n |X (n) ) ≤ exp(−C2 rnε2n ) exp(Cnϵ2n log n) Following steps of proof of proposition 11 on page 2,111 in Van Der Vaart and Van Zanten (2011), Z Z ∞ (C2 /2)nεn ||θ−θ0 ||2 2 e dΠ(Uεcn |X (n) ) = e(C2 /2)nεn r Π(||θ − θ0 ||2 ≥ rεn |X (n) )dr 0 51 Therefore, Z e(C2 /2)nεn ||θ−θ0 ||2 dΠ(Uεcn |X (n) ) Z ∞ 2 = exp(Cnϵn log n) exp((C2 /2)rnε2n ) exp(−C2 rnε2n )dr Z0 ∞ 2 = exp(Cnϵ2n log n) exp(−(C2 /2)rnε2n )dr = 2 exp(Cnϵ2n log n) 0 C 2 nε n This completes the proof. 2.14.1 Proof of Relation (2.53) Lemma 15. With Urϵn same as in Lemma 1, we have for some C2 > 0, Z ! (n) L(θ) P0 sup p(θ)dθ ≤ −C2 rnϵ2n → 1, n → ∞ (2.54) θ∈Urϵnc L(θ 0 ) Proof. Following the proof of Lemma 1, we relate Urϵn to Vrϵn = {|β − β0 | < rϵn , |B − B0 | < rϵn }. Note, Vrϵn can be split into sets V1n , · · · , V8n as in the proof of Lemma 1. We study the behavior of sup (log L(θ) − log L(θ0 )) θ∈V1n where V1n = {β ≥ β0 + rϵn , B ≥ B0 }. The proof for other cases of V2n , · · · , V8n follows similarly. The proof of (2.54) will then follow as a consequence of (2.30). By a simple modification of the steps for the relation (2.31) used in the proof of Lemma 1, it can be shown that sup (log L(θ) − log L(θ0 )) ≤ O(log L((β0 + rϵn , B0 )) − log L(β0 , B0 )) θ∈V1n Next, we numerically demonstrate that log L((β0 + rϵn , B0 )) − log L(β0 , B0 ) ≤ −Crnϵ2n , n → ∞ | {z } | {z } LHS RHS 52 where C = 0.01. Note, for computing LHS in the above relation, x is generated using relation (1.1) with β = β0 and B = B0 . For varying values of r ∈ (0, 1000), we consider two ratios ρ1 = Proportion of r values where LHS > RHS ρ2 = Proportion of r values where LHS > RHS provided LHS < 0 The reason for considering the two proportion ρ1 and ρ2 is because RHS is a negative quantity and the upper bound is justified only when the LHS is als a negative quantity. The tables below give the values of ρ1 and ρ2 for varying values of n, d and ϵn . It is evident that both ρ1 and ρ2 approach zero as sample size increases. The fact that ρ1 approaches 0 as n → ∞ is also immediate from the proof of Lemma 1 which shows that LHS becomes negative as n → ∞,see relation (2.32). ρ1 for d = 5 (ρ2 for d = 5) ϵn β B n = 200 n = 1000 n = 5000 n−0.4 0.2 0.2 0.0(0.0) 0.0(0.0) 0.0(0.0) -0.2 0.002(0.002) 0.0(0.0) 0.0(0.0) 0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) -0.5 0.092(0.0) 0.0(0.0) 0.092(0.0) 0.5 0.2 0.092(0.0) 0.0(0.0) 0.0(0.0) -0.2 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) -0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) n−0.2 0.2 0.2 0.091(0.0) 0.040(0.009) 0.0(0.0) -0.2 0.0(0.0) 0.014(0.008) 0.0(0.0) 0.5 0.0(0.0) 0.056(0.008) 0.0(0.0) -0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.5 0.2 0.060(0.009) 0.0(0.0) 0.0(0.0) -0.2 0.036(0.009) 0.0(0.0 0.0(0.0) 0.5 0.066(0.009) 0.018(0.006) 0.014(0.007) -0.5 0.091(0.010) 0.021(0.007) 0.016(0.007) 53 ρ1 for d = 30 (ρ2 for d = 30) ϵn β B n = 200 n = 1000 n = 5000 n−0.4 0.2 0.2 0.0(0.0) 0.062(0.0) 0.0(0.0) -0.2 0.026(0.017) 0.092(0.0) 0.0(0.0) 0.5 0.056(0.008) 0.023(0.009) 0.0(0.0) -0.5 0.0(0.0) 0.025(0.008) 0.0(0.0) 0.5 0.2 0.091(0.0) 0.081(0.014) 0.091(0.0) -0.2 0.067(0.014) 0.0(0.0) 0.093(0.0) 0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) -0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) n−0.2 0.2 0.2 0.028(0.028) 0.0(0.0) 0.0(0.0) -0.2 0.091(0.0) 0.023(0.023) 0.0(0.0) 0.5 0.0(0.0) 0.021(0.008) 0.0(0.0) -0.5 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.5 0.2 0.039(0.015) 0.0(0.0) 0.004(0.004) -0.2 0.044(0.013) 0.042(0.012) 0.0(0.0) 0.5 0.0(0.0) 0.046(0.009) 0.0(0.0) -0.5 0.015(0.007) 0.0(0.0) 0.0(0.0) 54 CHAPTER 3 BAYESIAN VARIABLE SELECTION IN A STRUCTURED REGRESSION MODEL In the history of statistics, estimation of β = (β1 , . . . , βp )⊤ in a regression model, y = Xβ+e, has been one of the most popular research topics, where, X ∈ Rn×p is a given design matrix, y ∈ Rn is a response vector, and e ∼ M V N (0, I) is an error term. We consider a high- dimensional problem n < p, with sparse regression coefficients in which some βi ’s are non- zero and others are exactly zero. In this sparse high-dimensional setting, many previous literatures have considered Bayesian approaches to variable selection (Ray and Szabó, 2021; Ročková and George, 2018; Martin et al., 2017; Ročková and George, 2014; Carbonetto and Stephens, 2012). In addition to the sparsity on β, we consider a structurally dependent feature space. Dependent feature vector commonly occurs in genetics, neuroimaging, and image analysis. For example, suppose that an image is given in which each pixel has a value. There are signal areas in the image such that the pixels in the areas have non-zero values and the others pixels are all zeros. The shape of the signal areas varies such as rectangles, circles, or letters (See Figure 3.1). The image can be represented by a matrix and corresponding β is the vectorization of the matrix. In such structured regression models, Figure 3.1: Example of the structured regression coefficients. White pixels denote the cor- responding βi ’s are zeros and darker pixels denote the corresponding βi ’s have larger values. problems may arise with the classic variable selection methods. In order to use the structural information and select signal features, Li and Zhang (2010) employed latent binary variables 55 γ = (γ1 , . . . , γp )⊤ ∈ {0, 1}p with Ising prior on it and a spike and slab prior on β for approximated the posterior selection probabilities by adapting Gibbs sampling which involves a matrix inversion. Due to the computational cost of the matrix inversion, Li and Zhang (2010)’s method are not scalable when p or the degree of the underlying graph is large. Chang et al. (2018) proposed a Bayesian shrinkage approach via EM algorithm which is scalable to high dimensional settings. We suggest a variational Bayes method (Jordan et al., 1999) to simultaneously select signal coefficients and estimate the magnitudes of the signals where a predetermined coupling matrix is considered to incorporate the network structure of features. The coupling matrix in our method does not necessarily represent the true connection between features, which allows us to utilize a common structure of a coupling matrix such as k-nearest neighbor in image analysis. 3.1 Model and methodology Throughout this chapter, we assume that pn is the number of covariates which depends on the number of observations n. In a linear regression model, let β ∈ Rpn and σ 2 > 0 denote the regression coefficients and the residual variance respectively. Since we consider high-dimensional setup, we assume that pn > n and pn → ∞ as n → ∞. Then, the linear regression model is written as: y = Xβ + e, (3.1) where y = (y1 , . . . , yn )⊤ is an observed response vector, X ∈ Rn×pn is a given design matrix, and e = (e1 , . . . , en )⊤ ∼ M V N (0, σ 2 In ). Since estimating σ 2 is not of our in- terest, with out loss of generality, we assume σ 2 = 1. Our Bayesian variable selection method is easily extendable to the case of unknown σ 2 with an appropriate prior distribu- tion. Now, there are pn unknown parameters in the regression model. In addition to the high-dimensional regression setup, we further assume structured covariates for which it is desirable to capture the intrinsic dependence among covariates. Introducing binary latent variables γ = (γ1 , . . . , γpn )⊤ ∈ {−1, 1}pn , we perform Bayesian variable selection using a 56 spike and slab method as follows: pn ! 1 X p(γ) = exp (b0 γi mi (γ) + a0 γi − log cosh(b0 mi (γ) + a0 )) , 2pn i=1 pn (3.2) Y p(β | γ) = p(βi | γi ), i=1     1 − γi 1 + γi 1(βi = 0) + N 0, τ 2 ,  p(βi | γi ) ∼ 2 2 where m(γ) := (m1 (γ), . . . , mpn (γ))⊤ = Jn γ, Jn ∈ Rpn ×pn is a coupling matrix, and N (0, τ 2 ) denotes a Gaussian distribution with probability density function as follows: β2   1 p(βi | γi = 1) = √ exp − i2 . 2πτ 2 2τ One can notice that p(γ) is a pseudo-likelihood of Ising model. To approximate the unknown posterior distributions, we define a variational family as follows: pn ! 1 X q(γ) = pn exp (bγi mi (γ) + ai γi − log cosh(bmi (γ) + ai )) , 2 i=1 pn (3.3) Y q(β | γ) = q(βi | γi ), i=1     1 − γi 1 + γi 1(βi = 0) + N µi , σi2 .  q(βi | γi ) ∼ 2 2 Note that we have 3pn + 1 variational parameters needed to be updated. We point out that using the multiple threshold parameters a = (a1 , . . . , apn )⊤ , the posterior inclusion probability, that is, the probability that γi = 1 given y, can be different over i, which is desirable for variable selection. 57 3.1.1 ELBO optimization Let ν = (a⊤ , b, µ⊤ , σ ⊤ ) denote the set of variational parameters, where a = (a1 , . . . , apn )⊤ , µ = (µ1 , . . . , µpn )⊤ , and σ = (σ1 , . . . , σpn )⊤ . Then, the negative ELBO is: L(ν) = Eq(β,γ) [log q(β, γ) − log p(β, γ, y)] = Eq(β,γ) [− log p (y | β)] + Eq(γ) [log q(γ) − log p(γ)] | {z } | {z } 1 2 (3.4) + Eq(β,γ) [log q(β | γ) − log p(β | γ)] . | {z } 3 The first term is:   1 ⊤ ⊤ 1 = Eq(β,γ) ⊤ ⊤  β X Xβ − 2y Xβ + y y + C 2   !2   n pn pn 1 X X X 1 ⊤  X·i⊤ yβi  +  = Eq(β,γ)   Xki βi − 2 y y +C 2 k=1 i=1 i=1 2 " n p pn n X # n 1 XX 2 2 X ⊤ X = Eq(β,γ) Xki βi − 2 X·i yβi + Xki Xkl βi βl + C, 2 k=1 i=1 i=1 k=1 i̸=l where X·i is i-th column of X. We compute the closed form of the above expectation: " n p pn n X # n XX X X Eq(β,γ) Xki2 2 βi − 2 X·i⊤ yβi + Xki Xkl βi βl k=1 i=1 i=1 k=1 i̸=l n pn pn XX X = Xki2 ϕi Eβi ∼N (µi ,σi2 ) [βi2 ] −2 X·i⊤ yϕi Eβi ∼N (µi ,σi2 ) [βi ] k=1 i=1 i=1 X n X + Xki Xkl ϕi ϕl Eβi ∼N (µi ,σi2 ) [βi ]Eβl ∼N (µl ,σl2 ) [βl ] k=1 i̸=l n X pn pn n X X X X = Xki2 ϕi (µ2i + σi2 ) −2 X·i⊤ yϕi µi + Xki Xkl ϕi ϕl µi µl , k=1 i=1 i=1 k=1 i̸=l where ϕi = eai +bmi (γ) / eai +bmi (γ) + e−ai −bmi (γ) is the marginal probability q(γi = 1). Sim-  ilarly, let ϕi0 = ea0 +b0 mi (γ) / ea0 +b0 mi (γ) + e−a0 −b0 mi (γ) denote the marginal probability in-  duced from the prior p(γ). The closed form of the second term in the negative ELBO (3.4) 58 is: pn " # X 2 = Eq(γ) (log ϕi − log ϕ0i ) i=1 p X = ϕi (log ϕi − log ϕ0i ) + (1 − ϕi ) (log(1 − ϕi ) − log(1 − ϕ0i )) . i=1 The last one is: pn " # X 3 = Eq(β,γ) (log q(βi | γi ) − log p(βi | γi )) i=1 pn X ϕi KL N (µi , σi2 ), N (0, τ 2 )  = i=1 pn X ϕi  µ2 + σ 2  i i = − 1 + 2 log τ − 2 log σi . i=1 2 τ2 The last result of −( 1 + 2 + 3 ) provides the closed form of the ELBO. Let ϕ := √ (ϕ1 , . . . , ϕpn )⊤ and let ϕ0.5 := ( ϕ1 , . . . , ϕpn )⊤ . Then, in a matrix notation, we write: p n X pn p n X X X X 2· 1 = Xki2 ϕi (µ2i + σi2 ) − 2 X·i⊤ yϕi µi + Xki Xkl ϕi ϕl µi µl + C k=1 i=1 i=1 k=1 i̸=l ⊤ ⊤ = ϕ0.5 ◦ µ X ⊤ X ◦ Ipn ϕ0.5 ◦ µ + ϕ0.5 ◦ σ X ⊤ X ◦ Ipn ϕ0.5 ◦ σ     − 2y ⊤ X(ϕ ◦ µ) + (ϕ ◦ µ)⊤ X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ) + C, (3.5)  where ◦ denote the element-wise product and 1pn ×pn ∈ Rpn ×pn is a matrix whose elements are all ones. Define another vector ϕ0 := (ϕ01 , . . . , ϕ0pn )⊤ and let log(·) mean element-wise logarithm when the input is a vector of a matrix. Then, p X 2 = ϕi (log ϕi − log ϕ0i ) + (1 − ϕi ) (log(1 − ϕi ) − log(1 − ϕ0i )) i=1 = ϕ⊤ (log ϕ − log ϕ0 ) + (1 − ϕ)⊤ (log (1 − ϕ) − log (1 − ϕ0 )) , and pn ϕi µ2i + σi2 X   3 = − 1 + 2 log τ − 2 log σi i=1 2 τ2 = 0.5ϕ⊤ τ −2 (µ ◦ µ + σ ◦ σ) + (2 log τ − 1) · 1 − 2 log σ .  59 With the closed form of the ELBO, the gradients of the ELBO with respect to variational parameters are also easily computable in closed forms. First, we compute the gradients ∇ai ϕi and ∇b ϕi : 2e2(ai +bmi (γ)) ∇ai ϕi = 2, (e2(ai +bmi (γ)) + 1) 2mi (γ)e2(ai +bmi (γ)) ∇b ϕi = 2. (e2(ai +bmi (γ)) + 1) ⊤ Observe that ∇ai ϕj = 0 if i ̸= j. We define ∇a ϕ := ∇a1 ϕ1 , . . . , ∇apn ϕpn and ∇b ϕ := (∇b ϕ1 , . . . , ∇b ϕpn )⊤ . Then, by chain rule, we can compute the closed forms of all the gradients needed. Gradients with respect to a: 2 · ∇a 1 = X ⊤ X ◦ Ip (µ ◦ µ ◦ ∇a ϕ) + X ⊤ X ◦ Ip (σ ◦ σ ◦ ∇a ϕ)   − 2 X ⊤ y ◦ µ ◦ ∇a ϕ + 2 X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ) ◦ (µ ◦ ∇a ϕ) ,    ∇a 2 = ∇a ϕ ◦ (log ϕ − log ϕ0 ) − ∇a ϕ (log (1 − ϕ) − log (1 − ϕ0 )) , ∇a 3 = 0.5∇a ϕ ◦ τ −1 (µ ◦ µ + σ ◦ σ) + (2 log τ − 1) · 1 − 2 log σ .  Gradients with respect to b: 0.5∇b 1 ⊤ ⊤ = ∇b ϕ0.5 ◦ µ X ⊤ X ◦ Ipn ∇b ϕ0.5 ◦ µ + ∇b ϕ0.5 ◦ σ X ⊤ X ◦ Ipn ∇b ϕ0.5 ◦ σ     − 2y ⊤ X(∇b ϕ ◦ µ) + (∇b ϕ ◦ µ)⊤ X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ)  + (ϕ ◦ µ)⊤ X ⊤ X ◦ (1pn ×pn − Ipn ) (∇b ϕ ◦ µ),  ∇b 2 = ∇b ϕ⊤ (log ϕ − log ϕ0 ) − ∇b ϕ⊤ (log (1 − ϕ) − log (1 − ϕ0 )) , ∇b 3 = 0.5∇b ϕ⊤ τ −1 (µ ◦ µ + σ ◦ σ) + (2 log τ − 1) · 1 − 2 log σ .  60 Gradients with respect to µ: 0.5∇µ 1 = X ⊤ X ◦ Ip (µ ◦ ϕ) − 2(X ⊤ y) ◦ ϕ  + 2 X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ) ◦ ϕ,   ∇µ 3 = τ −2 ϕ ◦ µ. Gradients with respect to σ: 0.5∇σ 1 = X ⊤ X ◦ Ip (σ ◦ ϕ) ,  ∇µ 3 = τ −2 ϕ ◦ µ. 3.2 Implementation details Starting with initial variational parameters ν (0) , the first step for implementing the VB algorithm is to draw a γ from the current q(γ; b(t) , a(t) ) using Gibbs sampling. Note that we do not need to draw samples from q(β | γ). Given a γ, we can obtain the closed forms of ELBO and all the gradients ∇\ ν L(ν) described in the subsection 3.1.1. Then, we can update the current variational parameters as follows: ν (t+1) ← ν (t) + ηt (ν) · ∇\ ν L(ν). Note that, we allow that the learning rates ηt (ν) depend on variational parameters ν ∈ ν. Since we have 3pn + 1 variational parameters, a single learning rate does not guarantee the convergence of all the variational parameters. Therefore, we use adaptive learning rates such as Adam described in Algorithm 1.1. After the optimal variational parameters ν ∗ are achieved, we can empirically compute the marginal probability that γi = 1, i = 1, . . . , pn based on the samples: γ 1 , . . . , γ S ∼ q(γ; b∗ , a∗ ), 61 where γ s = (γs,1 , . . . , γs,pn )⊤ , s = 1 . . . , S. Then the marginal probability of γi is: 1 (γs,i = 1) PS ∗ q (γi = 1) = s=1 . S Based on the empirical marginal probabilities, we select the i-th feature if: q ∗ (γi = 1) > T, where T is a threshold (it could be a fixed number such as 0.5, or it could be the M -th largest marginal). For estimation of β, we use the (variational) posterior mode µ∗ . 3.3 Numerical results In this section, we numerically investigate our variable selection algorithm equipped with two different adaptive learning rates, Adam (Kingma and Ba, 2014) and RMSprop which is an unpublished algorithm first proposed in the Coursera course. “Neural Network for Machine Learning” by Geoff Hinton. Also, we compare our method with a MCMC based method (Li and Zhang, 2010). Li and Zhang (2010) approached the structural regression problem through the Bayesian variable selection framework, where the covariates lie on an undirected graph and formulate an Ising prior on the model space for incorporating structural information. Li and Zhang (2010) adopt the Gibbs sampling algorithms, first suggested by George and McCulloch (1993). 3.3.1 Li and Zhang (2010)’s Gibbs sampling scheme For the formulation of Li and Zhang (2010)’s Gibbs sampling method, we define γ (−i) = (γ1 , . . . , γi−1 , γi+1 , . . . , γp )⊤ ; let I(−i) be the set of indices {γj = 1 : j ̸= i}; I(i) = I(−i) ∪ {i}; p(i) = |I(i) | and p(−i) = |I(−i) |. With an Ising prior on γ ∈ {0, 1}p , p ! 1 b0 ⊤ X p(γ) = exp γ Jγ + a0 γi , Z(a0 , b0 ) 2 i=1 the posterior distribution of γ given the data can be decomposed by Bayes formula as follows:  P γi = 1 | γ (−i) (3.6)  P γi = 1 | γ (−i) , y =  −1 , P γi = 1 | γ (−i) + BF i | γ (−i) · P γi = −1 | γ (−i) 62  P(y|γi =1,γ ) where BF i | γ (−i) = P y|γ =0,γ (−i) is the Bayes factor, and ( i (−i) )   P γi a0 +b0 j∈I γj  e (−i) P γi | γ (−i) = P a0 +b0 j∈I γj . 1+e (−i) The Bayes factor can be explicitly computed under the spike and slab prior on β: 1/2 y ⊤ y − y ⊤ X −1 ⊤ !n/2  |K (−i) | I (−i) K X (−i) I(−i) y BF i | γ (−i) = τ −1  −1 ⊤ , |K(i) | ⊤ ⊤ y y − y XI(i) K(i) XI(i) y where K(i) = XI⊤(i) XI(i) + τ −2 Ip(i) and K(−i) = XI⊤(−i) XI(−i) + τ −2 Ip(−i) . Based on Gibbs samples from the posterior probabilities in (3.6), Li and Zhang (2010) calculated the marginal posterior probabilities to find signal variables. 3.3.2 Hyper parameter selection To implement the algorithm with the prior distributions (3.2), three hyper parameters are needed to be selected. We follow Li and Zhang (2010) for the hyper parameter choices. Li and Zhang (2010) considered reparametrizations for the hyper parameters (a0 , b0 ) as follows: a0 = log(r/w02 ) and b0 = log(w1 · w0 ), where w0 = rw1 + 1 − r. Li and Zhang (2010) used fixed r = 0.03 and various w1 . Table 3.1 shows examples of hyper parameter choices. Note that w1 = 1 corresponds to the independent Bernoulli prior. w1 a0 b0 1 -3.5066 0 5 -3.7332 1.7228 9 -3.9368 2.4123 Table 3.1: Examples of hyper parameter choices 3.3.3 ROC curve In this subsection, we compute ROC curves to see how performance changes as hyper pa- rameters change in two scenarios. 63 Scenario 1: For the first scenario, we assume the γi s are arranged on a circle clockwise. Then, each γi has two neighbors, γi−1 and γi+1 , where γ0 = γp and γp+1 = γ1 , where n = 100 and p = 2000. Figure 3.2 shows the connectivity among γ. For the true regression coefficients, we use:  0.3, if i ∈ A and i is odd,        βi =  0.6, if i ∈ A and i is even,    0, otherwise,   where A = {i : i ∈ [245, 260] ∪ [745, 760] ∪ [1245, 1260] ∪ [1745, 1760]}. Figure 3.2: γ on a circle For design matrix X ∈ Rn×p , we first assume each component follows independent stan- dard Gaussian distribution, i.e., Xij ∼ N (0, 1). Figure 3.3 demonstrates the ROC curves corresponding to the hyper parameters w1 = 1 and w1 = 3. Figure 3.4 contains the results for higher w1 with independent X. 64 Figure 3.3: ROC curves for the three variable selection methods with the hyper parameter w1 = 1 (left) and w1 = 3 (right) respectively when the covariates are independent. Figure 3.4: ROC curves for the three variable selection methods with the hyper parameter w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively when the covariates are independent. For the second type of X ∈ Rn×p , we consider correlated X. In the blocks [241, 265], [741, 765], [1241, 1265], and [1741, 1765], we let corr (X·i , X·j ) = 0.75−0.03|i−j|, where X·i is i-th column of X. Also, as a noise, we let X be correlated as corr (X·i , X·j ) = 0.4−0.02|i−j|, in four blocks which do not include a signal, [41, 60], [941, 960], [1041, 1060], and [1941, 1960]. 65 With the hyper parameters w1 = 1, w1 = 3, w1 = 5, w1 = 7, and w1 = 9, the corresponding ROC curves are demonstrated Figure 3.5 and Figure 3.6. Figure 3.5: ROC curves for the three variable selection methods with the hyper parameter w1 = 1 (left) and w1 = 3 (right) respectively when the covariates are correlated. Figure 3.6: ROC curves for the three variable selection methods with the hyper parameter w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively when the covariates are correlated. Scenario 2: For the second scenario, we assume the γi s are arranged on a lattice such that each γi is connected with less then or equal to four neighbors. The four γi s at the corner 66 of the lattice have two neighbors each, the γi s at the boundary of the lattice have three neighbors, and the others have four neighbors. The connectivity is shown in Figure 3.7. Figure 3.7: γ on a lattice. For the true regression coefficients, we consider a vectorization of an image in which there are three signal areas. Each pixel in the signal areas takes a nonzero value, βi ∈ {0.3, 0.6}, and all the other pixels are zeros. Figure 3.8 shows the signal areas in the image. The black pixels represent larger value of nonzero βi s (0.6), the grey pixels represent smaller value of nonzero βi s (0.3), and the white pixels represent zero βi s. We use the vecotrization of the image (matrix) as the vector of true regression coefficients in scenario 2. 67 Figure 3.8: Signal areas in an image For design matrix X ∈ Rn×p in this scenario, we also consider two types of X. We first use independent standard Gaussian distribution as in the scenario 1. Figure 3.9 demonstrates the ROC curves corresponding to the hyper parameters w1 = 1, w1 = 3, and w1 = 5. Figure 3.9: ROC curves in scenario 2 for the three variable selection methods with the hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom) respectively when the covariates are independent. 68 When w1 is higher than 5, Li and Zhang (2010)’s approach does not work because of a phase transition. A phase transition can occur when the space of γ is two-dimensional (lattice). One can see Stanley (1971) for more details of phase transition in Ising model. Figure 3.10 contains the results for higher w1 = 7 and w1 = 9 when X is independent. Figure 3.10: ROC curves in scenario 2 for the two VB methods with the hyper parameter w1 = 7 (left) and w1 = 9 (right) respectively when the covariates are independent. For the second type of design matrix in scenario 2, we consider the same structure of γ, that is, corr (X·i , X·j ) > 0 if γi and γj are connected:  0.3 if γi and γj are connected,   corr (X·i , X·j ) = 0, otherwise.   Using the correlated X, the ROC curves with w1 = 1, w1 = 3, and w1 = 5 are shown in Figure 3.11 and the ROC curves for w1 = 7 and w1 = 9 are shown in Figure 3.12. 69 Figure 3.11: ROC curves in scenario 2 for the three variable selection methods with the hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom) respectively when the covariates are correlated. Figure 3.12: ROC curves in scenario 2 for the two VB methods with the hyper parameter w1 = 7 (left) and w1 = 9 (right) respectively when the covariates are correlated. 3.4 Theoretical results In this section, we describe theoretical results. To establish the selection consistency of our variational Bayes method, we first observe the true posterior with the true Ising prior on γ 70 as follows: p ! 1 b0 ⊤ X p(γ) = exp γ Jn γ + a0 γi , Zn (a0 , b0 ) 2 i=1 where Jn ∈ Rpn ×pn is a given coupling matrix. We define an activation set Aγ = {i : γi = 1} and, given γ = g, let β g be the vector of non-zero βi ’s and Xg be the corresponding design matrix of size n × |β g |. Defining |g| := |β g | = |Ag |, the joint posterior is: π(β, γ = g | y) ∝ p(γ = g)p(β | γ = g)p(y | β g ) pn !   X b0 ⊤ 1 ⊤  ∝ exp a0 gi + g Jn g exp − y − Xg β g y − Xg β g i=1 2 2 Y Y × p(βi | γi = 1) p(βi | γi = −1) i∈Ag i∈Acg pn !   X b0 ⊤ 1 ⊤  ∝ exp a0 gi + g Jn g exp − y − Xg β g y − Xg β g i=1 2 2  Y  |βg | 1 ⊤ × 2πτ 2 − 2 exp − 2 β g β g 1 (βi = 0) 2τ i∈Acg pn (3.7) ! b0 ⊤ − |β2g | Y 1 (βi = 0) X ∝ exp a0 gi + g Jn g 2πτ 2 i=1 2 i∈Acg    1 ⊤  1 ⊤ × exp − y − Xg β g y − Xg β g + 2 β g β g 2 τ pn ! b0 − |β2g | Y 1 (βi = 0) X ∝ exp a0 gi + g ⊤ Jn g 2πτ 2 i=1 2 i∈Acg pn ! b0 ⊤ − |β2g | Y 1 (βi = 0) X ∝ exp a0 gi + g Jn g 2πτ 2 i=1 2 i∈Agc     1   ⊤ 1 ⊤ × exp − β g − β̃ g X g Xg + 2 I β g − β̃ g 2 τ      1 ⊤ ⊤ 1 × exp − y y − β̃ g Xg Xg + 2 I β̃ g , 2 τ 71 −1 where β̃ g = Xg⊤ Xg + τ 1 2 I Xg⊤ y. To get the marginal posterior of γ = g, we integrate β g out first: π(β −g , γ = g | y) Z = π(β, γ = g | y)dβ g pn   ! 1 bn ⊤ 1 (βi = 0) X Y ∝ exp − 2 Rg exp a0 gi + g Jg 2σ i=1 2 i∈Agc Z |β |     g 1   ⊤ 1 2 2 − 2 ⊤  × 2πσ τ exp − 2 β g − β̃ g X g Xg + 2 I β g − β̃ g dβ g 2σ τ pn   ! 1 1 b 1 (βi = 0) , 1 X n Y ∝ |Xg⊤ Xg + 2 I|− 2 exp − 2 Rg exp a0 gi + g ⊤ Jg τ 2σ i=1 2 i∈Ac g  −1  where Rg = y ⊤ I − Xg Xg⊤ Xg + 1 τ2 I Xg⊤ y. Now, we integrate β −g out: π(γ = g | y) Z = π(β −g , γ = g | y)dβ −g pn   !Z 1 1 b 1 (βi = 0) dβ−g 1 X 0 Y ∝ |Xg⊤ Xg + 2 I|− 2 exp − 2 Rg exp a0 gi + g ⊤ Jg τ 2σ i=1 2 i∈Ag c pn   ! 1 1 1 X b0 = |Xg⊤ Xg + 2 I|− 2 exp − 2 Rg exp a0 gi + g ⊤ Jg . τ 2σ i=1 2 −1/2 Let Qg = Xg⊤ Xg + 1 τ2 I . Then, the posterior ratio is: P R(g, t) π(γ = g | y) = π(γ = t | y)   pn pn ! (3.8) Qg 1 1 X b0 ⊤ X b0 ⊤ = exp − 2 Rg + 2 Rt exp a0 gi + g Jg − a0 ti − t Jt , Qt 2σ 2σ i=1 2 i=1 2 |{z} | {z }| a {z } b c where t denotes the true model. Our first result is described in the following subsection. 72 3.4.1 True posterior consistency with true Ising prior We let λnmax denote the maximum eigenvalue of the Gram matrix X ⊤ X/n, and for ν > 0, as like Narisetty and He (2014), we define: ! n Xg⊤ Xg mn (ν) = pn ∧ and λnmin (ν) := inf ϕ# min , (3.9) (2 + ν) log pn |g|≤mn (ν) n where ϕ# min (A) denotes the minimum nonzero eigenvalue of a matrix A. To establish the selection consistency of the true posterior under the true Ising prior, we need following regularity conditions (Yang and Shen, 2017): Condition 1. (On dimension pn ). pn = endn for some dn → 0 as n → ∞, that is, log pn = o(n). Condition 2. (On prior parameters). nτ 2 ∼ (n ∨ p2+3δ n ), for some δ > 0, a0 ∼ −ndn , and 1 b0 ∼ kmax pn , where kmax is the maximum row sum of Jn , that is, pn X kmax = max Jn (i, j). i∈{1,...,pn } j=1 Condition 3. (On true model). y | X ∼ Nn (Xt β t + Xtc β tc , σ 2 I), where the size of the true model |t| is fixed. Besides, |t|/2 < rt < |t|, where rt is the rank of Xt . For any fixed K > 0, define ∆n (K) := inf ||(I − Pg )Xt β t ||22 , {g:|g| 1 + 8/δ such that ∆n (K) > γn := 5σ 2 |t|(1 + √ δ) log( n ∨ pn ). Condition 5. (Regularity of the design). For some ν < δ, κ < (K − 1)δ/2, n ∨ p2+2δ   −κ λnmax ∼ O(1) and λnmin ⪰ ∨ pn . nτ 2 73 Theorem 2. Under the Conditions 1 - 5, the posterior probability of the true model goes to 1, that is, X P R(g, t) → 0, g̸=t as n goes to infinity. Narisetty and He (2014) separated the model space into four disjoint parts: M1 = {g : rg > mn }, M2 = {g : g ⊃ t, rg ≤ mn }, M3 = {g : g ̸⊃ t, K|t| < rg ≤ mn }, M4 = {g : g ̸⊃ t, rg ≤ K|t|}. To prove the Theorem 2, we show P R(g, t) → 0 in each subspace. P g Lemma 16. For a universal constant c′ > 0, Qg −(rg∗ −|t|)/2 n −|t|/2 ≤ c′ nτ 2 λnmin (λmin ) Qt Proof. From Lemma 11.1 in Narisetty and He (2014), we have −1/2 1 Qg = I + 2 Xg⊤ Xg τ −1/2 = I + τ 2 Xg Xg⊤ Note, Qg Qg Qg∧t a = = · , Qt Qg∧t Qt 74 Qg = |I + τ 2 Xg Xg⊤ |−1/2 |I + τ 2 Xg∧t Xg∧t ⊤ 1/2 | Qg∧t −rg /2 r /2 ≤ 1 + τ 2 λng,min 1 + τ 2 λng∧t,max t∧g −rg /2 r /2 ≤ τ 2 λng,min 1 + nτ 2 λnmax t∧g −rg /2 r /2 ≤ nτ 2 λnmin 1 + nτ 2 λnmax t∧g −rg /2 r /2 ≃ nτ 2 λnmin nτ 2 λnmax t∧g for sufficiently large n −(rg −rt∧g )/2 λnmax rt∧g /2   2 n = nτ λmin λnmin  n rt∧g /2 −(rg∗ −rt∧g )/2 λmax ≤ nτ 2 λnmin λnmin  n rt∧g /2 −(rg∗ −rt )/2 λmax ≤ nτ 2 λnmin λnmin −(rg∗ −rt )/2 n −rt∧g /2 ≤ C nτ 2 λnmin (λmin ) , where rg∗ = rg ∧ mn . Qg∧t ⊤ −1/2 = |I + τ 2 Xg∧t Xg∧t | |I + τ 2 Xt Xt⊤ |1/2 Qt ⊤ = |I + τ 2 Xg∧t Xg∧t + τ 2 Xgc ∧t Xg⊤c ∧t |1/2 |I + τ 2 Xg∧t Xg∧t ⊤ −1/2 | ⊤ −1 + τ 2 Xgc ∧t Xg⊤c ∧t ⊤  −1/2 = | I + τ 2 Xg∧t Xg∧t I + τ 2 Xg∧t Xg∧t | −1 = |I + τ 2 Xg⊤c ∧t I + τ 2 Xg∧t Xg∧t ⊤ Xgc ∧t |1/2 ≤ |I + τ 2 Xg⊤c ∧t Xgc ∧t |1/2 = |I + τ 2 Xgc ∧t Xg⊤c ∧t |1/2 r c /2 ≤ 1 + τ 2 λngc ∧t,max t∧g r c /2 ≤ 1 + τ 2 λnmax t∧g r c /2 ≃ nτ 2 λnmax t∧g for sufficiently large n rt∧gc /2 λnmax rt∧gc /2   2 n = nτ λmin λnmin r c /2 ≤ C nτ 2 λnmin t∧g (λnmin )−rt∧g /2 , λnmax ∼ O(1) c 75 From the above two inequalities, −(rg∗ −rt∧g −rt∧gc )/2 λnmax (rt∧g +rt∧gc )/2   Qg a = ′ 2 n ≤ c nτ λmin Qt λnmin c −(rg∗ −|t∧g|−|t∧gc |)/2 λnmax (|t∧g|+|t∧g |)/2   ′ 2 n ≤ c nτ λmin λnmin  n |t|/2 −(rg∗ −|t|)/2 λmax = c′ nτ 2 λnmin λnmin −(rg∗ −|t|)/2 n −|t|/2 ≤ c′ nτ 2 λnmin (λmin ) Using the same argument of the Lemma A.1 in Narisetty and He (2014), we show that:  −1 ! 1 Rg = Rg = y ⊤ I − Xg Xg⊤ Xg + 2 I Xg⊤ y τ −1 = y ⊤ I + τ 2 Xg Xg⊤ y ≥ 0, ∀g (3.10) 1. Models in M1 . From (3.10) and the supplement to Narisetty and He (2014), P Rt − Rg > n(1 + 2s)σ 2 ≤ P Rt > n(1 + 2s)σ 2     ′ ≤ 2e−c n , uniformly for all g (3.11) Observe that on M1 , rg∗ = mn ≥ n/ log(p2+ν n ) ≥ n/ log(pn ). Therefore, on the high- 2+δ 76 probability event {Rt − Rg ≤ n(1 + 2s)σ 2 }, by (3.11) and the regularity Conditions, X P R(g, t) g∈M1 |t|/2 λnmax    X −(rg∗ −|t|)/2 n(1 + 2s) ⪯ nτ 2 λnmin exp g∈M1 λnmin 2 pn pn ! X b0 ⊤ X b0 ⊤ exp a0 gi + g Jg − a0 ti − t Jt i=1 2 i=1 2   X −(1+δ)(mn −|t|) n −|t|/2 n(1 + 2s) ⪯ pn (λmin ) exp g∈M1 2 pn pn ! X b0 X b0 exp a0 gi + g ⊤ Jg − a0 ti − t⊤ Jt , by Condition 2 and 5 i=1 2 i=1 2   X n −|t|/2 n(1 + 2s) = exp (−ndn (1 + δ)(mn − |t|)) (λmin ) exp g∈M1 2 pn pn ! X b0 ⊤ X b0 exp a0 gi + g Jg − a0 ti − t⊤ Jt , by Condition 1 i=1 2 i=1 2     X n(1 + δ) n −|t|/2 n(1 + 2s) ⪯ exp − (λmin ) exp g∈M1 2 + δ 2 pn pn ! X b0 ⊤ X b0 ⊤ exp a0 gi + g Jg − a0 ti − t Jt , because mn ≥ n/ log(pn2+δ ) i=1 2 i=1 2     X n(1 + δ) κ|t|/2 n(1 + 2s) ⪯ exp − pn exp g∈M1 2+δ 2 pn pn ! X b0 ⊤ X b0 ⊤ exp a0 gi + g Jg − a0 ti − t Jt , by Condition 5 i=1 2 i=1 2 pn   ! n(1 + δ) n(1 + 2s) κ|t|/2 X b0 ⊤ ⪯ exp − + pn exp −a0 ti − t Jt 2+δ 2 i=1 2 pn ! X X b0 exp a0 gi + g ⊤ Jg , by Condition 5. g∈M i=1 2 1 Since the conditions on a0 and b0 bounds the summation term, we have     X ndn κ|t| n(1 + δ) n(1 + 2s) P R(g, t) ⪯ w exp′ exp − + , by Condition 1 g∈M1 2 2+δ 2    1 + 2s 1 + δ ⪯ exp n − → 0, 2 2+δ 77 when (1 + 2s)/2 − (1 + δ)/(2 + δ)) < 0, i.e., s < δ/2(2 + δ). 2. Models in M2 . Consider 0 < s ≤ δ/8, and define the events: A(g) := {Rt − Rg > 2σ 2 (1 + 2s)(rg − rt ) log pn }, U (d) := ∪{g:rg =d} A(g). Let Pg = Xg (Xg⊤ Xg )−1 Xg⊤ . Since Rg ≥ Rg∗ = y ⊤ (I − Pg ) y, we have P [U (d)] = P ∪{g:rg =d} {Rt − Rg > 2σ 2 (1 + 2s)(rg − rt ) log pn }   ≤ P ∪{g:rg =d} {Rt − Rg∗ > 2σ 2 (1 + 2s)(rg − rt ) log pn }   ≤ P ∪{g:rg =d} {Rt∗ − Rg∗ > σ 2 (2 + 3s)(rg − rt ) log pn }   + P Rt − Rt∗ > sσ 2 (d − rt ) log pn   ≤ c′ p−(1+s)(d−r n t ) (d−rt ) pn + exp (−c′ n log pn ) ≤ 2c′ p−s(d−r n t) . Next, we consider the union of all such events U (d), that is,   2c′ P ∪{d>rt } U (d) ≤ → 0. psn − 1 Observe that on M2 , we have rg∗ = rg , |t ∧ g| = |t|, and |t ∧ g c | = 0. On the high-probability 78 event ∩{d>rt } U (d)c , X P R(k, t) k∈M2 |t|/2 λnmax    X −(rk∗ −rt∧k )/2 1 ⪯ nτ 2 λnmin exp − 2 (Rk − Rt ) k∈M2 λnmin 2σ pn pn ! X b0 ⊤ X b0 ⊤ exp an gi + k Jk − a0 ti − t Jt i=1 2 i=1 2 X −(rk −rt )/2 (1+2s)(r −rt ) n −|t|/2 ⪯ nτ 2 λnmin pn k (λmin ) k∈M2 pn pn ! X b0 X b0 exp a0 gi + k⊤ Jk − a0 ti − t⊤ Jt i=1 2 i=1 2 X −(rk −rt )/2 (1+2s)(r −rt ) κ|t|/2 ⪯ p2+2δ n ∨n pn k pn k∈M2 pn pn ! X b0 X b0 exp a0 gi + k⊤ Jk − a0 ti − t⊤ Jt i=1 2 i=1 2 X √ (r −rt ) (1+2s)(rk −rt ) κ|t|/2 ⪯ p−1−δ n ∧ 1/ n k pn pn k∈M2 pn pn ! X b0 X b0 exp a0 gi + k⊤ Jk − a0 ti − t⊤ Jt i=1 2 i=1 2 (rk −rt ) p1+2s X   −δ+2s n ⪯ pn ∧ √ pnκ|t|/2 k∈M n 2 pn pn ! X b0 X b0 exp a0 gi + k⊤ Jk − a0 ti − t⊤ Jt i=1 2 i=1 2 ! 1+δ/4 pn ⪯ p−3δ/4 n ∧ √ pnκ|t|/2 n pn pn ! ! X b0 X X b0 exp −a0 ti − t⊤ Jt exp a0 gi + k⊤ Jk i=1 2 k∈M2 i=1 2  κ|t| δ  +1+ 4 κ|t| − 34 δ pn 2  → 0, for some δ. ⪯ pn2 ∧ √ n 79 3. Models in M3 . We define: B(k) := {Rt − Rk > 2σ 2 (1 + 2s)(rk − rt ) log pn }. Note that: −1 −1 Rk = y ⊤ I + τ 2 Xk Xk⊤ y, Rk∨t = y ⊤ I + τ 2 Xk Xk⊤ + τ 2 Xkc ∧t Xk⊤c ∧t y. Let A = I + τ 2 Xk Xk⊤ . −1 −1 I + τ 2 Xk Xk⊤ + τ 2 Xkc ∧t Xk⊤c ∧t = A + τ 2 Xkc ∧t Xk⊤c ∧t = A−1 − A−1 Xkc ∧t I + Xk⊤c ∧t A−1 Xkc ∧t Xk⊤c ∧t A−1 .  Therefore, Rk∨t ≤ Rk . Let V (d) := ∪{k:rk =d,k∈M3 } B(k). From Narisetty and He (2014), P [V (d)] ≤ P ∪{k:rk =d,k∈M3 } Rt − Rk∨t > 2σ 2 (1 + 2s)(rk − rt ) log pn    ′ ≤ c′ pn−w d . Then, ′ P ∪{d>K|t|} V (d) ≤ pn−w K|t| → 0, as n → ∞.   80 Restricting our attention to the high probability event ∩{d>rt } V (d)c , X P R(k, t) k∈M3 |t|/2 λnmax    X −(rk −rt∧k )/2 1 ⪯ nτ 2 λnmin exp − 2 (Rk − Rt ) k∈M3 λnmin 2σ pn pn ! X bn ⊤ X bn ⊤ exp an gi + k Jk − an ti − t Jt i=1 2 i=1 2 X √ −(rk −rt ) n −|t|/2 (1+2s)(rk −rt ) ⪯ p1+δ n ∨ n (λmin ) pn k∈M3 pn pn ! X bn ⊤ X bn exp an gi + k Jk − an ti − t⊤ Jt i=1 2 i=1 2 X √ −(rk −rt ) n −|t|/2 ⪯ pδ−2s n ∨ npn−1−2s (λmin ) k∈M3 pn pn ! X bn X bn exp an gi + k⊤ Jk − an ti − t⊤ Jt i=1 2 i=1 2 !(rk −rt ) 1+δ/4 X pn ⪯ p−3δ/4 n ∨ √ pnκ|t|/2 k∈M n 3 pn pn ! X bn X bn exp an gi + k⊤ Jk − an ti − t⊤ Jt i=1 2 i=1 2 !(K−1)rt +1 1+δ/4 pn ⪯ p−3δ/4 n ∨ √ pnκ|t|/2 n pn pn ! ! X bn ⊤ X X bn ⊤ exp −an ti − t Jt exp an gi + k Jk i=1 2 k∈M3 i=1 2 1+δ/4 (K−1)|t|/2 ! p n ⪯ p−3δ/4 n ∨ √ pnδ(K−1)|t|/4 exp (−an |t|) n 1+δ/4 (K−1)|t|/2 ! p n ∼ p−3δ/4 n ∨ √ exp (ndn |t|) n   3δ(K − 1)|t| ∼ exp −ndn exp (ndn |t|) 8    3δ(K − 1) = exp −ndn |t| −1 → 0. 8 81 4. Models in M4 . If c ∈ (0, 1),     P ∪{k∈M4 } {Rk − Rt < ∆n (1 − c)} ≤ P ∪{k∈M4 } {Rk − Rk∨t < ∆n (1 − c)} ≤ 2 exp (−c′ ∆n ) → 0. Observe that: pn pn ! X b0 ⊤ X b0 ⊤ exp a0 ki + k Jn k − an ti − t Jn t i=1 2 i=1 2 pn pn ! X b 0 X b 0 = exp a0 ki∗ + (k∗ )⊤ Jn (k∗ ) − a0 t∗i − (t∗ )⊤ Jn (t∗ ) i=1 2 i=1 2 × exp −b0 k⊤ Jn 1 + b0 t⊤ Jn 1 ,  where k∗ = k + 1 and t∗ = t + 1. By restricting to the high probability event Cn := 82 {Rk − Rt ≥ ∆n (1 − c), ∀k ∈ M4 }, X P R(k, t) k∈M4 |t|/2 λnmax    X |t|/2 1 ⪯ nτn2 λnmin exp − 2 (Rk − Rt ) k∈M4 λnmin 2σ pn pn ! X b n ⊤ X b n ⊤ exp an ki∗ + (k∗ ) Jn (k∗ ) − an t∗i − (t∗ ) Jn (t∗ ) i=1 2 i=1 2 exp −bn k⊤ Jn 1 + bn t⊤ Jn 1  X |t|/2 n −|t|/2 nτn2 λnmin exp −∆n (1 − c)/2σ 2  ⪯ (λmin ) k∈M4 pn pn ! X b n ⊤ X b n ⊤ exp an ki∗ + (k∗ ) Jn (k∗ ) − an t∗i − (t∗ ) Jn (t∗ ) i=1 2 i=1 2 |t|/2 δ|t|/2 ⪯ p2+3δ 2  n ∨ n p n exp −∆ n (1 − c)/2σ pn pn ! ! X b n ⊤ X X b n ⊤ exp −an t∗i − (t∗ ) Jn (t∗ ) exp an ki∗ + (k∗ ) Jn (k∗ ) i=1 2 k∈M4 i=1 2 pn !! 1 X ∼ exp − 2 ∆n (1 − c) − σ 2 |t| log p2+3δ ∨ n − σ 2 |t|δ log pn + 2σ 2 an t∗i  n 2σ   i=1 1 ∼ exp − 2 ∆n (1 − c) − σ 2 |t| log p2+3δ ∨ n − σ 2 |t|(4 + δ) log pn   n 2σ   1 2  ⪯ exp − 2 ∆n (1 − c) − σ |t|(6 + 4δ) log pn 2σ   1 2  ⪯ exp − 2 ∆n (1 − c) − σ |t|(4 + 4δ) log pn 2σ   1 ′ ∼ exp − 2 (∆n (1 − c) − w γn ) → 0, 2σ where w′ ∈ (0, 1) and c < 1 − w′ . 3.4.2 True posterior consistency with pseudo Ising prior In our variable selection algorithm, we used pseudo Ising prior on γ instead of the ture Ising prior. Replacing the true Ising prior with the Ising prior we used, the posterior ratio is: 83 π̃(γ = g | X, y, σ 2 ) P˜R(g, t) = (3.12) π̃(γ = t | X, y, σ 2 )   Qg 1 1 = exp − 2 Rg + 2 Rt Qt 2σ 2σ |{z} | {z } a b  × exp a0 1⊤ g + b0 g ⊤ Jn g − 1⊤ log cosh (b0 Jn g + a0 1) (3.13)  ⊤ ⊤ ⊤ − a0 1 t − b0n t Jn t + 1 log cosh (b0 Jn t + a0 1) . (3.14) Theorem 3 indicates the posterior above goes to zero for all g ̸= t: Theorem 3. Under the Conditions 1 - 5, the posterior probability of the true model with pseudo Ising prior goes to 1, that is, X P˜R(g, t) → 0, g̸=t as n goes to infinity. Note that the terms a and b in (3.12) are the same as in (3.8). Therefore, we need to be albe to control the last exponential term. Observe that: 1⊤ log cosh (b0 Jn t + a0 1) − 1⊤ log cosh (b0 Jn g + a0 1) → 0. Above convergence is due to Condition 2. It allows us to use the same arguments to prove Theorem 3. 3.4.3 Bounded KL divergence In this subsection, we provide an upper bound of KL divergence between q̃(β, γ) and π̃ (β, γ | y) with appropriate choices of variational parameters, where q̃(β, γ) is a variational distribution with pseudo-Ising on γ and π̃ (β, γ | y) is the true posterior with pseudo-Ising 84 on γ: KL (q̃(β, γ) || π̃ (β, γ | y)) = KL (q̃(γ) || p̃ (γ)) + Eq̃(β,γ) [log q(β | γ) − log p(β | γ)] (3.15)   p(y | β) − Eq̃(β,γ) log + C. p(y | β t ) Theorem 4. There exists a variational distribution in the variational family (3.3) which satisfies KL (q̃(β, γ) || π̃ (β, γ | y)) = o(n). To prove Theorem 4, observe the first term in (3.15): KL (q̃(γ) || p̃ (γ)) X = q̃(γ) (log q̃(γ) − log p̃(γ)) γ∈{−1,1}pn pn ! X X = q̃(γ) (ai − a0 ) γi γ∈{−1,1}pn i=1 (3.16) pn ! X X + q̃(γ) (log cosh (b0 mi (γ) + a0 ) − log cosh (bmi (γ) + ai )) γ∈{−1,1}pn i=1 X q̃(γ) bγ ⊤ Jn γ − b0 γ ⊤ Jn γ .  + γ∈{−1,1}pn We consider variational parameters below:  ndn if i ∈ A,   1 b= and ai = . (3.17) pn −ndn if i ∈ Ac   Lemma 17. For any g ̸= t, with the choices of variational parameters in (3.17), q̃(γ = g) → 0. 85 Proof. Since b = 1 pn , bg ⊤ Jn g is bounded and bmi (g) is decreasing to zero at a rate 1 pn . Therefore, pn ! X q̃(γ = g) = 2−pn exp bg ⊤ Jn g + (ai gi − log cosh (bmi (g) + ai )) i=1 pn ! X = 2−pn C exp (ai gi − log cosh (bmi (g) + ai )) i=1 pn ! X ≃ 2−pn C exp (ai gi − log cosh (ai )) i=1 pn ! X X X = 2−pn C exp ai g i + ai gi − log cosh (ndn ) i∈A i∈Ac i=1 pn ! X X X = 2−pn C exp ai g i + ai gi − log cosh (ndn ) i∈A i∈Ac i=1 pn ! X X X −pn =2 C exp ndn gi − ndn gi − log cosh (ndn ) . i∈A i∈Ac i=1 Observe that ndn gi = ndn (|B| − |Bc |) where B = {i : gi = ti }. When P P i∈A ndn gi − i∈Ac |B| − |B c | < 0, it is easy to show (3.18) → 0. Provided g ̸= t, the upper bound of |B| − |B c | is pn − 2 such that: pn ! X X X 2−pn C exp ndn gi − ndn gi − log cosh (ndn ) (3.18) i∈A i∈Ac i=1 ≤ 2−pn C exp (ndn (pn − 2) − pn log cosh (ndn )) = 2−pn C exp (pn (ndn − log cosh (ndn )) − 2ndn )  ndn −log cosh(ndn ) pn e =C e−2ndn 2  log 2 pn e ≃C e−2ndn → 0. 2 To derive above convergence, we use the fact u − log cosh(u) ≃ log 2 for large u. 86 Using the Lemma 17, pn ! X X q̃(γ) (ai − a0 ) γi γ∈{−1,1}pn i=1 ! X X X = q̃(γ) (ai − a0n ) γi + (ai − a0n ) γi γ∈{−1,1}pn i∈A i∈Ac ! X X X = q̃(γ) (2ndn ) γi + (0) γi γ∈{−1,1}pn i∈A i∈Ac ! X ≤ (2ndn |A|) q̃(t) + q̃(γ) = o(n). γ̸=t For the second term in (3.16), note that pn X (log cosh (b0 mi (γ) + a0 ) − log cosh (bmi (γ) + ai )) i=1     1 1 X pn exp pn + a 0 + exp pn − a 0 ∼ log    . 1 1 i=1 exp pn + a i + exp pn − a i Combining the results above, we have: KL (q̃(γ) || p̃ (γ)) = o(n). Next, to control the other terms, we construct the variational parameters as follows:  0, if gi = −1,   2 σi = (3.19) 1/pn , if gi = 1,   and  = 0, if gi = −1,        µi = = βt,i , if gi = 1 and ti = 1, (3.20)     = 1/(n1/2 pn ), if gi = 1 and ti = −1.   87 Observe the second term in (3.15): Eq̃(β,γ) [log q(β | γ) − log p(β | γ)] X = q̃(γ)Eq(β|γ) [log q(β | γ) − log p(β | γ)] γ∈{−1,1}pn 1 X q̃(γ = g) tr (Σg ) − |g| + µ⊤ (3.21)  = g µg − log |Σg | , 2 g∈{−1,1}pn where µg is a vector of µi s in (3.20), σ g is a vector of σi s in (3.19), and Σg = diag σ 2g .  With the variational parameters in (3.19) and (3.20), we can bound (3.21) by o(n) using the Lemma 17. For the third term in (3.15), let L0 ∼ N (Xβ t , σ 2 I) and Lβ ∼ N (Xβ, σ 2 I). Then, KL (L0 || Lβ ) = (β − β t )⊤ X ⊤ X (β − β t ) , Eq̃(β,γ) [KL (L0 || Lβ )] X Z = (β − β t )⊤ X ⊤ X (β − β t ) q̃(β, γ)dβ γ∈{−1,1}pn X Z = ∥X (β − β t )∥22 q̃(β, γ)dβ γ∈{−1,1}pn X h i ⊤ ⊤ = q̃(γ = g)Eq̃(β|γ=g) (β − β t ) X X (β − β t ) g∈{−1,1}pn X  ⊤  X ⊤ X µg − β t + tr X ⊤ XΣg .  = q̃(γ = g) µg − β t g∈{−1,1}pn Note, X ⊤ X ⊤ X µg − β t  q̃(γ = g) µg − β t g∈{−1,1}pn X ⊤ = q̃(γ = t) (µt − β t )⊤ X ⊤ X (µt − β t ) + X ⊤ X µg − β t .  q̃(γ = g) µg − β t g̸=t Let w := µg − β t . Then, pn  X X µg − β t = Xw = wi xi , i=1 88 where xi is the i-th column of X. pn X X X X wi xi = wi xi + wi xi + wi xi i=1 {i:gi =−1} {i:gi =1 and ti =1} {i:gi =1 and ti =−1} X =C+ wi xi . {i:gi =1 and ti =−1} The order of the above summation is o(n) due to (3.20). 3.4.4 Variational posterior consistency In this subsection, we investigate a couple of ingredients for establishing variational posterior consistency, which is our ultimate goal. Theorem 5. Let q ∗ be the variational posterior obtained under the prior (3.2) and variational family (3.3). Then for any g ̸= t, we have q ∗ (γ = g) → 0. as n goes to infinity. First, we consider: X Z q̃ := Znq̃ (a, b) = q̃(γ), γ∈{−1,1}pn X Z p̃ := Znp̃ (a0n , b0n ) = p̃(γ). γ∈{−1,1}pn Z q̃ and Z p̃ are the normalizing constants of q̃(γ) and p̃(γ) respectively. We define two valid probability mass functions qz (γ) = 1 Znq̃ q̃(γ) and pz (γ) = 1 Znp̃ p̃(γ). Then, the posterior distribution with pz (γ) is: −1 Z p̃ p̃(γ)p(β | γ)L(β) πZ (γ, β | y) = P R γ β (Z p̃ )−1 p̃(γ)p(β | γ)L(β)dβ p̃(γ)p(β | γ)L(β) =P R γ β p̃(γ)p(β | γ)L(β)dβ = π̃ (γ, β | y) . 89 From the relation above, we have: KL (qZ (γ, β) || πZ (γ, β | y)) X Z = qZ (γ) q(β | γ) (log qZ (γ) + log q(β | γ) − log πZ (γ, β | y)) dβ γ{−1,1}pn β Z X 1 q(β | γ) log q̃(γ) − log Z q̃ + log q(β | γ) − log π̃ (γ, β | y) dβ  = q̃(γ) Z q̃ β γ{−1,1}pn KL (q̃(γ, β) || π̃(γ, β | y)) = q̃ − log Z q̃ . Z Next, using Corollary 4.15 in Boucheron et al. (2013), we have Z Z KL (qZ (γ, β) || πZ (γ, β | y)) ≥ f dQZ − log ef dΠZ (| y) Z Z 1 = q̃ f dQ̃ − log ef dΠ̃ (| y) , Z where f is any function. Reorganizing the relation above, we have Z Z q̃ f dQ̃ ≤ KL (q̃(γ, β) || π̃(γ, β | y)) − Z log Z + Z log q̃ q̃ ef dΠ̃ (| y) . (3.22) We need show that a lower bound of Z q̃ is one to use the relation (3.22). Proposition 1. Let q̃(γ) be a pseudo likelihood of Ising model characterized by parameters a = (a1 , . . . , apn ) and b. Then, X q̃(γ) ≥ 1, γ∈{−1,1}pn for any a and b. We provide numerical evidences to Proposition 1 (See Table 3.2). 90 Parameters q̃ Z (a, b) a = 0.2 · 1 b = 0.01 a = −0.2 · 1 b = 0.01 pn = 4 1.0004 1.0004 pn = 9 1.0011 1.0011 pn = 16 1.0022 1.0022 a=1 b = 0.1 a = −1 b = 0.1 pn = 4 1.006 1.006 pn = 9 1.014 1.014 pn = 16 1.025 1.025 a = 0.2 · 1 b = 0.5 a = −0.2 · 1 b = 0.5 pn = 4 1.616 1.616 pn = 9 2.879 2.879 pn = 16 6.941 6.941 a = 0.5 · 1 b = 0.5 a = −0.3 · 1 b = 0.5 pn = 4 1.417 1.562 pn = 9 2.048 2.635 pn = 16 3.603 5.879 Table 3.2: Exact normalizing constants with varying a and b. With f = 1 in (3.22), we get: Z q̃ log Z q̃ ≤ KL (q̃(γ, β) || π̃(γ, β | y)) (3.23) With f = n1(γ ̸= t) in (3.22) and (3.23), we complete the Theorem 5 under the Propo- sition 1. Z n1(γ ̸= t) ≤ KL (q̃(γ, β) || π̃(γ, β | y)) − Z log Z + Z log en1(γ̸=t) dΠ̃ (| y) q̃ q̃ q̃ Z ≤ KL (q̃(γ, β) || π̃(γ, β | y)) + Z log en1(γ̸=t) dΠ̃ (| y) q̃ ≤ o(n) + Z q̃ log (1 + en π̃ (γ ̸= t | y)) =⇒ 1(γ ̸= t) → 0. 91 CHAPTER 4 DISCUSSION AND FUTURE RESEARCH 4.1 Conclusion Modeling discrete data is a basic problem in statistics and machine learning. Discrete data are rarely independent and a fundamental modeling task is to model the dependencies among variables. A graphical model is a flexible tool available for modeling such dependent discrete data. In this dissertation, we focused on a well known graphical model, Ising model. We provided a procesure for Ising model parameter estimation using variational Bayes meth- ods. In order to tackle the issue of the intractable normalizing constant, we employed a pseudo-likelihood and placed it wherever the true likelihood is needed. We suggested two variational family choices and developed variational Bayes algorithms for each family of dis- tributions under the pseudo-likelihood. In a variety of numerical studies, we compared our VB methods and two other existing methods, PMLE and a MCMC based method. Notably, the simulation results demonstrated the superiority of our VB methods in terms of accuracy and computational costs. In addition, we found that a two-parameter Ising model is suitable for characterizing a network data. Using the Facebook example, we applied the estimation procedures to characterize an overall strength of interaction and an external influence of the network. This thesis also provides a theoretical justification to the VB algorithm under mean-field family. We showed that the variational posterior is consistent as the data size increasing under three mild conditions on the coupling matrix An . Specifically, we estab- lished that the variational posterior concentrates around shrinking neighborhoods of the true parameter and we next establish the rate of contraction for the variational posterior with a suitable bound on the Kulback-Leibler divergence between the variational and the true posterior. In addition to the Ising model parameter estimation, we developed a variable selection 92 technique in a high dimensional setup when the feature space is structurally dependent. We capture the structural dependencies using an pseudo-Ising prior and a pseudo-Ising vari- ational distribution on latent binary variables with multiple threshold parameters in the variational family, which enable us to perform variable selection. Providing numerical ex- periments and some theoretical results, we validated the efficacy of the variable selection VB algorithm. 4.2 Directions for future research Multiple observations from an Ising model: While we considered only one observation of x is available for the inference on two-parameter Ising model, another group of previous researches assumed that multiple observations of x are available. With the i.i.d copies of x, we will be able to develop a procedure to estimate the structure of the underlying graph, i.e., we will be able to estimate all the edges in the graph. Besides, the multiple observations will enable inferences on multi-parameter Ising model beyond only one interaction parameter and only one threshold parameter. Multivariate version of an Ising model: The Potts model is a versatile graphical model for discrete data that naturally extends from the Ising model. Consider an undirected graph. Without restricting to binary variables, each node of the graph represents a categorical variable such as blood types, hair colors, education levels, etc. Variational Bayes would be a useful tool for inference on a Potts model and its applications. Multiresponse regression: If response variables in a regression model is multivariate, we need to consider the additional dependencies among the response variables. A researcher would use a graphical model to capture different types of dependencies in various ways. Also, to approximate more complex posterior distribution, a variational Bayes method will definitely help. 93 BIBLIOGRAPHY 94 BIBLIOGRAPHY Anandkumar, A., Tan, V. Y., Huang, F., Willsky, A. S., et al. (2012). High-dimensional structure estimation in ising models: Local separation criterion. The Annals of Statistics, 40(3):1346–1375. Andersen, M. R., Vehtari, A., Winther, O., and Hansen, L. K. (2017). Bayesian inference for spatio-temporal spike-and-slab priors. The Journal of Machine Learning Research, 18(1):5076–5133. Andersen, M. R., Winther, O., and Hansen, L. K. (2014). Bayesian inference for structured spike and slab priors. Advances in Neural Information Processing Systems, 27. Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U., and Strawn, N. (2013). Posterior consistency in linear models under shrinkage priors. Biometrika, 100(4):1011–1018. Basak, A. and Mukherjee, S. (2017). Universality of the mean-field for the potts model. Probability Theory and Related Fields, 168(3):557–600. Bhattacharya, B. B., Mukherjee, S., et al. (2018). Inference in ising models. Bernoulli, 24(1):493–525. Bhattacharya, S. and Maiti, T. (2021). Statistical foundation of variational bayes neural networks. Neural Networks, 137:151–173. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877. Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymp- totic theory of independence. Oxford university press. Bresler, G. (2015). Efficiently learning ising models on arbitrary graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 771–782. Brush, S. G. (1967). History of the lenz-ising model. Reviews of modern physics, 39(4):883. Cao, X., Khare, K., and Ghosh, M. (2019). Posterior graph selection and estimation consis- tency for high-dimensional bayesian dag models. The Annals of Statistics, 47(1):319–348. Carbonetto, P. and Stephens, M. (2012). Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73–108. 95 Castillo, I. and Roquain, É. (2020). On spike and slab empirical bayes multiple testing. The Annals of Statistics, 48(5):2548–2574. Chang, C., Kundu, S., and Long, Q. (2018). Scalable bayesian variable selection for struc- tured high-dimensional data. Biometrics, 74(4):1372–1382. Chatterjee, S. and Dembo, A. (2016). Nonlinear large deviations. Advances in Mathematics, 299:396–450. Chatterjee, S. et al. (2007). Estimation in spin glasses: A first step. The Annals of Statistics, 35(5):1931–1946. Comets, F. (1992). On consistency of a class of estimators for exponential families of markov random fields on the lattice. The Annals of Statistics, pages 455–468. Comets, F. and Gidas, B. (1991). Asymptotics of maximum likelihood estimators for the curie-weiss model. The Annals of Statistics, pages 557–578. Fauske, J. (2009). An empirical study of the maximum pseudo-likelihood for discrete markov random fields. Master’s thesis, Institutt for matematiske fag. Gan, L., Narisetty, N. N., and Liang, F. (2019). Bayesian regularization for graphical models with unequal shrinkage. Journal of the American Statistical Association, 114(527):1218– 1231. Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163–185. George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423):881–889. Geyer, C. J. (1994). Estimating normalizing constants and reweighting mixtures. Ghosal, P., Mukherjee, S., et al. (2020). Joint estimation of parameters in ising model. Annals of Statistics, 48(2):785–810. Ghosh, S., Khare, K., and Michailidis, G. (2018). High-dimensional posterior consistency in bayesian vector autoregressive models. Journal of the American Statistical Association. Gidas, B. (1988). Consistency of maximum likelihood and pseudo-likelihood estimators for gibbs distributions. In Stochastic differential systems, stochastic control theory and applications, pages 129–145. Springer. Guyon, X. and Künsch, H. R. (1992). Asymptotic comparison of estimators in the ising model. In Stochastic Models, Statistical Methods, and Algorithms in Image Analysis, pages 96 177–198. Springer. Halim, S. (2007). Modified ising model for generating binary images. Jurnal Informatika, 8(2):115–118. Haslbeck, J. M., Epskamp, S., Marsman, M., and Waldorp, L. J. (2021). Interpreting the ising model: The input matters. Multivariate behavioral research, 56(2):303–313. Hurn, M. A., Husby, O. K., and Rue, H. (2003). A tutorial on image analysis. Spatial statistics and computational methods, pages 87–141. Ising, E. (1924). Beitrag zur theorie des ferro-und paramagnetismus. PhD thesis, Grefe & Tiedemann. Izenman, A. J. (2021). Sampling algorithms for discrete markov random fields and related graphical models. Journal of the American Statistical Association, pages 1–22. Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Lee, H. K. (2000). Consistency of posterior distributions for neural networks. Neural Net- works, 13(6):629–642. Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data. Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American statistical asso- ciation, 105(491):1202–1214. Li, F., Zhang, T., Wang, Q., Gonzalez, M. Z., Maresh, E. L., Coan, J. A., et al. (2015). Spatial bayesian variable selection and grouping for high-dimensional scalar-on-image regression. The Annals of Applied Statistics, 9(2):687–713. Lokhov, A. Y., Vuffray, M., Misra, S., and Chertkov, M. (2018). Optimal structure and parameter learning of ising models. Science advances, 4(3):e1700791. Martin, R., Mess, R., and Walker, S. G. (2017). Empirical bayes posterior concentration in sparse high-dimensional linear models. Bernoulli, 23(3):1822–1847. 97 Milman, V. and Schechtman, G. (1986). Asymptotic theory of finite-dimensional normed spaces, lecture notes in mathematics 1200. Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the american statistical association, 83(404):1023–1032. Molkaraie, M. (2014). An importance sampling algorithm for the ising model with strong couplings. arXiv preprint arXiv:1404.5666. Møller, J., Pettitt, A. N., Reeves, R., and Berthelsen, K. K. (2006). An efficient markov chain monte carlo method for distributions with intractable normalising constants. Biometrika, 93(2):451–458. Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789–817. Okabayashi, S., Johnson, L., and Geyer, C. J. (2011). Extending pseudo-likelihood for potts models. Statistica Sinica, pages 331–347. Park, J., Jin, I. H., and Schweinberger, M. (2022). Bayesian model selection for high- dimensional ising models, with applications to educational data. Computational Statistics & Data Analysis, 165:107325. Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822. PMLR. Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. (2010). High-dimensional ising model selection using ℓ1 -regularized logistic regression. The Annals of Statistics, 38(3):1287–1319. Ray, K. and Szabó, B. (2021). Variational bayes for high-dimensional linear regression with sparse priors. Journal of the American Statistical Association, pages 1–12. Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407. Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics, 46(1):401–437. Ročková, V. and George, E. I. (2014). Emvs: The em approach to bayesian variable selection. Journal of the American Statistical Association, 109(506):828–846. Ročková, V. and George, E. I. (2016). Fast bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association, 111(516):1608–1622. Ročková, V. and George, E. I. (2018). The spike-and-slab lasso. Journal of the American 98 Statistical Association, 113(521):431–444. Sriram, K., Ramamoorthi, R., Ghosh, P., et al. (2013). Posterior consistency of bayesian quantile regression based on the misspecified asymmetric laplace density. Bayesian Anal- ysis, 8(2):479–504. Stanley, H. E. (1971). Phase transitions and critical phenomena, volume 7. Clarendon Press, Oxford. Van Der Vaart, A. and Van Zanten, H. (2011). Information rates of nonparametric gaussian process methods. Journal of Machine Learning Research, 12(6). Wang, Y. and Blei, D. M. (2019). Frequentist consistency of variational bayes. Journal of the American Statistical Association, 114(527):1147–1161. Xi, R., Li, Y., and Hu, Y. (2016). Bayesian quantile regression based on the empirical likelihood with spike and slab priors. Bayesian Analysis, 11(3):821–855. Xue, L., Zou, H., Cai, T., et al. (2012). Nonconcave penalized composite conditional likeli- hood estimation of sparse ising models. The Annals of Statistics, 40(3):1403–1429. Yang, K. and Shen, X. (2017). On the selection consistency of bayesian structured variable selection. Stat, 6(1):131–144. 99