VARIATIONAL BAYES INFERENCE OF ISING MODELS AND THEIR
                         APPLICATIONS
                                 By
                            Minwoo Kim
                       A DISSERTATION
                           Submitted to
                   Michigan State University
            in partial fulfillment of the requirements
                         for the degree of
                Statistics – Doctor of Philosophy
                                2022


                                        ABSTRACT
         VARIATIONAL BAYES INFERENCE OF ISING MODELS AND THEIR
                                       APPLICATIONS
                                              By
                                         Minwoo Kim
    Ising models originated in statistical physics have been widely used in modeling spatial
data and computer vision problems. However, statistical inference of this model and its
application to many practical fields remain challenging due to intractable nature of the
normalizing constant in the likelihood. This dissertation consists of two main themes, (1)
parameter estimation of Ising model and (2) structured variable selection based on the Ising
model using variational Bayes (VB).
    In Chapter 1, we review the background, research questions and development of Ising
model, variational Bayes, and other statistical concepts. An Ising model basically deal with
a binary random vector in which each component is dependent on its neighbors. There exist
various versions of Ising model depending on parameterization and neighboring structure.
In Chapter 2, with two-parameter Ising model, we describe a novel procedure for the pa-
rameter estimation based on VB which is computationally efficient and accurate compared
to existing methods. Traditional pseudo maximum likelihood estimate (PMLE) can pro-
vide accurate results only for smaller number of neighbors. A Bayesian approach based on
Markov chain Monte Carlo (MCMC) performs better even with a large number of neighbors.
Computational costs of MCMC, however, are quite expensive in terms of time. Accordingly,
we propose a VB method with two variational families, mean-field (MF) Gaussian family
and bivariate normal (BN) family. Extensive simulation studies validate the efficacy of the
families. Using our VB methods, computing times are remarkably decreased without dete-
rioration in performance accuracy, or in some scenarios we get much more accurate output.
In addition, we demonstrates theoretical properties of the proposed VB method under MF
family. The main theoretical contribution of our work lies in establishing the consistency of


the variational posterior for the Ising model with the true likelihood replaced by the pseudo-
likelihood. Under certain conditions, we first derive the rates at which the true posterior
based on the pseudo-likelihood concentrates around the εn - shrinking neighborhoods of the
true parameters. With a suitable bound on the Kullback-Leibler distance between the true
and the variational posterior, we next establish the rate of contraction for the variational pos-
terior and demonstrate that the variational posterior also concentrates around εn -shrinking
neighborhoods of the true parameter.
    In Chapter 3, we propose a Bayesian variable selection technique for a regression setup
in which the regression coefficients hold structural dependency. We employ spike and slab
priors on the regression coefficients as follows: (i) In order to capture the intrinsic structure,
we first consider Ising prior on latent binary variables. If a latent variable takes one, the
corresponding regression coefficient is active, otherwise, it is inactive. (ii) Employing spike
and slab prior, we put Gaussian priors (slab) on the active coefficients and inactive coefficients
will be zeros with probability one (spike).


Copyright by
MINWOO KIM
2022


                               ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to my advisors Dr. Tapabrata Maiti and Dr.
Shrijita Bhattacharya for their guidance toward my researches and my life in US. They have
always helped me make great progress with their insightful suggestions. I would also like to
extend my appreciation to my dissertation committee members, Dr. Yimin Xiao and Dr.
Jiliang Tang. Their feedback and comments are really beneficial for my research. I am also
grateful to my previous advisor Dr. Chaeyoung Lim and my future advisor Dr. Marc G.
Genton for giving me wonderful opportunities and encouraging me all the time. I made a
lot of friends and I have never felt lonely because of them. I really appreciate my friends.
Last but not least, I would like to express my sincere thanks to my family for their support
and concerns.
    During my five years at Michigan State University, I have learned a lot from the courses
and seminars and I am able to apply the knowledge I learned to my research. Thanks to all
professors and friends and family, I am incredibly confident in my academic success. I am
very much looking forward to my life as a statistician.
                                              v


                               TABLE OF CONTENTS
LIST OF TABLES      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               xi
CHAPTER 1 INTRODUCTION .                . . . . . . . . . . . . . .   .  . . . . . . .  .  . . .  1
   1.1 Ising model . . . . . . . . .    . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   1
        1.1.1 Pseudo-likelihood . .     . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   3
   1.2 Variational Bayes (VB) . . .     . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   4
        1.2.1 Black box variational    inference (BBVI) . . . . .      . . . . . . . . .  . . .   6
   1.3 Adaptive learning rates . . .    . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   7
   1.4 Variable selection . . . . . .   . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   8
   1.5 Posterior consistency . . . .    . . . . . . . . . . . . . . .  . . . . . . . . .  . . .  10
CHAPTER 2 A VARIATIONAL BAYES ALGORITHM AND POSTERIOR CON-
               SISTENCY FOR TWO-PARAMETER ISING MODEL ESTIMATION                                 12
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      12
   2.2 VB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        13
        2.2.1 VB algorithm with MF family . . . . . . . . . . . . . . . . . . . . . .            15
        2.2.2 VB algorithm with BN family . . . . . . . . . . . . . . . . . . . . . .            16
   2.3 PMLE and MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             17
   2.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         18
        2.4.1 Generating observed data . . . . . . . . . . . . . . . . . . . . . . . .           18
        2.4.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . .           19
        2.4.3 Image reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . .         20
   2.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      22
        2.5.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         22
        2.5.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .         23
   2.6 Extension to multi threshold parameters . . . . . . . . . . . . . . . . . . . .           26
   2.7 Posterior consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       28
        2.7.1 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        28
   2.8 Preliminary notations and Lemmas . . . . . . . . . . . . . . . . . . . . . . .            33
   2.9 Taylor expansion for log-likelihood . . . . . . . . . . . . . . . . . . . . . . .         34
   2.10 Technical details of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . .         37
   2.11 Technical details of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . .         40
   2.12 Technical details of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . .         46
   2.13 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       50
   2.14 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     51
        2.14.1 Proof of Relation (2.53) . . . . . . . . . . . . . . . . . . . . . . . . .        52
                                               vi


CHAPTER 3 BAYESIAN VARIABLE SELECTION IN A STRUCTURED RE-
              GRESSION MODEL . . . . . . . . . . . . . . . . . . . . . . . . . .            55
   3.1 Model and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      56
       3.1.1 ELBO optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . .      58
   3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   61
   3.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  62
       3.3.1 Li and Zhang (2010)’s Gibbs sampling scheme . . . . . . . . . . . . .          62
       3.3.2 Hyper parameter selection . . . . . . . . . . . . . . . . . . . . . . . .      63
       3.3.3 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      63
   3.4 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  70
       3.4.1 True posterior consistency with true Ising prior . . . . . . . . . . . .       73
       3.4.2 True posterior consistency with pseudo Ising prior . . . . . . . . . . .       83
       3.4.3 Bounded KL divergence . . . . . . . . . . . . . . . . . . . . . . . . .        84
       3.4.4 Variational posterior consistency . . . . . . . . . . . . . . . . . . . .      89
CHAPTER 4 DISCUSSION AND FUTURE RESEARCH . . . . . . . . . . . . . .                        92
   4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
   4.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . .   93
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        94
                                             vii


                                   LIST OF TABLES
Table 2.1: Mean squared errors and computation times for each pair of (β0 , B0 )
           when n = 100 (left numbers) and n = 500 (right numbers) given the
           degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 21
Table 2.2: Mean squared errors and computation times for each pair of (β0 , B0 )
           when n = 100 (left numbers) and n = 500 (right numbers) given the
           degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 22
Table 2.3: Mean squared errors and computation times for each pair of (β0 , B0 )
           when n = 100 (left numbers) and n = 500 (right numbers) given the
           degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 23
Table 2.4: Mean squared errors and computation times for each pair of (β0 , B0 )
           when n = 100 (left numbers) and n = 500 (right numbers) given the
           degree of underlying graph (d). . . . . . . . . . . . . . . . . . . . . . . . . 24
Table 2.5: The estimated parameters with standard errors (SE) in parentheses and
           time costs for the features gender and school. . . . . . . . . . . . . . . . .  24
Table 3.1: Examples of hyper parameter choices . . . . . . . . . . . . . . . . . . . .     63
Table 3.2: Exact normalizing constants with varying a and b. . . . . . . . . . . . . .     91
                                            viii


                                   LIST OF FIGURES
Figure 1.1: An undirected graph with three nodes . . . . . . . . . . . . . . . . . . .         3
Figure 2.1: Left: ELBO convergence with two variational families (BN and MF)
            and S = 20. Right: ELBO convergence with two variational families
            (BN and MF) and S = 200. Blue lines represent BN family and orange
            lines represent MF family in each plot. . . . . . . . . . . . . . . . . . . .     20
Figure 2.2: Original images (left) and the estimated images (right). . . . . . . . . . .      25
Figure 2.3: Visualization of Facebook network data where the size of circle repre-
            sents the degree of the node. . . . . . . . . . . . . . . . . . . . . . . . .     26
Figure 2.4: Density plots for the estimated parameters (left: β, right: B) from VB
            with BN family (red), VB with MF family (green), and MCMC (blue)
            for the features gender and school. . . . . . . . . . . . . . . . . . . . . .     27
Figure 3.1: Example of the structured regression coefficients. White pixels denote
            the corresponding βi ’s are zeros and darker pixels denote the corre-
            sponding βi ’s have larger values. . . . . . . . . . . . . . . . . . . . . . . .  55
Figure 3.2: γ on a circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 3.3: ROC curves for the three variable selection methods with the hyper
            parameter w1 = 1 (left) and w1 = 3 (right) respectively when the co-
            variates are independent. . . . . . . . . . . . . . . . . . . . . . . . . . . .   65
Figure 3.4: ROC curves for the three variable selection methods with the hyper pa-
            rameter w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively
            when the covariates are independent. . . . . . . . . . . . . . . . . . . . .      65
Figure 3.5: ROC curves for the three variable selection methods with the hyper
            parameter w1 = 1 (left) and w1 = 3 (right) respectively when the co-
            variates are correlated. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  66
Figure 3.6: ROC curves for the three variable selection methods with the hyper pa-
            rameter w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively
            when the covariates are correlated. . . . . . . . . . . . . . . . . . . . . .     66
Figure 3.7: γ on a lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 3.8: Signal areas in an image . . . . . . . . . . . . . . . . . . . . . . . . . . .    68
                                               ix


Figure 3.9: ROC curves in scenario 2 for the three variable selection methods with
             the hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom)
             respectively when the covariates are independent. . . . . . . . . . . . . .       68
Figure 3.10: ROC curves in scenario 2 for the two VB methods with the hyper param-
             eter w1 = 7 (left) and w1 = 9 (right) respectively when the covariates
             are independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    69
Figure 3.11: ROC curves in scenario 2 for the three variable selection methods with
             the hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom)
             respectively when the covariates are correlated. . . . . . . . . . . . . . .      70
Figure 3.12: ROC curves in scenario 2 for the two VB methods with the hyper param-
             eter w1 = 7 (left) and w1 = 9 (right) respectively when the covariates
             are correlated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
                                                x


                             LIST OF ALGORITHMS
Algorithm 1.1: Adam learning rates. . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
Algorithm 2.1: Black box variational inference (BBVI) . . . . . . . . . . . . . . . . .   15
                                            xi


                                               CHAPTER 1
                                            INTRODUCTION
In this Chapter, we briefly introduce basic concepts without complete details related to our
work, Ising model and its pseudo-likelihood (Section 1.1), variational Bayes (Section 1.2),
Adam learning rates (Section 1.3), variable selection (Section 1.4), and posterior consistency
(Section 1.5). In addition, previous studies that have produced significant results in those
fields are introduced.
1.1     Ising model
A popular way of modeling a binary vector x = (x1 , . . . , xn )⊤ , xi ∈ {−1, 1}, in which elements
are pairwise dependent is to take advantage of Ising model named after the physicist Ernst
Ising Ising (1924) which has been used in a wide range of applications including spatial data
analysis and computer vision. Many different versions of Ising model have emerged in the
literature. Among them, the probability mass function of a general Ising model is of the
form:
                                                                                n
                                                                                         !
                                                1                              X
                P (n)
                      (X = x) =                                 exp x⊤ Kn x +      Bi xi   ,
                                     Zn (Kn , B1 , . . . , Bn )                i=1
where Kn ∈ Rn×n and (B1 , . . . , Bn ) ∈ Rn are model parameters and Zn (Kn , B1 , . . . , Bn )
is the normalizing constant that makes the sum of the probability mass function over all
possible combinations of x equal to 1:
                                                                             n
                                                                                      !
                                                   X                        X
                   Zn (Kn , B1 , . . . , Bn ) =             exp x⊤ Kn x +       Bi xi   .
                                                x∈{−1,1}n                   i=1
The general Ising model can be reduced to two-parameter Ising model, assuming that all
nonzero entries of Kn take the same value and Bi = B for all i, which has an inverse
temperature parameter β > 0 (also known as interaction parameter) and a magnetization
parameter B ̸= 0 (also known as threshold parameter). With a specified symmetric coupling
                                                          1


matrix An ∈ Rn×n , a probability mass function of two-parameter Ising model is:
                                                                          n
                                                                               !
                       (n)                 1              β              X
                     Pβ,B (X = x) =               exp       x⊤ A n x + B     xi ,          (1.1)
                                       Zn (β, B)          2              i=1
In the two-parameter Ising model, β characterizes the strength of interactions among xi ’s
and B represents external influence on x. In the first place, Ising model has been introduced
for the relations between atom spins with the domain {−1, 1}n (Brush, 1967). While we
work with the domain {−1, 1}n , in many current applications, Ising model has been defined
with another domain {0, 1}n . One can read Haslbeck et al. (2021) for more details on two
different domains.
    An Ising model is usually represented by an undirected graph. Consider an undirected
graph which has n vertices (nodes) xi , i = 1, . . . , n. Each vertex of the graph takes a value
either -1 or 1, i.e., xi ∈ {−1, 1}, and let E = {(i, j) | i ∼ j, 1 ≤ i, j ≤ n} represent the set
of edges in the graph where i ∼ j denotes that the vertices i and j are connected. Then,
one common choice of the coupling matrix An is a scaled adjacency matrix of the underlying
graph whose all diagonal elements are zeros and the other elements are non-negative:
Definition 1 (Scaled adjacency matrix). A scaled adjacency matrix for a graph Gn with n
vertices is defined as:
                                             
                                                  n
                                             
                                             
                                             
                                                2|Gn |
                                                         if (i, j) ∈ E
                                An (i, j) :=                           ,
                                             
                                             0
                                                        otherwise.
where |Gn | denotes the number of edges in the graph Gn .
    For a simple example, consider three nodes (x1 , x2 , x3 ) and two edges between x1 and x2 ,
and between x2 and x3 as in Figure 1.1. Then, n = 3, |Gn | = 2, and the scaled adjacency
matrix is:                                                       
                                            0 0.75 0 
                                                                 
                                   An =    0.75      0    0.75  .
                                                                  
                                                                 
                                               0 0.75 0
                                                   2


                       Figure 1.1: An undirected graph with three nodes
1.1.1   Pseudo-likelihood
One of the largest challenges in using Ising model is the unknown normalizing constant
Zn (β, B) in the likelihood (1.1):
                                                                     n
                                                                           !
                                      X              β ⊤            X
                        Zn (β, B) =           exp      x An x + B       xi
                                                     2
                                    x∈{−1,1}n                       i=1
One can notice that the exact calculation of the normalizing constant involves sum of 2n
terms, which is available only for small n. Due to the intractable nature of the normalizing
constant, standard statistical methodologies based on the true likelihood are infeasible. One
way to approximate the normalizing constant is importance sampling, see Geyer (1994);
Gelman and Meng (1998); Molkaraie (2014). Another approach to handling the normalizing
constant is to use a pseudo-likelihood. The conditional probability of xi is easily calculated
because xi is binary:
                        (n)                              eβmi (x)+B
                       Pβ,B                ̸ i = βmi (x)+B
                             Xi = 1|Xj , j =                                 ,
                                                   e         + e−βmi (x)−B
                                                 3


where mi (x) =           An (i, j)xj . The pseudo-likelihood of Ising model corresponding to the
                 Pn
                     j=1
true likelihood in (1.1), is defined as the product of one dimensional conditional distributions:
                 Yn
                       (n)
                      Pβ,B (Xi = xi | Xj , j ̸= i)
                 i=1
                                  n
                                                                                    !
                                 X
                 = 2−n exp             (βxi mi (x) + Bxi − log cosh(βmi (x) + B)) .         (1.2)
                                 i=1
Fauske (2009) is an example on empirical study using pseudo-likelihood and Ghosal et al.
(2020) provides theoretical justification on use of pseudo-likelihood. For v, w ∈ [−1, 1]n ,
defining a function g as:
                                       n
                                     X    1 + vi      1 + wi 1 − vi        1 − wi
                       g(v, w) =                 log        +          log        ,
                                      i=1
                                             2           2       2            2
we point out that the pseudo-likelihood can be written as:
                              Yn
                                     (n)
                                  Pβ,B (Xi = xi | Xj , j ̸= i) = eg(x,b(x)) ,               (1.3)
                              i=1
where b(x) = (b1 (x; θ), · · · , bn (x; θ))⊤ and
                 bi (x) = E(Xi | Xj , j ̸= i) = tanh(βmi (x) + B), i = 1, . . . , n.
1.2     Variational Bayes (VB)
Let θ be the set of parameters of interest with a prior distribution p(θ). In a Bayesian
inference, a typical objective is to obtain posterior distribution given data D. We can derive
the exact posterior distribution π(θ | D) using Bayes’ theorem:
                                                      p(θ)p(D | θ)
                                      π(θ | D) = R                   ,                      (1.4)
                                                    θ
                                                      p(θ)p(D | θ)dθ
where p(D | θ) is a likelihood given parameter θ. The exact posterior in (1.4), however, is not
typically available except for a few well-known examples. To get an approximated posterior,
many statisticians have widely used sampling based Markov chain Monte Carlo (MCMC)
methods but it is hardly scalable to high-dimensional cases. Beyond sampling methods,
                                                      4


variational Bayes (VB) also called variational inference or variational approximation (Jordan
et al., 1999) has been popularized as an efficient alternative to MCMC. VB recasts the
sampling problem as an optimization problem minimizing Kullback-Leibler (KL) divergence
between a surrogate distribution (called a variational distribution) and the true posterior
distribution (Blei et al., 2017).
Definition 2 (Kullback-Leibler (KL) divergence). For two probability measures P1 and P2
over a set Ω, the KL divergence between P1 and P2 is defined as:
                               KL (P1 , P2 ) = Ep1 (log p1 − log p2 )                    (1.5)
                                               Z               
                                                         p1 (ω)
                                             =     log            dP1                    (1.6)
                                                Ω        p2 (ω)
where p1 and p2 are corresponding densities to P1 and P2 respectively.
    As the first step in building a VB algorithm, we need to define a family of distributions
(called variational family) denoted by Q which contains candidates of the best approximation
to the true posterior (1.4):
        Q = {q(θ; ν) : q is a probability density function that can be easily handled.},
where ν is a set of parameters (called variational parameters) that characterize variational
distributions. For instance, if Q is a Gaussian family, then ν includes mean (µ) and stan-
dard deviation (σ) of a Gaussian distribution. After an appropriate variational family is
chosen, VB seeks the best surrogate function (called variational posterior) by minimizing
KL divergence with the true posterior:
                                q ∗ = arg min KL (Q, Π (· | D)) .                        (1.7)
                                        q∈Q
Observe that:
                 KL (Q, Π (· | D)) = Eq (log q(θ) − log π (θ | D))
                                    = −Eq (log p (θ, D) − log q(θ)) + log m (D) ,        (1.8)
                                                 5


where m (D) is the marginal distribution of data which does not depend on θ. So, we can
find the minimizer of KL divergence by maximizing Eq (log p (θ, D) − log q(θ)) which is called
evidence lower bound (ELBO).
1.2.1    Black box variational inference (BBVI)
Black box variational inference (BBVI) is an stochastic gradient optimization technique
suggested by Ranganath et al. (2014) to maximize ELBO using unbiased gradients. Consider
the ELBO as a function of variational parameters ν:
                               L(ν) = Eq (log p (θ, D) − log q(θ)) .                            (1.9)
In each BBVI iteration, a variational parameter ν ∈ ν is updated in the direction that the
objective function L(ν) increases as follows:
                                   ν (t+1) ← ν (t) + ηt ∇ν L(ν),                              (1.10)
where ν (t) is the variational parameter at t-th iteration and ηt ’s, t = 1, 2, . . . , are learning
rates which satisfy Robbin-Monro conditions (Robbins and Monro, 1951):
                                 X∞                   X∞
                                     ηt = ∞ and           ηt2 < ∞.
                                 t=1                  t=1
A closed form of the gradient ∇ν L(ν) is not always available. Ranganath et al. (2014)
proposed an unbiased Monte Carlo estimate. Observe that:
                           Z
             ∇ν L(ν) = ∇ν q(θ) (log p (θ, D) − log q(θ)) dθ
                        Z                                         Z
                     = ∇ν q(θ) (log p (θ, D) − log q(θ)) dθ − q(θ)∇ν log q(θ)dθ
                        Z
                     = ∇ν q(θ) (log p (θ, D) − log q(θ)) dθ
                        Z
                     = q(θ)∇ν log q(θ) (log p (θ, D) − log q(θ)) dθ
                     = Eq (∇ν log q(θ) (log p (θ, D) − log q(θ)))                             (1.11)
                                                  6


The third equality is the fact that the expectation of a score function is zero. From the last
expectation form in (1.11), we can induce a Monte Carlo estimate as follow:
                                   S
                               1X
                     \
                    ∇ν L(ν) =         ∇ν log q(θs ) (log p (θs , D) − log q(θs )) ,       (1.12)
                               S s=1
where θs is a draw from the current q(θ; ν). Also, we define an empirical ELBO as the Monte
Carlo estimate:
                                        S
                                     1X
                           [
                           L(ν) =          (log p (θs , D) − log q(θs )) ,                (1.13)
                                     S s=1
Replacing ∇ν L(ν) in (1.10) with the unbiased estimate in (1.12), even though the closed
forms of ELBO and its gradients are not exactly computable, we can update variational
parameters until the empirical ELBO (1.13) converges.
1.3    Adaptive learning rates
To optimize the objective function (1.9), appropriate learning rates ηt in (1.10) can vary
widely between variational parameters. Instead of a single learning rate, one can use an
adaptive learning rate method. Kingma and Ba (2014) proposed an algorithm for first-order
gradient-based optimization based on adaptive estimates of lower-order moments named
Adam. Adam computes individual adaptive learning rates for different parameters from
estimates of first and second moments of the gradients. Algorithm 1.1 outlines the Adam
algorithm.
    Adam algorithm updates exponential moving averages of the gradient (mt ) and the
squared gradient (ut ) where α1 , α2 ∈ [0, 1) control the exponential decay rates of these mov-
ing averages. Kingma and Ba (2014) suggested good default settings α0 = 0.001, α1 = 0.9,
α0 = 0.999, and ϵ = 10−8 . We use Adam learning rates in Chapter 3 for our variable selection
algorithm.
                                                 7


Algorithm 1.1 Adam learning rates.
Initialize: α0 : Stepsize
Initialize: α1 , α2 ∈ [0, 1): Exponential decay rates for the moment estimates
Initialize: The hyper parameters m0 ← 0 and u0 ← 0
  1: while ELBO increases            do
  2:        Draw θ ∼ q θ; ν , s = 1, . . . , S;
                     (s)          (t)
  3:        Get ∇\  ν L(ν) based on the   S samplepoints;
  4:        mt ← α1 mt−1 + (1 − α1 ) · ∇\     ν L(ν) ;
                                                  2
  5:        ut ← α2 ut−1 + (1 − α2 ) · ∇\   ν L(ν)    ;
  6:        m̂t ← mt /(1 − α1t ), where α1t denotes α1 to the power of t;
  7:        ût ← ut /(1 − α2t ), where α2t denotes α2 to the power of t; 
                                                                                √       
  8:        Update the parameter of interest: ν ← ν  (t)     (t−1)
                                                                   + α0 · m̂t /   û + ϵ ;
  9: end while
Output: Optimal variational parameters ν ∗
1.4         Variable selection
Statistical methodologies are well-established if the set of variables to consider is fixed and
small. However, in high-dimensional setup in which the number of covariates (p) are much
larger than the number of observations (n), classical models do not perform well, which leads
statisticians to develop various variable selection methods. Variable selection in statistics
means selecting among many variables which to include in a statistical model. From a
total list of variables, significant variables are selected by removing irrelevant or redundant
variables. In this section, we introduce a Bayesian variable selection method based on a
spike and slab prior which was first suggested by Mitchell and Beauchamp (1988).
     In a linear regression model, consider a sparse vector of the regression coefficients β =
(β1 , . . . , βp )⊤ ∈ Rpn :
                                                 y = Xβ + e,                               (1.14)
where p > n, y ∈ Rn , X ∈ Rn×p , and e ∈ Rn . We assume that only a few number of βi is
nonzero and define an activation set of nonzero coefficients:
                                              A := {βi : βi ̸= 0}.                         (1.15)
                                                        8


To select explanatory variables corresponding to the nonzero coefficients, it is desirable that
a tall and narrow function around zero is assigned to βi ’s in Ac . We call the tall and narrow
function a "Spike" distribution. Whereas, for βi ’s in A, we use a flatter and diffused function
called a "Slab" distribution. Plus, given that it is not known a priori which covariates should
be included in a model, we introduce a latent binary vector γ = (γ1 , . . . , γp )⊤ ∈ {−1, 1}pn .
If γi = −1, a spike distribution is used as a prior of βi . If γi = 1, for a prior of βi , a slab
distribution is used as follows:
                                              p
                                            Y
                               p(β | γ) =        p(βi | γi ),
                                            i=1
                                             1 − γi             1 + γi
                                  βi | γi ∼         f1 (βi ) +          f2 (βi ),
                                                2                  2
where f1 and f2 denote a spike distribution and a slab distribution respectively. One simple
choice of a prior distribution of γ is independent Bernoulli:
                                            p
                                           Y     (1+γi )/2
                                  p(γ) =        ϕi         (1 − ϕi )(1−γi )/2 ,            (1.16)
                                           i=1
where ϕi is the probability that γi = 1.
    Various versions of spike and slab priors have emerged in the past literature. Mitchell
and Beauchamp (1988) used a mixture of a point mass at zero (spike) and a diffuse uniform
distribution (slab). With a point mass spike distribution, many previous studies used a
Gaussian slab distribution which include but not limited to Li and Zhang (2010), Andersen
et al. (2014), Xi et al. (2016) and Andersen et al. (2017). Another group of previous researches
which includes but not limited to Johnstone and Silverman (2004), Castillo and Roquain
(2020), and Ray and Szabó (2021) used a Laplace (double exponential) slab with a point mass
spike. Two continuous distributions have been also considered as spike and slab distributions.
George and McCulloch (1993) used a mixture Gaussian prior as follows:
                                        1 − γi                1 + γi
                            βi | γi ∼          N (0, τi2 ) +          N (0, c2i τi2 ),
                                          2                     2
where ci is set to be large. Park et al. (2022) used similar mixture Gaussian priors with
applications to educational data. In a series of studies Ročková and George (2016), Ročková
                                                      9


(2018), and Ročková and George (2018), they have developed spike and slab LASSO priors
using two Laplace distributions as follows:
                        p                                     
                       Y    1 − γi λ0                      1 + γi λ1
            p(β | γ) =                 exp (−λ0 |βi |) +             exp (−λ1 |βi |) ,
                       i=1
                              2     2                        2     2
where λ0 is a large scale parameter and λ1 is a smaller scale parameter. Under the spike
and slab LASSO priors, Gan et al. (2019) proposed an approach for precision matrix estima-
tion called BAGUS, short for “Bayesian regularization for Graphical models with Unequal
Shrinkage”.
    We can simply use the independent Bernoulli prior distributions on γ in (1.16) assum-
ing that there is no inter-dependence between covariates or between regression coefficients.
Dependent data, however, are now routinely analyzed. Some previous researches have used
combination of an Ising model and a spike and slab prior to facilitate the catch of depen-
dence. Li and Zhang (2010) considered Ising prior on γ and suggested a MCMC method
with known structure among the covariates. Li et al. (2015) proposed a joint Ising and
DiriChlet process for grouping and selecting significant voxels in functional magnetic reso-
nance imaging (fMRI) data. In Chapter 3, using a spike and slab prior with Ising model
on γ, we will describe a VB algorithm for simultaneously selecting significant explanatory
variables and estimating the regression coefficients when there exists structural dependence
among the regression coefficients.
1.5    Posterior consistency
Establishing posterior consistency with contraction rates of a statistical method has been
a fundamental research topic for a Bayesian to provide theoretical justification. In an esti-
mation problem, the basic idea of posterior consistency is that the posterior distribution is
concentrated around the true parameter. Many previous studies have constructed a theoret-
ical basis for Bayesian methodology by dealing with posterior consistency. For example, in
a high-dimensional linear regression setup, the posterior consistency of the regression coeffi-
cients with popular shrinkage priors under mild conditions was proved. More specifically, in
                                              10


the high-dimensional linear model, y = Xβ + e, posterior distribution satisfies:
                                   π (β : ||β − β 0 || > ϵ | y) → 0,
where ϵ is any positive constant and β 0 are the true regression coefficients (See Armagan
et al. (2013) for detailed statements and proof). As other examples, Sriram et al. (2013)
provided a justification to use of (misspecified) asymmetric Laplace density for the response
in Bayesian quantile regression by proving posterior consistency. More recently, Ghosh et al.
(2018) considered a VAR model with two priors for the coefficient matrix and showed pos-
terior consistency. Cao et al. (2019), for a covariance estimation and selection problem,
established strong graph selection consistency and posterior convergence rates for estima-
tion using Gaussian directed acyclic graph model. For VB, there are a few theoretical
results. Wang and Blei (2019) established frequentist consistency and asymptotic normality
of VB methods by proving a variational Bernstein–von Mises theorem. Bhattacharya and
Maiti (2021) established the mean-field variational posterior consistency for a feed-forward
artificial neural network model.
    In our two-parameter Ising model estimation problem, we say the posterior is consistent
if
                                             
                                                            (n)
                          π θ ∈ Nnc | X (n) → 0 in P0 probability,
                                                                        (n)
where Nn is a shrinking neighborhood of the true parameter (β0 , B0 ), P0 is the distribution
induced by the true likelihood (1.1) with (β0 , B0 ), and X (n) is a sample vector from the
Ising model with (β0 , B0 ). Analogous to the true posterior, we say the variational posterior
is consistent if
                                                        (n)
                              q ∗ (θ ∈ Nnc ) → 0 in P0 probability.
In Chapter 3, we derive variational posterior consistency with contraction rates obtained
under the pseudo-likelihood and Gaussian mean-field variational family.
                                                 11


                                       CHAPTER 2
          A VARIATIONAL BAYES ALGORITHM AND POSTERIOR
  CONSISTENCY FOR TWO-PARAMETER ISING MODEL ESTIMATION
In this chapter, we describe a VB algorithm for Ising model estimation under pseudo-
likelihood along with numerical studies for assessing performances.
2.1     Introduction
Estimation of Ising model parameters has received considerable attention in statistics and
computer science literature. The existing literature can be broadly divided into two groups.
Some literature assume that i.i.d. (independently and identically distributed) copies of data
are available for inference; see Anandkumar et al. (2012), Bresler (2015), Lokhov et al.
(2018), Ravikumar et al. (2010), and Xue et al. (2012). Another category of literature
assumes that only one sample is observable; see Bhattacharya et al. (2018), Chatterjee et al.
(2007), Comets (1992), Comets and Gidas (1991), Ghosal et al. (2020), Gidas (1988), and
Guyon and Künsch (1992). In this dissertation, using variational Bayes (VB), we provide a
new Bayesian methodology for model parameter estimation when one observes the data only
once. Under the assumption of only one observation, Comets and Gidas (1991) showed that
the MLE of β > 0 for Curie-Weiss model is consistent if B ̸= 0 is known, and vice versa. They
also proved that the joint MLE does not exist when neither β nor B is given. In this regard,
Ghosal et al. (2020) addressed joint estimation of (β, B) using pseudo-likelihood and showed
that the pseudo-likelihood estimator is consistent under some conditions on coupling matrix
An . We also assume only one observation of x and provide a variational Bayes algorithm for
model parameter estimation with its posterior consistency.
    One of the main challenges in the Bayesian estimation of Ising models lies in the in-
tractable nature of the normalizing constant in the likelihood. Following the works of Ghosal
et al. (2020), Bhattacharya et al. (2018) and Okabayashi et al. (2011), we replace the true
                                              12


likelihood of the Ising model by a pseudo-likelihood and we establish that the posterior based
on the pseudo-likelihood is consistent for a suitable choice of the prior distribution. Further,
we use variational Bayes (VB) approach which has recently become a popular and compu-
tationally powerful alternative to MCMC. In order to approximate the unknown posterior
distribution using VB, we propose a Gaussian mean field family and general bivariate nor-
mal family with transformation of the parameters to (log β, B). For implementation of VB,
we consider a black box variational inference (BBVI), Ranganath et al. (2014). In BBVI,
we need to be able to evaluate the likelihood to compute the gradient estimates, but the
existence of an unknown normalizing constant in likelihood of Ising model prevents us using
BBVI directly. So, as mentioned above, we use pseudo-likelihood as in Ghosal et al. (2020).
Replacing the true likelihood of Ising model with pseudo-likelihood, we are able to compute
all the quantities needed for implementing BBVI. Our VB algorithm based on optimiza-
tion is computationally more powerful than the sampling based MCMC methods (Møller
et al., 2006). Also, use of PyTorch’s automatic differentiation enables us to further reduce
computational costs.
2.2     VB algorithm
Let θ = (β, B) be the parameter set in a two-parameter Ising model. To develop vari-
ational Bayes algorithm, we consider the following independent prior distribution p(θ) =
pβ (β)pB (B), with pβ (β) as a log-normal prior for β and pB (B) as a normal prior for B as
follows:
                                      1     (log β)2                1     B2
                         pβ (β) = √ e− 2 ,              pB (B) = √ e− 2 .                  (2.1)
                                   β 2π                             2π
The assumption of log-normal prior on β is to ensure the positivity of β. Let L(θ) be
the pseudo-likelihood function given by (1.2), then the above prior structure leads to the
following posterior distribution
                                        R                     R
                                            π(θ, X (n) )dθ        L(θ)p(θ)dθ
                        Π(A | X  (n)
                                     )=   A
                                                            =  RA            ,             (2.2)
                                            m (X (n) )            L(θ)p(θ)dθ
                                                   13


for any set A ⊆ Θ where Θ denotes the parameter space of θ. Note, π(θ, X (n) ) is the joint
density of θ and the data X (n) and m X (n) = L(θ)p(θ)dθ is the marginal density of X (n)
                                                 R
which is free from the parameter set θ. Next, we provide a variational approximation to the
posterior distribution (2.2) considering two choices of the variational family in order to obtain
approximated posterior distribution (variational posterior). One candidate of our variational
family, for the virtue of simplicity, is a mean-field (MF) Gaussian family as follows:
                                                                                         
           Q MF                                                        2
                 = q(θ) | q(θ) = qβ (β)qB (B), log β ∼ N (µ1 , σ1 ), B ∼ N (µ2 , σ2 ) . 2
                                                                                              (2.3)
The above variational family is the same as a lognormal distribution on β and normal
distribution on B. Also, we point out that β and B are independent in QMF and each
q(θ) ∈ QMF is governed by its own parameter set, ν MF = (µ1 , µ2 , σ12 , σ22 )⊤ . ν MF denotes
the set of variational parameters which will be updated to find the optimal variational
distribution closest to the true posterior (2.2).
    In addition, we suggest a bivariate normal (BN) family to exploit the interdependence
among the parameters (β, B) as follows:
                          (                                                           )
                 Q BN
                       =    q(θ) | q(θ) = q(β, B), (log β, B) ∼ M V N (µ, Σ) ,                (2.4)
                                        
            µ1              σ11 σ12 
where µ =   and Σ =                     .
              µ2                σ12 σ22
    QMF can also be represented as (independent) bivariate normal family. The variational
parameters of BN family are ν BN = (µ1 , µ2 , σ11 , σ22 , σ12 )⊤ . Once a variational family is se-
lected, one can find the variational posterior by maxiimizing the ELBO between a variational
distribution q ∈ Q and the true posterior (2.2). Recall the updates in (1.10) and the Monte
Carlo estimates of the gradients in (1.12):
                                    ν (t+1) ← ν (t) + ηt ∇\ν L(ν),
                                    S
                     \          1X
                    ∇ν L(ν) =           ∇ν log q(θs ) (log p (θs , D) − log q(θs )) .
                                S s=1
                                                  14


Starting with initial values ν (0) , we update the variational parameters in the direction of
increasing ELBO using BBVI. The summary of BBVI algorithm is shown in Algorithm 2.1.
Algorithm 2.1 Black box variational inference (BBVI)
Initialize: p(θ), q(θ; ν (0) ) and learning rate sequence ηt .
  1: while ELBO increases        do
  2:     Draw θ ∼ q θ; ν , s = 1, . . . , S;
                  (s)         (t)
  3:     Get ∇\  ν L(ν) based on the S sample points;
  4:     Update ν (t+1) ← ν (t) + ηt ∇\ν L(ν);
  5: end while
Output: Optimal variational parameters ν ∗
     In the next subsection, we discuss more details of implementing BBVI with the mean-field
family (2.3).
2.2.1     VB algorithm with MF family
In the mean-field family, we have four variational parameters (µ1 , µ2 , σ1 , σ2 ). For each vari-
ational parameter ν, we should compute ∇ν log q(θ) to evaluate the Monte Carlo gradients.
Although one can simply use PyTorch’s automatic differentiation without manual calcula-
tion, it is worth manually calculating the gradients to fully understand a BBVI algorithm.
First, for µi , i = 1, 2, the gradients are:
                                                       log β − µ1
                                     ∇µ1 log q(θ; ν) =            ,
                                                            σ12
                                                       B − µ2
                                     ∇µ2 log q(θ; ν) =          .
                                                          σ22
For σi , i = 1, 2, we need to be more careful because σi must be always positive. During
the updates, it may occur that σi takes a negative value. In order to preclude this issue,
                                                        ′
we consider a reparametrization σi = log 1 + eσi and update the quantity σi′ , as a free
                                                   15


parameter, instead of updating σi . Then,
                                                             ′
                                                          eσ1       (log β − µ1 )2
                                                                                          
                                                                                        1
                               ∇σ1′ log q(θ; ν) =               ′                   −          ,
                                                       1 + eσ1            σ13          σ1
                                                             ′
                                                          eσ2       (B − µ2 )2
                                                                                      
                                                                                    1
                               ∇σ2′ log q(θ; ν) =               ′               −        .
                                                       1 + eσ2          σ23         σ2
In the next subsection, we discuss more details with the bivariate normal family (2.4).
2.2.2      VB algorithm with BN family
In addition to the positivity condition on some variational parameters, we should control
postive definiteness of the covariance matirx Σ in (2.4). To guarantee the positive defi-
niteness, we use Cholesky decomposition such that Σ = LLT , where L = ll11                               0
                                                                                                              . We
                                                                                                            
                                                                                                     12 l22
update the elements of the lower triangular matrix (l11 , l12 , l22 ) in place of directly updating
(σ11 , σ12 , σ22 ) to avoid the cases of negative definite Σ. Computation of ∇ν log q(θ; ν) with
BN family is as follows:
                                                 2
                                       2                                              
                                   1 (l22    + l12  )(log β − µ1 ) l12 (B − µ2 )
        ∇µ1 log q(θ; ν) = 2                                            −                   ,
                                  l22                l11                      l11
                                                                        
                                   1                  l12 (log β − µ1 )
        ∇µ2 log q(θ; ν) = 2 B − µ2 −                                        ,
                                  l22                          l11
        ∇l11 ′ log q(θ; ν)
                   ′
                 el11                                 2
                                                         )(log β − µ1 )2 l12 (log β − µ1 )(B − µ2 )
                                          2                                                         
                               1       1 (l22    + l12
         =             ′     − + 2                          3
                                                                            −                 2
                                                                                                            ,
              1 + el11        l11 l22                      l11                               l11
        ∇l22 ′ log q(θ; ν)
                       ′
                     el22
                                       2       2
                                                   )(log β − µ1 )2 2l12 (log β − µ1 )(B − µ2 )
                                                                                                   
                                  1 (l22    + l12
         =                 ′      2                   2
                                                                       −
              l22 (1 + el22 ) l22                    l11                               l11
                             2                     2         !
                (B − µ2 ))              log β − µ1
         +                         −                       −1 ,
                      l22                   l11
                                   1 (log β − µ1 )(B − µ2 ) l12 (log β − µ1 )2
                                                                                             
        ∇l12 log q(θ; ν) = 2                                          −          2
                                                                                                 ,
                                  l22                l11                        l11
                              ′ 
where lii = log 1 + elii for i = 1, 2.
                                                               16


2.3      PMLE and MCMC
We compare our VB algorithm with two other methods, a PMLE method (Ghosal et al.,
2020) and a MCMC based method (Møller et al., 2006). In thid section, we briefly introduce
the two competitors.
     PMLE: Let h(β, B) denote the pseudo-likelihood in (1.2). Ghosal et al. (2020) used
grid search to find pseudo maximum likelihood estimate (PMLE) for Ising parameters which
simultaneously satisfies          ∂
                                 ∂β
                                    log h(β, B) = 0 and      ∂
                                                            ∂B
                                                               log h(β, B) = 0.
                                              n
                           ∂                X
                             log h(β, B) =       mi (x) (xi − tanh (βmi (x) + B)) = 0,
                          ∂β                 i=1
                                              n
                           ∂                X
                             log h(β, B) =       (xi − tanh (βmi (x) + B)) = 0.
                         ∂B                  i=1
We create a grid such that the search space for β contains all points from 0.1 to 2 in
increments of 0.01 and the search space for B increases from -1 to 1 by 0.01.
     MCMC: Let p(θ) be a prior distribution of θ and p(x | θ) be the true likelihood in
(1.1). The closed form of the true posterior (1.4) is not available because the integral in the
denominator is not analytically tractable. Instead, one can use a sampling method. Consider
a Metropolis-Hastings ratio given by:
                                                 p(θ′ )fθ′ (x)u(θ | θ′ )   Zn (θ)
                                 M H(θ′ | θ) =                           ×          ,         (2.5)
                                                  p(θ)fθ (x)u(θ′ | θ)      Zn (θ′ )
where u(θ′ | θ) is the proposal density and fθ (x) is an unnormalized density of Ising model,
i.e, p(θ | x) = Zn (θ)−1 fθ (x). Obtaining the Metropolis-Hastings ratio, however, is still lim-
ited because we cannot compute the normalizing constant in p(x | θ) even with a moderate
amount of n. In order to remove the ratio of normalizing constants in (2.5), Møller et al.
(2006) proposed an alternative approach using an auxiliary variable z with conditional dis-
tribution g(z | θ, x). Targeting π(θ, z | x) ∝ p(θ)g(z | θ, x) Zn1(θ) fθ (x) instead of π(θ | x)
and taking the proposal density for z ′ to be an Ising likelihood depending on θ′ , that is,
z′ ∼     1
               f ′ (z ′ ),
       Zn (θ′ ) θ
                           we can cancel the normalizing constants in the Metropolis-Hastings ratio
                                                        17


(2.5) as follows:
                                                 g(z ′ | θ′ , x)p(θ′ )fθ′ (x)fθ (z)u(θ | θ′ )
                          M H(θ′ , z ′ | θ, z) =                                              .  (2.6)
                                                 g(z | θ, x)p(θ)fθ (x)fθ′ (z ′ )u(θ′ | θ)
As suggested by Møller et al. (2006), the conditional density g(z | θ, x) is approximated
by fθ̃ (z)/Zn (θ̃) which does not depend on θ, where θ̃ is PMLE. Also, using independent
log-normal and normal distributions so that u(θ | θ′ )/u(θ′ | θ) = 1, the ratio (2.6) further
reduces to:
                                                                   fθ̃ (z ′ )fθ′ (x)fθ (z)
                             M H(θ′ , z ′ | θ, z) = 1 (θ ∈ Θ)                              .     (2.7)
                                                                   fθ̃ (z)fθ (x)fθ′ (z ′ )
We accept the proposal (θ′ , z ′ ) as a new state with probability min{1, M H(θ′ , z ′ | θ, z)}.
Although Møller et al. (2006)’s approach intelligently controls the normalizing constant,
the algorithm is still computationally expensive because it involves sampling from an Ising
likelihood at each Metropolis-Hastings iteration.
2.4     Numerical experiments
2.4.1    Generating observed data
For numerical experiments, we need a coupling matrix An and an observed vector xobserved
from (1.1). First, for generating a random d-regular graph and its scaled adjacency matrix,
we use a python package NetworkX. Using the scaled adjacency matrix as our coupling matrix
An , we facilitate Metropolis-Hastings algorithm to generate an observed vector xobserved with
true parameters (β0 , B0 ) as follows:
   0. Define H(x) =                                     xi and start with a random binary vector x =
                             β0 ⊤                Pn
                             2
                               x An x       + B0   i=1
       (x1 , . . . , xn )⊤ .
   1. Randomly choose a spin xi , i ∈ {1, . . . , n}.
   2. Flip the chosen spin, i.e. xi = −xi , and calculate ∆H = H(xnew ) − H(xold ) due to
       this flip.
                                                          18


   3. The probability that we accept xnew is:
                                                      
                                                                   if ∆H > 0,
                                                      
                                                      1,
                                                      
                            P(accept xnew ) =
                                                      exp (∆H) , otherwise.
                                                      
                                                      
   4. If rejected, put the spin back, i.e. xi = −xi .
   5. Go to 1 until the maximum number of iterations (L) is reached.
   6. After L = 1, 000, 000 iterations, the last result is a sample xobserved we use.
One can read Izenman (2021) for more details of sampling from Ising model.
2.4.2    Performance Comparison
We compare the performance of the parameter estimation methods for two-parameter Ising
model (1.1) under various combinations of (d, n) and (β0 , B0 ). Using the given coupling
matrix An for each scenario, we repeat the following steps R times:
    • Generate an observed vector x from (1.1) with true parameters (β0 , B0 ).
    • Using the proposed BBVI algorithm with MF family or BN family, obtain the optimal
      variational parameters ν ∗ .
    • We get the estimates θ̂ = (log β̂, B̂)⊤ = (µ∗1 , µ∗2 )⊤ .
We use S = 20 or S = 200 as the Monte Carlo sample size in (1.12). Figure 2.1 describes
ELBO convergences for the two different sample sizes with MF family and BN family. The
figure indicates that the ELBO converges well with a moderate choice of S. Further, for
more stable convergence, one might choose higher S and BN family over MF family.
    We use Mean squared error (MSE) as the measurement for assessing the performances
with R = 50 pairs of estimates (β̂1 , B̂1 ), . . . , (β̂R , B̂R ):
                                       R                                 
                                   1X                      2         2
                         M SE =                 β̂r − β0 + B̂r − B0         .         (2.8)
                                   R r=1
                                                     19


Figure 2.1: Left: ELBO convergence with two variational families (BN and MF) and S = 20.
Right: ELBO convergence with two variational families (BN and MF) and S = 200. Blue
lines represent BN family and orange lines represent MF family in each plot.
    For each pair of (β0 , B0 ) we take d = 10, 50. The two numbers in each cell of the tables,
Table 2.1, 2.2, 2.3, and 2.4, represent MSE values or convergence time when n = 100 and
n = 500 respectively. First, we consider a small value of β0 = 0.2 with B0 = ±0.2, ±0.5. In
these cases, as shown in Table 2.1 and 2.2, PMLE is the fastest but less accurate especially
for d = 50. MCMC achieves smaller MSEs but it has the highest runtimes. Our VB methods
notably reduce the runtimes without compromising accuracy.
    Second, for higher interaction parameter β0 = 0.7 with B0 = ±0.2, ±0.5, our VB algo-
rithms are more accurate than the others (See Table 2.3 and 2.4). The numerical studies
validate the superiority of our proposed VB methods. For more practical applications, we
used our algorithm to regenerate an image in the next subsection.
2.4.3    Image reconstruction
Ising model can be used for constructing an image in computer vision field. In particular,
the Bayesian procedure facilitate the reconstruction easily by using the posterior predictive
distribution Halim (2007). Consider an image in which each pixel represents either −1(white)
or 1(black). For choice of coupling matrix An , we use twelve-nearest neighbor structure and
construct corresponding scaled adjacency matrix Hurn et al. (2003). Then, we generate such
images following the steps in the subsection 2.4.1 with a true parameter pair (β0 , B0 ) and use
                                                20


Table 2.1: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100
(left numbers) and n = 500 (right numbers) given the degree of underlying graph (d).
Degree of          Method        Monte Carlo      (0.2, 0.2)      (0.2, −0.2)     Convergence
graph (d)                        samples (S)                                       time (sec)
    10             PMLE1                -       0.119 /  0.049   0.091  / 0.022     3.2 / 3.6
                   MCMC2                -       0.098 /  0.194   0.097  / 0.146  165.2 / 675.8
                 MF3 family           20        0.059 /  0.018   0.057  / 0.014    6.3 / 10.3
                                      200       0.037 /  0.027   0.032  / 0.013    10.7 / 15.9
                 BN4 family           20        0.064 /  0.021   0.061  / 0.018    7.0 / 12.3
                                      200       0.041 /  0.026   0.035  / 0.009    12.3 / 17.8
    50             PMLE                 -       0.369 /  0.105   0.299  / 0.170     3.1 / 3.6
                   MCMC                 -       0.165 /  0.084   0.144  / 0.110  168.1 / 678.0
                 MF family            20        0.072 /  0.027   0.073  / 0.030    6.4 / 10.1
                                      200       0.070 /  0.023   0.070  / 0.027    10.8 / 15.9
                  BN family            20       0.081 /  0.045   0.082  / 0.049     7.1 / 12.1
                                      200       0.073 /  0.058   0.067  / 0.067    12.2 / 17.5
1
  PMLE, pseudo maximum likelihood estimate (Ghosal et al., 2020); 2 MCMC, markov chain
monte carlo (Møller et al., 2006); 3 MF, mean-field; 4 BN, bivariate normal
it as our given data xobserved . With the generated image xobserved and coupling matrix An , we
obtain (β̂, B̂) after implementing the parameter estimation procedure based on BN family.
The estimates (β̂, B̂) are used for data regeneration following the steps in the subsection
2.4.1 again. In Figure 2.2, we plot two original images in the left column. The first original
image was generated with β0 = 1.2, B0 = 0.2 (left top) and we use β0 = 1.2, B0 = −0.2
for the second one (left bottom). Also, in the right column, there are two corresponding
images regenerated with (β̂ = 1.071, B̂ = 0.357) (right top) and (β̂ = 0.982, B̂ = −0.326)
(right bottom) respectively. It seems that, using two-parameter Ising model and our VB
method, we can reconstruct the overall tendency of black and white images fairly well. For
more precise pixel-by-pixel reconstruction, one can utilize multiple threshold parameters,
B = (B1 , . . . , Bn )⊤ (See the section 2.6).
                                               21


Table 2.2: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100
(left numbers) and n = 500 (right numbers) given the degree of underlying graph (d).
Degree of       Method       Monte Carlo       (0.2, 0.5)      (0.2, −0.5)        Convergence
graph (d)                    samples (S)                                           time (sec)
    10          PMLE               -         0.270 /  0.053   0.126  / 0.068        3.1 / 3.5
                MCMC               -         0.228 /  0.275   0.174  / 0.255     164.9 / 673.5
              MF family           20         0.075 /  0.017   0.075  / 0.017        6.5 / 10.2
                                 200         0.071 /  0.015   0.070  / 0.014       10.8 / 16.1
              BN family           20         0.090 /  0.039   0.091  / 0.038       7.1 / 12.0
                                 200         0.088 /  0.033   0.087  / 0.032       12.3 / 17.1
    50          PMLE               -         0.792 /  0.197   0.653  / 0.251        3.2 / 3.5
                MCMC               -         0.407 /  0.153   0.380  / 0.169     166.1 / 675.1
              MF family           20         0.088 /  0.023   0.089  / 0.024        6.3 / 10.2
                                 200         0.082 /  0.019   0.080  / 0.021       10.4 / 16.4
              BN family           20         0.090 /  0.065   0.112  / 0.072       7.2 / 12.1
                                 200         0.110 /  0.081   0.102  / 0.096       12.2 / 17.2
2.5     Real data analysis
In two-parameter Ising model, higher value of β implies stronger interactions between con-
nected nodes and the threshold parameter B controls the model size (number of 1s), where
the model size is greater for B > 0, smaller for B < 0. In this section, applying our meth-
ods to a real data set, we obtain the estimated interaction parameter and the threshold
parameter (β̂, B̂).
2.5.1    Data description
Stanford Network Analysis Project (SNAP) provides a Facebook network data set (Leskovec
and Krevl, 2014) available at http://snap.stanford.edu/data/ego-Facebook.html. The
Facebook network consists of 4,039 nodes and 88,234 edges. Each node represents a Facebook
user and there is an edge between two nodes if corresponding users are friends. The data
set also contains user features such as birthday, school, gender and location. The features
are fully anonymized. For instance, while the original data may include a feature “location
                                             22


Table 2.3: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100
(left numbers) and n = 500 (right numbers) given the degree of underlying graph (d).
Degree of      Method       Monte Carlo        (0.7, 0.2)       (0.7, −0.2)       Convergence
graph (d)                    samples (S)                                           time (sec)
    10          PMLE               -         0.439 /  0.074    0.393  / 0.083       3.1 / 3.6
               MCMC                -         0.114 /  0.080    0.111  / 0.064    170.3 / 678.1
              MF family           20         0.109 /  0.135    0.095  / 0.144       6.3 / 10.0
                                 200         0.116 /  0.140    0.100  / 0.148      10.5 / 16.0
              BN family           20         0.085 /  0.074    0.080  / 0.081      6.9 / 11.9
                                 200         0.087 /  0.080    0.080  / 0.088      12.3 / 17.0
    50          PMLE               -         0.718 /  0.333    0.730  / 0.386       3.2 / 3.8
               MCMC                -         0.170 /  0.107    0.172  / 0.120    171.0 / 673.6
              MF family           20         0.093 /  0.185    0.093  / 0.178       6.2 / 10.2
                                 200         0.101 /  0.191    0.103  / 0.184      10.4 / 16.1
              BN family           20         0.068 /  0.107    0.071  / 0.086      7.0 / 12.2
                                 200         0.068 /  0.117    0.075  / 0.110      12.5 / 17.3
= Michigan", the anonymized data would simply contain“location = anonymized location
A". Thus, using the anonymized data, we can determine whether two users stay in the same
location, but we do not know where.
    Among the 4, 039 users, we select only users who disclose gender information to create a
sub-graph such that there are 3, 948 nodes and 84, 716 edges in the sub-graph. Later we used
the gender information for binary observations. Each node has different number of neighbors
(degree). The maximum degree of the sub-graph is 1, 024, and the minimum is 1, with an
average degree 42.92 (Figure 2.3 shows the nodes and edges in the reduced network).
2.5.2    Parameter estimation
We utilize the selected users (n = 3, 948) as a real data set to apply our VB algorithm
with the features school and gender as observed binary vectors. For the school feature, we
encode 1 if a user (node) belongs to an anonymized school A, otherwise −1. For the gender
feature, we encode a group by 1 and the other group by −1. The model sizes are 114 and
2, 417 for school and gender respectively. Note that, no matter which feature is used, the
                                             23


Table 2.4: Mean squared errors and computation times for each pair of (β0 , B0 ) when n = 100
(left numbers) and n = 500 (right numbers) given the degree of underlying graph (d).
Degree of       Method       Monte Carlo       (0.7, 0.5)     (0.7, −0.5)         Convergence
graph (d)                    samples (S)                                           time (sec)
    10          PMLE               -        0.603   / 0.228 0.788   / 0.233         3.2 / 3.7
                MCMC               -        0.143   / 0.113 0.178   / 0.121      170.0 / 678.5
              MF family          20         0.083   / 0.184 0.078   / 0.193         6.3 / 10.1
                                 200        0.103   / 0.188 0.096   / 0.199        10.4 / 16.2
               BN family          20        0.049   / 0.074 0.047   / 0.079        7.8 / 12.3
                                 200        0.059   / 0.077 0.056   / 0.083        12.5 / 17.5
    50          PMLE               -        0.893   / 0.638 0.804   / 0.781         3.1 / 3.8
                MCMC               -        0.144   / 0.154 0.126   / 0.255      171.9 / 677.1
              MF family          20         0.080   / 0.219 0.071   / 0.209         6.2 / 10.0
                                 200        0.095   / 0.225 0.091   / 0.219        10.6 / 16.0
               BN family          20        0.044   / 0.080 0.040   / 0.074        7.7 / 12.1
                                 200        0.051   / 0.085 0.048   / 0.079        12.4 / 17.3
connectivity among nodes does not change. One can expect that the interaction parameter
β is higher when we use the school feature because people from the same school are more
likely to be Facebook friends with each other. Also, we expect the threshold parameter B
will be negative for school and positive for gender because of the model sizes. Table 2.5
summarizes the estimated parameters with standard errors (SE) (for VB and MCMC) and
runtimes.
Table 2.5: The estimated parameters with standard errors (SE) in parentheses and time
costs for the features gender and school.
Feature        Method       Monte Carlo            β̂              B̂             Convergence
                            samples (S)          (SE)             (SE)              time (sec)
Gender         PMLE               -            0.250(-)         0.180(-)                5.9
               MCMC               -          0.112(0.066)    0.210(0.031)             27109.2
             MF family          200          0.132(0.070)    0.189(0.027)              173.3
             BN family          200          0.106(0.089)    0.248(0.033)              271.2
 School        PMLE               -            0.260(-)        -1.560(-)                6.0
               MCMC               -          0.203(0.087)   -1.610(0.093)             27059.3
             MF family          200          0.252(0.061)   -1.445(0.049)              173.4
             BN family          200          0.299(0.092)   -1.602(0.103)              265.3
                                             24


            Figure 2.2: Original images (left) and the estimated images (right).
    SEs of MCMC in Table 2.5 are calculated based on 10,000 draws after the burn-in period
of 10,000 iterations. For our VB methods, to calculate SEs, we sample the same amount
of β and B (10,000 draws) from the optimal variational distributions and calculate sample
standard deviations. Figure 2.4 indicate density plots of the draws from BN family, MF
family, and MCMC for the features gender and school.
    Computational gain is very clear from the figures in Table 2.5. While estimated parame-
ters are comparable for all the methods, the MCMC implementation takes about fifty times
more time compared to VB to achieve similar level of accuracy. The PMLE approach does
not produce SE and thus limited for statistical inference.
                                             25


Figure 2.3: Visualization of Facebook network data where the size of circle represents the
degree of the node.
2.6    Extension to multi threshold parameters
Beyond the two-parameter Ising model which is a main material of this paper, we can extend
the parameter estimation procedure using VB to more general Ising model. Allowing multi-
threshold parameters, that is, B = (B1 , . . . , Bn ), the likelihood is:
                                                                      n
                                                                                !
                    (n)                   1             β ⊤         X
                   Pβ,B (X = x) =                exp      x An x +        Bi xi   .      (2.9)
                                    Zn (β, B)           2            i=1
Note, there are n + 1 unknown parameters in the likelihood (2.9) and the corresponding
pseudo-likelihood is:
               Yn
                    (n)
                   Pβ,B (Xi = xi | Xj , j ̸= i)
               i=1
                             n
                                                                                    !
                            X
               = 2−n exp        (βxi mi (x) + Bi xi − log cosh(βmi (x) + Bi )) .        (2.10)
                            i=1
We introduce a VB algorithm for estimating the parameters θ := (log β, B1 , . . . , Bn ) given
An and x. As a natural extension, consider the following multivariate Gaussian variational
                                                 26


Figure 2.4: Density plots for the estimated parameters (left: β, right: B) from VB with BN
family (red), VB with MF family (green), and MCMC (blue) for the features gender and
school.
family:
                        (                                                  )
                QM G =    q(θ) | q(θ) = q(β, B), (log β, B) ∼ M V N (µ, Σ) ,                (2.11)
where µ ∈ Rn+1 and Σ ∈ R(n+1)×(n+1) . Without any assumption on Σ, updating all the vari-
ational parameters and finding variational posterior require quite demanding computaional
costs because the total number of updated parameters is (n + 1) + (n + 1)(n + 2)/2. If we as-
sume Σ is a diagonal matrix, that is, the variational family (2.11) is mean-field, the number
of parameters to be updated reduces to 2(n+1). Under the pseudo-likelihood (2.10) and mul-
tivariate Gaussian family (2.11) with a diagonal covariance matrix Σ = diag(σ12 , . . . , σn2 ), we
can easily compute the gradients (1.12) for updating the variational parameters and develop
                                             27


a VB algorithm for multi-threshod parameter Ising model.
2.7     Posterior consistency
The main theoretical contribution of this work lies in establishing the consistency of the
variational posterior for the Ising model with the true likelihood replaced by the pseudo-
likelihood. In this direction, we first establish the rates at which the true posterior based
on the pseudo-likelihood concentrates around the εn - shrinking neighborhoods of the true
parameters. With a suitable bound on the Kulback-Leibler distance between the true pos-
terior (under pseudo-likelihood) and the variational posterior, we next establish the rate of
contraction for the variational posterior and demonstrate that the variational posterior also
concentrates around εn -shrinking neighborhoods of the true parameter. These results have
been derived under three set of assumptions on the coupling matrix An . Indeed, we demon-
strate that the variational posterior consistency holds for the same set of assumptions on
An as those needed for the convergence of the maximum likelihood estimates based on the
pseudo-likelihood. One of the main caveats in establishing the posterior contraction rates
under the pseudo-likelihood structure is in ensuring that the concentration of the variational
                      (n)                     (n)
posterior occurs in P0    probability where P0    is the distribution induced by the true like-
                                                                            (n)
lihood and not the pseudo-likelihood. Indeed, we could show that in P0          probability, the
contraction of variational posterior happens at the rate 1 − 1/Mn in contrast to the faster
rate 1 − exp(−Cnε2n ), C > 0 for the true posterior. As a final theoretical contribution, we
establish that the variational Bayes estimator convergences to the true parameters at the
rate 1/εn where εn can be chosen n−δ , 0 < δ < 1/2 provided the An matrix satisfies certain
regularity assumptions.
2.7.1    Sketch of Proof
In this subsection, we states main theorem and provide a sketch of proof for consistency
of the variational posterior (1.7). In this direction, we establish the variational posterior
                                               28


contraction rates to evaluate how well the posterior distribution of β and B under the
variational approximation concentrates around the true values β0 and B0 . Towards the
proof, we make the following assumptions:
Assumption 1 (Bounded row sums of An ). The row sums of An are bounded above
                                             X n
                                       max        An (i, j) ≤ γ,
                                       i∈[n]
                                              j=1
for a constant γ independent of n.
    Assumption 1 is the same as (1.2) in Ghosal et al. (2020). As a consequence of Assumption
1, it can be shown |mi (x)| ≤ γ, i = 1 . . . , n.
Assumption 2 (Mean field assumption on An ). Let ϵn → 0 and nϵ2n → ∞ such that
                  Xn X n                                     Xn X  n
              (i)         An (i, j) = O(nϵ2n ),        (ii)          An (i, j)2 = o(nϵ2n ).
                  i=1 j=1                                    i=1 j=1
    Assumption 2-(i) is the same as condition (1.4) in Ghosal et al. (2020) on An for ϵn = 1.
Assumption 2-(ii) is the same as (1.6) in Ghosal et al. (2020) with ϵn = 1. For more details
on the mean field assumption, we refer to Definition 1.3 in Basak and Mukherjee (2017).
Assumption 3 (Bounded variance of An ). Let Ān = (1/n)
                                                                     Pn Pn
                                                                       i=1    j=1 An (i, j),
                                         n      n
                                      1XX
                             lim inf         (    An (i, j) − Ān )2 > 0.
                               n→∞    n i=1 j=1
    Finally, the Assumption 3 corresponds to (1.7) in Ghosal et al. (2020). The validity
of Assumption 3 ensures that Tn (x) = (1/n) ni=1 (mi (x) − m̄(x))2 is bounded below and
                                                    P
above in probability, an essential requirement towards the proof of contraction rates of the
variational posterior.
    Let θ = (β, B) be the model parameter and θ0 = (β0 , B0 ) be the true parameter from
which the data are generated. Let L(θ) and L(θ0 ) denote the pseudo-likelihood as in (1.2)
under the model parameters and true parameters respectively. Further, let L0 denote the
true probability mass function from which X (n) is generated. Thus, L0 is as in (1.1) with
                                                  29


                                         (n)         (n)
θ = θ0 . We shall use the notations E0 and P0 to denote expectation and probability mass
function with respect to L0 .
    We next present the main theorem which establishes the contraction rate for the varia-
tional posterior. Following the proof, we next establish the contraction rate of the variational
Bayes estimator as a corollary. We shall use the term with dominating probability to imply
              (n)
that under P0 , the probability of the event goes to 1 as n → ∞.
Theorem 1 (Posterior Contraction). Let Uεn = {θ : ∥θ − θ0 ∥2 ≤ εn } be neighborhood of the
                                                                    (n)
true parameters. Suppose ϵn satisfies Assumption 2, then in P0          probability
                                        Q∗ (Uεcn ) → 0, n → ∞,
               √
where εn = ϵn Mn log n for any slowly increasing sequence Mn → ∞ satisfying εn → 0.
    The above result establishes that the posterior distribution of β and B concentrates
around the true value β0 and B0 at a rate slight larger than ϵn . The proof of the above
theorem rests on following lemmas, whose proofs have been deferred to the Section 2.10,
2.11, and 2.12.
Lemma 1. There exists a constant C0 > 0, such that for any ϵn → 0, nϵ2n → ∞,
                                Z                             !
                      (n)              L(θ)                 2
                    P0      log              p(θ)dθ ≤ −C0 nϵn → 1, n → ∞.
                                 Uϵcn L(θ0 )
Lemma 2. Let ϵn be the sequence satisfying the Assumption 2, then for any C > 0,
                                     Z                             
                          (n)            L(θ)                 2
                         P0      log             p(θ)dθ ≤ Cnϵn log n → 1.
                                         L(θ0 )
Lemma 3. Let ϵn be the sequence satisfying Assumption 2, then for some Q ∈ QMF and
any C > 0,
                                Z                                 
                            (n)          L(θ0 )             2
                          P0         log        q(θ)dθ ≤ Cnϵn log n → 1.
                                         L(θ)
    Lemma 1 and Lemma 2 taken together suffice to establish the posterior consistency of the
true posterior based on the pseudo-likelihood L(θ) as in (2.2). Lemma 3 on the other hand
                                                    30


is the additional condition which needs to hold to ensure the consistency of the variational
posterior. We next state an important result which relates the variational posterior to the
true posterior.
Formula for KL divergence: By Corollary 4.15 in Boucheron et al. (2013),
                                                 Z                  Z         
                                                                          f
                             KL(P1 , P2 ) = sup       f dP1 − log       e dP2 .
                                              f
Using the above formula in the context of variational distributions, we get
                        Z                                          Z
                                 ∗          ∗
                            f dQ ≤ KL(Q , Π(| X      (n)
                                                         )) + log      ef dΠ(| X (n) ).                (2.12)
The above relation serves as an important tool towards the proof of Theorem 1. Next, we
provide a brief sketch of the proof. Further details on the proof have been deferred to Section
2.13.
Sketch of proof of Theorem 1: Let f = (C0 /2)nε2n 1[θ ∈ Uεcn ], then
                                                                     2
    (C0 /2)nε2n Q∗ (Uεcn ) ≤ KL(Q∗ , Π(| X (n) )) + log(e(C0 /2)nεn Π(Uεcn | X (n) ) + Π(Uεn | X (n) ))
                            2                               2                          2
     =⇒ Q∗ (Uεcn ) ≤           2
                                 KL(Q∗ , Π(| X (n) )) +         2
                                                                  log(1 + e(C0 /2)nεn Π(Uεcn | X (n) )).
                         C0 nεn                           C0 nεn
By Lemma 2 and 3, it can be established with dominating probability for any C > 0, as
n→∞
                                    KL Q∗ , Π(| X (n) ) ≤ Cnϵ2n log n.
                                                        
By Lemma 1 and 2, it can be established with dominating probability, as n → ∞
                                                                   2
                                       Π(Uεcn | X (n) ) ≤ e−C1 nεn ,                                   (2.13)
for any C1 > C0 /2. Therefore, with dominating probability
                                       2C         2         
                                                                     −(C1 −C0 /2)nε2n
                                                                                      
                         Q∗ (Uεcn ) ≤        +          log   1  + e
                                      C0 Mn C0 nε2n
                                                               2
                                       2C       e−(C1 −C0 /2)nεn
                                    ∼        +                    → 0.                                 (2.14)
                                      C0 Mn         C0 nε2n
This completes the proof.
                                                    31


    Note that (2.13) gives the statement for the contraction of the true posterior. Similarly
the contraction rate for the variational posterior follows as a consequence of (2.14). An
important difference to note is that Q∗ (Uεcn ) goes to 0 at the rate 1/Mn in contrast to the
faster rate e−C1 nεn for the true posterior.
                    2
    Note, Theorem 1 gives the contraction rate of the variational posterior. However, the
convergence of the of variational Bayes estimator to the true values of β0 and B0 is not
immediate. The following corollary gives the convergence rate for the variational Bayes
estimate as long as Assumptions 1, 2 and 3 hold.
Corollary 1 (Variational Bayes Estimator Convergence). Let εn be as in Theorem 1, then
     (n)
in P0    probability,
                                  1
                                    EQ∗ (∥θ − θ0 ∥2 ) → 0, as n → ∞.
                                 εn
    Next, we provide a brief sketch of the proof. Further details of the proof have been
deferred to Section 2.14.
Sketch of proof of Corollary 1: Let f = (C2 /2)nεn ∥θ − θ0 ∥2 , then
                         Z
             (C2 /2)nεn     ∥θ − θ0 ∥2 dQ∗ (θ)
                                                        Z                                       
                                ∗          (n)                C2 nεn ∥θ−θ0 ∥2 /2            (n)
                      ≤ KL(Q , Π(| X           )) + log      e                   dΠ(θ | X       ) .
By Lemma 2 and 3, it can be established with dominating probability, for any C > 0
                                  KL(Q∗ , Π(| X (n) )) ≤ Cnϵ2n log n.
By Lemma 1, and 2, it can be established with dominating probability, for some C2 > 0
                      Z
                                                                     1             2
                        e(C2 /2)nεn ∥θ−θ0 ∥2 dΠ(θ | X (n) ) ≤              2
                                                                              eCnϵn log n .              (2.15)
                                                               (C2 /2)nεn
Therefore, with dominating probability
                                               2 log(C2 /2) 2εn log(nε2n )
         Z
                                   2Cεn                                              2Cεn
           ∥θ − θ0 ∥2 dQ∗ (θ) ≤            −                −             2
                                                                                  +           ≤ εn o(1).
                                  C2 Mn           C2 nεn           C2 nεn            C2 Mn
This completes the proof.
                                                      32


   (2.15) follows as a consequence of convergence of the true posterior. An important thing
to note that if εn can be made arbitrarily close to n−δ for 0 < δ < 1/2, it guarantees close
   √
to n convergence.
2.8     Preliminary notations and Lemmas
Let θ = (β, B). Define Wn := (W1n , W2n ) where
                                            Xn
                        W1n (θ | x) =           mi (x) (xi − tanh(βmi (x) + B)) ,
                                            i=1
                                             n                                                                    (2.16)
                                            X
                        W2n (θ | x) =           (xi − tanh(βmi (x) + B)) .
                                            i=1
Also, define
                                 P                                  Pn                            
                                         n           2
                                         i=1  mi (x) Si (θ | x)        i=1 mi (x)Si (θ | x)
                 Hn (θ | x) :=                                                                    ,             (2.17)
                                 
                                     Pn                                 Pn
                                          i=1  mi (x)Si (θ | x)            i=1 Si (θ | x)
                P                                                   Pn                                              
                     n          3                      3                            2                    3
                         m  (x)    (h   (θ  |  x) −  h   (θ  |  x))         m i (x)    (h i (θ  | x) − h   (θ  | x))
                 i=1 i               i                i               i=1                               i
 R1n (θ | x) :=                                                                                                     ,
                                                                                                                     
                  Pn            2                      3
                                                                     P  n                               3
                     i=1 mi (x) (hi (θ | x) − hi (θ | x))               i=1 mi (x) (hi (θ | x) − hi (θ | x))
                                                                                                                  (2.18)
and
                 P                                                                                                 
                      n          2                      3
                                                                     Pn                                 3
                         m   (x)    (h   (θ  | x)  − h    (θ  | x))         m  i (x)  (h i (θ  |  x) − h  (θ  |  x))
                  i=1 i               i                i               i=1                             i
  R2n (θ | x) :=                                                                                                   ,
                                                                                                                    
                   Pn                                  3
                                                                         P  n                        3
                      i=1 mi (x) (hi (θ | x) − hi (θ | x))                  i=1 (hi (θ | x) − hi (θ | x))
                                                                                                                  (2.19)
where Si (θ | x) = sech2 (βmi (x) + B) and hi (θ | x) = tanh (βmi (x) + B).
Lemma 4. Let W1n and W2n be as in (2.16), then
                   1 (n)                                         1 (n)
                     E0 (W1n (θ0 | x))2 < ∞                        E0 (W2n (θ0 | x))2 < ∞
                   n                                             n
Proof. See the lemma 2.1 in Ghosal et al. (2020).
                                                          33


Lemma 5. Let p1 and p2 be any two density functions. Then,
                                                 
                                              p1                        2
                                   EP1 log           ≤ KL(P1 , P2 ) +
                                              p2                        e
Proof. See the lemma 4 in Lee (2000).
                              1
                                 Pn                1
                                                      Pn           2
Lemma 6. Let Tn (x) =         n     i=1  mi (x) −  n     i=1 mi (x) . Suppose Assumptions 1, 2 and
3 hold, then
                                  Tn (x) = Op (1),      1/Tn (x) = Op (1)
Proof. See the theorem 1.4 in Ghosal et al. (2020).
2.9     Taylor expansion for log-likelihood
Lemma 7. Consider the term (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) where Hn is the same as in (2.17).
Then, for some C1 , C2 > 0, we have
        (n)
             C1 n∥θ − θ0 ∥22 ≤ (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) ≤ C2 n∥θ − θ0 ∥22 → 1, n → ∞
                                                                                 
       P0
Proof. For some M1 , M2 > 0, let A1n = {x : Tn (x) ≤ M1 }, A2n = {x : Tn (x) ≥ M2 }, and
                            (n)
An = A1n ∩ A2n , then P0 (An ) → 1.
This is because by Lemma 6, there exists M1 and M2 such that
                                (n)                 (n)
                             P0 (Tn > M1 ) = P0 (1/Tn < 1/M1 ) → 0
                                (n)                 (n)
                             P0 (Tn < M2 ) = P0 (1/Tn > 1/M2 ) → 0
The remaining part of the proof works with only x ∈ An . Let eH            1 ≥ e2 be the eigenvalues
                                                                                H
of Hn (θ0 | x). The trace of Hn (θ0 | x), is
                                          Xn
       tr (Hn (θ0 | x)) = eH         H
                                              sech2 (β0 mi (x) + B0 ) m2i (x) + 1 ≤ n(1 + γ 2 )
                                                                                 
                             1 + e2 =
                                          i=1
where we used |mi (x)| ≤ γ based on Assumption 1. Note (2.7) in Ghosal et al. (2020) gives
a lower bound of eH  2 :
                                           sech4 (β0 γ + |B0 |)
                                    eH2 ≥                       nTn (x),                        (2.20)
                                                 1 + γ2
                                                     34


where Tn (x) is as in the Lemma 6. By spectral decomposition of Hn (θ0 | x),
  (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) ≤ eH      (β − β0 )2 + (B − B0 )2
                                          
                                        1
                          ≤ n(1 + γ 2 ) − eH       (β − β0 )2 + (B − B0 )2
                                                
                                              2
                                              sech4 (β0 γ + |B0 |)
                                                                        
                                        2
                                                                             (β − β0 )2 + (B − B0 )2
                                                                           
                          ≤ n (1 + γ ) −                  2
                                                                   Tn (x)
                                                     1+γ
                                                                                                 (2.21)
Also,
   (θ − θ0 )⊤ Hn (θ0 | x)(θ − θ0 ) ≥ eH     (β − β0 )2 + (B − B0 )2
                                          
                                        2
                                       sech4 (β0 γ + |B0 |)
                                                            nTn (x) (β − β0 )2 + (B − B0 )2      (2.22)
                                                                     
                                   ≥                2
                                              1+γ
Since M2 ≤ Tn (x) ≤ M1 for every x ∈ An , the proof follows.
Lemma 8. For R1n and R2n as in (2.18) and (2.19) respectively, let
     3Rn (θ̃, θ − θ0 | x)                                                                        (2.23)
      = (β − β0 )(θ − θ0 )⊤ R1n (θ̃ | x)(θ − θ0 ) + (B − B0 )(θ − θ0 )⊤ R2n (θ̃ | x)(θ − θ0 )    (2.24)
where Rn = Rn (θ̃, θ − θ0 | x) and θ̃ = θ0 + c(θ − θ0 ) 0 < c < 1. Then, as n → ∞ for some
C1 , C2 > 0 we have
                                  (n)
                                P0 (M1 n∆∗ ≤ Rn ≤ M2 n∆∗ ) → 1,
where ∆∗ = ((β − β0 )γ + (B − B0 ))∥θ0 − θ∥22 .
Proof. For some M1 , M2 > 0, let A1n = {x : Tn (x) ≤ M1 }, A2n = {x : Tn (x) ≥ M2 }.
                                (n)
Let An = A1n ∩ A2n , then P0 (An ) → 1.
This is because by Lemma 6, there exists M1 , M2 > 0 such that
                              (n)                   (n)
                             P0 (Tn > M1 ) = P0 (1/Tn < 1/M1 ) → 0
                              (n)                   (n)
                             P0 (Tn < M2 ) = P0 (1/Tn > 1/M2 ) → 0
                                                     35


The remaining part of the proof works with only x ∈ An .
The determinant of R1n (θ̃ | x) is:
            det(R1n (θ̃ | x))
                    n
               1X
                       mi (x)mj (x) hi (x) − h3i (x) hj (x) − h3j (x) (mi (x) − mj (x))2
                                                                             
             =
               2 i,j=1
                                                                                                            √
where hi (x) = tanh(β̃mi (x) + B̃). Since tanh(·) − tanh3 (·) has maximum value 0.38 at                       3/3
and mi (x) ≤ γ by Assumption 1. Therefore,
                                                n   n
                                 1            XX
         |det(R1n (θ̃ | x))| ≤ γ 2 (0.38)2             (mi (x) − mj (x))2 = γ 2 (0.38)2 n2 Tn (x).
                                 2             i=1 j=1
The trace of R1n (θ̃ | x) is:
                               X n
                                    mi (x)(hi (x) − h3i (x)) m2i (x) + 1 ≤ n0.38γ(1 + γ 2 ).
                                                                              
           tr(R1n (θ̃ | x)) =
                                i=1
Let eR
     1
       1n
           ≥ e2R1n be eigenvalues of R1n (θ̃ | x).
                eR 1n R1n
                 1 e2           det(R1n (θ̃ | x))             γ 2 (0.38)2 n2 Tn (x)         0.38γ
    eR
     2
       1n
           ≥   R1n      R1n
                             =                          ≥−                      2
                                                                                     =−             nTn (x).
              e1 + e2            tr(R1n (θ̃ | x))              n0.38γ(1 + γ )              1 + γ2
Therefore,
                                                                      0.38γ
        (θ − θ0 )⊤ R1n (θ̃ | x)(θ − θ0 ) ≥ eR               2
                                              2 ||θ − θ0 ||2 ≥ −
                                                1n
                                                                             nTn (x)||θ − θ0 ||22          (2.25)
                                                                     1 + γ2
and
      (θ − θ0 )⊤ R1n (θ̃ | x)(θ − θ0 ) ≤ eR               2                              R1n
                                             1 ||θ − θ0 ||2 = (tr(R1n (θ̃ | x)) − e2 ))||θ − θ0 ||2
                                               1n                                                        2
                                                                            
                                                                      Tn (x)
                                         ≤ 0.38γn (1 + γ ) +  2
                                                                                ||θ − θ0 ||22 .            (2.26)
                                                                     1 + γ2
With the same argument, we can get:
                                                              0.38
                      (θ − θ0 )⊤ R2n (θ̃ | x)(θ − θ0 ) ≥ −            nTn (x)||θ − θ0 ||22 ,               (2.27)
                                                            1 + γ2
                                                                                   
                                                                           Tn (x)
                         t
               (θ − θ0 ) R2n (θ̃ | x)(θ − θ0 ) ≤ 0.38n (1 + γ ) +    2
                                                                                      ||θ − θ0 ||22        (2.28)
                                                                           1 + γ2
Using (2.25), (2.26), (2.27) and (2.28) and noting M2 ≤ Tn (x) ≤ M1 for every x ∈ An , the
proof follows.
                                                      36


Lemma 9. Let q(θ) ∈ QM F with µ1 = log β0 , µ2 = B0 , and σ12 = σ22 = 1/n, then
                                          1
                                                KL(Q, P ) → 0, n → ∞
                                    nϵ2n log n
Proof. Using the same notation in (2.3), the KL divergence is:
  KL(Q, P ) = Eqβ (β)qB (B) (log qβ (β) + log qB (B) − log pβ (β) − log pB (B))
                = KL (Qβ , Pβ ) + KL (QB , PB )
                                                         
                  1                     1           1
                =      (log β0 ) + + B0 + − 2 + log n = o(nϵ2n log n),
                                 2            2
                                                                                  since nϵ2n → ∞
                  2                     n           n
2.10     Technical details of Lemma 1
Proof of Lemma 1. Let Vϵn = {|β − β0 | < ϵn , |B − B0 | < ϵn }. Then Vϵn ⊆ U√2ϵn which
implies U√c 2ϵn ⊆ Vϵcn which further implies
                                 Z                            Z
                                           L(θ)                      L(θ)
                            log                  p(θ)dθ ≤ log              p(θ)dθ             (2.29)
                                    c
                                   U√     L(θ0 )               Vϵcn L(θ0 )
                                      2ϵn
We shall now establish for some C0 > 0
                                   Z                                !
                        (n)               L(θ)
                      P0       log               p(θ)dθ ≤ −C0 nϵ2n    → 1, n → ∞,
                                     Vϵcn L(θ0 )
which in lieu of (2.29) completes the proof.
Define A1n = {x : W1n (θ0 | x)2 + W1n (θ0 | x)2 ≤ n2/3 } and A2n = {x : Tn (x) ≥ M } for
some M > 0. Define An = A1n ∩ A2n .
       (n)
Here P0 (An ) → 1. This because by Markov’s inequality and Lemma 4,
                             (n)
                                  W1n (θ0 | x)2 + W2n (θ0 | x)2 > n2/3 ε
                                                                            
                           P0
                                  1 (n)
                            ≤ 4/3 E0 W1n (θ0 | x)2 + W2n (θ0 | x)2 → 0.
                                                                           
                                n
                      (n)                   (n)
and by Lemma 6 P0 (Tn < M ) = P0 (1/Tn > 1/M ) → 0.
                                                       37


We shall show for x ∈ An , L(θ)/L(θ0 ) ≤ e−C0 nϵn , ∀ θ ∈ Vϵcn which implies ∀ x ∈ An ,
                                                            2
                   Z                                      Z
               log       (L(θ)/L(θ0 ))p(θ)dθ = log              (L(θ)/L(θ0 ))p(θ)dθ
                    Vϵcn                                   Vϵcn
                                                                        Z             !
                                                              −C0 nϵ2n
                                                  ≤ log e                      p(θ)dθ   ≤ −C0 nϵ2n ,  (2.30)
                                                                          Vϵcn
                                                                 (n)
since p(Vϵcn ) ≤ 1. This completes the proof since P0 (An ) → 1 as n → ∞.
    Next, note that Vϵcn is given by the union of the following terms
     V1n      = {(β, B) : β − β0 ≥ ϵn , B ≥ B0 }, V2n = {(β, B) : β − β0 ≥ ϵn , B < B0 }
     V3n = {(β, B) : β − β0 < −ϵn , B ≥ B0 }, V4n = {(β, B) : β − β0 < −ϵn , B < B0 }
     V5n      = {(β, B) : β ≥ β0 , B − B0 ≥ ϵn }, V6n = {(β, B) : β < β0 , B − B0 ≥ ϵn }
     V7n = {(β, B) : β ≥ β0 , B − B0 < −ϵn }, V8n = {(β, B) : β < β0 , B − B0 < −ϵn }
We shall now show for x ∈ An and θ ∈ V1n , L(θ)/L(θ0 ) ≤ e−C0 nϵn . The proof of other parts
                                                                                    2
follow similarly.
(a) Let θ = (β, B) and θ0′ = (β0 + ϵ, B0 ), where β ≥ β0 + ϵ and B ≥ B0 . Also, define
                                  θt = θ0′ + t(θ − θ0′ ) where 0 < t < 1.
Consider a function g:
                      g(t) = f (θt ) = log L(θt ) − log L(θ0′ ) − ∆n (θ0′ )⊤ (θt − θ0′ ),
where ∆n (θ) = (∇β log L(θ), ∇B log L(θ))⊤ . Note that g(t) is a function of t. We want to
show g(t) ≤ g(0) provided t > 0. We shall instead show g ′ (t) ≤ 0. By Taylor expansion,
                                           g ′ (t) = g ′ (0) + g ′′ (t̃)t.
for some t̃ ∈ [0, t]. Here, g ′ (0) = 0 and g ′′ (t̃) = −(θ − θ0′ )⊤ Hn (θt̃ | x)(θ − θ0′ ) ≤ 0 where Hn as
in (2.17) is a positive definite matrix (by (2.22) in 7 and Tn (x) ≥ 0). Since g(t) is decreasing
for 0 < t < 1, thus
                                     g(1) ≤ g(0) =⇒ f (θ) ≤ f (θ0′ )
                                                         38


(b) Similarly, let θ = (β, B) θ0′′ = (β0 , B0 + ϵ), where β ≥ β0 and B ≥ B0 + ϵ. Define
                                     θt = θ0′ + t(θ − θ0′′ ) where 0 < t < 1.
                       h(t) = f (θt ) = log L(θt ) − log L(θ0′′ ) − ∆n (θ0′′ )⊤ (θt − θ0′′ ).
With similar argument in (a), we conclude that h(1) ≤ h(0) =⇒ f (θ) ≤ f (θ0′′ ). Therefore,
     sup (log L(θ) − log L(θ0 ))
    θ∈V1n
    ≤         sup           (log L(θ) − log L(θ0 )) +           sup        (log L(θ) − log L(θ0 ))
       {β−β0 ∈[ϵn ,ϵ],B≥B0 }                               {β>β0 +ϵ,B≥B0 }
    ≤         sup           (log L(θ) − log L(θ0 )) + (log L(θ0′ ) − log L(θ0 ))
       {β−β0 ∈[ϵn ,ϵ],B≥B0 }
    ≤             sup             (log L(θ) − log L(θ0 )) +             sup           (log L(θ) − log L(θ0 ))
       {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]}                             {β−β0 ∈[ϵn ,ϵ],B>B0 +ϵ}
    + log L(θ0′ ) − log L(θ0 ))
    ≤             sup             (log L(θ) − log L(θ0 )) +          sup        (log L(θ) − log L(θ0 ))
       {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]}                             {β≥β0 ,B>B0 +ϵ}
    + log L(θ0′ ) − log L(θ0 ))
    ≤             sup             (log L(θ) − log L(θ0 )) + log L(θ0′′ ) − log L(θ0 )
       {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]}
    + log L(θ0′ ) − log L(θ0 )
    ≤             sup              3(log L(θ) − log L(θ0 )) ≤ −C0 nϵ2n                                   (2.31)
       {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]}
where the second inequality follows from (a) and fifth inequality follows from (b) above.
Finally for the last inequality, consider Taylor expansion for log L(θ) upto the second order
                                                                     1
           log L(θ) − log L(θ0 ) = Wn (θ0 | x)⊤ (θ − θ0 ) − (θ − θ0 )⊤ Hn (θ̃ | x)(θ − θ0 )
                                                                     2
where θ̃ = θ0 + c(θ − θ0 ), 0 < c < 1 and Wn and Hn are as defined in (2.16) and (2.17)
respectively.
By Cauchy Schwarz inequality,
             |Wn (θ0 | x)⊤ (θ − θ0 ))| ≤ ( W1n (θ0 | x)2 + W2n (θ0 | x)2 ∥θ − θ0 ∥22 + 1)
                                                                                   
                                           ≤ n2/3 ∥θ − θ0 ∥22 + 1
                                                        39


for every x ∈ An . Further
                                                                   1
         log L(θ) − log L(θ0 ) ≤ n2/3 ∥θ − θ0 ∥22 + 1 − (θ − θ0 )⊤ Hn (θ̃ | x)(θ − θ0 )
                                                                   2
                                                                   sech4 (β̃γ + |B̃|)
                                     ≤ n2/3 ∥θ − θ0 ∥22 + 1 −                           nTn (x)∥θ − θ0 ∥22
                                                                         1 + γ2
                                                                                             !
                                                        1            sech 4
                                                                            (β̃γ +  |B̃|) n
                                     ≤ n2/3 +                    −                             ∥θ − θ0 ∥22
                                                   ||θ − θ0 ||22           1 + γ2         M
where the second inequality is a consequence of the lower bound (2.20) and the third in-
equality holds since x ∈ An . Taking sup over the set {β − β0 ∈ [ϵn , ϵ], B − B0 ∈ [0, ϵ]} on
both sides,
                  sup              (log L(θ) − log L(θ0 ))
       {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]}
                                                        sech4 ((β0 + ϵ)γ + (B0 + ϵ)) n
                                                                                              
                                           2/3    1
       ≤               sup              n + 2 −                               2
                                                                                                 ∥θ − θ0 ∥22
           {β−β0 ∈[ϵn ,ϵ],B−B0 ∈[0,ϵ]}           ϵn                    1 +  γ               M
       ≤ −C0 nϵ2n                                                                                            (2.32)
for some C0 > 0 as n → ∞ since n2/3 and 1/ϵ2n = o(n). This completes the proof.
2.11     Technical details of Lemma 2
Lemma 10. Let L0 and L(θ0 ) represent the true likelihood (1.1) and the pseudo-likelihood
(1.2) with the true parameters θ0 , respectively. Then,
                                   1 (n)
                                      E (log L0 − log L(θ0 )) → 0, n → ∞.
                                 nϵ2n 0
Proof.
                                                    efθ0 (x)             efθ0 (x)
                                      L0 = P                 fθ0 (x)
                                                                      =
                                                x∈{−1,1}n e
                                                                         Zn (θ0 )
where fθ0 (x) = (β0 /2)x⊤ An x + B0 x⊤ 1. Define b(x; θ) = (b1 (x; θ), · · · , bn (x; θ)) where
                            bi (x; θ) = E(Xi | Xj , j ̸= i) = tanh(βmi (x) + B).
Then L(θ) = eg(x,b(x;θ)) where the function g for v, w ∈ [−1, 1]n is defined as
                                          n
                                        X     1 + vi      1 + wi 1 − vi              1 − wi
                           g(v, w) =                 log            +           log
                                         i=1
                                                2            2            2             2
                                                          40


Also, define I(v) = g(v, v). Now, observe that
    (n)                             (n)
  E0 (log L0 − log L(θ0 )) = E0 (fθ0 (x) − g(x, b(x; θ0 )) − log Zn (θ0 ))
                    (n)                                  (n)
               = E0 (fθ0 (x) − fθ0 (b(x; θ0 )) + E0 (fθ0 (b(x; θ0 )) − I(b(x; θ0 )))
                   (n)
               + E0 (I(b(x; θ0 )) − g(x, b(x; θ0 ))) − log Zn (θ0 )
                     (n)                                         (n)
               ≤ (E0 (fθ0 (x) − fθ0 (b(x; θ0 )))2 )1/2 + (E0 (I(b(x; θ0 )) − g(x, b(x; θ0 )))2 )1/2
                   (n)
               + E0 (fθ0 (b(x; θ0 )) − I(b(x; θ0 ))) − log Zn (θ0 ),                            (2.33)
where the last step is due to Hölder’s inequality.
Under Assumption 2, mimicking the proof of Lemmas 3.2 and 3.3 in Basak and Mukherjee
(2017) with n replaced by nϵ2n , we get
                                (n)
                            (E0 (fθ0 (x) − fθ0 (b(x; θ0 )))2 )1/2 = o(nϵ2n )                    (2.34)
                            (n)
                         (E0 (I(b(x; θ0 )) − g(x, b(x; θ0 )))2 )1/2 = o(nϵ2n )                  (2.35)
Also for rn = supv∈[−1,1]n (fθ0 (v) − I(v)), we have
                                  (n)
                                E0 (fθ0 (b(x, θ0 )) − I(b(x, θ0 ))) ≤ rn
By Theorem 1.6 in Chatterjee and Dembo (2016) with the fact ∂ 2 fθ0 /∂x2i = 0, i = 1, · · · , n,
we have − log Zn (θ0 ) ≤ −rn . Therefore,
                          (n)
                        E0 (fθ0 (b(x; θ0 )) − I(b(x; θ0 )) − log Zn (θ0 )) ≤ 0.                 (2.36)
Using (2.34), (2.35) and (2.36) in (2.33) completes the proof.
Lemma 11. Note that L(θ) is not a valid density function. So, we consider L̃(θ) =
                                 P                                P
L(θ)/Jn (θ) where Jn (θ) =           x∈{−1,1}n L(θ) such that        x∈{−1,1}n L̃(θ) = 1. Then for ev-
ery θ,
                                    p                                 √            
                   Jn (θ) ≤ βϵn         n(1 + γ 2 )/2 + o(nϵ2n ) log 3 2 − log ϵn .
                                                      41


                                                                 √
Proof. Let Nn (ϵn ) := {i ∈ [n] : |λi (An )| > ϵn / 2} and with the mean field condition in the
Assumption 2, it is easy to note that
                                                                        n
                   |Nn (ϵn )|         2 X                          2 X
                                 ≤ 2           λi (An )2 = 2                An (i, j)2 → 0, n → ∞                    (2.37)
                        n            nϵn                         nϵn i,j=1
                                         i∈[n]
Set kn = |Nn (ϵn )| and let Dn,0 (ϵn ) be a ϵn n/2 net of the set {f ∈ Rkn : fi2 ≤ n} of size
                                                          p                                           P
              √
at most (3 2/ϵn )kn . The existence of such a net is standard (see for example Lemma 2.6 in
Milman and Schechtman (1986)).
Let {p1 , · · · , pn } be the eigen vectors of An . Then setting
                                                                                           
                                               X                                           
                             Dn,1 (ϵn ) :=                    ci λi (An )pi , c ∈ Dn,0 (ϵn )
                                                                                           
                                                   i∈Nn (ϵn )
We claim Dn,1 (ϵn ) is ϵn           n(1 + γ 2 )/2 of the set {An x : x ∈ {−1, 1}n }. Indeed any x ∈
                                 p
{−1, 1}n can be written as ni=1 fi pi where ni=1 fi2 =                             xi = n. In particular, it means
                                       P                        P               P 2
   i∈Nn (ϵn ) fi ≤ n, which implies there exists a c ∈ Dn,0 (ϵn ) such that ||c − f || ≤ ϵn
P                                                                                                                 p
                                                                                                                    n/2.
Let i∈Nn (ϵn ) ci λi (An )pi ∈ Dn,1 (ϵn ), then
     P
                        X                                 X                                  X
           ||An x −              ci λi (An )pi ||22 =             (ci − fi )2 λi (An )2 +             λi (An )2 fi2
                      i∈Nn (ϵn )                       i∈Nn (ϵn )                          i∈N
                                                                                            / n (ϵn )
                                                       γ  2
                                                            nϵ2n     nϵ2n
                                                    ≤             +
                                                            2         2
where the last inequality is a consequence of maxi∈[n] |λi (An )| ≤ maxi∈[n]
                                                                                                    Pn
                                                                                                       j=1 |An (i, j)| ≤ γ
and the definition of the set Nn (ϵn ).
    In particular for any x ∈ {−1, 1}n , there exists at least one p ∈ Dn,1 (ϵn ) such that
||p − m(x)|| ≤ ϵn n(1 + γ 2 )/2. For any p ∈ Dn,1 (ϵn ), let
                       p
                                  n                                               p                  o
                     P(p) := x ∈ {−1, 1}n : ||p − m(x)|| ≤ ϵn n(1 + γ 2 )/2 .
Therefore,
                                   X                              X          X
                                           eg(x,b(x;θ)) =                         eg(x,b(x;θ))
                               x∈{−1,1}n                      p∈Dn,1 (ϵn ) x∈P(p)
                                                               42


Setting u(p) := tanh(βp + B) if ||p − m(x)|| ≤ ϵn                   n(1 + γ 2 )/2, then we have
                                                                 p
                                                       Xn
                                                                                     p
            |g(x, b(x; θ)) − g(x, u(p))| ≤ 2β                |mi (x) − pi | ≤ 2βϵn     n(1 + γ 2 )/2
                                                       i=1
Finally,
    X                           √                 X           X
                                   n(1+γ 2 )/2
           eg(x,b(x;θ)) ≤ e2βϵn                                     eg(x,u(p))
 x∈{−1,1}n                                     p∈Dn,1 (ϵn ) x∈P(p)
                                √                 X            X                          √
                                   n(1+γ 2 )/2                                              n(1+γ 2 )/2
                        ≤ e2βϵn                                        eg(x,u(p)) = e2βϵn               |Dn,1 (ϵn )|
                                               p∈Dn,1 (ϵn ) x∈{−1,1}n
where the last equality follows since                         eg(x,u) = 1 for any u ∈ [−1, 1]n . Therefore,
                                              P
                                                 x∈{−1,1}n
                                                p
                           log Jn (θ) ≤ βϵn        n(1 + γ 2 )/2 + log |Dn,1 (ϵn )|
Since |Dn,1 (ϵn )| = |Dn,0 (ϵn )|, therefore
                                                                      √
                             log |Dn,1 (ϵn )| ≤ |Nn (ϵn )|(log 3 2 − log ϵn )
The proof follows since |Nn (ϵn )| = o(nϵ2n ).
Lemma 12. Define Vϵn := {θ : |β − β0 | < ϵn , |B − B0 | < ϵn }. Then,
                                  Vϵn ⊆ Kϵn , for n sufficiently large
                       (n)
where Kϵn := {θ : E0 (log(L(θ0 )/L(θ))) < 3nϵ2n }.
Proof. For any θ ∈ Vϵn , using the decomposition in (2.43), we get
                                                                                       
                 (n)                                 (n)
              E0 (log L(θ0 ) − log L(θ)) = E0                − 1 − 2 + 3 − 4               ≤ 3nϵ2n
where the last inequality is justified next.
For some M > 0 using Lemma 4, we get
                                                                                                      1/2
                                                              √
                                                                            
            (n)                      (n)                                       1 (n)
         −E0 (    1 ) = (β0 −   β)E0 (W1n (θ0 |x)        ≤ n|β0 − β|             E0 (W1n (θ0 |x))2
                                                                               n
                            √
                      ≤ M nϵn
                                                        43


                                                                                                        1/2
                                                               √
                                                                              
            (n)                        (n)                                       1 (n)
        −E0 (   2 ) = (B0 −      B)E0 (W2n (θ0 |x)         ≤ n|B0 − B|             E0 (W2n (θ0 |x))2
                                                                                n
                          √
                    ≤ M nϵn
By relation (2.21), we get
                                                     sech4 (β0 γ + |B0 |) (n)
                                                                                          
       (n)
     E0 (   3 ) ≤ n||θ −  θ0 ||22            2
                                     (1 + γ ) −                             E0 (Tn (x)) ≤ 2(1 + γ 2 )nϵ2n
                                                            1 + γ2
By relation (2.44), we get
      (n)           0.38n        (n)                                                          0.38γ 2 (1 + γ) 3
   −E0 ( 4 ) ≤                E      (T n (x))  ||θ  −  θ ||
                                                         0 2
                                                            2
                                                              (|β − β 0 |γ +  |B  − B 0 |) ≤                   nϵn
                  3(1 + γ 2 ) 0                                                                  3(1 + γ 2 )
Lemma 13. With prior distribution p(θ) as in (2.1), we have
                                  Z
                                        p(θ)dθ ≥ Cϵ2n ,         for some C > 0
                                   Vϵn
Proof. By mean value theorem with β ⋆ ∈ [β0 − ϵn , β0 + ϵn ] and B ⋆ ∈ [B0 − ϵn , B0 + ϵn ],
      Z               Z  β0 +ϵ                          Z B0 +ϵ
                                    1       (log β)2
                                           − 2                     1       B2
            p(θ)dθ =              √ e                dβ           √ e− 2 dB
        Vϵn            β0 −ϵ β 2π                         B0 −ϵ     2π
                        2ϵn − (log β⋆ )2 2ϵn − (B⋆ )2
                   = √ e                2    √ e 2
                      β ⋆ 2π                     2π
                                                                                                            
                                                                     1            ⋆          ⋆ 2       ⋆ 2
                                                                                                           
                   = exp −(log π − log 2 − 2 log ϵn ) −                   2 log β + (log β ) + (B )
                                                                     2
                                                                                           
                                                                     1
                   ≥ exp −(log π − log 2 − 2 log ϵn ) − (2u1 + ũ1 + u2 )
                                                                     2
                   ≥ Ce2 log ϵn = Cϵ2n
where the above result follow since ϵn → 0 implies u1 ≤ max(log(β0 + 1), log(β0 + 1)),
ũ1 ≤ max((log(β0 − 1))2 , (log(β0 + 1))2 ), and u2 = max((B0 − 1)2 , (B0 + 1)2 ).
Proof of Lemma 2. Let L∗ =              L(θ)p(θ)dθ, Jn∗ =                      L∗ . Then,
                                     R                           P
                                                                   x∈{−1,1}n
                                                    X       Z
                                        Jn∗ =                   L(θ)p(θ)dθ.
                                                x∈{−1,1}n
                                                          44


Since L(θ)p(θ) > 0, Tonelli’s theorem allows for interchange of the order of summation and
integral. Using Lemma 11 and − log ϵn = O(log n), we get
                       Z     X                     Z
               Jn∗  =                L(θ)p(θ)dθ =       Jn (θ)p(θ)dθ
                          x∈{−1,1}n
                         p                                       √
                    = ϵn n(1 + γ 2 )/2EP (β) + o(nϵ2n )(log 3 2 − log ϵn )
                                                            √
                                                                                           (2.38)
                         p
                    = ϵn ne(1 + γ 2 )/2 + o(nϵ2n )(log 3 2 − log ϵn ) = o(nϵ2n log n)
Also, by Lemma 11 and − log ϵn = O(log n),
                                                              √
                                                                                           (2.39)
                             p
       log Jn (θ0 ) = β0 ϵn    n(1 + γ 2 )/2 + o(nϵ2n )(log 3 2 − log ϵn ) = o(nϵ2n log n)
                          Z                                        
               P0n    log (L(θ)/L(θ0 ))p(θ)dθ >         Cnϵ2n log n                        (2.40)
                                           Z                         
                         1        (n)
               ≤                E       log (L(θ)/L(θ0 ))p(θ)dθ
                   Cnϵ2n log n 0
                         1        (n)
               =        2
                                E0 (|log (L∗ /L(θ0 ))|)
                   Cnϵn log n
                                                                             Jn∗
                                                                                       
                         1                    ∗                                       4
               ≤                   KL(L0 , L̃ ) + KL(L0 , L̃(θ0 )) + log            +
                   Cnϵ2n log n                                             Jn (θ0 )   e
                         2      
                                    (n)                           (n)
               ≤        2
                                  E0 (log L0 − log L(θ0 )) + E0 (log L(θ0 ) − log L∗ )
                   Cnϵn log n
                                                4
               + 2(log Jn∗ + log Jn (θ0 )) +                                               (2.41)
                                                e
where the second last step follows from Lemma 5.
                                                   45


    Then, using the set Kϵn in Lemma 12, we get
       (n)
     E0 (log L(θ0 ) − log L∗ ))
                                Z
           (n)
     = E0 (log L(θ0 ) − log L(θ)p(θ)dθ)
                             Z                     !
           (n)
     ≤ E0      log(L(θ0 )/         L(θ)p(θ)dθ)
                               Kϵ2
                                 n
                                                                   !!
                                      p(Kϵ2n )
                                               Z
           (n)
     ≤ E0      log L(θ0 ) − log                      L(θ)p(θ)dθ))
                                      p(Kϵ2n )  Kϵ2
                                                   n
                                                                          
           (n)                                   (n)
     ≤ E0 (log L(θ0 )) − log(p(Kϵ2n ) + E0             Ep|Kϵ2 (− log L(θ))
                                                            n
                                                    Z                          !
                              (n)
     ≤ − log(p(Kϵ2n )) + E0         log L(θ0 ) −          log L(θ)p|Kϵ2n (θ)dθ   Jensen’s Inequality
                                                     Kϵ2
                                                        n
                            Z
                                    (n)
     = − log(p(Kϵ2n )) +           E0 (log(L(θ0 ) − L(θ))p|Kϵ2n (θ)dθ
                             Kϵ2
                                n
     ≤ −2 log(C ′ ϵ2n ) + 3nϵ2n = o(nϵ2n log n)                                                    (2.42)
where the last line follows from Lemma 12 and Lemma 13. The final order is because
− log ϵn = O(log n) and nϵ2n → ∞ and log n → ∞.
The proof follows by using relations (2.42), (2.39) and (2.38) in (2.40).
2.12       Technical details of Lemma 3
Lemma 14. Let q(θ) ∈ QM F with µ1 = log β0 , µ2 = B0 , and σ12 = σ22 = 1/n, then
                            Z
                        1         (n)
                                E0 (log L(θ0 ) − log L(θ))q(θ)dθ ≲ 0, n → ∞
                       nϵ2n
Proof. Using the Taylor expansion of log L(θ) around θ = θ0 , we get
   log L(θ0 ) − log L(θ) = log L(θ0 ) − log L(θ0 ) − (β − β0 )W1n (θ0 |x) − W2n (θ0 |x)(B − B0 )
                                                            |        {z         } |        {z        }
                                                                     1                     2
                              1
                           + (θ − θ0 )⊤ Hn (θ0 |x)(θ − θ0 ) − Rn (θ̃, θ − θ0 |x)                   (2.43)
                             |2                {z              } |         {z     }
                                               3                            4
                                                       46


W1n , W2n is as in (2.16), Hn is as in (2.17) and Rn (θ̃, θ − θ0 |x) is defined in (2.23). Therefore,
                             Z
                                   (n)
                                 E0 (log L(θ0 ) − log L(θ)) q(θ)dθ
                                    Z                             Z          
                                           (n)                            (n)
                             = − E0                1 q(θ)dθ − E0                 2 q(θ)dθ
                                Z                              Z         
                                       (n)                            (n)
                             + E0              3 q(θ)dθ − E0                  4 q(θ)dθ,
               Z                                      Z
          1         (n)
                                               1                   (n)
      − 2         E0       1     q(θ)dθ = 2              (β0 − β)E0 (W1n (θ0 |x)) q(θ)dθ
        nϵn                                     nϵn
                                                                                                   1/2
                                                                   √
                                                      Z                 
                                                 1                         1 (n)                 2
                                            ≤             |β0 − β| n          E (W1n (θ0 |x))           q(β)dβ,
                                                nϵ2n                       n 0
                                                   √ Z
                                                M n
                                            ≤                |β − β0 |qβ (β)dβ
                                                 nϵ2n
                                                   √ Z
                                                M n
                                            ≤           ( (β − β0 )2 qβ (β)dβ)1/2
                                                 nϵ2n
                                                   √
                                                M n 2 log β0 2/n               1/2n        1/2   M elog β0
                                            =           (e        (e     −  2e      +  1))     ∼           →0
                                                 nϵ2n                                               nϵ2n
where the second inequality above above line holds by Hölder inequality and third inequality
holds by Lemma 4, for some constant M . Finally the last convergence to 0 is nϵ2n → ∞.
Similarly,
                Z                                     Z
           1         (n)
                                                1                    (n)
       − 2         E0       2     q(θ)dθ = 2 (B0 − B)E0 (W2n (θ0 |x)) q(θ)dθ
          nϵn                                   nϵn
                                                 √                                1/2 Z
                                                    n 1 (n)                      2
                                            ≤ 2             E (W2n (θ0 |x))                |B − B0 |qB (B)dB
                                                nϵn n 0
                                                √ Z
                                            M n                                           M
                                                 2
                                                     ( (B − B0 )2 q(B)dB)1/2 ∼ 2 → 0
                                             nϵn                                          nϵn
Using the upper bound (2.21) and nϵ2n → ∞, we get
        Z
    1         (n)
                   
            E0       3 q(θ)dθ
   nϵ2n
                             sech4 (β0 γ + |B0 |) (n)
                                                                      Z
        1              2
                                                                                (β − β0 )2 + (B − B0 )2 q(θ)dθ
                                                                              
   ≤ 2 (1 + γ ) −                             2
                                                        E0 (Tn (x))
       2ϵn                           1+γ
       (1 + γ 2 ) 2 log β0 2/n              1/2n
                                                                
   ≤                e          e     −   2e        +  1    +  1/n
          2ϵ2n
         2 log β0
       (e         + 1)(1 + γ 2 )
   ∼                                →0
                  2nϵ2n
                                                              47


where the second inequality holds since Tn (x) = (1/n)
                                                                   Pn
                                                                      i=1 (mi (x) − (1/n)mi (x))2 ≥ 0.
For the remainder term, using relations (2.25) and (2.27) in relation (2.23), we get
      (n)
                     0.38n       (n)
             4                                    (β − β0 )2 + (B − B0 )2 {(β − β0 )γ + (B − B0 )}
                                                
   −E0            ≤              E 0   (Tn (x))
                     3(1 + γ 2 )
                                                                                                   (2.44)
Further,
          Z
                (β − β0 )2 + (B − B0 )2 {(β − β0 )γ + (B − B0 )} q(θ)dθ
              
        n
               Z
                          − elog β0 )2 + (B − B0 )2 (elog β − elog β0 )γ + (B − B0 ) q(θ)dθ
                   log β                                
         =n        (e
            |                                           {z                                       }
                                                         5
                        Z
                5 =n        e3 log β − 3e2 log β+log β0 + 3elog β+2 log β0 − e3 log β0 q(β)dβ      (2.45)
                                                                                      
                       Z
                            B 3 − 3B0 B 2 + 3B02 B − B03 q(B)dB                                    (2.46)
                                                              
                   +n
                       Z
                            e2 log β − 2elog β+log β0 + e2 log β0 (B − B0 ) q(β)q(B)dβdB           (2.47)
                                                                 
                   +n
                         Z
                            (B − B0 )2 elog β − elog β0 q(β)q(B)dβdB                               (2.48)
                                                           
                   + nγ
       (2.45) = ne3 log β0 e9/2n − 3e2/n + 3e1/2n − e0 ∼ 0
                                                              
       (2.46) = n B03 + 3B0 /n − 3B0 (B02 + 1/n) + 3B03 − B03 = 0,
                                                                          
       (2.47) = n B0 e2 log β0 e2/n − 2e1/2n + e0 − B0 e2 log β0 e2/n − 2e1/2n + e0 = 0,
                                                                                             
                          γ             γ 0
       (2.48) = nelog β0      e1/2n −      e = γelog β0 (e1/2n − e0 ) ∼ 0
                            n            n
                                             Z
                                         1        (n)
                                                          
                                     − 2        E0      4    q(θ)dθ ≲ 0
                                        nϵn
since Tn (x) ≤ γ 2 (since by Assumption 1, mi (x) ≤ γ) and nϵ2n → ∞.
                                                      48


Poof of Lemma 3. With the q as in Lemma 14, using Markov’s inequality,
     Z                                                   
   n
  P0      q(θ) log(L(θ0 )/L(θ))dθ > Cnϵn log n  2
                                                                                                             (2.49)
                        Z
          1        (n)
  ≤               E        q(θ) log(L(θ0 )/L(θ))dθ
     Cnϵ2n log n 0
                       Z                                     
          1        (n)
  ≤               E         q(θ) |log(L(θ0 )/L(θ))| dθ
     Cnϵ2n log n 0
                  Z
          1                 (n)
  ≤                  q(θ)E0 (|log(L(θ0 )/L(θ))|) dθ                  Fubini’s theorem
     Cnϵ2n log n
                  Z                                     
          1                 (n)            L0 L(θ0 )
  =                  q(θ)E0        log                          dθ
     Cnϵ2n log n                          L(θ) L0
                  Z                                                 !!
          1                 (n)            L  0   L̃(θ0  ) J  (θ
                                                             n 0  )
  =                  q(θ)E0        log                                      dθ
     Cnϵ2n log n                          L̃(θ) 0   L      J n (θ)
                  Z                                                    !                   !
          1                 (n)            L  0                 L̃(θ 0 )            J   (θ
                                                                                      n 0   )
  =                  q(θ)E0        log              + log                   + log                  dθ
     Cnϵ2n log n                          L̃(θ)                   L 0                J n (θ)
                  Z                                                       !                    !
          1                 (n)            L  0                   L̃(θ  0 )              J  (θ
                                                                                           n 0  )
  ≤                  q(θ)E0        log               + log                      + log                  dθ
     Cnϵ2n log n                          L̃(θ)                     L  0                  Jn (θ)
                  Z                                                                                 
          1                                                                       Jn (θ0 )       4
  ≤                  q(θ) KL L0 , L̃(θ) + KL L0 , L̃(θ0 ) + log                                   +      dθ (2.50)
     Cnϵ2n log n                                                                      Jn (θ)         e
Therefore,
          Z                                                    
        n                                              2
      P0       q(θ) log(L(θ0 )/L(θ))dθ > Cnϵn log n
                                                               Z                                          
                1            (n)                                            (n)
       =                  2E0 (log L0 − log L(θ0 )) + q(θ)E0 (log L(θ0 ) − log L(θ))dθ
          Cnϵ2n log n
                                             Z                                
               1                                                             4
       +                 2 log Jn (θ0 ) + 2 q(θ) log Jn (θ)dθ +                  →0                          (2.51)
          Cnϵ2n log n                                                        e
where the inequality in second last step is due to Lemma 5. The last convergence to 0 is
explained next. By Lemma 10 and Lemma 14 respectively, we get
                                            (n)
                                          E0 (log L0 − log L(θ0 )) = o(nϵ2n )
                            Z
                                      (n)
                                q(θ)E0 (log L(θ0 ) − log L(θ))dθ ≤ o(nϵ2n )
By (2.39), log Jn (θ0 ) = o(nϵ2n log n) and by Lemma 11 and − log ϵn = O(log n)
   Z                            p                  Z                                  √
      q(θ) log Jn (θ)dθ ≤ ϵn n(1 + γ )/2 βq(β)dβ + o(nϵ2n )(log 3 2 − log ϵn ) = o(nϵ2n )
                                            2
where we use EQ (β) = exp(log β0 + 1/n) → β0
                                                         49


2.13     Proof of Theorem 1
                                                                                      (n)
In this section, with dominating probability term is used to imply that under P0 , the
probability of the event goes to 1 as n → ∞.
                          Z                        Z
                 (n)
    KL(Q, Π(|X       )) =    q(θ) log q(θ)dθ −        q(θ) log π(θ|X (n) )dθ
                          Z                        Z
                                                                   L(θ)p(θ)
                        =    q(θ) log q(θ)dθ − q(θ) log R                     dθ
                                                                  L(θ)p(θ)dθ
                                             Z                                Z
                        = KL(Q, P ) − log(L(θ)/L(θ0 ))q(θ)dθ + log (L(θ)/L(θ0 ))p(θ)dθ
                                            Z                                 Z
                        = KL(Q, P ) + log(L(θ0 )/L(θ))q(θ)dθ + log (L(θ)/L(θ0 ))p(θ)dθ
                                                                                        (2.52)
By Lemma 9, KL(Q, P ) = o(nϵ2n log n) ≤ (C/3)nϵ2n log n. By Lemma 3, with dominating
probability
                            Z
                                log(L(θ0 )/L(θ))q(θ)dθ ≤ (C/3)nϵ2n log n
for any C > 0. By Lemma 2, with dominating probability
                                 Z
                            log (L(θ)/L(θ0 ))p(θ)dθ ≤ (C/3)nϵ2n log n
Therefore, with dominating probability, for any C > 0,
                                       KL(Q, Π(|X (n) )) ≤ Cnϵ2n
Further,
                                      R                      R
                                        Uεcn
                                              L(θ)p(θ)dθ       Uc
                                                                  (L(θ)/L(θ0 ))p(θ)dθ
                   Π(Uεcn |X (n) ) = R                   = Rεn
                                           L(θ)p(θ)dθ            (L(θ)/L(θ0 ))p(θ)dθ
By Lemma 1, with dominating probability, for any C > 0, as n → ∞
                              Z
                                    (L(θ)/L(θ0 ))p(θ)dθ ≤ exp(−C0 nε2n )
                               Uεcn
By Lemma 2, with dominating probability
                            Z
                               (L(θ)/L(θ0 ))p(θ)dθ ≥ exp(−Cnϵ2n log n)
                                                     50


Therefore, with dominating probability
                    Π(Uεcn |X (n) ) ≤ exp(−C0 nε2n (1 − C/Mn )) ≤ exp(−C1 nε2n )
for any C0 > C1 /2. This is because for n sufficiently large 1 − C/Mn > 1/2.
This completes the proof.
2.14     Proof of Corollary 1
By Lemma 1, with dominating probability, there exists C0 (r) > 0 such that as n → ∞,
                             Z
                                     (L(θ)/L(θ0 ))p(θ)dθ ≤ exp(−C0 (r)r2 nε2n )
                              Urεc
                                   n
Let us assume,
                     C0 (r) ≥ C2 /r           for all r > 0 for some constant C2 > 0                       (2.53)
Numerical evidence for validity of this assumption been provided in the section 2.14.1. How-
ever, the explicit theoretical derivation is technically involved and has been avoided in this
thesis. By Lemma 2, with dominating probability
                               Z
                                   (L(θ)/L(θ0 ))p(θ)dθ ≥ exp(−Cnϵ2n log n)
Note, that
                                         R                         R
                                             c
                                           Urε
                                                 L(θ)p(θ)dθ           c
                                                                     Urε
                                                                           (L(θ)/L(θ0 ))p(θ)dθ
                        c
                  Π(Urε   n
                            |X (n) ) =      R  n
                                                               =     R   n
                                               L(θ)p(θ)dθ               (L(θ)/L(θ0 ))p(θ)dθ
Therefore, with dominating probability
                                   c
                            Π(Urε    n
                                       |X (n) ) ≤ exp(−C2 rnε2n ) exp(Cnϵ2n log n)
Following steps of proof of proposition 11 on page 2,111 in Van Der Vaart and Van Zanten
(2011),
         Z                                             Z  ∞
            (C2 /2)nεn ||θ−θ0 ||2                                       2
           e                      dΠ(Uεcn |X (n) )  =        e(C2 /2)nεn r Π(||θ − θ0 ||2 ≥ rεn |X (n) )dr
                                                        0
                                                          51


Therefore,
            Z
               e(C2 /2)nεn ||θ−θ0 ||2 dΠ(Uεcn |X (n) )
                                       Z ∞
                          2
            = exp(Cnϵn log n)               exp((C2 /2)rnε2n ) exp(−C2 rnε2n )dr
                                       Z0 ∞
                                                                          2
            = exp(Cnϵ2n log n)              exp(−(C2 /2)rnε2n )dr =          2
                                                                               exp(Cnϵ2n log n)
                                        0                             C 2 nε n
This completes the proof.
2.14.1    Proof of Relation (2.53)
Lemma 15. With Urϵn same as in Lemma 1, we have for some C2 > 0,
                                        Z                            !
                       (n)                  L(θ)
                     P0        sup                 p(θ)dθ ≤ −C2 rnϵ2n → 1, n → ∞                (2.54)
                              θ∈Urϵnc      L(θ 0 )
Proof. Following the proof of Lemma 1, we relate Urϵn to Vrϵn = {|β − β0 | < rϵn , |B − B0 | <
rϵn }. Note, Vrϵn can be split into sets V1n , · · · , V8n as in the proof of Lemma 1. We study
the behavior of
                                           sup (log L(θ) − log L(θ0 ))
                                          θ∈V1n
where V1n = {β ≥ β0 + rϵn , B ≥ B0 }. The proof for other cases of V2n , · · · , V8n follows
similarly. The proof of (2.54) will then follow as a consequence of (2.30).
By a simple modification of the steps for the relation (2.31) used in the proof of Lemma 1,
it can be shown that
             sup (log L(θ) − log L(θ0 )) ≤ O(log L((β0 + rϵn , B0 )) − log L(β0 , B0 ))
            θ∈V1n
Next, we numerically demonstrate that
                   log L((β0 + rϵn , B0 )) − log L(β0 , B0 ) ≤ −Crnϵ2n , n → ∞
                   |                        {z                  } | {z }
                                           LHS                         RHS
                                                        52


where C = 0.01. Note, for computing LHS in the above relation, x is generated using relation
(1.1) with β = β0 and B = B0 . For varying values of r ∈ (0, 1000), we consider two ratios
             ρ1 = Proportion of r values where LHS > RHS
             ρ2 = Proportion of r values where LHS > RHS provided LHS < 0
The reason for considering the two proportion ρ1 and ρ2 is because RHS is a negative quantity
and the upper bound is justified only when the LHS is als a negative quantity.
    The tables below give the values of ρ1 and ρ2 for varying values of n, d and ϵn . It is evident
that both ρ1 and ρ2 approach zero as sample size increases. The fact that ρ1 approaches 0
as n → ∞ is also immediate from the proof of Lemma 1 which shows that LHS becomes
negative as n → ∞,see relation (2.32).
                                            ρ1 for d = 5 (ρ2 for d = 5)
                  ϵn     β    B       n = 200         n = 1000       n = 5000
                n−0.4   0.2   0.2     0.0(0.0)         0.0(0.0)       0.0(0.0)
                             -0.2  0.002(0.002)        0.0(0.0)       0.0(0.0)
                              0.5     0.0(0.0)         0.0(0.0)       0.0(0.0)
                             -0.5    0.092(0.0)        0.0(0.0)     0.092(0.0)
                        0.5   0.2    0.092(0.0)        0.0(0.0)       0.0(0.0)
                             -0.2     0.0(0.0)         0.0(0.0)       0.0(0.0)
                              0.5     0.0(0.0)         0.0(0.0)       0.0(0.0)
                             -0.5     0.0(0.0)         0.0(0.0)       0.0(0.0)
                n−0.2   0.2   0.2    0.091(0.0)     0.040(0.009)      0.0(0.0)
                             -0.2     0.0(0.0)      0.014(0.008)      0.0(0.0)
                              0.5     0.0(0.0)      0.056(0.008)      0.0(0.0)
                             -0.5     0.0(0.0)         0.0(0.0)       0.0(0.0)
                        0.5   0.2 0.060(0.009)         0.0(0.0)       0.0(0.0)
                             -0.2 0.036(0.009)         0.0(0.0        0.0(0.0)
                              0.5 0.066(0.009) 0.018(0.006) 0.014(0.007)
                             -0.5 0.091(0.010) 0.021(0.007) 0.016(0.007)
                                               53


                     ρ1 for d = 30 (ρ2 for d = 30)
 ϵn    β   B     n = 200        n = 1000       n = 5000
n−0.4 0.2 0.2    0.0(0.0)      0.062(0.0)       0.0(0.0)
          -0.2 0.026(0.017) 0.092(0.0)          0.0(0.0)
           0.5 0.056(0.008) 0.023(0.009)        0.0(0.0)
          -0.5   0.0(0.0)     0.025(0.008)      0.0(0.0)
      0.5 0.2   0.091(0.0) 0.081(0.014)       0.091(0.0)
          -0.2 0.067(0.014)      0.0(0.0)     0.093(0.0)
           0.5   0.0(0.0)        0.0(0.0)       0.0(0.0)
          -0.5   0.0(0.0)        0.0(0.0)       0.0(0.0)
n−0.2 0.2 0.2 0.028(0.028)       0.0(0.0)       0.0(0.0)
          -0.2 0.091(0.0)     0.023(0.023)      0.0(0.0)
           0.5   0.0(0.0)     0.021(0.008)      0.0(0.0)
          -0.5   0.0(0.0)        0.0(0.0)       0.0(0.0)
      0.5 0.2 0.039(0.015)       0.0(0.0)    0.004(0.004)
          -0.2 0.044(0.013) 0.042(0.012)        0.0(0.0)
           0.5   0.0(0.0)     0.046(0.009)      0.0(0.0)
          -0.5 0.015(0.007)      0.0(0.0)       0.0(0.0)
                          54


                                          CHAPTER 3
  BAYESIAN VARIABLE SELECTION IN A STRUCTURED REGRESSION
                                             MODEL
In the history of statistics, estimation of β = (β1 , . . . , βp )⊤ in a regression model, y = Xβ+e,
has been one of the most popular research topics, where, X ∈ Rn×p is a given design matrix,
y ∈ Rn is a response vector, and e ∼ M V N (0, I) is an error term. We consider a high-
dimensional problem n < p, with sparse regression coefficients in which some βi ’s are non-
zero and others are exactly zero. In this sparse high-dimensional setting, many previous
literatures have considered Bayesian approaches to variable selection (Ray and Szabó, 2021;
Ročková and George, 2018; Martin et al., 2017; Ročková and George, 2014; Carbonetto and
Stephens, 2012). In addition to the sparsity on β, we consider a structurally dependent
feature space. Dependent feature vector commonly occurs in genetics, neuroimaging, and
image analysis. For example, suppose that an image is given in which each pixel has a
value. There are signal areas in the image such that the pixels in the areas have non-zero
values and the others pixels are all zeros. The shape of the signal areas varies such as
rectangles, circles, or letters (See Figure 3.1). The image can be represented by a matrix and
corresponding β is the vectorization of the matrix. In such structured regression models,
Figure 3.1: Example of the structured regression coefficients. White pixels denote the cor-
responding βi ’s are zeros and darker pixels denote the corresponding βi ’s have larger values.
problems may arise with the classic variable selection methods. In order to use the structural
information and select signal features, Li and Zhang (2010) employed latent binary variables
                                                 55


γ = (γ1 , . . . , γp )⊤ ∈ {0, 1}p with Ising prior on it and a spike and slab prior on β for
approximated the posterior selection probabilities by adapting Gibbs sampling which involves
a matrix inversion. Due to the computational cost of the matrix inversion, Li and Zhang
(2010)’s method are not scalable when p or the degree of the underlying graph is large.
Chang et al. (2018) proposed a Bayesian shrinkage approach via EM algorithm which is
scalable to high dimensional settings. We suggest a variational Bayes method (Jordan et al.,
1999) to simultaneously select signal coefficients and estimate the magnitudes of the signals
where a predetermined coupling matrix is considered to incorporate the network structure
of features. The coupling matrix in our method does not necessarily represent the true
connection between features, which allows us to utilize a common structure of a coupling
matrix such as k-nearest neighbor in image analysis.
3.1     Model and methodology
Throughout this chapter, we assume that pn is the number of covariates which depends
on the number of observations n. In a linear regression model, let β ∈ Rpn and σ 2 > 0
denote the regression coefficients and the residual variance respectively. Since we consider
high-dimensional setup, we assume that pn > n and pn → ∞ as n → ∞. Then, the linear
regression model is written as:
                                            y = Xβ + e,                                    (3.1)
where y = (y1 , . . . , yn )⊤ is an observed response vector, X ∈ Rn×pn is a given design
matrix, and e = (e1 , . . . , en )⊤ ∼ M V N (0, σ 2 In ). Since estimating σ 2 is not of our in-
terest, with out loss of generality, we assume σ 2 = 1. Our Bayesian variable selection
method is easily extendable to the case of unknown σ 2 with an appropriate prior distribu-
tion. Now, there are pn unknown parameters in the regression model. In addition to the
high-dimensional regression setup, we further assume structured covariates for which it is
desirable to capture the intrinsic dependence among covariates. Introducing binary latent
variables γ = (γ1 , . . . , γpn )⊤ ∈ {−1, 1}pn , we perform Bayesian variable selection using a
                                                  56


spike and slab method as follows:
                                       pn
                                                                                              !
                           1          X
               p(γ) =          exp          (b0 γi mi (γ) + a0 γi − log cosh(b0 mi (γ) + a0 )) ,
                         2pn          i=1
                          pn
                                                                                                 (3.2)
                         Y
            p(β | γ) =        p(βi | γi ),
                         i=1
                                                              
                            1 − γi                        1 + γi
                                       1(βi = 0) +                  N 0, τ 2 ,
                                                                            
           p(βi | γi ) ∼
                               2                             2
where m(γ) := (m1 (γ), . . . , mpn (γ))⊤ = Jn γ, Jn ∈ Rpn ×pn is a coupling matrix, and N (0, τ 2 )
denotes a Gaussian distribution with probability density function as follows:
                                                                        β2
                                                                          
                                                          1
                                 p(βi | γi = 1) = √             exp − i2      .
                                                         2πτ 2         2τ
One can notice that p(γ) is a pseudo-likelihood of Ising model. To approximate the unknown
posterior distributions, we define a variational family as follows:
                                         pn
                                                                                             !
                             1         X
                 q(γ) = pn exp               (bγi mi (γ) + ai γi − log cosh(bmi (γ) + ai )) ,
                           2            i=1
                            pn
                                                                                                 (3.3)
                           Y
             q(β | γ) =         q(βi | γi ),
                           i=1
                                                                
                              1 − γi                        1 + γi
                                        1(βi = 0) +                  N µi , σi2 .
                                                                               
            q(βi | γi ) ∼
                                 2                             2
Note that we have 3pn + 1 variational parameters needed to be updated. We point out
that using the multiple threshold parameters a = (a1 , . . . , apn )⊤ , the posterior inclusion
probability, that is, the probability that γi = 1 given y, can be different over i, which is
desirable for variable selection.
                                                        57


3.1.1     ELBO optimization
Let ν = (a⊤ , b, µ⊤ , σ ⊤ ) denote the set of variational parameters, where a = (a1 , . . . , apn )⊤ ,
µ = (µ1 , . . . , µpn )⊤ , and σ = (σ1 , . . . , σpn )⊤ . Then, the negative ELBO is:
                       L(ν) = Eq(β,γ) [log q(β, γ) − log p(β, γ, y)]
                              = Eq(β,γ) [− log p (y | β)] + Eq(γ) [log q(γ) − log p(γ)]
                                  |             {z                 } |                  {z                  }
                                                1                                        2                         (3.4)
                               + Eq(β,γ) [log q(β | γ) − log p(β | γ)] .
                                   |                       {z                        }
                                                            3
The first term is:
                                                                                
                                 1 ⊤ ⊤
             1 = Eq(β,γ)                                      ⊤             ⊤
                                                                                
                                     β X Xβ − 2y Xβ + y y + C
                                 2
                                                              !2                                           
                                         n     pn                          pn
                                 1     X       X                          X                   1 ⊤ 
                                                                               X·i⊤ yβi  +
                                                                                                            
                  = Eq(β,γ)                        Xki βi − 2                                     y y +C
                                 2 k=1 i=1                                i=1
                                                                                              2
                                 " n p                           pn                  n X
                                                                                                               #
                                            n
                     1              XX
                                                2 2
                                                                X
                                                                        ⊤
                                                                                   X
                  = Eq(β,γ)                   Xki βi − 2              X·i yβi +              Xki Xkl βi βl + C,
                     2              k=1 i=1                     i=1                k=1 i̸=l
where X·i is i-th column of X. We compute the closed form of the above expectation:
                        " n p                        pn                     n X
                                                                                                   #
                                 n
                          XX                       X                      X
              Eq(β,γ)                Xki2 2
                                          βi − 2          X·i⊤ yβi +                Xki Xkl βi βl
                          k=1 i=1                   i=1                   k=1 i̸=l
                      n    pn                                          pn
                    XX                                                X
                 =             Xki2
                                    ϕi Eβi ∼N (µi ,σi2 ) [βi2 ] −2         X·i⊤ yϕi Eβi ∼N (µi ,σi2 ) [βi ]
                    k=1 i=1                                           i=1
                     X n X
                   +            Xki Xkl ϕi ϕl Eβi ∼N (µi ,σi2 ) [βi ]Eβl ∼N (µl ,σl2 ) [βl ]
                      k=1 i̸=l
                      n X  pn                                 pn                     n X
                    X                                       X                      X
                 =             Xki2
                                    ϕi (µ2i  + σi2 ) −2            X·i⊤ yϕi µi +             Xki Xkl ϕi ϕl µi µl ,
                    k=1 i=1                                  i=1                    k=1 i̸=l
where ϕi = eai +bmi (γ) / eai +bmi (γ) + e−ai −bmi (γ) is the marginal probability q(γi = 1). Sim-
                                                                    
ilarly, let ϕi0 = ea0 +b0 mi (γ) / ea0 +b0 mi (γ) + e−a0 −b0 mi (γ) denote the marginal probability in-
                                                                              
duced from the prior p(γ). The closed form of the second term in the negative ELBO (3.4)
                                                                 58


is:
                                      pn
                                   "                            #
                                     X
                   2 = Eq(γ)              (log ϕi − log ϕ0i )
                                     i=1
                            p
                         X
                       =         ϕi (log ϕi − log ϕ0i ) + (1 − ϕi ) (log(1 − ϕi ) − log(1 − ϕ0i )) .
                          i=1
The last one is:
                                                     pn
                                                  "                                          #
                                                    X
                                  3 = Eq(β,γ)            (log q(βi | γi ) − log p(βi | γi ))
                                                    i=1
                                           pn
                                         X
                                              ϕi KL N (µi , σi2 ), N (0, τ 2 )
                                                                               
                                      =
                                          i=1
                                           pn
                                         X ϕi  µ2 + σ 2                                   
                                                      i       i
                                      =                         − 1 + 2 log τ − 2 log σi .
                                          i=1
                                               2        τ2
The last result of −( 1 + 2 + 3 ) provides the closed form of the ELBO. Let ϕ :=
                                              √
(ϕ1 , . . . , ϕpn )⊤ and let ϕ0.5 := ( ϕ1 , . . . , ϕpn )⊤ . Then, in a matrix notation, we write:
                                                         p
                      n X pn                                p                  n X
                    X                                     X                   X
        2· 1 =                  Xki2
                                     ϕi (µ2i + σi2 ) − 2        X·i⊤ yϕi µi +           Xki Xkl ϕi ϕl µi µl + C
                     k=1 i=1                              i=1                 k=1 i̸=l
                                  ⊤                                              ⊤
                  = ϕ0.5 ◦ µ           X ⊤ X ◦ Ipn ϕ0.5 ◦ µ + ϕ0.5 ◦ σ                   X ⊤ X ◦ Ipn ϕ0.5 ◦ σ
                                                                                                             
                   − 2y ⊤ X(ϕ ◦ µ) + (ϕ ◦ µ)⊤ X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ) + C,                               (3.5)
                                                                                       
where ◦ denote the element-wise product and 1pn ×pn ∈ Rpn ×pn is a matrix whose elements
are all ones. Define another vector ϕ0 := (ϕ01 , . . . , ϕ0pn )⊤ and let log(·) mean element-wise
logarithm when the input is a vector of a matrix. Then,
                             p
                          X
                    2 =          ϕi (log ϕi − log ϕ0i ) + (1 − ϕi ) (log(1 − ϕi ) − log(1 − ϕ0i ))
                           i=1
                       = ϕ⊤ (log ϕ − log ϕ0 ) + (1 − ϕ)⊤ (log (1 − ϕ) − log (1 − ϕ0 )) ,
and
                                pn
                                     ϕi µ2i + σi2
                              X                                                 
                      3 =                             − 1 + 2 log τ − 2 log σi
                               i=1
                                     2         τ2
                         = 0.5ϕ⊤ τ −2 (µ ◦ µ + σ ◦ σ) + (2 log τ − 1) · 1 − 2 log σ .
                                                                                                      
                                                                59


With the closed form of the ELBO, the gradients of the ELBO with respect to variational
parameters are also easily computable in closed forms. First, we compute the gradients ∇ai ϕi
and ∇b ϕi :
                                                     2e2(ai +bmi (γ))
                                        ∇ai ϕi =                        2,
                                                 (e2(ai +bmi (γ)) + 1)
                                                 2mi (γ)e2(ai +bmi (γ))
                                        ∇b ϕi =                         2.
                                                  (e2(ai +bmi (γ)) + 1)
                                                                                           ⊤
Observe that ∇ai ϕj = 0 if i ̸= j. We define ∇a ϕ := ∇a1 ϕ1 , . . . , ∇apn ϕpn                 and ∇b ϕ :=
(∇b ϕ1 , . . . , ∇b ϕpn )⊤ . Then, by chain rule, we can compute the closed forms of all the gradients
needed.
Gradients with respect to a:
    2 · ∇a 1 = X ⊤ X ◦ Ip (µ ◦ µ ◦ ∇a ϕ) + X ⊤ X ◦ Ip (σ ◦ σ ◦ ∇a ϕ)
                                                                     
                    − 2 X ⊤ y ◦ µ ◦ ∇a ϕ + 2 X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ) ◦ (µ ◦ ∇a ϕ) ,
                                                                                       
                 ∇a 2 = ∇a ϕ ◦ (log ϕ − log ϕ0 ) − ∇a ϕ (log (1 − ϕ) − log (1 − ϕ0 )) ,
                 ∇a 3 = 0.5∇a ϕ ◦ τ −1 (µ ◦ µ + σ ◦ σ) + (2 log τ − 1) · 1 − 2 log σ .
                                                                                              
Gradients with respect to b:
  0.5∇b 1
                         ⊤                                                  ⊤
    = ∇b ϕ0.5 ◦ µ            X ⊤ X ◦ Ipn ∇b ϕ0.5 ◦ µ + ∇b ϕ0.5 ◦ σ              X ⊤ X ◦ Ipn ∇b ϕ0.5 ◦ σ
                                                                                                     
    − 2y ⊤ X(∇b ϕ ◦ µ) + (∇b ϕ ◦ µ)⊤ X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ)
                                                                           
    + (ϕ ◦ µ)⊤ X ⊤ X ◦ (1pn ×pn − Ipn ) (∇b ϕ ◦ µ),
                                               
                 ∇b 2 = ∇b ϕ⊤ (log ϕ − log ϕ0 ) − ∇b ϕ⊤ (log (1 − ϕ) − log (1 − ϕ0 )) ,
                 ∇b 3 = 0.5∇b ϕ⊤ τ −1 (µ ◦ µ + σ ◦ σ) + (2 log τ − 1) · 1 − 2 log σ .
                                                                                             
                                                      60


Gradients with respect to µ:
                    0.5∇µ 1 = X ⊤ X ◦ Ip (µ ◦ ϕ) − 2(X ⊤ y) ◦ ϕ
                                                   
                               + 2 X ⊤ X ◦ (1pn ×pn − Ipn ) (ϕ ◦ µ) ◦ ϕ,
                                                                         
                                         ∇µ 3 = τ −2 ϕ ◦ µ.
Gradients with respect to σ:
                              0.5∇σ 1 = X ⊤ X ◦ Ip (σ ◦ ϕ) ,
                                                              
                                         ∇µ 3 = τ −2 ϕ ◦ µ.
3.2    Implementation details
Starting with initial variational parameters ν (0) , the first step for implementing the VB
algorithm is to draw a γ from the current q(γ; b(t) , a(t) ) using Gibbs sampling. Note that we
do not need to draw samples from q(β | γ). Given a γ, we can obtain the closed forms of
ELBO and all the gradients ∇\  ν L(ν) described in the subsection 3.1.1. Then, we can update
the current variational parameters as follows:
                                ν (t+1) ← ν (t) + ηt (ν) · ∇\     ν L(ν).
Note that, we allow that the learning rates ηt (ν) depend on variational parameters ν ∈ ν.
Since we have 3pn + 1 variational parameters, a single learning rate does not guarantee the
convergence of all the variational parameters. Therefore, we use adaptive learning rates
such as Adam described in Algorithm 1.1. After the optimal variational parameters ν ∗ are
achieved, we can empirically compute the marginal probability that γi = 1, i = 1, . . . , pn
based on the samples:
                                    γ 1 , . . . , γ S ∼ q(γ; b∗ , a∗ ),
                                                      61


where γ s = (γs,1 , . . . , γs,pn )⊤ , s = 1 . . . , S. Then the marginal probability of γi is:
                                                                      1 (γs,i = 1)
                                                                PS
                                                ∗
                                               q (γi = 1) = s=1                    .
                                                                        S
Based on the empirical marginal probabilities, we select the i-th feature if:
                                                        q ∗ (γi = 1) > T,
where T is a threshold (it could be a fixed number such as 0.5, or it could be the M -th
largest marginal). For estimation of β, we use the (variational) posterior mode µ∗ .
3.3         Numerical results
In this section, we numerically investigate our variable selection algorithm equipped with
two different adaptive learning rates, Adam (Kingma and Ba, 2014) and RMSprop which
is an unpublished algorithm first proposed in the Coursera course. “Neural Network for
Machine Learning” by Geoff Hinton. Also, we compare our method with a MCMC based
method (Li and Zhang, 2010). Li and Zhang (2010) approached the structural regression
problem through the Bayesian variable selection framework, where the covariates lie on an
undirected graph and formulate an Ising prior on the model space for incorporating structural
information. Li and Zhang (2010) adopt the Gibbs sampling algorithms, first suggested by
George and McCulloch (1993).
3.3.1         Li and Zhang (2010)’s Gibbs sampling scheme
For the formulation of Li and Zhang (2010)’s Gibbs sampling method, we define γ (−i) =
(γ1 , . . . , γi−1 , γi+1 , . . . , γp )⊤ ; let I(−i) be the set of indices {γj = 1 : j ̸= i}; I(i) = I(−i) ∪ {i};
p(i) = |I(i) | and p(−i) = |I(−i) |. With an Ising prior on γ ∈ {0, 1}p ,
                                                                                    p
                                                                                          !
                                                    1              b0 ⊤          X
                                     p(γ) =                 exp      γ Jγ + a0          γi ,
                                                Z(a0 , b0 )        2              i=1
the posterior distribution of γ given the data can be decomposed by Bayes formula as follows:
                                                                                     
                                                                  P γi = 1 | γ (−i)
                                                                                                              (3.6)
                                   
      P γi = 1 | γ (−i) , y =                                                  −1                        ,
                                         P γi = 1 | γ (−i) + BF i | γ (−i)            · P γi = −1 | γ (−i)
                                                                 62


                     P(y|γi =1,γ )
where BF i | γ (−i) = P y|γ =0,γ (−i) is the Bayes factor, and
                           ( i    (−i) )
                                                                            
                                                                P
                                                      γi a0 +b0 j∈I       γj
                                               e                   (−i)
                              P γi | γ (−i) =                     P
                                                           a0 +b0 j∈I       γj
                                                                               .
                                                    1+e               (−i)
The Bayes factor can be explicitly computed under the spike and slab prior on β:
                                                1/2 y ⊤ y − y ⊤ X               −1    ⊤
                                                                                              !n/2
                                  
                                     |K  (−i) |                         I (−i)
                                                                               K     X
                                                                                 (−i) I(−i) y
           BF i | γ (−i) = τ −1
                         
                                                                                  −1 ⊤             ,
                                      |K(i) |               ⊤        ⊤
                                                          y y − y XI(i) K(i)        XI(i) y
where K(i) = XI⊤(i) XI(i) + τ −2 Ip(i) and K(−i) = XI⊤(−i) XI(−i) + τ −2 Ip(−i) . Based on Gibbs
samples from the posterior probabilities in (3.6), Li and Zhang (2010) calculated the marginal
posterior probabilities to find signal variables.
3.3.2    Hyper parameter selection
To implement the algorithm with the prior distributions (3.2), three hyper parameters are
needed to be selected. We follow Li and Zhang (2010) for the hyper parameter choices. Li
and Zhang (2010) considered reparametrizations for the hyper parameters (a0 , b0 ) as follows:
                             a0 = log(r/w02 ) and b0 = log(w1 · w0 ),
where w0 = rw1 + 1 − r. Li and Zhang (2010) used fixed r = 0.03 and various w1 . Table
3.1 shows examples of hyper parameter choices. Note that w1 = 1 corresponds to the
independent Bernoulli prior.
                                       w1          a0          b0
                                         1      -3.5066         0
                                         5      -3.7332     1.7228
                                         9      -3.9368     2.4123
                       Table 3.1: Examples of hyper parameter choices
3.3.3    ROC curve
In this subsection, we compute ROC curves to see how performance changes as hyper pa-
rameters change in two scenarios.
                                                     63


Scenario 1: For the first scenario, we assume the γi s are arranged on a circle clockwise.
Then, each γi has two neighbors, γi−1 and γi+1 , where γ0 = γp and γp+1 = γ1 , where
n = 100 and p = 2000. Figure 3.2 shows the connectivity among γ. For the true regression
coefficients, we use:
                                  
                                    0.3, if i ∈ A and i is odd,
                                  
                                  
                                  
                                  
                                  
                                  
                                  
                            βi =
                                   0.6, if i ∈ A and i is even,
                                  
                                  
                                  
                                  0, otherwise,
                                  
                                  
where A = {i : i ∈ [245, 260] ∪ [745, 760] ∪ [1245, 1260] ∪ [1745, 1760]}.
                                   Figure 3.2: γ on a circle
    For design matrix X ∈ Rn×p , we first assume each component follows independent stan-
dard Gaussian distribution, i.e., Xij ∼ N (0, 1). Figure 3.3 demonstrates the ROC curves
corresponding to the hyper parameters w1 = 1 and w1 = 3. Figure 3.4 contains the results
for higher w1 with independent X.
                                               64


Figure 3.3: ROC curves for the three variable selection methods with the hyper parameter
w1 = 1 (left) and w1 = 3 (right) respectively when the covariates are independent.
Figure 3.4: ROC curves for the three variable selection methods with the hyper parameter
w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively when the covariates are
independent.
    For the second type of X ∈ Rn×p , we consider correlated X. In the blocks [241, 265],
[741, 765], [1241, 1265], and [1741, 1765], we let corr (X·i , X·j ) = 0.75−0.03|i−j|, where X·i is
i-th column of X. Also, as a noise, we let X be correlated as corr (X·i , X·j ) = 0.4−0.02|i−j|,
in four blocks which do not include a signal, [41, 60], [941, 960], [1041, 1060], and [1941, 1960].
                                                 65


With the hyper parameters w1 = 1, w1 = 3, w1 = 5, w1 = 7, and w1 = 9, the corresponding
ROC curves are demonstrated Figure 3.5 and Figure 3.6.
Figure 3.5: ROC curves for the three variable selection methods with the hyper parameter
w1 = 1 (left) and w1 = 3 (right) respectively when the covariates are correlated.
Figure 3.6: ROC curves for the three variable selection methods with the hyper parameter
w1 = 5 (left), w1 = 7 (right), and w1 = 9 (bottom) respectively when the covariates are
correlated.
Scenario 2: For the second scenario, we assume the γi s are arranged on a lattice such that
each γi is connected with less then or equal to four neighbors. The four γi s at the corner
                                             66


of the lattice have two neighbors each, the γi s at the boundary of the lattice have three
neighbors, and the others have four neighbors. The connectivity is shown in Figure 3.7.
                                   Figure 3.7: γ on a lattice.
    For the true regression coefficients, we consider a vectorization of an image in which there
are three signal areas. Each pixel in the signal areas takes a nonzero value, βi ∈ {0.3, 0.6},
and all the other pixels are zeros. Figure 3.8 shows the signal areas in the image. The black
pixels represent larger value of nonzero βi s (0.6), the grey pixels represent smaller value of
nonzero βi s (0.3), and the white pixels represent zero βi s. We use the vecotrization of the
image (matrix) as the vector of true regression coefficients in scenario 2.
                                                67


                            Figure 3.8: Signal areas in an image
   For design matrix X ∈ Rn×p in this scenario, we also consider two types of X. We first use
independent standard Gaussian distribution as in the scenario 1. Figure 3.9 demonstrates
the ROC curves corresponding to the hyper parameters w1 = 1, w1 = 3, and w1 = 5.
Figure 3.9: ROC curves in scenario 2 for the three variable selection methods with the
hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom) respectively when the
covariates are independent.
                                             68


    When w1 is higher than 5, Li and Zhang (2010)’s approach does not work because of
a phase transition. A phase transition can occur when the space of γ is two-dimensional
(lattice). One can see Stanley (1971) for more details of phase transition in Ising model.
Figure 3.10 contains the results for higher w1 = 7 and w1 = 9 when X is independent.
Figure 3.10: ROC curves in scenario 2 for the two VB methods with the hyper parameter
w1 = 7 (left) and w1 = 9 (right) respectively when the covariates are independent.
    For the second type of design matrix in scenario 2, we consider the same structure of γ,
that is, corr (X·i , X·j ) > 0 if γi and γj are connected:
                                           
                                           0.3 if γi and γj are connected,
                                           
                                           
                       corr (X·i , X·j ) =
                                           0, otherwise.
                                           
                                           
Using the correlated X, the ROC curves with w1 = 1, w1 = 3, and w1 = 5 are shown in
Figure 3.11 and the ROC curves for w1 = 7 and w1 = 9 are shown in Figure 3.12.
                                                   69


Figure 3.11: ROC curves in scenario 2 for the three variable selection methods with the
hyper parameter w1 = 1 (left), w1 = 3 (right), and w1 = 5 (bottom) respectively when the
covariates are correlated.
Figure 3.12: ROC curves in scenario 2 for the two VB methods with the hyper parameter
w1 = 7 (left) and w1 = 9 (right) respectively when the covariates are correlated.
3.4     Theoretical results
In this section, we describe theoretical results. To establish the selection consistency of our
variational Bayes method, we first observe the true posterior with the true Ising prior on γ
                                               70


as follows:
                                                                               p
                                                                                     !
                                        1               b0 ⊤                 X
                         p(γ) =                  exp       γ Jn γ + a0            γi   ,
                                   Zn (a0 , b0 )        2                    i=1
where Jn ∈ Rpn ×pn is a given coupling matrix. We define an activation set Aγ = {i : γi = 1}
and, given γ = g, let β g be the vector of non-zero βi ’s and Xg be the corresponding design
matrix of size n × |β g |. Defining |g| := |β g | = |Ag |, the joint posterior is:
            π(β, γ = g | y)
              ∝ p(γ = g)p(β | γ = g)p(y | β g )
                            pn
                                                  !                                              
                           X         b0 ⊤                     1                  ⊤              
              ∝ exp a0         gi + g Jn g exp − y − Xg β g                           y − Xg β g
                           i=1
                                     2                        2
                   Y                     Y
               ×       p(βi | γi = 1)          p(βi | γi = −1)
                  i∈Ag                  i∈Acg
                            pn
                                                  !                                              
                           X         b0 ⊤                     1                  ⊤              
              ∝ exp a0         gi + g Jn g exp − y − Xg β g                           y − Xg β g
                           i=1
                                     2                        2
                                                      Y
                          |βg |             1 ⊤
               × 2πτ    2 − 2
                                 exp − 2 β g β g                1 (βi = 0)
                                           2τ             i∈Acg
                            pn                                                                      (3.7)
                                                  !
                                     b0 ⊤                   − |β2g | Y
                                                                            1 (βi = 0)
                           X
              ∝ exp a0         gi + g Jn g 2πτ 2
                           i=1
                                     2                                i∈Acg
                                                                                 
                            1                  ⊤                       1 ⊤
               × exp −           y − Xg β g         y − Xg β g + 2 β g β g
                            2                                           τ
                            pn
                                                  !
                                     b0                     − |β2g | Y
                                                                            1 (βi = 0)
                           X
              ∝ exp a0         gi + g ⊤ Jn g 2πτ 2
                           i=1
                                     2                                i∈Acg
                            pn
                                                  !
                                     b0 ⊤                   − |β2g | Y
                                                                            1 (βi = 0)
                           X
              ∝ exp a0         gi + g Jn g 2πτ 2
                           i=1
                                     2                                i∈Agc
                                                                               
                            1             ⊤                 1
                                                    ⊤
               × exp − β g − β̃ g                X g Xg + 2 I           β g − β̃ g
                            2                                τ
                                                                
                            1     ⊤                ⊤         1
               × exp −          y y − β̃ g Xg Xg + 2 I β̃ g                    ,
                            2                               τ
                                                      71


                                   −1
where β̃ g = Xg⊤ Xg +       τ
                              1
                               2 I       Xg⊤ y. To get the marginal posterior of γ = g, we integrate β g
out first:
      π(β −g , γ = g | y)
           Z
        = π(β, γ = g | y)dβ g
                                               pn
                                                                   !
                      1                                  bn ⊤
                                                                              1 (βi = 0)
                                              X                          Y
        ∝ exp − 2 Rg exp a0                         gi + g Jg
                    2σ                        i=1
                                                          2             i∈Agc
             Z                  |β   |
                                                                                                     
                                   g              1                ⊤                 1
                      2 2 − 2                                               ⊤
                           
          ×      2πσ τ                 exp − 2 β g − β̃ g                 X g Xg + 2 I        β g − β̃ g     dβ g
                                                2σ                                    τ
                                                                       pn
                                                                                        !
                         1                      1                                 b
                                                                                                    1 (βi = 0) ,
                                  1
                                                                      X             n
                                                                                              Y
        ∝ |Xg⊤ Xg + 2 I|− 2 exp − 2 Rg exp a0                              gi + g ⊤ Jg
                        τ                     2σ                      i=1
                                                                                  2          i∈Ac g
                                                   −1      
where Rg = y ⊤ I − Xg Xg⊤ Xg +                  1
                                               τ2
                                                  I     Xg⊤ y. Now, we integrate β −g out:
  π(γ = g | y)
       Z
    = π(β −g , γ = g | y)dβ −g
                                                                   pn
                                                                                      !Z
                    1                       1                                 b
                                                                                                    1 (βi = 0) dβ−g
                            1
                                                                 X              0
                                                                                             Y
    ∝ |Xg⊤ Xg + 2 I|− 2 exp − 2 Rg exp a0                              gi + g ⊤ Jg
                   τ                       2σ                     i=1
                                                                              2             i∈Ag c
                                                                   pn
                                                                                      !
                    1       1               1                    X            b0
    = |Xg⊤ Xg + 2 I|− 2 exp − 2 Rg exp a0                              gi + g ⊤ Jg .
                   τ                       2σ                     i=1
                                                                              2
                                −1/2
Let Qg = Xg⊤ Xg +        1
                        τ2
                           I           . Then, the posterior ratio is:
 P R(g, t)
         π(γ = g | y)
      =
         π(γ = t | y)
                                                             pn                          pn
                                                                                                              ! (3.8)
          Qg              1               1                   X           b0 ⊤            X           b0 ⊤
      =        exp − 2 Rg + 2 Rt exp a0                             gi + g Jg − a0             ti − t Jt ,
          Qt            2σ               2σ                   i=1
                                                                           2              i=1
                                                                                                       2
         |{z} |                {z                 }|
           a
                                                                                  {z                          }
                                b                                                  c
where t denotes the true model. Our first result is described in the following subsection.
                                                            72


3.4.1    True posterior consistency with true Ising prior
We let λnmax denote the maximum eigenvalue of the Gram matrix X ⊤ X/n, and for ν > 0, as
like Narisetty and He (2014), we define:
                                                                                              !
                                  n                                                    Xg⊤ Xg
             mn (ν) = pn ∧                    and λnmin (ν) := inf ϕ#             min           ,   (3.9)
                            (2 + ν) log pn                           |g|≤mn (ν)          n
where ϕ# min (A) denotes the minimum nonzero eigenvalue of a matrix A. To establish the
selection consistency of the true posterior under the true Ising prior, we need following
regularity conditions (Yang and Shen, 2017):
Condition 1. (On dimension pn ). pn = endn for some dn → 0 as n → ∞, that is, log pn =
o(n).
Condition 2. (On prior parameters). nτ 2 ∼ (n ∨ p2+3δ            n    ), for some δ > 0, a0 ∼ −ndn , and
        1
b0 ∼  kmax pn
              , where kmax is the maximum row sum of Jn , that is,
                                                            pn
                                                           X
                                   kmax =       max             Jn (i, j).
                                             i∈{1,...,pn }
                                                           j=1
Condition 3. (On true model). y | X ∼ Nn (Xt β t + Xtc β tc , σ 2 I), where the size of the
true model |t| is fixed. Besides, |t|/2 < rt < |t|, where rt is the rank of Xt .
    For any fixed K > 0, define
                            ∆n (K) :=          inf          ||(I − Pg )Xt β t ||22 ,
                                        {g:|g|<K|t|,g̸⊃t}
where Pg is the projection matrix onto the column space of Xg .
Condition 4. (Identifiability). There is K > 1 + 8/δ such that ∆n (K) > γn := 5σ 2 |t|(1 +
       √
δ) log( n ∨ pn ).
Condition 5. (Regularity of the design). For some ν < δ, κ < (K − 1)δ/2,
                                                               n ∨ p2+2δ
                                                                                    
                                                                                −κ
                         λnmax ∼ O(1) and       λnmin   ⪰                  ∨ pn .
                                                                   nτ 2
                                                     73


Theorem 2. Under the Conditions 1 - 5, the posterior probability of the true model goes to
1, that is,
                                    X
                                          P R(g, t) → 0,
                                    g̸=t
as n goes to infinity.
    Narisetty and He (2014) separated the model space into four disjoint parts:
                            M1 = {g : rg > mn },
                            M2 = {g : g ⊃ t, rg ≤ mn },
                            M3 = {g : g ̸⊃ t, K|t| < rg ≤ mn },
                            M4 = {g : g ̸⊃ t, rg ≤ K|t|}.
To prove the Theorem 2, we show        P R(g, t) → 0 in each subspace.
                                   P
                                     g
Lemma 16. For a universal constant c′ > 0,
                          Qg                  −(rg∗ −|t|)/2 n −|t|/2
                              ≤ c′ nτ 2 λnmin                (λmin )
                          Qt
Proof. From Lemma 11.1 in Narisetty and He (2014), we have
                                                            −1/2
                                                1
                                 Qg = I + 2 Xg⊤ Xg
                                               τ
                                                            −1/2
                                    = I + τ 2 Xg Xg⊤
    Note,
                                        Qg        Qg Qg∧t
                                  a =        =           ·       ,
                                        Qt       Qg∧t Qt
                                               74


                Qg
                      = |I + τ 2 Xg Xg⊤ |−1/2 |I + τ 2 Xg∧t Xg∧t  ⊤ 1/2
                                                                      |
               Qg∧t
                                         −rg /2                       r /2
                      ≤ 1 + τ 2 λng,min             1 + τ 2 λng∧t,max t∧g
                                    −rg /2                    r /2
                      ≤ τ 2 λng,min           1 + nτ 2 λnmax t∧g
                                    −rg /2                    r /2
                      ≤ nτ 2 λnmin            1 + nτ 2 λnmax t∧g
                                    −rg /2              r /2
                      ≃ nτ 2 λnmin            nτ 2 λnmax t∧g for sufficiently large n
                                    −(rg −rt∧g )/2 λnmax rt∧g /2
                                                             
                            2 n
                      = nτ λmin
                                                       λnmin
                                                      n rt∧g /2
                                    −(rg∗ −rt∧g )/2 λmax
                      ≤ nτ 2 λnmin
                                                        λnmin
                                                    n rt∧g /2
                                    −(rg∗ −rt )/2 λmax
                      ≤ nτ 2 λnmin
                                                      λnmin
                                       −(rg∗ −rt )/2 n −rt∧g /2
                      ≤ C nτ 2 λnmin                  (λmin )           ,
where rg∗ = rg ∧ mn .
            Qg∧t                      ⊤ −1/2
                 = |I + τ 2 Xg∧t Xg∧t    |      |I + τ 2 Xt Xt⊤ |1/2
             Qt
                                      ⊤
                 = |I + τ 2 Xg∧t Xg∧t     + τ 2 Xgc ∧t Xg⊤c ∧t |1/2 |I + τ 2 Xg∧t Xg∧t
                                                                                     ⊤ −1/2
                                                                                        |
                                        ⊤
                                                                 −1
                                            + τ 2 Xgc ∧t Xg⊤c ∧t                       ⊤
                                                                                            −1/2
                 = | I + τ 2 Xg∧t Xg∧t                                   I + τ 2 Xg∧t Xg∧t  |
                                                           −1
                 = |I + τ 2 Xg⊤c ∧t I + τ 2 Xg∧t Xg∧t   ⊤
                                                                Xgc ∧t |1/2
                 ≤ |I + τ 2 Xg⊤c ∧t Xgc ∧t |1/2
                 = |I + τ 2 Xgc ∧t Xg⊤c ∧t |1/2
                                         r c /2
                 ≤ 1 + τ 2 λngc ∧t,max t∧g
                                    r c /2
                 ≤ 1 + τ 2 λnmax t∧g
                                r c /2
                 ≃ nτ 2 λnmax t∧g          for sufficiently large n
                                rt∧gc /2 λnmax rt∧gc /2
                                                  
                         2 n
                 = nτ λmin
                                            λnmin
                                   r c /2
                 ≤ C nτ 2 λnmin t∧g (λnmin )−rt∧g /2 , λnmax ∼ O(1)
                                                           c
                                                     75


From the above two inequalities,
                                      −(rg∗ −rt∧g −rt∧gc )/2 λnmax (rt∧g +rt∧gc )/2
                                                                     
                   Qg
              a =          ′   2 n
                      ≤ c nτ λmin
                   Qt                                           λnmin
                                                                                 c
                                      −(rg∗ −|t∧g|−|t∧gc |)/2 λnmax (|t∧g|+|t∧g |)/2
                                                                      
                           ′   2 n
                      ≤ c nτ λmin
                                                                 λnmin
                                                     n |t|/2
                                      −(rg∗ −|t|)/2 λmax
                      = c′ nτ 2 λnmin
                                                        λnmin
                                      −(rg∗ −|t|)/2 n −|t|/2
                      ≤ c′ nτ 2 λnmin               (λmin )
   Using the same argument of the Lemma A.1 in Narisetty and He (2014), we show that:
                                                                  −1    !
                                                               1
                   Rg = Rg = y ⊤ I − Xg Xg⊤ Xg + 2 I                   Xg⊤ y
                                                              τ
                                              −1
                      = y ⊤ I + τ 2 Xg Xg⊤         y ≥ 0, ∀g                          (3.10)
   1. Models in M1 . From (3.10) and the supplement to Narisetty and He (2014),
                P Rt − Rg > n(1 + 2s)σ 2 ≤ P Rt > n(1 + 2s)σ 2
                                                                       
                                                          ′
                                                 ≤ 2e−c n , uniformly for all g       (3.11)
   Observe that on M1 , rg∗ = mn ≥ n/ log(p2+ν    n ) ≥ n/ log(pn ). Therefore, on the high-
                                                                     2+δ
                                                76


probability event {Rt − Rg ≤ n(1 + 2s)σ 2 }, by (3.11) and the regularity Conditions,
   X
       P R(g, t)
  g∈M1
                                                     |t|/2
                                               λnmax
                                                                                
           X                 −(rg∗ −|t|)/2                          n(1 + 2s)
       ⪯         nτ 2 λnmin                                  exp
          g∈M1
                                               λnmin                      2
                           pn                             pn
                                                                              !
                         X             b0 ⊤              X          b0 ⊤
              exp a0           gi + g Jg − a0                 ti − t Jt
                          i=1
                                       2                 i=1
                                                                     2
                                                                       
           X
                 −(1+δ)(mn −|t|)      n     −|t|/2         n(1 + 2s)
       ⪯       pn                  (λmin )         exp
          g∈M1
                                                                 2
                           pn                             pn
                                                                              !
                         X             b0                X          b0
              exp a0           gi + g ⊤ Jg − a0               ti − t⊤ Jt , by Condition 2 and 5
                          i=1
                                       2                 i=1
                                                                     2
                                                                                         
           X
                                                           n    −|t|/2         n(1 + 2s)
       =       exp (−ndn (1 + δ)(mn − |t|)) (λmin )                    exp
          g∈M1
                                                                                    2
                           pn                             pn
                                                                              !
                         X             b0 ⊤              X          b0
              exp a0           gi + g Jg − a0                 ti − t⊤ Jt , by Condition 1
                          i=1
                                       2                 i=1
                                                                     2
                                                                           
           X             n(1 + δ)            n    −|t|/2         n(1 + 2s)
       ⪯       exp −                      (λmin )        exp
          g∈M1
                             2 +  δ                                    2
                           pn                             pn
                                                                              !
                         X             b0 ⊤              X          b0 ⊤
              exp a0           gi + g Jg − a0                 ti − t Jt , because mn ≥ n/ log(pn2+δ )
                          i=1
                                       2                 i=1
                                                                     2
                                                                    
           X             n(1 + δ) κ|t|/2                  n(1 + 2s)
       ⪯       exp −                      pn      exp
          g∈M1
                             2+δ                               2
                           pn                             pn
                                                                              !
                         X             b0 ⊤              X          b0 ⊤
              exp a0           gi + g Jg − a0                 ti − t Jt , by Condition 5
                          i=1
                                       2                 i=1
                                                                     2
                                                                           pn
                                                                                          !
                   n(1 + δ) n(1 + 2s) κ|t|/2                             X           b0 ⊤
       ⪯ exp −                  +                   pn       exp −a0           ti − t Jt
                     2+δ                  2                               i=1
                                                                                     2
                                  pn
                                                        !
               X                X             b0
                     exp a0            gi + g ⊤ Jg , by Condition 5.
              g∈M                i=1
                                              2
                   1
Since the conditions on a0 and b0 bounds the summation term, we have
                                                                                     
     X                               ndn κ|t|                 n(1 + δ) n(1 + 2s)
          P R(g, t) ⪯ w exp′
                                                   exp −                 +                , by Condition 1
    g∈M1
                                          2                    2+δ                2
                                                         
                                       1 + 2s 1 + δ
                     ⪯ exp n                     −               → 0,
                                          2         2+δ
                                                           77


when (1 + 2s)/2 − (1 + δ)/(2 + δ)) < 0, i.e., s < δ/2(2 + δ).
   2. Models in M2 . Consider 0 < s ≤ δ/8, and define the events:
                      A(g) := {Rt − Rg > 2σ 2 (1 + 2s)(rg − rt ) log pn },
                      U (d) := ∪{g:rg =d} A(g).
Let Pg = Xg (Xg⊤ Xg )−1 Xg⊤ . Since Rg ≥ Rg∗ = y ⊤ (I − Pg ) y, we have
              P [U (d)] = P ∪{g:rg =d} {Rt − Rg > 2σ 2 (1 + 2s)(rg − rt ) log pn }
                                                                                    
                        ≤ P ∪{g:rg =d} {Rt − Rg∗ > 2σ 2 (1 + 2s)(rg − rt ) log pn }
                                                                                     
                        ≤ P ∪{g:rg =d} {Rt∗ − Rg∗ > σ 2 (2 + 3s)(rg − rt ) log pn }
                                                                                   
                        + P Rt − Rt∗ > sσ 2 (d − rt ) log pn
                                                                   
                        ≤ c′ p−(1+s)(d−r
                              n
                                           t ) (d−rt )
                                              pn        + exp (−c′ n log pn )
                        ≤ 2c′ p−s(d−r
                                n
                                      t)
                                         .
Next, we consider the union of all such events U (d), that is,
                                                             2c′
                                  P ∪{d>rt } U (d) ≤                → 0.
                                                            psn − 1
Observe that on M2 , we have rg∗ = rg , |t ∧ g| = |t|, and |t ∧ g c | = 0. On the high-probability
                                                       78


event ∩{d>rt } U (d)c ,
           X
                 P R(k, t)
          k∈M2
                                                                           |t|/2
                                                                   λnmax
                                                                                                      
                      X                    −(rk∗ −rt∧k )/2                                  1
                  ⪯           nτ 2 λnmin                                          exp − 2 (Rk − Rt )
                    k∈M2
                                                                   λnmin                   2σ
                                          pn                                 pn
                                                                                               !
                                        X              b0 ⊤                 X         b0 ⊤
                           exp an              gi + k Jk − a0                    ti − t Jt
                                         i=1
                                                        2                   i=1
                                                                                       2
                      X                    −(rk −rt )/2 (1+2s)(r −rt ) n −|t|/2
                  ⪯           nτ 2 λnmin                    pn          k
                                                                               (λmin )
                    k∈M2
                                         pn                                  pn
                                                                                               !
                                        X              b0                   X         b0
                           exp a0              gi + k⊤ Jk − a0                   ti − t⊤ Jt
                                        i=1
                                                       2                    i=1
                                                                                       2
                      X                      −(rk −rt )/2 (1+2s)(r −rt ) κ|t|/2
                  ⪯           p2+2δ
                               n       ∨n                    pn           k
                                                                                pn
                    k∈M2
                                         pn                                  pn
                                                                                               !
                                        X              b0                   X         b0
                           exp a0              gi + k⊤ Jk − a0                   ti − t⊤ Jt
                                        i=1
                                                       2                    i=1
                                                                                       2
                      X                        √ (r −rt ) (1+2s)(rk −rt ) κ|t|/2
                  ⪯           p−1−δ
                               n       ∧ 1/ n k                pn                 pn
                    k∈M2
                                         pn                                  pn
                                                                                               !
                                        X              b0                   X         b0
                           exp a0              gi + k⊤ Jk − a0                   ti − t⊤ Jt
                                        i=1
                                                       2                    i=1
                                                                                       2
                                                         (rk −rt )
                                              p1+2s
                      X                               
                                 −δ+2s          n
                  ⪯            pn         ∧    √                   pnκ|t|/2
                    k∈M
                                                   n
                          2
                                         pn                                  pn
                                                                                               !
                                        X              b0                   X         b0
                           exp a0              gi + k⊤ Jk − a0                   ti − t⊤ Jt
                                        i=1
                                                       2                    i=1
                                                                                       2
                                               !
                                       1+δ/4
                                     pn
                  ⪯     p−3δ/4
                         n        ∧ √              pnκ|t|/2
                                          n
                                             pn                                           pn
                                                                    !                                    !
                                           X             b0              X               X        b0
                           exp −a0                ti − t⊤ Jt                   exp a0         gi + k⊤ Jk
                                            i=1
                                                          2            k∈M2              i=1
                                                                                                  2
                                         κ|t|       δ
                                                       
                                               +1+ 4
                         κ|t|
                              − 34 δ    pn 2
                                                        → 0, for some δ.
                  ⪯ pn2             ∧       √
                                                n
                                                               79


   3. Models in M3 . We define:
                       B(k) := {Rt − Rk > 2σ 2 (1 + 2s)(rk − rt ) log pn }.
Note that:
                                   −1                                                    −1
       Rk = y ⊤ I + τ 2 Xk Xk⊤          y, Rk∨t = y ⊤ I + τ 2 Xk Xk⊤ + τ 2 Xkc ∧t Xk⊤c ∧t     y.
Let A = I + τ 2 Xk Xk⊤ .
                                        −1                          −1
    I + τ 2 Xk Xk⊤ + τ 2 Xkc ∧t Xk⊤c ∧t     = A + τ 2 Xkc ∧t Xk⊤c ∧t
                                            = A−1 − A−1 Xkc ∧t I + Xk⊤c ∧t A−1 Xkc ∧t Xk⊤c ∧t A−1 .
                                                                                      
Therefore, Rk∨t ≤ Rk . Let V (d) := ∪{k:rk =d,k∈M3 } B(k). From Narisetty and He (2014),
           P [V (d)] ≤ P ∪{k:rk =d,k∈M3 } Rt − Rk∨t > 2σ 2 (1 + 2s)(rk − rt ) log pn
                                                                                         
                               ′
                     ≤ c′ pn−w d .
Then,
                                                     ′
                          P ∪{d>K|t|} V (d) ≤ pn−w K|t| → 0, as n → ∞.
                                             
                                                  80


Restricting our attention to the high probability event ∩{d>rt } V (d)c ,
           X
               P R(k, t)
          k∈M3
                                                                 |t|/2
                                                         λnmax
                                                                                           
                   X                 −(rk −rt∧k )/2                              1
               ⪯         nτ 2 λnmin                                      exp − 2 (Rk − Rt )
                  k∈M3
                                                         λnmin                   2σ
                                    pn                                pn
                                                                                     !
                                  X             bn ⊤                X        bn ⊤
                       exp an            gi + k Jk − an                  ti − t Jt
                                   i=1
                                                 2                   i=1
                                                                              2
                   X               √ −(rk −rt ) n −|t|/2 (1+2s)(rk −rt )
               ⪯         p1+δ
                          n    ∨      n              (λmin )          pn
                  k∈M3
                                    pn                                pn
                                                                                     !
                                  X             bn ⊤                X        bn
                       exp an            gi + k Jk − an                  ti − t⊤ Jt
                                   i=1
                                                 2                   i=1
                                                                              2
                   X                √              −(rk −rt ) n −|t|/2
               ⪯         pδ−2s
                          n     ∨ npn−1−2s                     (λmin )
                  k∈M3
                                    pn                                pn
                                                                                     !
                                  X             bn                  X        bn
                       exp an            gi + k⊤ Jk − an                 ti − t⊤ Jt
                                   i=1
                                                 2                   i=1
                                                                              2
                                                 !(rk −rt )
                                         1+δ/4
                   X                   pn
               ⪯          p−3δ/4
                           n       ∨     √                  pnκ|t|/2
                  k∈M
                                             n
                      3
                                    pn                                pn
                                                                                     !
                                  X             bn                  X        bn
                       exp an            gi + k⊤ Jk − an                 ti − t⊤ Jt
                                   i=1
                                                 2                   i=1
                                                                              2
                                         !(K−1)rt +1
                                 1+δ/4
                               pn
               ⪯    p−3δ/4
                     n      ∨ √                         pnκ|t|/2
                                    n
                                       pn                                        pn
                                                            !                                  !
                                     X            bn ⊤           X              X       bn ⊤
                       exp −an              ti − t Jt                   exp an      gi + k Jk
                                      i=1
                                                   2            k∈M3            i=1
                                                                                        2
                                 1+δ/4 (K−1)|t|/2
                                         !
                               p n
               ⪯ p−3δ/4
                     n      ∨ √                        pnδ(K−1)|t|/4 exp (−an |t|)
                                    n
                                 1+δ/4 (K−1)|t|/2
                                         !
                               p n
               ∼ p−3δ/4
                     n      ∨ √                        exp (ndn |t|)
                                    n
                                                   
                                3δ(K − 1)|t|
               ∼ exp −ndn                             exp (ndn |t|)
                                          8
                                                          
                                        3δ(K − 1)
               = exp −ndn |t|                          −1          → 0.
                                               8
                                                       81


   4. Models in M4 . If c ∈ (0, 1),
                                                                                  
       P ∪{k∈M4 } {Rk − Rt < ∆n (1 − c)} ≤ P ∪{k∈M4 } {Rk − Rk∨t < ∆n (1 − c)}
                                              ≤ 2 exp (−c′ ∆n ) → 0.
Observe that:
                     pn                        pn
                                                                 !
                    X        b0 ⊤             X        b0 ⊤
            exp a0      ki + k Jn k − an          ti − t Jn t
                    i=1
                              2               i=1
                                                       2
                         pn                                pn
                                                                                   !
                        X        b 0
                                                         X         b 0
              = exp a0      ki∗ + (k∗ )⊤ Jn (k∗ ) − a0        t∗i − (t∗ )⊤ Jn (t∗ )
                        i=1
                                  2                       i=1
                                                                   2
              × exp −b0 k⊤ Jn 1 + b0 t⊤ Jn 1 ,
                                            
where k∗ = k + 1 and t∗ = t + 1. By restricting to the high probability event Cn :=
                                               82


{Rk − Rt ≥ ∆n (1 − c), ∀k ∈ M4 },
   X
       P R(k, t)
  k∈M4
                                             |t|/2
                                      λnmax
                                                                             
           X                |t|/2                           1
        ⪯        nτn2 λnmin                         exp − 2 (Rk − Rt )
          k∈M4
                                      λnmin                 2σ
                           pn                                      pn
                                                                                              !
                          X            b n      ⊤
                                                                 X          b n      ⊤
              exp an           ki∗ + (k∗ ) Jn (k∗ ) − an               t∗i − (t∗ ) Jn (t∗ )
                          i=1
                                       2                          i=1
                                                                            2
              exp −bn k⊤ Jn 1 + bn t⊤ Jn 1
                                                   
           X                |t|/2 n −|t|/2
                 nτn2 λnmin                       exp −∆n (1 − c)/2σ 2
                                                                            
        ⪯                          (λmin )
          k∈M4
                           pn                                      pn
                                                                                              !
                          X            b n      ⊤
                                                                 X          b n      ⊤
              exp an           ki∗ + (k∗ ) Jn (k∗ ) − an               t∗i − (t∗ ) Jn (t∗ )
                          i=1
                                       2                          i=1
                                                                            2
                      |t|/2 δ|t|/2
        ⪯ p2+3δ                                               2
                                                                
            n    ∨  n        p n      exp    −∆  n (1 − c)/2σ
                              pn                                                 pn
                                                           !                                               !
                            X            b n     ⊤
                                                              X                X         b n     ⊤
              exp −an             t∗i − (t∗ ) Jn (t∗ )               exp an         ki∗ + (k∗ ) Jn (k∗ )
                             i=1
                                          2                  k∈M4               i=1
                                                                                         2
                                                                                                  pn
                                                                                                         !!
                    1                                                                            X
        ∼ exp − 2 ∆n (1 − c) − σ 2 |t| log p2+3δ               ∨ n − σ 2 |t|δ log pn + 2σ 2 an       t∗i
                                                                     
                                                          n
                   2σ
                                                                                               i=1
                    1
        ∼ exp − 2 ∆n (1 − c) − σ 2 |t| log p2+3δ             ∨ n − σ 2 |t|(4 + δ) log pn
                                                                                            
                                                         n
                  2σ
                                                                    
                    1                         2
                                                                    
        ⪯ exp − 2 ∆n (1 − c) − σ |t|(6 + 4δ) log pn
                  2σ
                                                                    
                    1                         2
                                                                    
        ⪯ exp − 2 ∆n (1 − c) − σ |t|(4 + 4δ) log pn
                  2σ
                                                  
                    1                         ′
        ∼ exp − 2 (∆n (1 − c) − w γn ) → 0,
                  2σ
where w′ ∈ (0, 1) and c < 1 − w′ .
3.4.2   True posterior consistency with pseudo Ising prior
In our variable selection algorithm, we used pseudo Ising prior on γ instead of the ture Ising
prior. Replacing the true Ising prior with the Ising prior we used, the posterior ratio is:
                                                        83


                                 π̃(γ = g | X, y, σ 2 )
                   P˜R(g, t) =                                                             (3.12)
                                 π̃(γ = t | X, y, σ 2 )
                                                           
                             Qg              1          1
                         =        exp − 2 Rg + 2 Rt
                             Qt            2σ         2σ
                            |{z}  |             {z          }
                              a                 b
                               
                        × exp a0 1⊤ g + b0 g ⊤ Jn g − 1⊤ log cosh (b0 Jn g + a0 1)         (3.13)
                                                                               
                                ⊤         ⊤         ⊤
                         − a0 1 t − b0n t Jn t + 1 log cosh (b0 Jn t + a0 1) .             (3.14)
Theorem 3 indicates the posterior above goes to zero for all g ̸= t:
Theorem 3. Under the Conditions 1 - 5, the posterior probability of the true model with
pseudo Ising prior goes to 1, that is,
                                         X
                                               P˜R(g, t) → 0,
                                          g̸=t
as n goes to infinity.
    Note that the terms a and b in (3.12) are the same as in (3.8). Therefore, we need to
be albe to control the last exponential term. Observe that:
                   1⊤ log cosh (b0 Jn t + a0 1) − 1⊤ log cosh (b0 Jn g + a0 1) → 0.
Above convergence is due to Condition 2. It allows us to use the same arguments to prove
Theorem 3.
3.4.3     Bounded KL divergence
In this subsection, we provide an upper bound of KL divergence between q̃(β, γ) and
π̃ (β, γ | y) with appropriate choices of variational parameters, where q̃(β, γ) is a variational
distribution with pseudo-Ising on γ and π̃ (β, γ | y) is the true posterior with pseudo-Ising
                                                   84


on γ:
                   KL (q̃(β, γ) || π̃ (β, γ | y))
                     = KL (q̃(γ) || p̃ (γ)) + Eq̃(β,γ) [log q(β | γ) − log p(β | γ)]         (3.15)
                                                  
                                       p(y | β)
                     − Eq̃(β,γ) log                  + C.
                                       p(y | β t )
Theorem 4. There exists a variational distribution in the variational family (3.3) which
satisfies
                                   KL (q̃(β, γ) || π̃ (β, γ | y)) = o(n).
    To prove Theorem 4, observe the first term in (3.15):
       KL (q̃(γ) || p̃ (γ))
              X
         =             q̃(γ) (log q̃(γ) − log p̃(γ))
           γ∈{−1,1}pn
                                 pn
                                                  !
              X                X
         =             q̃(γ)        (ai − a0 ) γi
           γ∈{−1,1}pn           i=1
                                                                                             (3.16)
                                pn
                                                                                           !
              X                X
        +             q̃(γ)         (log cosh (b0 mi (γ) + a0 ) − log cosh (bmi (γ) + ai ))
           γ∈{−1,1}pn          i=1
              X
                      q̃(γ) bγ ⊤ Jn γ − b0 γ ⊤ Jn γ .
                                                      
        +
           γ∈{−1,1}pn
We consider variational parameters below:
                                                      
                                                      ndn if i ∈ A,
                                                      
                                                      
                                       1
                                  b=      and ai =                        .                  (3.17)
                                      pn
                                                      −ndn if i ∈ Ac
                                                      
                                                      
Lemma 17. For any g ̸= t, with the choices of variational parameters in (3.17),
                                               q̃(γ = g) → 0.
                                                      85


Proof. Since b =        1
                       pn
                          , bg ⊤ Jn g is bounded and bmi (g) is decreasing to zero at a rate             1
                                                                                                        pn
                                                                                                           .
Therefore,
                                                           pn
                                                                                                  !
                                                         X
              q̃(γ = g) = 2−pn exp bg ⊤ Jn g +                (ai gi − log cosh (bmi (g) + ai ))
                                                          i=1
                                              pn
                                                                                         !
                                            X
                          = 2−pn C exp            (ai gi − log cosh (bmi (g) + ai ))
                                             i=1
                                              pn
                                                                              !
                                            X
                          ≃ 2−pn C exp            (ai gi − log cosh (ai ))
                                             i=1
                                                                           pn
                                                                                              !
                                            X               X            X
                          = 2−pn C exp            ai g i +       ai gi −      log cosh (ndn )
                                             i∈A            i∈Ac          i=1
                                                                           pn
                                                                                              !
                                            X               X            X
                          = 2−pn C exp            ai g i +       ai gi −      log cosh (ndn )
                                             i∈A            i∈Ac          i=1
                                                                                pn
                                                                                                  !
                                            X                 X               X
                              −pn
                          =2      C exp           ndn gi −          ndn gi −       log cosh (ndn ) .
                                             i∈A              i∈Ac             i=1
Observe that                                ndn gi = ndn (|B| − |Bc |) where B = {i : gi = ti }. When
                 P                 P
                    i∈A  ndn gi −      i∈Ac
|B| − |B c | < 0, it is easy to show (3.18) → 0. Provided g ̸= t, the upper bound of |B| − |B c |
is pn − 2 such that:
                                                                        pn
                                                                                            !
                                      X                X               X
                      2−pn C exp          ndn gi −           ndn gi −       log cosh (ndn )          (3.18)
                                      i∈A             i∈Ac             i=1
                            ≤ 2−pn C exp (ndn (pn − 2) − pn log cosh (ndn ))
                            = 2−pn C exp (pn (ndn − log cosh (ndn )) − 2ndn )
                                   ndn −log cosh(ndn ) pn
                                     e
                            =C                                 e−2ndn
                                              2
                                   log 2 pn
                                     e
                            ≃C                   e−2ndn → 0.
                                        2
To derive above convergence, we use the fact u − log cosh(u) ≃ log 2 for large u.
                                                          86


   Using the Lemma 17,
                                       pn
                                                          !
                     X                X
                             q̃(γ)          (ai − a0 ) γi
                  γ∈{−1,1}pn          i=1
                                                                                       !
                          X                 X                    X
                   =              q̃(γ)         (ai − a0n ) γi +        (ai − a0n ) γi
                      γ∈{−1,1}pn            i∈A                  i∈Ac
                                                                           !
                          X                 X                 X
                   =              q̃(γ)         (2ndn ) γi +       (0) γi
                      γ∈{−1,1}pn            i∈A               i∈Ac
                                                          !
                                               X
                   ≤ (2ndn |A|) q̃(t) +             q̃(γ)   = o(n).
                                               γ̸=t
For the second term in (3.16), note that
                     pn
                    X
                        (log cosh (b0 mi (γ) + a0 ) − log cosh (bmi (γ) + ai ))
                    i=1
                                                                     
                                          1                    1
                        X pn     exp     pn
                                             +  a 0   + exp   pn
                                                                 −  a 0
                      ∼      log                                     .
                                           1                   1
                         i=1      exp     pn
                                              + a i   + exp   pn
                                                                 −  a i
Combining the results above, we have:
                                     KL (q̃(γ) || p̃ (γ)) = o(n).
Next, to control the other terms, we construct the variational parameters as follows:
                                             
                                             0, if gi = −1,
                                             
                                             
                                       2
                                     σi =                                                (3.19)
                                             1/pn , if gi = 1,
                                             
                                             
and
                              
                                 = 0, if gi = −1,
                              
                              
                              
                              
                              
                              
                              
                        µi =     = βt,i , if gi = 1 and ti = 1,                          (3.20)
                              
                              
                              
                              
                              = 1/(n1/2 pn ), if gi = 1 and ti = −1.
                              
                              
                                                     87


Observe the second term in (3.15):
                  Eq̃(β,γ) [log q(β | γ) − log p(β | γ)]
                           X
                     =             q̃(γ)Eq(β|γ) [log q(β | γ) − log p(β | γ)]
                       γ∈{−1,1}pn
                        1     X
                                      q̃(γ = g) tr (Σg ) − |g| + µ⊤                               (3.21)
                                                                                        
                     =                                                  g µg − log |Σg | ,
                        2
                          g∈{−1,1}pn
where µg is a vector of µi s in (3.20), σ g is a vector of σi s in (3.19), and Σg = diag σ 2g .
                                                                                                      
With the variational parameters in (3.19) and (3.20), we can bound (3.21) by o(n) using the
Lemma 17. For the third term in (3.15), let L0 ∼ N (Xβ t , σ 2 I) and Lβ ∼ N (Xβ, σ 2 I).
Then,
                             KL (L0 || Lβ ) = (β − β t )⊤ X ⊤ X (β − β t ) ,
            Eq̃(β,γ) [KL (L0 || Lβ )]
                   X Z
             =                  (β − β t )⊤ X ⊤ X (β − β t ) q̃(β, γ)dβ
                γ∈{−1,1}pn
                   X        Z
             =                  ∥X (β − β t )∥22 q̃(β, γ)dβ
                γ∈{−1,1}pn
                   X                             h                               i
                                                              ⊤     ⊤
             =              q̃(γ = g)Eq̃(β|γ=g) (β − β t ) X X (β − β t )
                g∈{−1,1}pn
                   X                               ⊤                                     
                                                       X ⊤ X µg − β t + tr X ⊤ XΣg .
                                                                           
             =              q̃(γ = g)     µg − β t
                g∈{−1,1}pn
Note,
       X                               ⊤
                                          X ⊤ X µg − β t
                                                            
               q̃(γ = g) µg − β t
    g∈{−1,1}pn
                                                      X                          ⊤
    = q̃(γ = t) (µt − β t )⊤ X ⊤ X (µt − β t ) +                                    X ⊤ X µg − β t .
                                                                                                  
                                                           q̃(γ = g) µg − β t
                                                      g̸=t
Let w := µg − β t . Then,
                                                                 pn
                                                              X
                                   X µg − β t = Xw =                 wi xi ,
                                                                i=1
                                                     88


where xi is the i-th column of X.
               pn
              X              X                           X                                X
                  wi xi =              wi xi +                        wi xi +                       wi xi
              i=1         {i:gi =−1}            {i:gi =1 and ti =1}            {i:gi =1 and ti =−1}
                                          X
                        =C+                             wi xi .
                                 {i:gi =1 and ti =−1}
The order of the above summation is o(n) due to (3.20).
3.4.4     Variational posterior consistency
In this subsection, we investigate a couple of ingredients for establishing variational posterior
consistency, which is our ultimate goal.
Theorem 5. Let q ∗ be the variational posterior obtained under the prior (3.2) and variational
family (3.3). Then for any g ̸= t, we have
                                               q ∗ (γ = g) → 0.
as n goes to infinity.
     First, we consider:
                                                                      X
                                       Z q̃ := Znq̃ (a, b) =                 q̃(γ),
                                                                 γ∈{−1,1}pn
                                                                      X
                                 Z p̃ := Znp̃ (a0n , b0n ) =                 p̃(γ).
                                                                 γ∈{−1,1}pn
Z q̃ and Z p̃ are the normalizing constants of q̃(γ) and p̃(γ) respectively. We define two
valid probability mass functions qz (γ) =              1
                                                      Znq̃ q̃(γ)  and pz (γ) =        1
                                                                                     Znp̃ p̃(γ). Then, the posterior
distribution with pz (γ) is:
                                                            −1
                                                      Z p̃      p̃(γ)p(β | γ)L(β)
                        πZ (γ, β | y) = P R
                                                 γ   β
                                                       (Z p̃ )−1   p̃(γ)p(β | γ)L(β)dβ
                                                    p̃(γ)p(β | γ)L(β)
                                          =P R
                                                 γ β p̃(γ)p(β | γ)L(β)dβ
                                          = π̃ (γ, β | y) .
                                                           89


From the relation above, we have:
    KL (qZ (γ, β) || πZ (γ, β | y))
             X                  Z
      =                qZ (γ) q(β | γ) (log qZ (γ) + log q(β | γ) − log πZ (γ, β | y)) dβ
         γ{−1,1}pn               β
                                 Z
           X          1
                                    q(β | γ) log q̃(γ) − log Z q̃ + log q(β | γ) − log π̃ (γ, β | y) dβ
                                                                                                    
     =                    q̃(γ)
                     Z q̃         β
       γ{−1,1}pn
       KL (q̃(γ, β) || π̃(γ, β | y))
     =                       q̃
                                            − log Z q̃ .
                           Z
Next, using Corollary 4.15 in Boucheron et al. (2013), we have
                                                           Z                   Z
                    KL (qZ (γ, β) || πZ (γ, β | y)) ≥            f dQZ − log ef dΠZ (| y)
                                                                  Z              Z
                                                             1
                                                         = q̃ f dQ̃ − log ef dΠ̃ (| y) ,
                                                            Z
where f is any function. Reorganizing the relation above, we have
       Z                                                                           Z
                                                              q̃
            f dQ̃ ≤ KL (q̃(γ, β) || π̃(γ, β | y)) − Z log Z + Z log   q̃    q̃
                                                                                     ef dΠ̃ (| y) .   (3.22)
We need show that a lower bound of Z q̃ is one to use the relation (3.22).
Proposition 1. Let q̃(γ) be a pseudo likelihood of Ising model characterized by parameters
a = (a1 , . . . , apn ) and b. Then,
                                                  X
                                                          q̃(γ) ≥ 1,
                                              γ∈{−1,1}pn
for any a and b.
    We provide numerical evidences to Proposition 1 (See Table 3.2).
                                                         90


                                                   Parameters
                    q̃
                  Z (a, b)  a = 0.2 · 1 b = 0.01 a = −0.2 · 1 b = 0.01
                   pn = 4               1.0004                     1.0004
                   pn = 9               1.0011                     1.0011
                  pn = 16               1.0022                     1.0022
                                 a=1           b = 0.1       a = −1       b = 0.1
                   pn = 4                1.006                     1.006
                   pn = 9                1.014                     1.014
                  pn = 16                1.025                     1.025
                            a = 0.2 · 1 b = 0.5 a = −0.2 · 1 b = 0.5
                   pn = 4                1.616                     1.616
                   pn = 9                2.879                     2.879
                  pn = 16                6.941                     6.941
                            a = 0.5 · 1 b = 0.5 a = −0.3 · 1 b = 0.5
                   pn = 4                1.417                     1.562
                   pn = 9                2.048                     2.635
                  pn = 16                3.603                     5.879
               Table 3.2: Exact normalizing constants with varying a and b.
    With f = 1 in (3.22), we get:
                           Z q̃ log Z q̃ ≤ KL (q̃(γ, β) || π̃(γ, β | y))                (3.23)
    With f = n1(γ ̸= t) in (3.22) and (3.23), we complete the Theorem 5 under the Propo-
sition 1.
                                                                           Z
       n1(γ ̸= t) ≤ KL (q̃(γ, β) || π̃(γ, β | y)) − Z log Z + Z log en1(γ̸=t) dΠ̃ (| y)
                                                         q̃     q̃    q̃
                                                              Z
                  ≤ KL (q̃(γ, β) || π̃(γ, β | y)) + Z log en1(γ̸=t) dΠ̃ (| y)
                                                         q̃
                  ≤ o(n) + Z q̃ log (1 + en π̃ (γ ̸= t | y))
                  =⇒ 1(γ ̸= t) → 0.
                                                  91


                                        CHAPTER 4
                      DISCUSSION AND FUTURE RESEARCH
4.1     Conclusion
Modeling discrete data is a basic problem in statistics and machine learning. Discrete data
are rarely independent and a fundamental modeling task is to model the dependencies among
variables. A graphical model is a flexible tool available for modeling such dependent discrete
data. In this dissertation, we focused on a well known graphical model, Ising model. We
provided a procesure for Ising model parameter estimation using variational Bayes meth-
ods. In order to tackle the issue of the intractable normalizing constant, we employed a
pseudo-likelihood and placed it wherever the true likelihood is needed. We suggested two
variational family choices and developed variational Bayes algorithms for each family of dis-
tributions under the pseudo-likelihood. In a variety of numerical studies, we compared our
VB methods and two other existing methods, PMLE and a MCMC based method. Notably,
the simulation results demonstrated the superiority of our VB methods in terms of accuracy
and computational costs. In addition, we found that a two-parameter Ising model is suitable
for characterizing a network data. Using the Facebook example, we applied the estimation
procedures to characterize an overall strength of interaction and an external influence of
the network. This thesis also provides a theoretical justification to the VB algorithm under
mean-field family. We showed that the variational posterior is consistent as the data size
increasing under three mild conditions on the coupling matrix An . Specifically, we estab-
lished that the variational posterior concentrates around shrinking neighborhoods of the true
parameter and we next establish the rate of contraction for the variational posterior with
a suitable bound on the Kulback-Leibler divergence between the variational and the true
posterior.
    In addition to the Ising model parameter estimation, we developed a variable selection
                                               92


technique in a high dimensional setup when the feature space is structurally dependent. We
capture the structural dependencies using an pseudo-Ising prior and a pseudo-Ising vari-
ational distribution on latent binary variables with multiple threshold parameters in the
variational family, which enable us to perform variable selection. Providing numerical ex-
periments and some theoretical results, we validated the efficacy of the variable selection VB
algorithm.
4.2     Directions for future research
Multiple observations from an Ising model: While we considered only one observation
of x is available for the inference on two-parameter Ising model, another group of previous
researches assumed that multiple observations of x are available. With the i.i.d copies of x,
we will be able to develop a procedure to estimate the structure of the underlying graph,
i.e., we will be able to estimate all the edges in the graph. Besides, the multiple observations
will enable inferences on multi-parameter Ising model beyond only one interaction parameter
and only one threshold parameter.
Multivariate version of an Ising model: The Potts model is a versatile graphical model
for discrete data that naturally extends from the Ising model. Consider an undirected graph.
Without restricting to binary variables, each node of the graph represents a categorical
variable such as blood types, hair colors, education levels, etc. Variational Bayes would be
a useful tool for inference on a Potts model and its applications.
Multiresponse regression: If response variables in a regression model is multivariate, we
need to consider the additional dependencies among the response variables. A researcher
would use a graphical model to capture different types of dependencies in various ways.
Also, to approximate more complex posterior distribution, a variational Bayes method will
definitely help.
                                                 93


BIBLIOGRAPHY
      94


                                     BIBLIOGRAPHY
Anandkumar, A., Tan, V. Y., Huang, F., Willsky, A. S., et al. (2012). High-dimensional
  structure estimation in ising models: Local separation criterion. The Annals of Statistics,
  40(3):1346–1375.
Andersen, M. R., Vehtari, A., Winther, O., and Hansen, L. K. (2017). Bayesian inference
  for spatio-temporal spike-and-slab priors. The Journal of Machine Learning Research,
  18(1):5076–5133.
Andersen, M. R., Winther, O., and Hansen, L. K. (2014). Bayesian inference for structured
  spike and slab priors. Advances in Neural Information Processing Systems, 27.
Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U., and Strawn, N. (2013). Posterior
  consistency in linear models under shrinkage priors. Biometrika, 100(4):1011–1018.
Basak, A. and Mukherjee, S. (2017). Universality of the mean-field for the potts model.
  Probability Theory and Related Fields, 168(3):557–600.
Bhattacharya, B. B., Mukherjee, S., et al. (2018). Inference in ising models. Bernoulli,
  24(1):493–525.
Bhattacharya, S. and Maiti, T. (2021). Statistical foundation of variational bayes neural
  networks. Neural Networks, 137:151–173.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for
  statisticians. Journal of the American statistical Association, 112(518):859–877.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymp-
  totic theory of independence. Oxford university press.
Bresler, G. (2015). Efficiently learning ising models on arbitrary graphs. In Proceedings of
  the forty-seventh annual ACM symposium on Theory of computing, pages 771–782.
Brush, S. G. (1967). History of the lenz-ising model. Reviews of modern physics, 39(4):883.
Cao, X., Khare, K., and Ghosh, M. (2019). Posterior graph selection and estimation consis-
  tency for high-dimensional bayesian dag models. The Annals of Statistics, 47(1):319–348.
Carbonetto, P. and Stephens, M. (2012). Scalable variational inference for bayesian variable
  selection in regression, and its accuracy in genetic association studies. Bayesian analysis,
  7(1):73–108.
                                              95


Castillo, I. and Roquain, É. (2020). On spike and slab empirical bayes multiple testing. The
  Annals of Statistics, 48(5):2548–2574.
Chang, C., Kundu, S., and Long, Q. (2018). Scalable bayesian variable selection for struc-
  tured high-dimensional data. Biometrics, 74(4):1372–1382.
Chatterjee, S. and Dembo, A. (2016). Nonlinear large deviations. Advances in Mathematics,
  299:396–450.
Chatterjee, S. et al. (2007). Estimation in spin glasses: A first step. The Annals of Statistics,
  35(5):1931–1946.
Comets, F. (1992). On consistency of a class of estimators for exponential families of markov
  random fields on the lattice. The Annals of Statistics, pages 455–468.
Comets, F. and Gidas, B. (1991). Asymptotics of maximum likelihood estimators for the
  curie-weiss model. The Annals of Statistics, pages 557–578.
Fauske, J. (2009). An empirical study of the maximum pseudo-likelihood for discrete markov
  random fields. Master’s thesis, Institutt for matematiske fag.
Gan, L., Narisetty, N. N., and Liang, F. (2019). Bayesian regularization for graphical models
  with unequal shrinkage. Journal of the American Statistical Association, 114(527):1218–
  1231.
Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance
  sampling to bridge sampling to path sampling. Statistical science, pages 163–185.
George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal
  of the American Statistical Association, 88(423):881–889.
Geyer, C. J. (1994). Estimating normalizing constants and reweighting mixtures.
Ghosal, P., Mukherjee, S., et al. (2020). Joint estimation of parameters in ising model.
  Annals of Statistics, 48(2):785–810.
Ghosh, S., Khare, K., and Michailidis, G. (2018). High-dimensional posterior consistency in
  bayesian vector autoregressive models. Journal of the American Statistical Association.
Gidas, B. (1988). Consistency of maximum likelihood and pseudo-likelihood estimators
  for gibbs distributions. In Stochastic differential systems, stochastic control theory and
  applications, pages 129–145. Springer.
Guyon, X. and Künsch, H. R. (1992). Asymptotic comparison of estimators in the ising
  model. In Stochastic Models, Statistical Methods, and Algorithms in Image Analysis, pages
                                               96


   177–198. Springer.
Halim, S. (2007). Modified ising model for generating binary images. Jurnal Informatika,
   8(2):115–118.
Haslbeck, J. M., Epskamp, S., Marsman, M., and Waldorp, L. J. (2021). Interpreting the
   ising model: The input matters. Multivariate behavioral research, 56(2):303–313.
Hurn, M. A., Husby, O. K., and Rue, H. (2003). A tutorial on image analysis. Spatial
   statistics and computational methods, pages 87–141.
Ising, E. (1924). Beitrag zur theorie des ferro-und paramagnetismus. PhD thesis, Grefe &
   Tiedemann.
Izenman, A. J. (2021). Sampling algorithms for discrete markov random fields and related
   graphical models. Journal of the American Statistical Association, pages 1–22.
Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: Empirical
   bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to
   variational methods for graphical models. Machine learning, 37(2):183–233.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
   preprint arXiv:1412.6980.
Lee, H. K. (2000). Consistency of posterior distributions for neural networks. Neural Net-
   works, 13(6):629–642.
Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford large network dataset collection.
   http://snap.stanford.edu/data.
Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-dimensional
   covariate spaces with applications in genomics. Journal of the American statistical asso-
   ciation, 105(491):1202–1214.
Li, F., Zhang, T., Wang, Q., Gonzalez, M. Z., Maresh, E. L., Coan, J. A., et al. (2015). Spatial
   bayesian variable selection and grouping for high-dimensional scalar-on-image regression.
   The Annals of Applied Statistics, 9(2):687–713.
Lokhov, A. Y., Vuffray, M., Misra, S., and Chertkov, M. (2018). Optimal structure and
   parameter learning of ising models. Science advances, 4(3):e1700791.
Martin, R., Mess, R., and Walker, S. G. (2017). Empirical bayes posterior concentration in
   sparse high-dimensional linear models. Bernoulli, 23(3):1822–1847.
                                              97


Milman, V. and Schechtman, G. (1986). Asymptotic theory of finite-dimensional normed
  spaces, lecture notes in mathematics 1200.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
  Journal of the american statistical association, 83(404):1023–1032.
Molkaraie, M. (2014). An importance sampling algorithm for the ising model with strong
  couplings. arXiv preprint arXiv:1404.5666.
Møller, J., Pettitt, A. N., Reeves, R., and Berthelsen, K. K. (2006). An efficient markov chain
  monte carlo method for distributions with intractable normalising constants. Biometrika,
  93(2):451–458.
Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing
  priors. The Annals of Statistics, 42(2):789–817.
Okabayashi, S., Johnson, L., and Geyer, C. J. (2011). Extending pseudo-likelihood for potts
  models. Statistica Sinica, pages 331–347.
Park, J., Jin, I. H., and Schweinberger, M. (2022). Bayesian model selection for high-
  dimensional ising models, with applications to educational data. Computational Statistics
  & Data Analysis, 165:107325.
Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial
  Intelligence and Statistics, pages 814–822. PMLR.
Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. (2010). High-dimensional ising model
  selection using ℓ1 -regularized logistic regression. The Annals of Statistics, 38(3):1287–1319.
Ray, K. and Szabó, B. (2021). Variational bayes for high-dimensional linear regression with
  sparse priors. Journal of the American Statistical Association, pages 1–12.
Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of
  mathematical statistics, pages 400–407.
Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slab
  prior. The Annals of Statistics, 46(1):401–437.
Ročková, V. and George, E. I. (2014). Emvs: The em approach to bayesian variable selection.
  Journal of the American Statistical Association, 109(506):828–846.
Ročková, V. and George, E. I. (2016). Fast bayesian factor analysis via automatic rotations
  to sparsity. Journal of the American Statistical Association, 111(516):1608–1622.
Ročková, V. and George, E. I. (2018). The spike-and-slab lasso. Journal of the American
                                                 98


  Statistical Association, 113(521):431–444.
Sriram, K., Ramamoorthi, R., Ghosh, P., et al. (2013). Posterior consistency of bayesian
  quantile regression based on the misspecified asymmetric laplace density. Bayesian Anal-
  ysis, 8(2):479–504.
Stanley, H. E. (1971). Phase transitions and critical phenomena, volume 7. Clarendon Press,
  Oxford.
Van Der Vaart, A. and Van Zanten, H. (2011). Information rates of nonparametric gaussian
  process methods. Journal of Machine Learning Research, 12(6).
Wang, Y. and Blei, D. M. (2019). Frequentist consistency of variational bayes. Journal of
  the American Statistical Association, 114(527):1147–1161.
Xi, R., Li, Y., and Hu, Y. (2016). Bayesian quantile regression based on the empirical
  likelihood with spike and slab priors. Bayesian Analysis, 11(3):821–855.
Xue, L., Zou, H., Cai, T., et al. (2012). Nonconcave penalized composite conditional likeli-
  hood estimation of sparse ising models. The Annals of Statistics, 40(3):1403–1429.
Yang, K. and Shen, X. (2017). On the selection consistency of bayesian structured variable
  selection. Stat, 6(1):131–144.
                                             99