BAYESIAN VARIABLE SELECTION: EXTENSIONS OF NONLOCAL
PRIORS
By
Guiling Shi

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Statistics—Doctor of Philosophy
2017

ABSTRACT
BAYESIAN VARIABLE SELECTION: EXTENSIONS OF NONLOCAL PRIORS
By
Guiling Shi
In presence of high dimensional cavariates, variable selection is an important technique
for any further data analysis. Bayesian analysis can reach the aim of model selection based
on shrinkage priors. First I would explain Bayesian variable selection technique through
three methods, which have been demonstrated giving plausible performance when working
on high dimensional model selection problems. I also compared these methods based on
both simulation results and real data application. Further I extend the method based on
Dirichlet-Laplace prior from normal means problem to linear regression model, and show the
minimax contraction rate still holds under mild conditions.
While most developments in Bayesian model selection literature are based on local prior
on regression parameters, Johnson and Rossell(2012, 2013) proposed a nonlocal prior distribution for model selection. Enlightened by this idea, I applied nonlocal prior while performing spike and slab variable selection method. I used a point mass density for spike prior,
while applied nonlocal prior as slab density, this setting could make overlap between spike
and slab prior very little, which could achieve variable selection result efficiently. Following
I proved the consistency for variable selection of proposed method.
At last, I extended nonlocal prior model selection method from Johnson and Rossell’s
method to logistic regression and to generalized linear models. Laplace approximations are
used in implementation process due to complicated likelihood. Also, convergence rate is
derived under some regularity conditions. The selection based on a nonlocal prior eliminates unnecessary variables and recommends a simple model. This method is validated by

simulation study and illustrated by real data example.

ACKNOWLEDGMENTS

Firstly, I would like to express my sincere gratitude to my advisor Prof. Taps Maiti for the
continuous support of my Ph.D study and related research, for his encouragement, motivation, and immense knowledge. His guidance helped me in all the time of research and writing
of this thesis.
At the same time, I would like to express my special appreciation to my co-advisor Dr.
Chae Young Lim, not only for her tremendous academic support, but also for her patience,
genuine caring and concern. Without all her contributions of time and ideas, it is not possible
to make this thesis completed and comprehensive.
Besides my advisors, I would like to thank my thesis committee Dr. Jongeun Choi and
Dr. Hyokyoung Hong, for their encouragement and serving as member of committee.
Last but not the least, I would like to thank my family for supporting me spiritually
throughout writing this thesis and my life in general.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2

variable se. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .

2.1
2.2

2.3
2.4

2.5
2.6

Comparison study on high-dimensional Bayesian
lection methods . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Dirichlet-Laplace prior(DL) . . . . . . . . . . . . . . .
2.2.2 Shrinking and diffusing prior(SD) . . . . . . . . . . . .
2.2.3 Coupled MH algorithm (CMH) . . . . . . . . . . . . .
2.2.4 Remark on methodology . . . . . . . . . . . . . . . . .
Extension of DL to linear model . . . . . . . . . . . . . . . . .
Simulation and application . . . . . . . . . . . . . . . . . . . .
2.4.1 Simulation on normal means problem . . . . . . . . . .
2.4.2 Simulation for linear regression case . . . . . . . . . . .
2.4.3 Simulation on high-dimension case . . . . . . . . . . .
2.4.4 Remark on simulation . . . . . . . . . . . . . . . . . .
2.4.5 Real data application . . . . . . . . . . . . . . . . . . .
2.4.6 Remark on application result . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3 Variable Selection by mixing spike and a nonlocal slab prior
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Proposed model specification . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Posterior median as thresholding estimator . . . . . . . . . . . . . .
3.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Gibbs samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Gibbs sampler for βj . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Gibbs sampler for pj . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Gibbs sampler for σ 2 and p0 . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Comment on τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Eestimation performance by Mean Squared Errors . . . . . . . . . .
3.4.2 Selection performance . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Observation from simulation . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Real data application . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Proof of theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1

4
4
5
5
7
10
13
15
17
18
21
23
25
26
28
30
30
32
32
34
36
38
39
39
40
41
41
42
42
43
45
46
47
48

3.7

Additional: calculation of Gibbs samplers . . . . . . . . . . . . . . . . . . . .
3.7.1 Gibbs sampler for βk . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.2 Gibbs sampler for pj . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 4

nonlocal
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

58
58
61
61
63
65
67
68
69
70
74
75
78
79
80
80
81
86

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.1
4.2

4.3
4.4

4.5
4.6

Model selection for generalized linear model using
priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Bayesian logistic regression . . . . . . . . . . . . . . . . . .
4.2.2 Model selection via nonlocal prior . . . . . . . . . . . . . .
4.2.3 Extension to GLM . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Theoretical properties . . . . . . . . . . . . . . . . . . . .
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation and real data application . . . . . . . . . . . . . . . .
4.4.1 Simulation study . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 SNPs study . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Real data application . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Condition (O) . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2 Condition (N) . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3 Verification of Conditions (O) and (N) for nonlocal prior .
4.6.4 Proof of theorem 1 . . . . . . . . . . . . . . . . . . . . . .

53
53
54

vi

LIST OF TABLES

Table 2.1: Summary of error on estimation in normal means model. In each table,
true model contains 10 nonzero constant entries, with value equals A. The
number of total entry is 100. . . . . . . . . . . . . . . . . . . . . . . . . .

19

Table 2.2: Summary of error on estimation in linear regression model. In each table,
the true model contains 10 non-zero constant entries, whose value equals
A. The number of total possible covariates is 60. . . . . . . . . . . . . . .

22

Table 2.3: Summary of error on estimation in high-dimensional case. In each table,
the true model contains 10 non-zero constant entries, whose value equals
A. The number of total possible covariates is p. . . . . . . . . . . . . . . .

24

Table 2.4: Order of covariates by highest posterior probability . . . . . . . . . . . . .

27

Table 3.1: MSE for n=100 with mass-nonlocal prior . . . . . . . . . . . . . . . . . .

43

Table 3.2: Case1: Performance of MASS-nonlocal for n = P . The other columns
of the table are as follows: pp0 and pp1 (when applicable) are the average
posterior probabilities of inactive and active variables respectively; Z = t
is the proportion that the exact models is selected. Z ⊃ t is the proportion
that the selected model contains all the active covariates; FDR is the false
discovery rate, and MSPE is the mean squared prediction error of the
selected models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Table 3.3: Case2: Performance of MASS-nonlocal for high-dimensional. . . . . . . . .

44

Table 3.4: Case3: Performance of MASS-nonlocal for low signal. . . . . . . . . . . .

45

Table 4.1: Summary of variable selection for different parameter values. In each panel,
results of gLASSO and EBLASSO are based on the tuning parameter chosen as the average of 20 cross-validation values. All results are averaged
among 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Table 4.2: Summary of variable selection for parameter value (2,-2,2,-2). Results of
gLASSO and EBLASSO are based on tuning parameter chosen as the average of 20 cross-validation value. All results are averaged among 100
replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Table 4.3: Summary of variable selection for parameter value (2,-2,2,-2,2,-2). Results
of gLASSO and EBLASSO are based on tuning parameter chosen as the
average of 20 cross-validation value. All results are based on 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

vii

Table 4.4: Comparison of Nonlocal prior with gLASSO and BSR. Results are averaged
among 10 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Table 4.5: Compare on prediction results based on Nonlocal prior and gLASSO. Result
from gLASSO is based on tuning parameter chosen as the average of 20
cross-validation value. All results are averaged among 20 replications. . .

77

viii

LIST OF FIGURES

Figure 2.1: Plots of MSPE for each method. For the results shown in this figure, all
the hyper-parameters are chosen by default value or recommended value
as discussed in section 2.2.4. CMH1 is the MSPE resulted from choosing
model with highest posterior probability among certain model size. CMH2
represents for model including covariates with highest marginal posterior
probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Figure 2.2: Plot of MSPE for each method. For the results shown in this figure, all
the hyper-parameters are chosen by default value or recommended value
as discussed in section 2.4.2. And estimation here use median of Gibbs
sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Figure 3.1: Comparison between normal-normal and normal-nonlocal mixture priors

34

Figure 3.2: Mean squared prediction error versus model size for analyzing PEPCK. .

47

ix

Chapter 1
Introduction
With the emerge of high dimensional data in many industry, especially in clinical and genetic
research, variable selection is one of the most commonly used technique now. The aim of
variable selection is to select the best subset of predictors. Remove the redundant predictors
is a good way to explain the data in simple way, since unnecessary predictors will add noise
to the estimation of interested variables. Many research has been done related with variable
selection under both frequentist and Bayesian framework.
Under frequentist statistics, the simple method for variable selection procedure is backward elimination, forward selection and stepwise regression. While some issues are related
with stepwise regression, for example, collinearity can be a major issue, some variables may
be removed from the model when they are deemed important. Criterion based model selection procedures is popular under frequentist study. Some popular methods include Akaike
information criterion, Bayesian information criterion, Mallows Cp . Recommendation could
be provided based on comparison of these criterion for each subset model. However, with p
potential predictors, there are 2p possible models. If p is large, it is impossible to fit all these
models and choose the best one according to some criterion. This limitation encourages the
emergence of many penalization based ideas. Among which, some popular and proved to
be effective methods incorporate least absolute shrinkage and selection operation(LASSO),
smoothly clipped absolute deviation(SCAD), elastic net and ridge regression. Variable selection methods for generalized linear model are also based on criterion based procedure, or

1

regularization path.
In Bayesian analysis, variable selection is achieved by shrinking priors. Under appropriate priors, the solution for Bayesian methods could be equivalent to frequentist penalized
approach. For example, Laplacian prior for each coefficient could result to LASSO solution, Gaussian prior is corresponding with L2 penalty result. However, instead of searching
through model space and selection criteria for choosing between competing models, Bayesian
methods focus on the marginal posterior probability of covariates that should be in the model.
Many shrinkage priors have been proposed in Bayesian literature, most of them are local priors, which means prior function value is positive when parameter value is zero. Examples of
local priors include Gaussian, Cauchy and Laplacian prior. The positive density around zero
could shrink coefficient value to zero and achieve the aim of variable selection result.Among
all local priors, one special example is horseshoe prior. horseshoe prior is unbounded with
singularity at zero, it is formulated to obtain marginals having a high concentration around
zero with heavy tails. The singularity at zero point coupled with tail robustness properties
leads to excellent empirical performance of the horseshoe. Compared with common shrinkage
priors, horseshoe concentrates more along sparse regions of the parameter space, reference
as in [Carvalho, Polson, and Scott(2009)] and [Carvalho, Polson, and Scott(2010)]. While
another prior used to contrary with horseshoe is nonlocal prior density. Nonlocal prior densities are exactly zero whenever a model parameter equals its null value. It is first defined by
[Johnson and Rossell(2010)] in the context of hypothesis testing, then it is extended in model
selection problems in [Johnson and Rossell(2012)], where product moment and product inverse moment prior densities were introduced as priors on a vector of regression coefficients.
Model selection consistency property was demonstrated for model selection procedures based
on these nonlocal priors when p ≤ n. More recently, [Rossell et al. (2013)] proposed product

2

exponential moment prior density with similar behavior to product inverse moment prior.
However, model selection property of nonlocal priors in p

n settings remain understudied.

In high-dimensional data, one of the most useful technique for Bayesian variable selection is spike and slab prior, which introduces a latent variable for each covariate to indicate
whether the covariate is active in the model or not. First proposed by [Mitchell and Beauchamp(1988)],
then generalized by [Ishwaran and Rao (2005)], various selection procedures with spike and
slab structure have been proposed, they essentially differ in the form of spike priors and slab
priors. For example, Gaussian distributions for both spike and slab prior in [George and McCulloch (1993)],
uniform distribution for the slab prior in [Mitchell and Beauchamp(1988)]. In addition, variable selection consistency have been established for spike and slab prior in [Ishwaran and Rao (2011)]
and [Narisetty and He(2014)].
In this thesis, I did thoroughly study on Bayesian variable selection method for both
linear model and generalized linear model. First of all, I explained Bayesian variable selection
procedure by reviewing three demonstrated performance methods, and extended one of the
method(Dirichlet-Laplace prior) from normal means problem to linear model. In this review
study, nonlocal prior method outperforms the other method from application study, also it
has some unique properties. This is the motivation of why I believe extensions of nonlocal
prior is an interesting topic. This is covered by Chapter 2. In chapter 3, I explored spike
and slab variable selection method when slab prior is nonlocal density. The performance of
proposed method can be validated by simulation and data application results. In chapter
4, I extended variable selection method based on nonlocal prior to generalized linear model.
Convergence rate under proposed method is derived under some regularity conditions. For
clarification, each chapter uses independent notations.

3

Chapter 2
Comparison study on
high-dimensional Bayesian variable
selection methods

2.1

Introduction

Variable selection is a problem where we identify a subsets of the covariates {x1 , x2 , ..., xn }
thats forms a causal features of the given data y. It becomes a great challenge in a small n
large p situation. The penalised likelihood approach is to find a linear model where a subset
of {x1 , x2 , ..., xn } minimize a penalized likelihood. For example L1 penalty norm leads to
the LASSO method.
[Tibshirani(1996)] pointed out that the LASSO estimator can be interpreted as the maximum a posteriori (MAP) estimator when the regression parameters have independent and
identical Laplace priors. For large n small P regression, [Liang, Truong and Wong(2001)]
established an explicit relationship between the Bayesian approach and the penalised likelihood approach for linear regression. They showed empirically that Bayesian subset regression
(BSR) is choosing priors such that the resulting negative log-posterior probability of the subset model can be approximately reduced to frequentists subset model selection statistic upto
a multiplicative constant.
4

Extending from the idea of spike and slab prior, [Narisetty and He(2014)] introduced
shrinking and diffusing priors to establish the strong selection consistency of the approach
for p = eo(n) . Under the global-local prior, [Bhattacharya et al.(2014)] proposed a DirichletLaplace prior, and showed the minimax optimal rate of posterior contraction. [Johnson(2013)]
proposed a coupled Metropolis-Hastings algorithm, which allows moving between any two
models. These three paper are focusing on high-dimensional Bayesian variable selection,
while through different prior and algorithm. In this chapter, we review the idea for these
three paper, and compare the result with application results.

2.2

Methodology

In this section, we present three Bayesian methods solving problem in high-dimensional
case. They take Dirichlet-Laplace prior(DL), shrinking and diffusing prior(SD) and product
moment prior respectively. See following for specific method.

2.2.1

Dirichlet-Laplace prior(DL)

This part is based on paper Dirichlet-Laplace(DL) priors for optimal shrinkage([Bhattacharya et al.(2014)]),
which proposed DL prior to shrink some coefficients into zero.
In high-dimensional settings, most penalization approaches have a Bayesian interpretation as corresponding to the mode of a posterior distribution under a shrinkage prior.
Different penalty method can be explained by different priors. [Polson and Scott(2010)]
showed that essentially all such shrinkage priors can be represented as global-local(GL)
mixtures of Gaussians. In this part, we consider normal means problem, expressed by
yi = θi + i , i ∼ N (0, 1), 1 ≤ i ≤ n.

5

We can describe the setting of GL prior specifically.

(θi |τ, ψi ) ∼ N (0, ψτ )

ψi ∼ f (ψi )
τ ∼ g(τ )
Here, ψ is a vector with components ψi , and each ψi is called a local variance component,
it allows deviations in the degree of shrinkage, while τ is the global variance component, it
controls global shrinkage towards the origin. In the penalized-likelihood formulation, τ plays
the role of regularization parameter.
If we choose f and g appropriately, a lot of frequentist regularization procedures such as
ridge, lasso, bridge and elastic net can be explained under GL prior.
One example is double exponential prior. If θi follows double exponential prior, that is
|θ |

1 exp − i
θi ∼ 2b
b

, then to maximize a posteriori probability estimator corresponding to

L1 or LASSO penalty. If f (ψi ) is an exponential distribution, after integrating out the local
scales ψi , θi follows a double exponential prior. And LASSO solution is obtained in this
setting. In a specific case, if f (ψi ) is an exponential distribution with scale parameter 12 ,
that is ψi ∼ Exp(1/2), then θi follows a double exponential distribution with parameter τ ,
|θ |

1 exp − i
that is θi ∼ 2τ
τ

. We assume this is the case throughout this paper for DL prior.
1/2

Another example is horseshoe prior. If f is half Cauchy prior, ψi

∼ Ca+ (0, 1),

this resulting θi is horseshoe prior. Horseshoe prior is unbounded with a singularity at
zero. Along with tail robustness property leads to excellent empirical performance of the
horseshoe([Carvalho, Polson, and Scott(2010)]).

6

In proposed Dirichlet-Laplace prior setting, instead the single global scale τ , we use a
vector of scales (φ1 τ, . . . , φn τ ), where φ = (φ1 , . . . , φn ) satisfy {φj ≥ 0,

n
j=1 φj

= 1}, and

is assigned a Dirichlet density prior, φ ∼ Dir(a, . . . , a). Additionally, we assume τ follows
a Gamma density prior, τ ∼ Gamma(λ, 1/2), with λ = pa, p is the number of covariates.
Then the full DLa prior can be represented as

θi ∼ N (0, ψi φ2i τ 2 ), ψi ∼ Exp(1/2), φ ∼ Dir(a, . . . , a), τ ∼ Gamma(na, 1/2)

The posterior sampler cycles can be obtained based on (a) θ|ψ, φ, τ, y and (b) ψ, φ, τ |θ,
holds with the fact that ψ, φ, τ |θ is independent of y. Also, we have (ψ, φ, τ |θ)
= (ψ|φ, τ, θ)(τ |φ, θ)(φ|θ), so the complete cycle to get posterior sampler is: (1) draw θ|ψ, φ, τ, y,
by sample θj independently follows N (µj , σj2 ), here µj =

yj
1+1/(ψj φ2j τ 2 )

and σj2 =

1
2
1+1/(ψj φ2
jτ )

θ2

(2) draw ψ|φ, τ, θ , by sample ψi independently from ψi ∼ giG( 21 , 1, 2i 2 ) (3) sample τ |φ, θ
φ τ
from τ ∼ giG(pa−n, 1, 2
and T =

i Ti ,

p |θi |
i=1 φi ) (4) sample Ti

i

independently from Ti ∼ giG(a−1, 1, 2|θi |),

then φi |θ will has the same distribution with Ti /T . Note, Y ∼ giG (λ, ρ, χ)

if f (y) ∝ y λ−1 e−0.5(ρy+χ/y) for y > 0.
[Bhattacharya et al.(2014)] explained specifically the process to derive these posterior
densities, also proved the minimax rate of convergence on posterior contraction with appropriate choice of the Dirichlet prior parameter a. With Gibbs samples, we can get posterior
estimation and Bayesian credible region.

2.2.2

Shrinking and diffusing prior(SD)

This part is according to paper Bayesian variable selection with shrinking and diffusing
priors([Narisetty and He(2014)]). This paper worked on model with spike and slab prior,

7

and calculate the posterior probability of each covariate included in the model.
A natural assumption in high dimensional settings is that the regression function is
sparse, only a small number of covariates have nonzero coefficients. If one covariate has
nonzero coefficient, it is active in the model. The purpose of this paper is to develop a
Bayesian methodology for selecting the active covariates that is asymptotically consistent
and computationally convenient. If the selected model equals the true model with probability converging to one, this is called selection consistency. In Bayesian methods, if the
posterior probability of the true model converges to one, then it is referred as strong selection
consistency by [Bondell and Reich(2012)].
In the hierarchical model, we can place prior distributions on the regression coefficients,
also we can put prior on model space. The linear regression model considered is Yn×1 =
Xn×p βp×1 + n×1 , the subscripts is used to specify the dimension. We introduce latent binary
variables for each of the covariates to be denoted by Z = (Z1 , . . . , Zp ). Each Zi indicates
whether the i-th covariate Xi is active in the model or not. The prior distribution on the
regression coefficient βi under Zi = 0 is a point mass at zero, but a diffused prior under
Zi = 1. The concentrated prior of βi under Zi = 0 is referred as the spike prior, and the
diffused prior under Zi = 1 is called the slab prior. The prior on model space is applied by
assuming a prior distribution on the binary random vector Z. A Bayesian variable selection
method then selects the model with highest posterior probability. Different form of spike
and slab priors yield different selection procedures.
In [Narisetty and He(2014)], shrinking and diffusing priors are introduced as spike and
slab priors, and established strong selection consistency of the approach for p = eo(n) . This
approach is computationally advantageous because a standard Gibbs sampler can be used
to sample from the posterior.

8

From here, we use pn to denote the number of covariates to indicate that it grows with
n. Specific model can be described as following:

Y |(X, β, σ 2 ) ∼ N (Xβ, σ 2 I),
2 ),
βi |(σ 2 , Zi = 0) ∼ N (0, σ 2 τ0,n
2 ),
βi |(σ 2 , Zi = 1) ∼ N (0, σ 2 τ1,n

P (Zi = 1) = 1 − P (zi = 0) = qn ,
σ 2 ∼ IG(α1 , α2 )

where i runs from 1 to pn . Also, qn , τ0,n and τ1,n are constants that depend on n. We use
the posterior probabilities of the latent variable Z to identify the active covariates. Some
threshold value can be set here. In general, if the posterior probability of Zi = 1 is greater
than 0.5, then covariate Xi will be included in the model.
In the simple case, we consider the case where the number of covariates pn < n, and
assume that the design matrix X is orthogonal, that is X X = nI. We also assume σ 2 to
be known. After some calculation, the posterior probability of Zi can be obtained from

P (Zi =

0|σ 2 , Y )

=

(1 − qn )Eβˆ (π0 (B))
i

(1 − qn )Eβˆ (π0 (B)) + qn Eβˆ (π1 (B))
i

i

here βˆi is OLS estimator of βi , and for k = 0 and 1,




2 
ˆ
1
β
exp − i2
Eβˆ (πk (B)) = √
 2a 
i
2πak,n
k,n

with ak,n =

2 .
σ 2 /n + τk,n

9

2 = τ 2 < τ 2 = τ 2 and q = q = 0.5.
First, assume all the parameters are fixed. Fix τ0n
n
0
1n
1

Then the limiting value of P (Zi = 1|σ 2 , Y ) will be less than 0.5 as n → ∞. This implies that
even as n → ∞, we would not be able to identify the active coefficient in this case. Second,
2 , fixed τ 2 and q . τ 2 goes to 0 with n. After some
consider the case when shrinking τ0,n
n 0,n
1,n

argument, we can get P (Zi = 0|σ 2 , Y ) →P I(βi = 0). That is, for orthogonal design matrix,
the marginal posterior probability of including an active covariate or excluding an inacive
2 and fixed τ 2 and q . However, this does
covariate converges to one under shrinking τ0,n
n
1,n
2
not assure the consistency of overall model selection. Some arguments show that having τ1,n

and qn fixed leads to inconsistency of selection if the number of covariates is much greater
√
2 need to be shrinking, and τ 2 need
than n. So, to get consistency of model selection, τ0,n
1,n
to be diffusing.
Under some conditions described in paper, we can get the consistency of model selection,
P (Z = t|Y, σ 2 ) →P 1 as n → ∞, that is, the posterior probability of the true model goes to
1 as the sample size increases to ∞. Here we do not need the true σ 2 to be known. Even
for a misspecified σ
˜ 2 = σ 2 , we can still have this consistency under some conditions.
Gibbs sampler can be obtained from the posterior distributions of parameters.

2.2.3

Coupled MH algorithm (CMH)

This part is based on paper on numerical aspects of Bayesian model selection in high and
ultrahigh-dimensional settings([Johnson(2013)]). This paper works not only about how to get
the marginal probability for each covariate included in model, but also the way to calculate
the posterior probability of a specific model.
This method imposes nonlocal prior density on model parameters. Local prior density is
positive at null parameter value, which is typically 0 in model selection settings. Nonlocal

10

prior density is function that are identically zero whenever a model parameter is equal to
the null value. This paper shows model selection procedures based on nonlocal prior density
assign a posterior probability of 1 to the true model as the sample size n increases when the
number of possible covariates p is bounded by n and certain regularity conditions hold.
Linear model is of the form

y|βk , σ 2 ∼ Nn (Xk βk , σ 2 In )

Here k denote a statistical model.
Two classes of nonlocal prior density was discussed. The first class of prior density for β
is the product moment(pMOM) density, which is defined as

π(β|τ, σ 2 , r)

=

dp (2π)−p/2 (τ σ 2 )−rp−p/2 |Ap |1/2 exp

1
β Ap β
−
2τ σ 2

p

βi2r
i=1

Here dp is normalizing constant, τ > 0 is a scale parameter, which determines the dispersion
of prior densities on β around 0. For a specific model k with number of covariates p, Ap
is assumed to be the p × p identity matrix if no subjective information regarding the prior
correlation between regression coefficients in model k, and r is called the order of density,
can pick any positive integer.
For the second class of prior density, β is assumed to follow a product inverse moment
(piMOM) density, which has the general form

π(β|τ, σ 2 , r)

(rσ 2 )rp/2
=
Γ(r/2)p

p

|βi
i=1

|−(r+1) exp

τ σ2
− 2
βi

In piMOM density, τ > 0 is a scale parameter explained the same as pMOM density, r

11

can take any positive integer.
These two density classes are nonlocal density at 0 because they are identically 0 when
any component of β is 0. In model selection procedures, this is a good property, in the
sense that it could efficiently eliminate regression models which contain any unnecessary
explanatory variables.
For variance σ 2 known, the marginal density of the data can be expressed by

mk (yn ) =

p(yn |β)p(β)dβ.

If the variance σ 2 is not known, a common inverse gamma density is assumed for the
value of σ 2 . Then the marginal density of data under model k is

mk (yn ) =

p(yn |β, σ 2 )p(β)p(σ 2 )dβdσ 2

In pMOM prior, the exact expressions for mk (yn ) can be obtained, see [Kan(2008)], even
though the computational effort associated with the resulting expression increase exponentially with increasing model size. In piMOM prior, the analytic expression for mk (yn ) is not
available. To fix these issues, Laplace approximation is recommended to approximate the
marginal likelihood of the data mk (yn ) under each model.
Then, the posterior probability of a model t can be calculated by p(t|y) =

p(t)mt (y)
k∈J p(k)mk (y)

based on the approximations of marginal density, assume the prior of a model p(k) follows a
beta function. The model space J has 2p dimensions, which makes it impossible to compute
the marginal density for all possible models when p is large. So a Markov chain Monte
Carlo(MCMC) scheme is applied to obtain posterior samples of model from the model space.
Metropolis-Hastings algorithm is implemented in the scheme to decide whether to update the

12

model for a new added covariate. Based on the posterior samples of model from model space,
we can pick the model with highest posterior probability, also get the posterior probability
for other probable models. After selecting the model, we can use ordinary least square
estimation to estimate the coefficient.

2.2.4

Remark on methodology

First, we compare each method by their theoretical properties.
Three methods all focus on high-dimensional problems. While DL emphasize on estimating the coefficients of each covariate, and shrink some covariates into zero during estimation
through shrinking prior. SD can calculate the posterior probability of each covariate included in the model, and set a threshold for the posterior probability to decide whether
certain covariate should be in the true model. And CMH calculate the posterior probability
for all possible models, then the model with highest posterior probability will be the resulted
model.
There are some desired properties for these methods. SD has been proved with consistency in model selection, that is the posterior probability of the true model goes to 1
as the sample size increases to infinity. Minimax optimal rate of posterior contraction for
DL has been established in [Bhattacharya et al.(2014)]. [Johnson and Rossell(2012)] also
showed that CMH method can consistently select the true model when p < n.
SD allows dimension increases exponentially with sample size, that is logpn = o(n), and
still has model selection consistency, this is a very desirable condition. Consistency on model
selection for CMH does not hold when p > n, but the algorithm can be applied in settings
p

n. Until now, DL only develop optimal minimax convergence rate for normal means

problem, that is p = n, later we will show the convergence rate also holds for p = O(n2− ).

13

Then, if we ponder over the process of each method, we can get Gibbs sampler from the
process of DL and SD, if use the median of Gibbs sampler as the estimation of coefficient, we
are applying the same prior for both model selection and parameter estimation. This is the
one-stage method. While for CMH, we first select model based on pMOM(or piMOM) prior,
then apply least square for estimation of coefficients. Least square estimation is the same as
applying flat prior in Bayesian way. So CMH is a two-stage method in model selection and
estimation. We may get the hint that it has some advantage due to its two-stage setting.
And we will show this is the case in application results in Section 2.4.
In addition, we try to explain how to choose hyper-parameters in each model.
There are different parameters in prior setting for each method. For these parameters,
some are default setting, some are recommended in paper. Here is a summary on how to
choose the parameters.
In DL setting, the local variance ψi ∼ f (ψi ), the default distribution for this f here is
exponential distribution with mean 1/2. For vector φ ∼ Dir(a, . . . , a), a = 1/n. The global
variance τ ∼ g(τ ), the default setting for g is Gamma distribution with parameters (na, 1/2).
2

ˆ
2 , it is suggested use σ
In SD setting, for the spike prior variance term τ0n
10n to apply.
2 , σ
While for the slab prior variance term τ1n
ˆ 2 max

p2.1
n
100n , logn

is proposed to plug in.

Here σ
ˆ 2 is the sample variance of response vector Y , and choose qn = P [Zi = 1] such that
P[

p
i=1 (Zi

= 1) > K] = 0.1, and default value for K is max(10, log(n)). As stated in

section 2.2.2, even for a misspecified σ
˜ 2 = σ 2 , we can still have the consistency, so the choice
of α1 and α2 is trivial.
In CMH setting, the prior for σ 2 is inverse Gamma distribution, σ 2 ∼ IG(10−3 , 10−3 )
is proposed. And for model k, p(k) ∼ B(k + a, p − k + b), here B(., .) is the beta function, and default value of a and b recommended by [Scott and Berger.(2010)] are a =

14

b = 1, p is the number of covariates. The recommended value of τ has been proposed
in [Johnson and Rossell(2010)]. In practice, calculating inverse of Xk Xk + τ1 Ak is involved,
so we need to adjust the value of τ when Xk Xk is singular.

2.3

Extension of DL to linear model

In [Bhattacharya et al.(2014)], Dirichlet-Laplace priors are proposed for normal means problem. It said most of the ideas developed in this paper generalize directly to high-dimensional
linear and generalized linear models. We try to extend the whole process to linear regression
model. Following arguments also hold when p > n.
Linear model with Dirichlet-Laplace prior can be written as

y = Xβ + , ∼ N (0, In )
βi ∼ N (0, ψi φ2i τ 2 ), i = 1, . . . , p
ψi ∼ exp(1/2), i = 1, . . . , p
φ ∼ Dir(a, . . . , a)
τ ∼ Gamma(pa, 1)




ψ1 φ21 τ 2
0
···
0 






2
2
ψ2 φ2 τ · · ·
0 
 0

Denote Σ = 
 .
..
.. 
..
 ..
.
.
. 






2
2
0
0
· · · ψp φp τ
Then conditional distribution can be expressed by following:

15

(2.1)

(a)

1
1
β|ψ, φ, τ, y ∝ exp{− (y − Xβ) (y − Xβ)}exp{− β Σ−1 β}
2
2
∼ M V N orm (Σ−1 + X X)−1 X Y, (Σ−1 + X X)−1

(b)

ψi |φi , τ, βi ∝ p(βi |φi , τ, ψi ) ∗ p(ψi )



 1
2
βi
1
−1/2
∝ (ψi )
exp −
∗
+ ψi

 2 φ2i τ 2 ψi
β2
1
∼ giG( , 1, 2 i 2 )
2
φi τ
Note: generalized inverse Gaussian(giG) distribution Y ∼ giG(λ, ρ, χ) if we have f (y) ∝
y λ−1 e−0.5(ρy+χ/y) for y > 0.
(c)

τ |φ, β ∝ p(β|φ, τ ) ∗ p(τ )


 1
∝ τ −p+pa−1 exp − (2
 2
p

∼ giG(pa − p, 1, 2
i=1

i



|βi | 1
) + τ

φi τ

|βi |
)
φi

(d) φ|β has the same distribution with T1 /T, . . . , Tn /T , where Ti ∼ giG(a − 1, 1, 2|βi |)
independently and T =

p
i=1 Ti .

In before process, steps (b), (c) and (d) are exactly the same as normal means problem.
But in step (a), we draw β from multi-variate normal distribution.

16

While in this multi-variate normal distribution, the covariance matrix is written as
Σ−1 + X X

−1

. Rewrite (Σ−1 + X X)−1 as the form of (I + ΣX X)−1 Σ, this matrix

can work well even X X is singular.
In [Bhattacharya et al.(2014)], the property is developed for normal means problem. Under some restrictions on model means θ0 , the posterior arising from the DL setting in
[Bhattacharya et al.(2014)] contracts at the minimax rate of convergence for appropriate
choice of Dirichlet concentration parameter a. Here, we can get similar result for linear
regression model. If we put some conditions on design matrix, we can also get the mimimax
optimal rate of posterior contraction, which means the posterior concentrates most of its
mass on a ball around β0 of squared radius of the order of qn log(p/qn ).
Theorem 1 Consider model (2.1) where a = p−(1+β) for some β > 0 small. Assume
β0 ∈ l0 [qn ; p] with qn = o(n) and β0 22 ≤ qn log 4 p. Also, for design matrix X, suppose the
elements in X are bounded, that is, there is a constant K, so that maxi,j Xi,j ≤ K. And the
dimension of β is within the order of n2− , i.e. p = O(n2− ) for any

> 0. Then, with

s2n = qn log(p/qn ) and for some constant M > 0,

limn→∞ Eβ0 P ( β − β0 2 < M sn |y) = 1

If a = 1/p instead, then (2.2) holds when qn

2.4

(2.2)

logn.

Simulation and application

To compare the performance of these three methods, we show the results from some simulation study and application result on real data analysis.

17

2.4.1

Simulation on normal means problem

In this part, we investigate the performance of different methods when estimating the normal
means problem. In each setting, model is yi = θi + i , i ∼ N (0, 1), i = 1, . . . , n, suppose
n = 100, and the true model size is 10. That is y sampled from a N100 (θ0 , I100 ) distribution,
with θ0 having 10 non-zero entries which are all set to be a constant A > 0. We choose
different values of A, A = 0.75, 1.5, 3, 4, 5, 6, 7, 8. But for the simplicity of table, we only
show the result for A = 0.75, 1.5, 4, 7, 8 in this part. We have 50 replicates for each case,
and compare the average squared error and average absolute error in the table. The result
is shown in table 2.1.
Table 2.1a shows the estimation result based on Gibbs sampler median. Squared error loss
in the table is squared deviance between the posterior median and true value of coefficients,
and take the average across the 50 replicates, the expression for squared error can be written
as

50
i=1

100 (θˆ −θ )2
j=1 ij ij
50

. While the absolute error loss is the table is the absolute deviance

between the estimator and the true parameters, also averaged across 50 replicates, absolute
error can be expressed by

50
i=1

100 |θˆ −θ |
j=1 ij ij
50

. To better understand the source of error,

we divide the squared error into two parts. The first part error comes from those parameters
with nonzero true coefficients, that is

50
i=1

parameters with zero true coefficients, which

10 |θˆ −θ |
j=1 ij ij
. The second
50
50
100 |θˆ −θ |
i=1
j=10 ij ij
is
.
50

part error is from
We denote them

respectively as sq.error1 and sq.error2 in the table. In the same way, we divide the absolute
error in two parts, denote as abs.error1 and abs.err2 in table.
From the result in table 2.1a, we can see some interesting results. For the error calculated
in DL, all the error comes from error1, that is the error for covarites with nonzero coefficients,
and the error from zero coefficients are all zero. This implies DL could always shrink the

18

Table 2.1: Summary of error on estimation in normal means model. In each table, true
model contains 10 nonzero constant entries, with value equals A. The number of total entry
is 100.
(a) Summary of error on estimation based on Gibbs sampler median by each
method. sq.error denotes for squared error of estimation. abs.error denotes
for absolute error of estimation. sq.error1 represents for error from actually
nonzero covariates, sq.error2 is error from actually zero covariates.

DL

SD

CMH

A
sq.error
sq.error1
sq.error2
abs.error
abs.error1
abs.error2
sq.error
sq.error1
sq.error2
abs.error
abs.error1
abs.error2
sq.error
sq.error1
sq.error2
abs.error
abs.error1
abs.error2

0.75
5.625
5.625
0
7.500
7.500
0
5.752
5.234
0.518
7.905
7.127
0.778
5.625
5.625
0
7.500
7.500
0

1.5
4
7
8
22.500 122.659 26.276 14.233
22.500 122.659 26.276 14.233
0
0
0
0
15.000 32.593 11.289 9.376
15.000 32.593 11.289 9.376
0
0
0
0
24.883 24.389 8.733 8.131
18.735 24.389 8.733 8.131
6.1478
0
0
0
17.188 11.504 7.081 8.452
13.090 11.504 7.081 8.452
4.098
0
0
0
22.500 18.411 5.643 5.540
22.500
8.336 5.643 5.540
0 10.075
0
0
15.000 11.120 5.890 5.584
15.000
7.946 5.890 5.584
0
3.174
0
0

(b) Squared error based on least square estimation by each
method

A
DL
SD
CMH

0.75
1.5
4
7
8
5.625 22.500 157.865 53.301 7.092
13.591 28.501 12.451 5.643 5.540
5.625
22.5 18.411 5.643 5.540

(c) Variable selection performance for each method. FP represents for false positive
rate. FN represents for false negative rate.

A
Type
DL
SD
CMH

0.75
1.5
4
7
8
FP FN FP FN FP FN FP FN FP FN
0
10
0
10
0 9.86
0 0.98
0 0.12
1.56 9.04 3.24 7.88 0.7
0
0
0
0
0
0
10
0
10
1
1
0
0
0
0

19

zero coefficients into exactly zero, which is a good property in high-dimension problem.
While for SD, in any case, most of the error comes from error1. Compared to DL, SD yields
smaller error1, which means SD gives better estimation for those nonzero coefficients. When
θ is 4 or greater, CMH gives good estimate. Since it can pick up the correct covariates
when θ is 7 or 8, its error all comes from error1. Obviously OLS gives good result with true
nonzero covariates.
Note when the true parameters are around A = 4, the squared error is kind of large for
all methods. It seems when the true parameter is around 4, this signal is not strong enough
to be identified by this method, while the estimation result is still shrinking toward zero,
makes the squared error large.
Since error1 and error2 here depend on which covarite is included in the model, we further
see how each method performed when selecting the true model. When selecting variables,
the ideal result should include all the nonzero coefficients in the model, while exclude all the
zero coefficients. While in application each method inevitably make some mistakes. Table
2.1c summarize the variable selection result. In this table, FP represents for false positive
rate, which means estimate a zero coefficient incorrectly as a nonzero one. FN stands for
false negative rate, it is the mistake that estimate a nonzero coefficients into a zero one.
Both FP and FN in this table is averaged across 50 replicates.
From table 2.1c, DL never estimates a zero coefficient as a nonzero one, while it shrinks
all nonzero covariates into zero one when θ is small. In the case when θ is 0.75 or 1.5, it
is quite weak signal compared to the noise variance which is 1. All three methods make
obvious mistake when deciding which covariate should be included in the model. They just
assume every covairate has zero coefficients. If θ is 7 or 8, any method could recognize the
correct model in most cases, while CMH gives better estimation from the previous table.

20

We have explained the use of different priors for model selection and estimation in CMH,
here we want to compare three methods accordingly, so we also applied least square estimate
as in CMH method, and see how the estimation result will change. Table 2.1b gives the result.
From table 2.1b, the estimation does not improve, even get worse after applying flat prior
for both DL and SD, since in most cases, for example when signal value is less than 7, either
method has misspecified some predictors in the model. If the identified model is not the true
model, then least square will give poor estimation on coefficient. In this case, DL and SD
can give better estimation with the same prior as they applied in paper.

2.4.2

Simulation for linear regression case

As discussed in section 2.3, we can extend DL prior to linear model, even in the case when
p > n. In this part, we first consider the linear regression case when p < n. The model
is yn×1 = Xn×p βp×1 + n×1 , i ∼ N (0, 1). Set n = 100, p = 60. For design matrix X,
covariates Xi and Xj are standard normal with correlation given by ρ|i−j| , ρ = 0.5. The
true model size is 10. The true value of β contains 10 identical nonzero entries, and 50
zero entries. The value of non-zero entries are set to be A = 0.75, 1.5, 4, 7, 8. Table 2.2a
summarized the result of squared error loss for estimation. For the full table with more
information, see supplementary document.
From table 2.2a, CMH gives best estimation among the three. The result from SD is
better than DL. If we look at the divided error, as long as signal coefficient greater than 1,
the error of SD and CMH all comes from error1, that is the true parameters are not zero.
All the error from zero coefficient part is zero, which means these two methods could always
recognize the zero coefficients, or FP rate is zero. To check our speculation, we can see the
variable selection performance, table 2.2c summarizes the result.

21

Table 2.2: Summary of error on estimation in linear regression model. In each table, the
true model contains 10 non-zero constant entries, whose value equals A. The number of total
possible covariates is 60.
(a) Squared error on estimation based on Gibbs sampler
median by each method

A
DL
SD
CMH

0.75
1.5
4
7
8
0.620 0.745 0.451 0.625 0.521
0.204 0.176 0.332 0.252 0.389
0.201 0.117 0.077 0.085 0.187

(b) Squared error based on least square estimation by
each method

A
DL
SD
CMH

0.75
1.5
4
7
8
0.185 0.442 0.077 0.085 0.187
0.132 0.117 0.077 0.085 0.187
0.201 0.117 0.077 0.085 0.187

(c) Variable selection performance for each method. FP represents for false
positive rate. FN represents for false negative rate.

A
0.75
Type FP FN
DL
0.6 0
SD
0
0
CMH 1
0

1.5
FP FN
1
0
0
0
0
0

4
FP
0
0
0

22

7
FN
0
0
0

FP
0
0
0

8
FN
0
0
0

FP
0
0
0

FN
0
0
0

From table 2.2c, SD and CMH can recognize all the correct covariates even when A = 1.5,
which is a desirable result. Combine the information in table 2.2a, all three methods give
estimator pretty close to the true parameter value.
If we apply least square for all three methods, the squared error of estimation result is
summarized in second part of table 2.2b.
From table 2.2b, the squared error loss is decreasing compared with the result based on
Gibbs sampler, which means after applying flat prior, the estimation results improve toward
the true value. Since in most cases, any method could pick up the correct model, then least
square could give desirable results.

2.4.3

Simulation on high-dimension case

All these three methods are designed to solve high dimensional variable selection problems,
so we want to check how they work in high-dimensional case. We compare the results of
DL, SD and CMH when p > n. The basic settings are similar with before part. The model
is yn×1 = Xn×p βp×1 + n×1 , i ∼ N (0, 1). In high-dimension case, p is greater than n. In
simulation, we set two scenarios, n = 100, p = 120 and n = 100, p = 200. To construct
design matrix X, covariates Xi and Xj are standard normal with correlation given by ρ|i−j| ,
ρ = 0.5. For each case, the true model size is 10, and true value of β contains 10 identical
nonzero entries, 110 zero entries in the case p = 120 and 190 zero entries in case p = 200.
The value of non-zero entries are set to be A = 0.75, 1.5, 4, 7, 8. The result are summarized
in table 2.3a. Each case has 50 replicates, and the squared error loss presented in table 2.3a
are the average of squared error over 50 replicates.
In table 2.3a, whenever p = 120 or p = 200, estimation results are close to the true value,
since all error in this table is quite small. If we look at the variable selection performance

23

Table 2.3: Summary of error on estimation in high-dimensional case. In each table, the true
model contains 10 non-zero constant entries, whose value equals A. The number of total
possible covariates is p.
(a) Squared error on estimation based on Gibbs sampler median by
each method

p = 120

A
DL
SD
CMH
p = 200
A
DL
SD
CMH

0.75
0.245
0.560
0.132
0.75
1.476
2.562
0.246

1.5
0.086
0.334
0.147
1.5
0.177
0.205
0.070

4
0.265
0.855
0.258
4
0.259
0.213
0.141

7
0.067
1.415
0.088
7
0.280
0.215
0.266

8
0.319
0.432
0.225
8
0.433
0.216
0.189

(b) Squared error by least square estimation

p = 120

A
DL
SD
CMH
p = 200
A
DL
SD
CMH

0.75
0.372
0.297
0.132
0.75
2.887
2.351
0.246

1.5
0.147
0.147
0.147
1.5
0.070
0.070
0.070

24

4
0.257
0.257
0.257
4
0.141
0.141
0.141

7
0.087
0.087
0.087
7
0.266
0.266
0.266

8
0.224
0.224
0.224
8
0.189
0.189
0.189

result, all the FP and FN rate are zero for signal value A ≥ 1.5, which means all three
methods could recognize the true model.
While for p = 200, CMH can always pick out the correct predictors even when A = 0.75,
this is an outstanding performance. Here DL and SD can identify the correct model in most
cases when A = 0.75, the estimation is also comparable with the case for p < n. So all three
methods give excellent performance in high-dimensional problems.
After applying least square estimation for estimation, table 2.3b are obtained for n = 120
and n = 200 respectively.
Compared table 2.3a with table 2.3b, the estimation improved slightly in some case, since
the true model could always be identified.

2.4.4

Remark on simulation

Each method gives acceptable estimation and variable selection results. When the true
nonzero signal is strong, like greater than 7 in simulation study, CMH will always be a good
choice, since it can pick up the correct model, and give estimation with small error. If the
nonzero signal is moderate, SD can be a good choice, it can correctly tell which covariate
should be included in the model on most cases, also estimation is quite close to the true
coefficients. If the signal is too weak, none of the methods could correctly estimate the
model. DL works desirably in normal means problem and when we don’t want too many
covariates appeared in the model, since DL best shrinks all the zero parameters as zero, and
it can be nicely explained in the context of relieving rely on one single global scale parameter.
Also, computation-consuming time is different for each method. If we make 8000 iterations, SD takes about 8 minumtes when n=100 and p=200, while it takes DL 16 minutes for
the same setting, and 17 minutes for CMH. In the simulation study, SD always has advantage

25

in computing time.

2.4.5

Real data application

In this part, we apply these three variable selection method to a real data set to examine
how they work in practice. We use the data from experiment to study the genetics of
two inbred mouse populations. The data include expression levels of 22,575 genes of 31
female and 29 male mice resulting in a total of 60 arrays. Some physiological phenotypes
are measured by quantitative real time PCR. The gene expression data and the phenotypic
data are available at GEO(http://www.ncbi.nlm.nihgov/geo). Because this is an ultra-high
dimensional problem with pn = 22, 575, we prefer to perform simple screenings of the genes
first based on the magnitude of marginal correlations with the response. After the screening,
the dataset for each of the responses consisted of p = 200 predictors (including the intercept
and gender) by taking 198 genes based on marginal screening. We choose GPAT(glycerol-3phosphate acyltransferase) as response. We performed variable selection with SD, DL and
CMH.
We split the sample into a training set of 55 observations and a test set with the remaining
five observations. We use the training set to get the fitted model, and predict the response
in the test set.
By ordering the posterior inclusion probability for each method, we can list the highest
10 variables. For DL, we use the rank of Gibbs sampler median instead, since we are using
Bayesian credible region to decide whether a covariate should be included in a model, the
rank of Gibbs sampler median could provide information about the importance of a covariate.
Table 2.4 lists 10 covariates with highest marginal inclusion probability for each method. The
results are based on average rank of 30 replicates, the process is, first give the rank for each

26

Table 2.4: Order of covariates by highest posterior probability
DL
SD
CMH

152, 149, 113, 25, 182, 183, 191, 139, 194, 125
152, 149, 113, 191, 25, 183, 182, 34, 199, 56
152, 191, 194, 182, 199, 113,149, 25, 183, 69

replication, and then calculate the average rank among 30 replicates, which gives the result
rank in table 2.4.
From table 2.4, variable with id 152 ”1457715 at” has the highest posterior inclusion
probability. Variables with id 25, 113, 149, 152, 182, 183, 191 are the common variables
proposed by each method. The recommend model by CMH is size 2 model with covariates
152 and 194. If we see this as a variable selection problem, we can pick up the covariates
included in model first, calculate the coefficients through LSE. We compare the mean squared
prediction error for different model size with each method. We choose model size equals 2,
4, 6, 8 and 10. Since CMH gives the posterior probability of each model, for each model size,
we pick out the model with highest posterior probability among certain model size.
It is an advantage for CMH to get the posterior probability of each model. To be comparable with other methods, we can also include those covariates with highest marginal
posterior probability. So, we also calculate the MSPE including the covariates required by
model size with highest marginal posterior probability.
Figure 2.1 compares MSPE obtained by each method. CMH1 represents model chosen
from highest model posterior probability among certain model size. CMH2 represent for
model including covariates with highest marginal posterior probability. From figure 2.1, we
can see CMH gives best prediction if the model is chose based on model posterior probability.
Instead of marginal inclusion probability for each covariate, it consider the probability of a
whole model, this is advantageous since it combine all the covariates in a model.

27

MSPE for each method
4.5
●

●

4.0

DL
SD
CMH1
CMH2

MSPE

3.5

3.0

●
●
●
●

2.5

2.0

1.5

2

4

6

8

10

Model Size

Figure 2.1: Plots of MSPE for each method. For the results shown in this figure, all the
hyper-parameters are chosen by default value or recommended value as discussed in section
2.2.4. CMH1 is the MSPE resulted from choosing model with highest posterior probability
among certain model size. CMH2 represents for model including covariates with highest
marginal posterior probability.

On the other hand, we would like to know what will happen if we just use the median of
posterior Gibbs sampler as the coefficient for SD and DL. The result is given in figure 2.2.
From figure 2.2, we can see prediction is not good if use the median of Gibbs sampler as
estimation of coefficient, since either SD or DL is over shrinking the estimator, makes every
coefficients close to zero. So from the application result, we may believe the model results
from CMH method with highest posterior probability is more reliable according to the result
of prediction.

2.4.6

Remark on application result

We can see that model with highest posterior probability based on CMH gives outstanding
performance in most cases. And if SD or DL is employed, the recommendation is first choose
the covariates, then apply least square estimation to get the model coefficients.

28

MSPE for each method

●

●

●

MSPE

●

●

15

●

DL
SD
CMH

10

5

2

4

6

8

10

Model Size

Figure 2.2: Plot of MSPE for each method. For the results shown in this figure, all the
hyper-parameters are chosen by default value or recommended value as discussed in section
2.4.2. And estimation here use median of Gibbs sampler
To explain this further, in DL and SD, we assumed a prior for all the model parameters.
And then we get the posterior distribution for each parameter, get Gibbs sample from each
step, and obtain estimator from Gibbs sample. Since we draw the Gibbs sample under the
assumption of prior, we are using the same prior for model selection also for parameter
estimation. While this may be not the best idea, since for the aim of model selection, we
are using shrinking prior. In CMH method, the Coupled MH algorithm is used for model
selection, after we decide which model is selected, we use least square to estimate. Least
square estimation is essentially we assume a flat prior for each parameter, this is more
reasonable when we don’t have evident information about parameters, and we know the
selected model is the true model. From these argument, we can conjecture CMH has smaller
estimation error as long as the correct model is selected.
Since CMH is using two-stage method to give estimation, while SD and DL just use
one-stage for both model selection and estimation, this feature disadvantages SD and DL. If
we also use two-stage like CMH method, suppose we use the same flat prior, that is the

29

least square result after we decide the model, then the estimation is comparable with
CMH, like the result in first reveals.

2.5

Conclusion

In this chapter, we compared three different Bayesian methods for model selection, the
comparison in both theory and application can give us some lights on how to work on
high-dimensional problems in Bayesian perspective. Also, we extend the DL method into
general linear regression case, and show the minimax convergence rate under some conditions.
Each method has advantages in some sense, while how to develop a method that could be
advantage in general may be an interesting topic.

2.6

Proof

This section shows proof sketch for Theorem.
The whole proof is based on the theorem of [Bhattacharya et al.(2014)]. First, we need
to get a similar version of Theorem 3.2 [Bhattacharya et al.(2014)]. Follow the procedure
in the paper, we can obtain the similar form of inequality (A.7). And then, by Lemma 5.2
2
in [Castillo and vander Vaart(2012)], we have An = {Dn ≥ e−rn P ( Xβ − Xβ0 2 ≤ rn )}.

If we have max|Xi,j | ≤ M , with tn = rn /M , the same expression can be obtained as
2

2

−r
An = {Dn ≥ e−rn P ( β0 2 ≤ tn )} with Pβ0 AC
n ≤ e n . Then follow the other steps, we

can get the similar expression as in Theorem 3.2 [Bhattacharya et al.(2014)].
Then, follow the steps in proof of Theorem 3.1 in [Bhattacharya et al.(2014)]. Similar
2
with previous argument, we can get An = {Dn ≥ e−4rn P ( β − β0 2 ≤ 2tn )} such that

30

2

−r
Pβ0 (AC
n ) ≤ e n , here tn = rn /M . Then by constructing the net similarly, we can get

S,j,i
β S,j,i − β 22 = βS − βS 22 + βS C 22 ≤ (jrn )2 + (p − qn )rn2 /n2 ≤ 4j 2 rn2

, the last inequality holds if p = O(n2− ) for any

> 0. Then we can finish the proof by

similar argument in [Bhattacharya et al.(2014)] Proof of Theorem 3.1.

31

Chapter 3
Variable Selection by mixing spike
and a nonlocal slab prior

3.1

Introduction

The literature of Bayesian variable selection is rapidly growing. Bayesian variable selection
is equipped with natural measures of uncertainty, such as the posterior probability of each
possible models and the marginal inclusion probabilities of each predictors. Given model
with prior and likelihood, there are formal justifications for choosing a particular model.
Many Bayesian methods have been proposed for variable selection in recent years, including the stochastic search variable selection ([George and McCulloch (1993)]), empirical Bayes
variable ([George and Foster(2000)]), penalized credible regions([Bondell and Reich(2012)]),
nonlocal prior method ([Johnson and Rossell(2012)]), just to name a few. The spike and
slab selection method proposed by [Mitchell and Beauchamp(1988)], then the method was
further modified and developed by several authors, e.g., [Madigan and Raftery(1994)] and
[George and McCulloch (1997)]. [Ishwaran and Rao (2005)] further generalized this model
selection procedure with detail computational steps. Although the spike and slab prior has
been surfacing in the literature for a while, very recently [Narisetty and He(2014)] developed
model selection consistency under high-dimensional set up. Another notable development

32

of spike and slab prior was done by [Xu and Ghosh(2015)] in the context of bi-level
selection who showed how to use spike and slab priors for selecting variables both at the
group level as well as within a group.
Spike and slab prior assumed that the regression coefficients were mutually independent
with a two-point mixture distribution made up of a uniform flat distribution (the slab) and
a degenerate distribution at zero (the spike). Several variations of spike and slab priors have
been proposed in the literature. Zero inflated mixture priors have been utilized to a Bayesian
approach for variable selection([Mitchell and Beauchamp(1988)]). [George and McCulloch (1997)]
used zero inflated normal mixture priors in the hierarchical formulation for variable selection
in linear model.
Spike and slab prior is an efficient method for variable selection. A latent binary variable
is introduced for each of the covariates to be denoted by Z = (Z1 , . . . , Zp ). Zi would indicate
whether the i-th covariate is active in the model or not. Prior distribution on the regression
coefficient βi under Zi = 0 is a point mass at zero, and a diffused prior is preferred under
Zi = 1. The concentrated prior of βi under Zi = 0 is called the spike prior, and the diffused
prior under Zi = 1 is called the slab prior. For slab prior, we would like to take a density
which is flat and with heavy tails. Most commonly used slab prior is normal density with
large standard deviation. While normal prior puts nontrivial density at point zero, this is
overlapped with spike prior, which may cause some non-identifiablity issue.
In this chapter, a nonlocal prior is proposed as slab prior. Since nonlocal prior is zero when
the parameter is zero, this could avoid overlap with spike prior, which is one natural desired
property for spike and slab prior. Figure 3.1 are comparison between normal mixture spike
and slab prior, with mixing spike and nonlocal slab prior. In extreme case, if the standard
deviation for spike prior is closing to zero, then spike prior degenerates to a point mass

33

density at zero. In this chapter, we apply degenerated point mass density as spike prior,
with nonlocal density as slab prior.

3
2
0

1

y

2
0

1

y

3

4

spike prior N(0,0.1), nonlocal slab prior

4

spike prior N(0,0.1), slab prior N(0,2)

−4

−2

0

2

4

−4

−2

x

0

2

4

x

Figure 3.1: Comparison between normal-normal and normal-nonlocal mixture priors

3.2

Proposed model specification

We discussed the motivation for employing nonlocal prior as spike prior, which is symmetric bimodal density function, with function value closes to 0 whenever the parameter value
approaches 0. Johnson (2010) proposed several forms of nonlocal densities, for example,
product moment(pMOM) density, t-moment(tMOM) density, inverse moment(iMOM) density and so on. Among these, pMOM is derived from multivariate normal distribution with
nonlocal properties. We use this pMOM density to explain nonlocal spike and slab prior
model selection method. The functional form of pMOM density is

π(β|τ, σ1 ) =

(2π)−1/2 (τ σ12 )(−3/2) exp

34

β2
−
2τ σ12

β2

(3.1)

Suppose τ and σ1 are given hyperparameters. By denoting (2π)−1/2 (τ σ12 )(−3/2) as C, the
pMOM density can be expressed as

π(β|τ, σ1 ) = C ∗ exp −

β2
2τ σ12

β2

Now, we will define the linear model specification. In linear model regression, we use P
to denote the number of covariates, and response dimension is n × 1, n × P design matrix
corresponding to P covariates of interest. β stands for the regression coefficient vector. Also,
we assume β is sparse in the sense that only a few components of β are non zero. The goal
of variable selection in high dimensional data is dimension reduction, which is to identify
the nonzero coefficients to explore the active covariates. The formal normal linear model is

Yn×1 |βP ×1 , σ 2 ∼ N (Xn×P βP ×1 , σ 2 In )

(3.2)

For each component in regression vector, that is for each j in {1, 2, . . . , P }, we assign a spike
and slab prior for βj , which is

βj ∼ Zj δ(βj ) + (1 − Zj )π(βj )

(3.3)

where δ(βj ) is point mass when βj = 0, and π(βj ) is simplified notation for π(βj |τj , σ1j ),
which has the nonlocal prior form in (3.1), written as

2 )(−3/2) exp −
π(βj |τj , σ1j ) = (2π)−1/2 (τj σ1j

35

βj2
2
2τj σ1j


 β2
j

In addition, we define the following prior distributions

Zj ∼ Bernoulli(p0 )
p0 ∼ Beta(a1 , b1 )
σ 2 ∼ InverseGamma(α1 , α2 )

Denote β as the vector of (β1 , β2 , . . . , βP ) and Z as the vector of (Z1 , Z2 , . . . , ZP ). Under
the above setting, we can write down the joint likelihood function as




2 


−n/2
1
2
2
Yi −
exp − 2
p(β, Z, σ ) ∝ σ
Xip βp 


 2σ i=1

p







−1/2 (τ σ 2 )(−3/2) ∗ exp − 1 β 2 β 2 + (1 − Z )δ(β )
× ΠP
j
j 
j 1j
j=1 (Zj )(2π)
2 j j
 2τj σ1j
a −1

× p0 1

n



(1 − p0 )b1 −1 × σ 2

−α1 −1

α
exp − 22
σ

Here Zj = 0 means βj is excluded from model and Zj = 1 means βj included in the model.

3.2.1

Posterior median as thresholding estimator

In [Xu and Ghosh(2015)], a bi-level variable selection model with spike and slab prior is
proposed, also the role of posterior median for thresholding is pointed out. Enlightened
by them, I would like to propose similar posterior median with [Xu and Ghosh(2015)], and
further analysis the property of proposed method based on posterior median estimator.
Consider the case when p < n and orthogonal design matrix, with model defined by (3.2)
2 , j = 1, . . . , P . Here, we use subscript n in τ
2
and (3.3) with fixed τj,n and σ1j,n
j,n and σ1j,n

36

2 depend on n for developing asymptotic theory. Under this
to emphasize that τj and σ1j

model and assumptions, the marginal posterior distribution for βj conditional on observed
data is also a spike and slab distribution, which has the form of


2

 βj − 1 − Bj,n |βˆjLS |

βj |Y, X ∼ lj,n δ0 (βj ) + (1 − lj,n )exp 

2σ 2 1 − B

j,n
n



 2
 β Dj
 j


(3.4)

2
where βˆjLS is the least squares estimator of βj , Bj,n = 2 σ 2
, Dj is normalization
σ +nτj,n σ1j,n

constant and is calculated as

1

Dj = 


2

1 − Bj,n |βˆjLS |

+

σ 2 1−Bj,n
n




(3.5)
2σ 2 1−Bj,n
n

π

In addition, the posterior probability of Zj = 0, which is the same as probability of βj = 0
conditional on observed data can be calculated as

lj,n = P (βj = 0|Y, X)
π0

=
2
π0 + (1 − π0 ) 1 + nτj,n σ1j,n

where Gj,n =

1−Bj,n
2σ 2

n βˆjLS

can be seen in section 3.3.

2

−1/2

1
2

exp Gj,n

+ Gj,n

2σ 2
2 +σ 2
nτj,n σ1j,n

3
2

(3.6)

. The specific expression and deduction for (3.4) and (3.6)


LS
 βj − 1−Bj,n |βˆj |

Denote Fj,n as cumulative function of exp 
σ 2 1−B
n

function of Fj,n is defined as

37

j,n

2


 2
 β Dj , and quantile
 j





−1 max 0, 0.5 − lj,n 
Qj,n = Fj,n
1 − lj,n

(3.7)

Then, the resulting median of βj , a soft thresholding estimator, can be given by

βˆjM ed = M ed βj |Y, X = sgn βˆjLS

3.2.2

Qj,n

(3.8)

+

Consistency

To further investigate the property of proposed thresholding estimator in (3.8), we assume
orthogonal design matrix for the rest of this chapter, i.e., X T X = nIP . This assumption
will simply the proof process in following theorem.
Let β1 , β2 , . . . , βP denote the true coefficients value for P covarites respectively. Define the A
as model index vector, with element value 1 if the corresponding covariate with nonzero coefficient, and with element value 0 if the corresponding covariate with zero coefficient. In other
words, A = I{βj = 0} for j = 1, 2, . . . , P . While selected model index vector by threshed = I{βˆM ed = 0} for j = 1, 2, . . . , P .
olding estimator βˆjM ed in (3.8) is defined as AM
n
j
ed = A = 1. Following theorem
Model selection consistency is achieved if limn→∞ P AM
n

states model selection consistency holds under very mild assumption.
Theorem 2 Assume orthogonal design matrix,
√

2
n τj,n σ1j,n

i.e.,

XT X

=

nIP .

Suppose

2
→ ∞ and log τj,n σ1j,n
/n → 0 as n → ∞, for j = 1, . . . , P , then the

median thresholding estimator has variable selection consistency as

ed = A = 1
limn→∞ P AM
n

38

(3.9)

Theorem 2 states that we can select the true model based on threshold estimator with
probability 1 for large enough sample size. Proof deduction is attached at section 3.6.

3.3

Gibbs samplers

In Bayesian variable selection methods, inference on parameter can be summarized from
Gibbs sampler of posterior distribution. For special case, when prior distribution and posterior distribution are conjugate, which means they belong to the same density category, it
is easier to update parameter value by analyzing posterior distribution. However, in many
cases, posterior distribution is not conjugate with prior density, then we need to derive
specific expression of posterior distribution.
In this part, we show the exact Gibbs samplers generating formula for each parameter.
Specific deduction process is attached in last section.

3.3.1

Gibbs sampler for βj

For each coefficient βj , where j runs through {1, 2, . . . , P }, posterior distribution also follows
spike and slab distribution as a result of spike and slab prior. Then spike part would be
point mass at zero, and slab part can be calculated from comprehensive integration. After
calculation, the slab part of posterior distribution for βj has following form:

B 2
slab βj |rest ∝ exp −A βj −
A

39

× βj2

where A
1
2σ 2

=

n
i=1

Yi −

1
2
2τj σ1j

+

1
2σ 2

p=j Xip βp

n
2
i=1 Xik

, B

=

n X Y−
p=j Xip βp
i=1 ij i
2σ 2

, C

=

2

This is a non-symmetric and unknown density with a high peak around B
A . The normalization constant for this density can be obtained by doing integration. We can denote the
density of

slab βj |rest

as DENj . Gibbs sampler from this density can be obtained by

sampling scheme, similar to generate a random number from a distribution with cumulative
distribution function information. Then the complete density for posterior distribution of
βk would be

βj |rest ∼ pj δ(βj ) + (1 − pj )DENj

Here, pj stands for the probability of βj follows a point mass at zero density, which means
the probability of βj = 0.

3.3.2

Gibbs sampler for pj

Since pj is the probability of βj = 0 for j = {1, 2, . . . , P }, so it is critical when deciding
which coefficient is significant in variable selection problem. In practice, we employed a
latent variable Zj corresponding to βj , Zj is generated from binomial distribution with
success probability

1 − pj . So Zj takes value either 0 or 1, Zj = 0 implies βj = 0,

and Zj = 1 implies βj follows DENj . The calculation for pj is as following. The specific

40

deduction process is attached in section 3.7.

pj = P (Zj = 0) =

T =

where K =

3.3.3

1
K (2π) 2




1
Cexp
 2σ2






p0
p0 + (1 − p0 ) × T
 1 
2


2
σ
 


2 
n
σ

2

X
+
i=1 ij
2
τj σ1j

n
2
i=1 Xij

+

σ2
τ σ12

M2








σ2





2

+
M


n
2 + σ2 

X
i=1 ij
2
τ σ
j 1j

and M =



n X Y−
p=j Xip βp
i=1 ij i
n X 2 + σ2
i=1 ij τ σ 2
j 1j

.

Gibbs sampler for σ 2 and p0

It is easier to obtain Gibbs sampler for error variance σ 2 and global proportion of nonzero
coefficient p0 , since posterior are conjugate with prior densities for them. Posterior for σ 2 is
also inverse gamma distribution

1
n
σ 2 |rest ∼ InverseGamma  + α1 ,
2
2

n

2


Yi −

i=1

p




Xip βp  + α2 

While posterior for p0 is still beta distribution

p0 |rest ∼ Beta # (β = 0) + a1 , # (β = 0) + b1

3.3.4

Comment on τ

Without any hands-on information, we can choose the same value of τj for j ∈ 1, 2, . . . , P ,
that is use same prior for each coefficient. For simplicity, we use τ without subscript to denote
this value is the same for every j. τ is an important tuning parameter for adjusting the prior

41

distributions. With large τ , the nonlocal slab prior is more flat, which puts majority density
away from 0, and nonlocal density function approaches 0 when parameter value closes to
zero, so large τ means strong penalty to small parameter value. This property helps avoiding
selecting unnecessary covariates, however, it has the risk of missing important covaraites. On
the other hand, with small τ , the prior distribution assigns more density to values around 0,
this prior could easily detect small magnitude values, while too small τ value may result in
over selection issue. It is critical to determine appropriate τ value such that we can balance
the over selection issue and omit important covariates risk. From our empirical experiment,
τ = 0.01 could achieve satisfactory performance.

3.4

Simulation study

3.4.1

Eestimation performance by Mean Squared Errors

In linear regression case, model is Yn×1 = Xn×P βP ×1 + n×1 , i ∼ N (0, 1). Set n = 100,
consider three different P value, P = 60, 120, and 200. For design matrix X, covariates Xi
and Xj are standard normal with correlation given by ρ|i−j| , ρ = 0.3. The true model size
is 10. The true value of β contains 10 identical nonzero entries, and 50 zero entries. The
value of non-zero entries are set to be Coef = 1, 1.5, 4, 7 respectively. For example, when
Coef = 1, true nonzero coefficients are (1, 1, . . . , 1)10×1 .
Table 3.1 summarized the result of mean squared error loss(MSE) for estimation. The
simulation result also shows that there is no falsely selected covariate or omitted important
covariate, so this mass-nonlocal prior could correctly identify the true model. Then MSE in
table 3.1 are all caused by estimating the 10 nonzero coefficients, with definition M SE =
1
10

10
i=1

βˆi − βi

2

. From table 3.1, we can notice that MSE are very small, which means

42

Table 3.1: MSE for n=100 with
Coef
1
1.5
dim=60 0.0635 0.0601
dim=120 0.0798 0.0767
dim=200 0.268 0.2475

mass-nonlocal prior
4
7
0.0766 0.0743
0.0785 0.0824
0.2052 0.257

this method could give accurate estimation for coefficient value.

3.4.2

Selection performance

In this part, we report simulation results for different cases under several (n,p)
combinations, signal strength and sparsity levels according to simulation setting at
[Narisetty and He(2014)].
We will refer the proposed method as mass-nonlocal for the specification of spike and slab
prior forms. Bayesian shrinking and diffusing prior setting in [Narisetty and He(2014)] is
referred as BASAD. Other methods under comparison are piMOM with nonlocal prior of
[Johnson and Rossell(2012)], SpikeSlab of [Ishwaran and Rao (2005)], and three penalization
methods LASSO, elastic net(EN), and SCAD tuned by BIC, denote as LASSO.BIC, EN.BIC,
SCAD.BIC respectively.
Case 1: We use sample size n = 100, and candidate dimension P = n = 100. The covariates are generated from multivariate normal distributions with zero mean and unit variance. The compound symmetric covariance with pairwise covariance ρ = 0.25 is used to
represent correlation between covariates. Five covariates are taken active with coefficients
β = (0.6, 1.2, 1.8, 2.4, 3.0). Under this setting, covariats have moderate correlation and signal
of coefficients are relatively strong.

Case 2: Consider scenario with (n, P ) = (100, 500), keep other parameters same as case 1.

43

Table 3.2: Case1: Performance of MASS-nonlocal for n = P . The other columns of the
table are as follows: pp0 and pp1 (when applicable) are the average posterior probabilities of
inactive and active variables respectively; Z = t is the proportion that the exact models is
selected. Z ⊃ t is the proportion that the selected model contains all the active covariates;
FDR is the false discovery rate, and MSPE is the mean squared prediction error of the
selected models.
pp0
pp1
Z=t
Z ⊃ t FDR MSPE
(n,p)=(100,100), |t| =5
mass-nonlocal
0.003 0.985
BASAD
0.016 0.985
piMOM
0.012 0.991
SpikeSlab
LASSO.BIC
EN.BIC
SCAD.BIC

0.960
0.866
0.836
0.005
0.01
0.398
0.356

0.972
0.954
0.982
0.216
0.992
0.982
0.990

0.001
0.015
0.030
0.502
0.430
0.154
0.160

1.099
1.092
1.083
1.660
1.195
1.134
1.157

(n,p)=(200,200), |t| =5
mass-nonlocal
0.001 1.000
BASAD
0.002 1.000
piMOM
0.003 1.000
SpikeSlab
LASSO.BIC
EN.BIC
SCAD.BIC

0.996
0.944
0.900
0.008
0.014
0.492
0.844

1.000
1.000
1.000
0.236
1.000
1.000
1.000

0.001
0.009
0.018
0.501
0.422
0.113
0.029

1.028
1.087
1.038
1.530
1.101
1.056
1.040

Table 3.3: Case2: Performance of MASS-nonlocal for high-dimensional.
pp0
pp1
Z=t
Z ⊃ t FDR
MSPE
(n,p)=(100,500), |t| =5
mass-nonlocal
0.007 0.967
BASAD
0.001 0.948
SpikeSlab
LASSO.BIC
EN.BIC
SCAD.BIC

44

0.682 0.876 0.054 1.152
0.730 0.775 0.011 1.130
0.000 0.040 0.626 3.351
0.005 0.845 0.4661 1.280
0.135 0.835 0.283 1.223
0.045 0.980 0.328 1.260

Table 3.4: Case3: Performance of MASS-nonlocal for low signal.
pp0
pp1
Z=t
Z ⊃ t FDR MSPE
(n,p)=(100,500), |t| =5
mass-nonlocal
0.014 0.975
BASAD
0.002 0.622
SpikeSlab
LASSO.BIC
EN.BIC
SCAD.BIC

0.572
0.185
0.000
0.000
0.040
0.045

0.956 0.010
0.195 0.066
0.000 0.857
0.520 0.561
0.345 0.478
0.340 0.464

1.143
2.319
2.466
1.555
1.552
1.561

Case 3: Keep design matrix covariance matrix and |t| = 5, with low signals
βt = (0.6, 0.6, 0.6, 0.6, 0.6).

MSPE stands for mean squared prediction error based on n (which equals 100 in case 2, case
3 and top part of case 1, 200 in bottom part of case 1) new observations as testing data.
Simulation results are summarized in table 3.2, table 3.3 and table 3.4. Measurement index,
include pp0 , pp1 , Z = t, Z ⊃ t, FDR and MSPE are defined in caption of table 3.2.

3.4.3

Observation from simulation

Table 3.2 shows result for case 1. We can see that three Bayesian methods mass-nonlocal,
BASAD and piMOM perform better than other methods. Still mass-nonlocal prior is outperforming in the sense of Z = t, Z ⊃ t, FDR and MSPE. Result for case 2 can is at table
3.3, mass-nonlocal method still outperform among these methods in most of the index.
When signal is low, for example in case 3, the performance of mass-nonlocal is even impressive. Table 3.4 reveals simulation performance for case 3. All the other 5 methods have
difficulty to identify the true model, the proportion of selecting the exact model for BASAD
is 18.5%, while for the other four penalization method less than 5%.

45

Mass-nonlocal method is superior with 57.2%, which is more than three times reliable than
BASAD. Also, mass-nolocal provides smallest FDR and MSPE.

3.4.4

Real data application

In this section, we apply the proposed variable selection method to a real data set to examine
how it woks in practice. We consider the data from an experiment to study the genetics of
two inbred mouse populations. The data include expression levels of 22,575 genes of 31 female
and 29 male mice resulting in a total of 60 arrays. The numbers of phosphoenopyruvate carboxykinase(PEPCK) is measured by quantitative real-time PCR. The gene expression data
and the phenotypic data are available at GEO(http://www.ncbi.nlm.nih.gov/geo;accession
number GSE3330). [Narisetty and He(2014)] used this data for illustrating their method.
Following [Narisetty and He(2014)], we first performed simple screenings of the genes based
on magnitude of marginal correlations with the response. [Fan and Lv(2008)] explained the
power of marginal screening. After screening, the data set consist of P = 118 predictors
(including the intercept and gender). We performed variable selection with mass-nonlocal
along with BASAD. Following [Narisetty and He(2014)], the samples are randomly split into
a training set of 55 observations and a test set with the remaining five observations. The
fitted model using the training set were used to predict the response in the test set. This
process was repeated 100 times to estimate the prediction power.
In Figure 3.2, we plot the average mean square prediction error(MSPE) for models of various
size chosen by BASAD and mass-nonlocal. We can observe that MSPE of mass-nonlocal is
mostly smaller than BASAD. About half the genes chosen by mass-nonlocal are overlapped
with chosen result by BASAD.

46

0.6

0.7

MSPE
BASAD
Mass−nonlocal

●

●

0.5

MSPE

●

●
●

0.4

●
●
●

0.3

●

2

4

6

8

10

Model size

Figure 3.2: Mean squared prediction error versus model size for analyzing PEPCK.

3.5

Discussion

In this chapter, we proposed a mass-nonlocal form of spike and slab prior setting for variable selection in high dimensional data. This mass-nonlocal prior put point mass at 0 for
spike prior, while employs nonlocal prior which avoid 0 point for slab prior, this property
guaranteed superior performance on variable selection, also gives more accurate estimation
compared with most other variable selection methods, which can be seen from simulation
result.
However, variable selection performance is sensitive with respect to tuning parameter τ . We
recommend tuning parameter value based on practical experiment. A rigorous method for
proposing valid tuning parameter value could be further investigate in a future study. For
example a variation of empirical Bayes can be developed.
In addition, from simulation results, we can see the superior performance of mass-nonlocal

47

prior in the sense of prediction since it yields much smaller mean prediction squared error.
This could be explained by precise estimation of parameter value. In this chapter, we checked
variable selection consistency property. We strongly believe that more strict property like
oracle property should hold.

3.6

Proof of theorem

In this part, proof for theorem 2 will be provided. List of notations will be used in proof:
βˆjM ed is defined in (3.8);


Fj,n as cumulative function of exp 


βj − 1−Bj,n |βˆj |LS
σ 2 1−B
j,n
n

2


 2
 β Dj ;
 j

Qj,n is defined in (3.7);
−1 is inverse function of F ;
Fj,n
j,n

lj,n defined in (3.6);
Prove:
For j such that |βj | = 0, since

√ ˆLS
2
→ ∞,
nβj = Op (1), and from assumption n τj,n σ1j,n

we have lj,n → 1 as n → ∞. The probability of correctly classifying this factor is

P |βˆjM ed | = 0 = P (Qj,n ≤ 0)
→1

(3.10)

as n → ∞, lj,n → 1, so Qj,n → Fj−1 (0), which is negative with probability 1. Then (3.10)
holds.
2
For j such that |βj | = 0, since βˆjLS →p βj0 and assumption log τj,n σ1j,n
/n → 0, we have

48

lj,n →p 0 as n → ∞. The probability of correctly identifying this factor is

P |βˆjM ed | = 0 = P (Qj,n > 0)

(3.11)

needs to show

−1
Qj,n = Fj,n

max(0,

0.5−l

With lj,n → 0, we have max 0, 1−l j,n
j,n
0.5−l

j,n
−1 max(0,
implies limn→∞ Fj,n
1−lj,n )

0.5 − lj,n
)
1 − lj,n

>0

(3.12)

0.5−l

0.5−l

= 1−l j,n , and limn→∞ 1−l j,n = 21 , which
j,n
j,n

= Fj−1 12 , (3.12) is satisfied if we can show t =

Fj−1 12 > 0, which can be derived from


2

 βj − 1 − Bj,n |βj |LS

exp 
Dj

2σ 2 1 − B
−∞

j,n
n
0

where Dj = 



1

2
2 σ 1−Bj,n

1−Bj,n |βj |LS +

n

49



 2
 β dβj < 1
 j
2


(3.13)

from expression in (3.5).
2σ 2 1−Bj,n
π
n

Next, we show how (3.13) holds. Denote Hj,n = 1 − Bj,n |βj |LS

0
−∞


exp 


βj − Hj,n
2σ 2
n

0
−∞


exp 


βj − Hj,n
2σ 2
n

0
−∞


exp 



2
 βj − Hj,n + Hj,n dβj


1 − Bj,n

2σ 2
n



2

βj − Hj,n


2
 βj − Hj,n dβj


1 − Bj,n


0

+ 2 Hj,n



2


=

 2
 β dβj
 j

1 − Bj,n


=



2

−∞


exp 


βj − Hj,n
2σ 2
n

1 − Bj,n


+ Hj,n

2

0
−∞


exp 


βj − Hj,n
2σ 2
n


=

−Hj,n
−∞


exp 

+ 2 Hj,n

−Hj,n
−∞

1 − Bj,n


exp 

2

1 − Bj,n


u2
2σ 2
n

2



 βj − Hj,n dβj



 dβj


 2
 (u) du

u2

2σ 2
n

1 − Bj,n


+ Hj,n

2

−Hj,n
−∞


exp 


 (u) du


u2
2σ 2
n

50

1 − Bj,n


 du


=−

+

σ2
n


1 − Bj,n exp 

σ2
1 − Bj,n
n


u2
2σ 2
n

−Hj,n
−∞

1 − Bj,n



exp 

 −Hj,n
u
−∞


u2

2σ 2
n

1 − Bj,n




− 2 Hj,n

+ Hj,n

2

σ2
u2
 −Hj,n

1 − Bj,n exp  2

2σ
n
−∞
1
−
B
j,n
n


−Hj,n
−∞


exp 

u2
2σ 2
n

1 − Bj,n



=−

+

σ2
n


 du


 du

2

−Hj,n


 −Hj,n
1 − Bj,n exp 
 2σ2

1 − Bj,n
n


σ2
1 − Bj,n
n


2σ 2
1 − Bj,n π ∗ Φ 
 −Hj,n
n


− 2 Hj,n



−Hj,n

σ2
1 − Bj,n exp 
 2σ2
n
1−B

2

j,n

n

1
σ2
n

1 − Bj,n








+ Hj,n

2


2σ 2
1 − Bj,n π ∗ Φ 
 −Hj,n
n


=−

+

σ2
1 − Bj,n
n

σ2
n



Hj,n



Hj,n exp 
 2σ2

1 − Bj,n
n

2
σ2
1 − Bj,n + Hj,n
n



∗ Φ
 −Hj,n

2

1

1
σ2
n

1 − Bj,n

2σ 2
1 − Bj,n π
n





51

1 − Bj,n









Also,

0

D
−∞


exp 


2

βj − Hj,n
2σ 2
n

1 − Bj,n


 2
 β dβj
 j


1 − Bj,n

=
 Hj,n

2

+

σ 2 1−Bj,n
n

Hj,n



−
2σ 2 1−Bj,n
n



σ2
n


exp 


π

2σ 2
n

1 − Bj,n









+ Φ
 −Hj,n

1
σ2
n

1 − Bj,n






Since −Hj,n

Hj,n

2

1
σ 2 1−B
j,n
n




< 0 , so Φ  −Hj,n

2
βj −Hj,n
0

−∞ exp 2σ 2
n 1−Bj,n



term is negative, which implies D

1
σ 2 1−B
j,n
n


 < Φ (0) = 21 , the first


 β 2 dβj always less than 1 , so
j
2

(3.13) holds, then (3.12) holds as a result of (3.13). This proves theorem 2.

52

3.7
3.7.1

Additional: calculation of Gibbs samplers
Gibbs sampler for βk




2 


1
1
2
2


β × exp − 2
β
slab βj |rest ∝ exp −
Yi −
Xip βp
2 j j

 2τj σ1j

 2σ i=1

p


2 


n


1
1
2


−
β
= exp −
Y
−
X
β
βj2
i
ip p
2 j
2


2σ
2τ
σ


j 1j
p
i=1

2 





n


1
1


2
βj2
βj − 2
Xip βj − Xik βj 
= exp −
Yi −
2


2σ



 2τj σ1j
i=1
p=j
2







n



n

n

1
1


2 β2
Xij
= exp −
βj2 − 2
Xip βp  +
Yi −
j
2
2σ
2τj σ1j
i=1
i=1
p=j


n

−2
i=1


Xij Yi −

βj2


Xip βp  βj
p=j





n

1
1
2  β2 + 2
= exp − 
+
Xij
j
2
2
2σ
2τj σ1j
i=1
2

1
− 2
2σ

Denote A
1
2σ 2

n
i=1

=
Yi −

Yi −

p=j

Xip βp

2σ 2

βj

n

i=1

1
2
2τj σ1j
p=j

n
i=1 Xij


Yi −

βj2

p=j

+ 12
2σ

Xip βp


Xip βp 

n
2
i=1 Xij

, B

n X Y−
p=j Xip βp
i=1 ij i
2
2σ

=

2

slab βj |rest ∝ exp −Aβj2 + 2Bβj − C βj2
= exp −A βj −

B 2
A

53

× βj2 × exp

B2
−C
A

, C

=

Denote slab βj |rest as DENj ,so

βj |rest ∼ pj δ(βj ) + (1 − pj )DENj

3.7.2

Gibbs sampler for pj

pj = P (Zj = 0) =

p0
p0 + (1 − p0 ) × T




C

T =

2

2 

n

1


Xip βp
exp − 2
Yi −


 2σ i=1

p


exp − 12 n
p=j Xip βp
i=1 Yi −
2σ




1
2 β2
exp −
β
2 j j
 2τj σ1j

2 









 1 n

1


2 β 2 dβ
exp −
Xip βp − Xij βj 
= M exp − 2
β
Yi −
j
2 j j

 2τj σ1j

2σ




i=1
p=j








n
 1




2  β 2 − 2 R β  exp − 1 β 2 β 2 dβ
= C exp − 2 
Xij
j
j
j
2 j j j


 2τj σ1j
 2σ

i=1







n
 1

1


2
2
2

= C exp − 2 
Xij βj − 2 Rj βj  −
β
β 2 dβ
2 j j j

2σ
2τ
σ


j 1j
i=1

where Rj =

n
i=1 Xij

Yi −

p=j

Xip βp .

54

n
2
i=1 Xij

Denote Nj =

+

σ2 ,
2
τj σ1j

n X Y−
p=j Xip βp
i=1 ij i
Nj

M=

T =C

exp −

1
Nj
2σ 2

βj2 − 2M βj

=C

exp −

1
Nj
2σ 2

βj − M

1
Nj M 2
2σ 2

= Cexp

Now denote K = Cexp

T =K
=K

1
2σ 2

exp −

, then

βj2 dβj

2

− M2

1
Nj
2σ 2

βj2 dβj
2

βj − M

βj2 dβj

Nj M 2 ,

1
Nj
2σ 2
1
exp − 2 Nj
2σ
exp −

βj − M
βj − M

2

2

βj − M + M
βj − M

2

2

dβj

+ 2M βj − M 2 dβj

= K ∗ (part1 + part2 + part3)

part1 =

=

exp −
1
(2π) 2

1
Nj
2σ 2

σ2

1
Nj

βj − M

2

βj − M

2

dβj

3
2

(3.14)

55

Equation in (3.14) is obtained by integrating density of nonlocal MOM prior.

part2 =

exp −

2πσ 2
×
Nj

= 2M

×
= 2M 2

part3 =

1
Nj
2σ 2

exp −

βj − M

2

2M βj dβj

1
2πσ 2
Nj

1
Nj
2σ 2

βj − M

2

βj dβj

2πσ 2
Nj

2
1
β
−
M
−M 2 dβj
N
j
j
2
2σ
2
1
exp − 2 Nj βj − M
dβj
2σ

exp −

= −M 2
= −M 2

×
= −M 2

2πσ 2
×
Nj
exp −

1
2πσ 2
Nj

1
Nj
2σ 2

2πσ 2
Nj

56

βj − M

2

dβj

Then,

T = K × (part1 + part2 + part3)

1

= K × (2π) 2

σ2

1
Nj


1

= K × (2π) 2

1

= K (2π) 2

σ2
Nj

σ2
1
2

1
Nj

3
2

+ 2M 2


2πσ 2
− M2
Nj


3
2

+ M2



σ2



Nj

2πσ 2 

Nj


57

+ M 2

2πσ 2 

Nj

Chapter 4
Model selection for generalized linear
model using nonlocal priors

4.1

Introduction

In modern statistical practice, variable selection is one of the most commonly used technique,
especially in clinical and genetic research, due to complex and high dimensional nature of
data. A good amount of work has been done recently based on both frequentist and Bayesian
perspective. This chapter concentrates on Bayesian model selection for generalized linear
models, in particular the logistic regression based on a nonlocal prior distribution.
In classical statistics, many approaches have been proposed to deal with model selection problems. Some popular methods include F tests, Akaike information criterion ([Akaike (1973)]),
Bayes information criterion ([Schwarz (1978)]), Mallows Cp , exhaustive search, stepwise,
backward and forward selection procedures. Also, many frequentist methods based on penalization have been developed with good properties. Among these, least absolute shrinkage
and selection operator (LASSO) method ([Tibshirani(1996)]) based on L1 norm penalty is
one of the most popular and proven to be effective model selection procedures. Elastic
net ([Zou and Hastie (2005)]) is derived from the linear combination of L1 and L2 norm
penalties. Smoothly clipped absolute deviation (SCAD) ([Fan and Li (2001)]) uses nonconcave penalty, which leads oracle property. Adaptive LASSO ([Zou (2006)]) using adaptive

58

weights for penalizing different coefficients in L1 norm shows consistency under some conditions. Dantzig selector ([Candes and Tao (2007)]) is a solution to a L1 regularization
problem which also shows efficient convergence.
In generalized linear regression, calculation can be tedious because of the associated likelihood function is complicated. Earlier literature on variable selection in logistic regression used criterion based methods such as AIC or BIC. However, with some approximation technique, penalized methods can also be applied in generalized linear regression
([Van de Geer (2008)], [Huang et al. (2008)]). With the popularity of LASSO, fitting the
generalized linear model with LASSO or elastic-net regularization path is proposed by
[Friedman, Hastie and Tibshirani (2010)].
Classical statistical methods have some limitations, for example, lack of explanation, consistency or uncertainty estimation, which motivated the employment of Bayesian methods.

Bayes factors and posterior probabilities are easy to understand, and Bayesian

model selection is consistent if one of the entertained model is actually the true model
and if enough data are observed. Also, Bayesian approach can account for model uncertainty. The solution for Bayesian methods can be equivalent to frequentist penalized
approach under appropriate priors. All these advantages make Bayesian methods popular, which also benefit from the development of efficient computational algorithms. Many
Bayesian methods have been proposed, like Bayesian LASSO ([Park and Casella (2008)])
using Laplacian shrinkage, Bayesian model average technique ([George (1999)]), spike and
slab prior ([Ishwaran and Rao (2005)]) by introducing a latent variable, Gibbs variable selection ([Dellaportas et al. (1997)]) with a mixture prior assumed for each variable, stochastic
search variable selection ([George and McCulloch (1993)]; [George and McCulloch (1997)])
with spike prior centered around zero and small variance.

59

When it comes to a generalized linear model, the computation problem arises again because of the intrinsically complicated likelihood function. The computation requirement
remains with the Bayesian methods due to no-conjugacy. With the development of modern computational platforms and efficient algorithms, fully Bayes approach can be applied.
For example, informative prior is proposed by [Chen, Ibrahim and Yiannoutsos (1999)].
[Nott and Leonte (2012)] discussed an efficient sampling algorithm. [Jiang (2006)] considered the logistic regression in which prior under some conditions leads to consistent convergence toward the true model. Then, the theory was extended for generalized linear models
in [Jiang (2007)]. [Sha et al.(2004)] applied probit regression to classify binary responses
based on microarray data. [Zhou, Liu and Wong(2004)] used Bayesian variable selection for
logistic regression to achieve excellent cross-validated classification errors. On the other
hand, Bayesian subset modeling is proposed by [Liang, Song and Yu (2013)]. Most of these
Bayesian methods put a great mass on the density of a null parameter value to reach the
result of shrinkage and go to variable selection results which is referred as model selection
based on local prior densities by [Johnson and Rossell(2012)].
In this chapter we are interested in nonlocal prior ([Johnson and Rossell(2010)]). Nonlocal
prior density is a density function that is identically zero whenever a model parameter is
equal to its null value, typically 0 in model selection settings. Most current Bayesian model
selection procedures employ local prior density, which is positive at null parameter values.
Since nonlocal prior density would be zero if any component of parameter is zero, this
property could bring a parsimonious model selection.
Bayesian model selection procedures for linear regression model by imposing a nonlocal
prior density is proposed by [Johnson and Rossell(2012)]. They further summarized the
comparison with popular local prior densities as well as with frequentist approaches. In

60

this chapter, we extend their approach to a generalized linear model, specifically logistic
regression model. Main contribution of our work is employing nonlocal prior density for a
logistic regression model and studying its validity. We prove the posterior convergence in
high-dimensional setting along the line of [Jiang (2007)]. We specifically find the convergence
rate for a logistic regression model with nonlocal prior density. The numerical results are
promising.
The rest of the chapter is organized as follows: The section 4.2 discusses the proposed
methodology. The section 4.3 discusses the algorithm for implementation and the section
4.4 provides the numerical results along with a real data example. The proofs are given in
section 4.6.

4.2
4.2.1

Methodology
Bayesian logistic regression

Let µ be the success probability for a binary random variable Y , then the logistic regression
is defined as logit(µ) = Xβ, where X is the design matrix with dimension n × pn , β is the
z
pn × 1 regression coefficient vector. The function logit(z) = log 1−z

for z ∈ (0, 1). With

n data points {(Xi , yi )} for i = 1, · · · , n, the likelihood function is
n

p(y|β) =
i=1

yi

1
1 + e−

s Xis βs

1
1 + e s Xis βs

1−yi

.

In Bayesian settings, we suppose β has a prior density p(β), then the joint density of the
data and β is p(y, β) = p(y|β)p(β), and the marginal density of the data y is p(y) =
p(y|β)p(β)dβ. Since this is independent of β, we can denote the marginal density as Z.

61

The Bayesian inference is made from the posterior distribution of β which is given by

p(β|y) =

1
1
p(y|β)p(β) = p(y|β)p(β).
p(y)
Z

Denote F (β) = −logp(y, β) = −log(p(y|β)p(β)). Then the posterior of β can be written as

p(β|y) =

1 −F (β)
e
.
Z

Assume there exists a posterior mode β ∗ . Expanding F (β) around β ∗ gives

1
F (β) ≈ F (β ∗ ) + (β − β ∗ )T g(β ∗ ) + (β − β ∗ )T H(β ∗ )(β − β ∗ ),
2

where g(β) =

∂F (β)
∂β

and H(β) =

∂ 2 F (β)
.
∂β∂β T

(4.1)

Both g and H are evaluated at the posterior mode

β ∗ . Here β ∗ maximizes −F (β), so the gradient g(β ∗ ) equals 0 and Hessian matrix H(β ∗ )
will be positive definite. Now we have

p(β|y) ≈

1
1 −F (β ∗ )
e
exp − (β − β ∗ )T H(β − β ∗ )
Z
2

so that the posterior of β is approximated by a normal distribution, N(β ∗ , H −1 ). This could
be a general scheme for Bayesian analysis in logistic regression model to avoid complicated
computational schemes such as Metropolis-Hastings algorithm ([Hoff (2009)]). Since we
obtained the posterior density of regression vector β in conjugate family, we can simply
perform Gibbs sampler for its evaluation. Newton-Raphson algorithm may be used to find
β ∗ and H. To achieve dimension reduction, we use prior density with some characteristic to
shrink each coefficient.

62

4.2.2

Model selection via nonlocal prior

We consider the nonlocal prior ([Johnson and Rossell(2012)]) as the prior specification. In
particular, product moment density (pMOM) proposed by [Johnson and Rossell(2012)]:

p(β|τ, σ 2 , r)

=

dp (2π)−p/2 (τ σ 2 )−rp−p/2 |Mp |1/2 exp

1
β Mp β
−
2τ σ 2

p

βi2r .

(4.2)

i=1

Here, σ 2 is a dispersion parameter. In the case of no over-dispersion logistic regression,
we set σ 2 = 1. Positive value τ is a scale parameter that determines dispersion of the
prior density on β around 0. Mp is a p × p nonsingular scale matrix. In the case of no
subjective information regarding the prior correlation between regression coefficients, we can
set Mp = Ip . r takes value from positive integers and it is called the order of density. We
set r = 1 for simplicity. dp is the normalizing constant.
In logistic regression with the pMOM prior, we continue using the notations as introduced
before. Then the expression of F (β) is

yi log(1 + e−Xi β ) +

F (β) = −logp(y, β) =

p
− log(dp ) + log(2π) +
2
+

1
β Mp β − 2r
2τ σ 2

(1 − yi )log(1 + eXi β )

(4.3)

p
1
+ rp log(τ σ 2 ) − log|Mp |
2
2
log(βi ).

With numerical optimization method, we can get the maximizer of logp(y, β), equivalently,
a minimizer of F (β), which is β ∗ . Then the value of joint density evaluated at β ∗ is easy to
calculate. Also, we can get the Hessian matrix of F (β) in the closed form expression: The

63

(i, j)th element of the Hessian matrix for i = j is expressed by
e s Xms βs
∂F
e− s Xms βs
=
ym Xmi Xmj
+
(1 − ym )Xmi Xmj
,
∂βi ∂βj
(1 + e s Xms βs )2
(1 + e− s Xms βs )2
m
m
while the ith diagonal element equals to
e− s Xms βs
e s Xms βs
1
2
∂F
2
2
+
(1
−
y
)X
=
y
X
+ + 2.
m
m mi
mi
2
∂ βi
βi
(1 + e s Xms βs )2
(1 + e− s Xms βs )2 τ
m
m
With these expressions, we can easily obtain the determinant of the Hessian matrix evaluated
at β ∗ . Then the marginal density of y can be approximated by Laplace approximation. This
gives us the marginal density of the data, p(y), in a closed form:

Z = p(y) =

p(y, β)dβ =


≈ det

F

(β ∗ )
2π

e−F (β) dβ

− 1
2



∗
e−F (β )

∗
= e−F (β ) (2π)k/2 |H|−1/2 .

(4.4)

Now for any model k, we can assign a prior probability p(k), where k can be any subset
of the full model. A reasonable and simple prior density can be a uniform prior, that is,
p(k) is same for all k. Another model prior can be a binomial prior. In this case, each
covariate has probability π to be included in the model. For example we can set π = 0.5.
Another candidate prior is a beta-binomial prior. In this prior, each covariate is included in
the model with probability π and π follows a beta density with parameters a and b, that is,
p(π|a, b) ∼

π a−1 (1−π)b−1
,
B(a,b)

where B(a, b) is the beta function.

If we choose a certain model prior p(k), combined with marginal density of the data, p(y),

64

posterior probability for any model k can be expressed by p(k|y) =

p(k)pk (y)
.
t p(t)pt (y)

When

the number of covariates is large, it is impossible to list all candidate models which makes
challenge in calculating the denominator of p(k|y). But we can compare the posterior of any
two models k1 and k2 , since p(k1 |y) and p(k2 |y) share the same denominator so that we only
need to compare the numerators, p(k1 )pk1 (y) and p(k2 )pk2 (y), and calculation of these two
terms is not difficult. Expression (4.4) is used to approximate p(y) and p(k) is from the prior
specification. Since comparison of posterior probabilities of models can be done, MetropolisHasting algorithm can be employed to obtain a sequence of sampled models. With sampled
models, we can identify the maximum a posteriori (MAP) model and estimate the posterior
probabilities of the MAP and other high-probability models.

4.2.3

Extension to GLM

We have explained logistic regression model selection in the previous section. In this part,
we extend the whole theory to a generalized linear model(GLM). Suppose that the data can
be modeled by GLM with the density function given by

p(y|θ) = exp{a(θ)y + b(θ) + c(y)},

(4.5)

where a(θ) and b(θ) are continuously differentiable functions of θ, c(y) is a constant function
of y, a(θ) has nonzero derivative, and θ is called the natural parameter that relates y to the
predictors through a linear function

θ = β 1 x1 + · · · + β p xp ,

65

(4.6)

where β1 , . . . , βp are regression coefficients. The mean function is µ = E(y|x1 , . . . , xp ) =
−

b (θ)
a (θ)

≡ ψ(θ).

In the following, we will show the sketch of GLM model selection procedure. Joint density
is p(y, β) = p(y|β)p(β) = exp{

n
i=1

a(θi )yi + b(θi ) + c(yi ) } · p(β), where (4.2) is used for

p(β). With the specific GLM definition, we can get corresponding function of a(θ), b(θ) and
c(θ). Then we can write down the log likelihood function F (β) in a similar way as in (4.3):

F (β) = − logp(y, β)
=

(4.7)

(a(θi (β))yi + b(θi (β)) + c(yi ))
p
− log(dp ) + log(2π) +
2
+

1
β Mp β − 2r
2τ σ 2

p
1
+ rp log(τ σ 2 ) − log|Mp |
2
2
log(βi ).

In this expression, θi (β) is used to emphasize a linear relationship between θ and β. With
specific forms of a(θ), b(θ) and c(y), calculation of a minimum point β ∗ for F (β) will be
feasible and computing the Hessian matrix is doable, even though the procedure can be
tedious and complicated depending on the form of a(θ) and b(θ). The (i, j)th element of
the Hessian matrix is expressed by
element equals to

m (a

m (a

(θm )ym + b (θm ))Xmi Xmj while the ith diagonal

2 + 1 + 2 .
(θm )ym + b (θm ))Xmi
τ
β2

After obtaining minimum point

i

β∗

of F (β) and Hessian matrix H(β ∗ ), marginal density

∗
p(y) ≈ e−F (β ) (2π)k/2 |H(β ∗ )|−1/2 can be easily computed. With a certain model prior p(k)

and marginal density pk (y), for any two models k1 and k2 , we can compare the posterior
probabilities p(k1 |y) and p(k2 |y). So a sequence of candidate models can be generated by the
Metropolis-Hastings algorithm. The MAP model and posterior probability of highly-likely
candidate models can be derived from the sequence of sampled models.

66

4.2.4

Theoretical properties

We would like to have model t obtained from the maximum posterior probability converge
to the true model t∗ as sample size n is increasing. Define f ∗ as the true density under the
model t∗ and f as the proposed density under the model t. To investigate the convergence
rate, we follow the results of [Jiang (2007)]. We assume the nonlocal prior specification
is used and limn→∞

pn
1

|βj∗ | < ∞, where pn is the number of covariates that allows to

increase with sample size n. γ is a subset of covariate indices for which |βj | > 0 and |γ|
is the cardinality of γ. Also, let ch1 (M ) be the largest eigenvalue of a matrix M . For two
positive sequences an and bn , an ≺ bn means limn→∞ an /bn = 0.
For the nonlocal prior with density (2), we consider a diagonal matrix Mp . That is, no prior
correlation between regression coefficients is assumed, since the prior normalization constant
dp can be difficult to evaluate when Mp is not a diagonal matrix. However, following theory
holds for any matrix Mp that satisfies certain assumptions. When Mp is a diagonal matrix,
it is proportional to a covariance matrix Aγ . In the following, we put some assumptions on
Mγ .
We first introduce some notations. rn is prior expectation of the model size.
infγ:|γ|=rn

j:j∈γ

(rn ) =

−1
¯ n ) = sup
˜
|βj∗ |, B(rn ) = supγ:|γ|=rn ch1 (Mγ ), B(r
γ:|γ|=rn ch1 (Mγ ), Bn =

supγ ch1 (Mγ−1 ), D(R) = 1 + R · sup|h|≤R |a (h)| · sup|h|≤R |ψ(h)|. Without loss of generality,
¯ n ) and B
˜n are bounded. Let n ∈ (0, 1] for each n, n 2
we assume B(rn ), B(r
n
the following conditions hold:

67

1 and assume

Assumption 1
(C 1) pn · log(1/ 2n ) ≺ n 2n ,
(C 2) pn · log(pn ) ≺ n 2n ,
(C 3) pn · log D 2pn

˜n
n 2n B

≺ n 2n ,

(C 4) rn ≺ pn ,
¯ n ) ≺ n 2 and
(C 5) rn log B(r
n
(C 6) log prnn

(rn ) ≺ 2n ,

4n 2

≤ − pnn .

Then we prove the following theorem to show models selection consistency and also derive
the convergence rate.
Theorem 3 Suppose that the prior given in (4.2) is employed and the conditions in Assumption 1 hold. Let P {·} denote the probability measure for the data Dn . Then, we have
2

(a) for some c0 > 0, limn→∞ P {π[d(f, f ∗ ) ≤ n |Dn ] ≥ 1 − e−c0 n n } = 1, where d(f, f ∗ ) =
√
√
( f − f ∗ )2 νy (dy)νx (dx) is the Hellinger distance between f and f ∗ .
2

(b) for some c1 > 0, and for all sufficiently large n, P {π[d(f, f ∗ ) > n |Dn ] ≥ e−0.5c1 n n } ≤
2

e−0.5c1 n n .
The proof is given in section 4.6.

4.3

Algorithm

When the number of covariates p is increasing with the sample size, the number of candidate
models is exponentially increasing with 2p . So it is nearly impossible to calculate the posterior
probability for each individual candidate model. Following [Johnson and Rossell(2012)], a
Methropolis-Hastings algorithm is employed to generate MCMC samples from the model
space, which is described below.
68

Step 1: Choose an initial model k curr .
Step 2: For i = 1, 2, . . . , p,
(a) Define a model k cand by excluding or including βi from the model k curr , according to
whether βi is currently included or excluded from k curr . Specifically, if model k curr includes
βi , then deleting βi from k curr gives model k cand . Conversely, if model k curr does not include
βi , adding βi into k curr gives model k cand .
(b) Compute
α=

pkcand (y)p(k cand )
pkcand (y)p(k cand ) + pkcurr (y)p(k curr )

,

(4.8)

where pkcurr (y) and pkcand (y) are calculated from the equation (4.4). Note that p(k curr )
and p(k cand ) depend on our prior assumption.
(c) Draw u ∼ U (0, 1). If α > u, update k curr = k cand .
Step 3: Repeat step 2 until a sufficiently long chain is acquired. After the sequence of sampled
models is obtained, we can use this sampler chain to identify the maximum a posteriori(MAP)
model, the posterior probability of the MAP and other high-probability models. Also, the
marginal information including probability for each covariate can be computed based on
sampled models.

4.4

Simulation and real data application

In this section, we evaluate the performance of the proposed nonlocal prior in highdimensional logistic regression models. We call this approach nonlocal prior method. We
compare the nonlocal prior method with LASSO ([Friedman, Hastie and Tibshirani (2010)])
and Empirical Bayesian LASSO in generalized linear models, denoted as gLASSO and
EBLASSO, respectively, where empirical Bayesian LASSO proposed an efficient algorithm

69

to solve Bayesian LASSO models ([Huang, Xu and Cai(2013)]).
In the simulation study, we adopt the null model (k = 0) as an initial model k curr to avoid
bias. For the prior density of a model p(k), we use a beta-binomial model on the model
space. Suppose ρ is a value between 0 and 1, which represent inclusion probability for each
covariate. This prior is obtained by assuming that prior probability assigned to a model k
is specified as
p(k|ρ) = ρ|k| (1 − ρ)(p−|k|) , ρ ∼ Beta(a, b).
√
We further assume that a = 1 and b = 1/ n so that the prior expectation of the model size
satisfies conditions in Assumption 1. For the prior on regression coefficients, we use first order
pMOM density. A hyper-parameter τ need to be settled and we follow the recommendation
of [Johnson and Rossell(2012)], that is, τ = 0.384.

4.4.1

Simulation study

We first investigate how our nonlocal prior method perform in some basic settings. In this
part, we generate a design matrix from a multivariate normal distribution, with moderate
correlation with correlation coefficient 0.3.
In the first simulation setting, we generate 100 observations with various dimensions for
covariates (p = 60, 120, 200, respectively). So for the design matrix X, 100 samples are
drawn from a multivariate normal distribution with mean 0 and the covariance matrix whose
ijth element is δ |i−j| with δ = 0.3. First, we consider the true regression parameter β
contains only 2 nonzero values, and 0 for the rest of them. We set different values for this
2 nonzero regression coefficients (e.g. (2, −2),(1.5, −1.5),(1, −1)). The response variable
is generated from the binomial distribution with success probability
variable selection results are given in Table 4.1.
70

eXβ .
1+eXβ

Summary of

Table 4.1: Summary of variable selection for different parameter values. In each panel,
results of gLASSO and EBLASSO are based on the tuning parameter chosen as the average
of 20 cross-validation values. All results are averaged among 100 replications.
Nonlocal Prior
dim True select False select
60
2
0.88
2
0.2
120
1.99
0.2
200

gLASSO
True select False select
2
10.8
2
12.91
1.99
15.22

EBLASSO
True select False select
2
6.04
2
8.74
1.99
10.5

(a) Variable selection result for the parameter value (2,-2)

Nonlocal Prior
dim True select False select
1.97
1.86
60
120
1.98
0.35
1.96
0.26
200

gLASSO
True select False select
1.98
9.32
1.97
11.48
1.9
12.28

EBLASSO
True select False select
1.99
6.25
1.98
8.18
1.99
9.08

(b) Variable selection result for the parameter value (1.5,-1.5)

Nonlocal Prior
dim True select False select
60
2
0.88
120
1.71
0.66
200
1.72
0.54

gLASSO
True select False select
1.83
6.94
1.59
8.58
1.49
7.52

EBLASSO
True select False select
1.85
5.27
1.73
7.49
1.76
8.26

(c) Variable selection result for the parameter value (1,-1)

71

We use true selection and false selection as criterion. Our definition for true selection is
the number of selected true nonzero coefficients. False selection is defined as the number of
falsely selected coefficients, that is coefficients selected in model with actually zero value. A
method with good performance on variable selection should have true selection close to 2
and false selection close to 0.
When the true parameter value is (2, −2) or (1.5, −1.5), which is moderately strong signal,
nonlocal prior method recognizes the true model even for relatively high dimensional cases
(e.g. p = 200 and n = 100).

72

Table 4.2: Summary of variable selection for parameter value (2,-2,2,-2). Results of gLASSO
and EBLASSO are based on tuning parameter chosen as the average of 20 cross-validation
value. All results are averaged among 100 replications.
Nonlocal Prior
gLASSO
EBLASSO
dim True select False select True select False select True select False select
4
1.13
4
14.92
4
6.4
60
120
3.89
0.18
3.97
19.32
3.97
10.84
3.88
0.24
3.9
20.62
3.81
11.77
200
However, gLASSO selects largest models. EBLASSO selects larger model than nonlocal
prior method, but smaller model compared with gLASSO. When coefficient value is large,
it is easier to identify the significant variable for most methods. When coefficient value
is relatively small (Part (c) in the Table 4.1), i.e. coefficient signal is relatively weak, all
methods select some noise covariates. However, the model size under gLASSO is much higher
than proposed method, while model size under EBLASSO is intermediate among these three.
Now we extend the model with regression coefficients β contains 4 nonzero components. We
set these 4 nonzero values as (2, −2, 2, −2), and 0 for the rest. Summary results are shown
in Table 4.2. Targeted value for true selection is 4 and false selection is 0. From the result
in Table 4.2, nonlocal prior method works well with 4 nonzero coefficients while gLASSO
includes some extra covariates again. For EBLASSO, falsely selected covariates number is
higher than the proposed nonlocal method, but lower than gLASSO.
We also tested variable selection performance on models with 6 nonzero coefficients. 6
nonzero values are set as (2, −2, 2, −2, 2, −2) and 0 for the rest. Model selection results are
displayed in Table 4.3. The nonlocal prior method tends to include some noise coefficients or
omit some important covariates. This may be caused by some potential correlation between
these 6 covariates which may weaken the effect of important variables and make some noise
variables. On the other hand, gLASSO contains more noise covariates.

73

Table 4.3: Summary of variable selection for parameter value (2,-2,2,-2,2,-2). Results of
gLASSO and EBLASSO are based on tuning parameter chosen as the average of 20 crossvalidation value. All results are based on 100 replications.
Nonlocal Prior
gLASSO
EBLASSO
dim True select False select True select False select True select False select
5.82
1.05
5.99
16.94
5.95
7.58
60
120
5.19
0.48
5.73
20.81
5.71
9.83
4.26
0.62
5.31
22.13
5.27
11.26
200

4.4.2

SNPs study

This simulation study mimics case-control genetic association study. Response variable y
represents disease status of a subject. It takes value 1 for the case and 0 for the control. Explanatory variables are generated as SNPs in human genome. So xij is genotype of SNP j of
the subject i which takes value 0, 1 or 2. xi represents for all genotype expression for subject
i. Following [Chen and Chen (2012)], the data were generated by following procedure.
Let n1 and n2 denote the numbers of cases and controls, respectively. Let s = {1, 2, . . . , k}
denote the causal SNPs for the disease. Here xi (s) stands for the causal genotype set
{xi1 , xi2 , . . . , xik }. Thus, there are 3k possible genotype profiles for the k SNPs. For the
SNPs belonging to s, the disease risk model is given by
k

logitP (yi = 1|xi (s)) =

βj xij
j=1

for the prespecified values of β1 , . . . , βk . For the noncausal SNPs xk+1 , . . . , xp , each xij is
generated from a binomial distribution with parameters (2, pj ), where pj represents the
frequency of one allele and is generated from Beta(2, 2).

This example consists of 10

simulated datasets. Each was generated with n1 = n2 = 500, p = 10, 000, k = 8, and
(β1 , . . . , β8 ) = (0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2.1.3).

74

Table 4.4: Comparison of Nonlocal prior with gLASSO and BSR. Results are averaged among
10 replications.
Methods
size
fsr(%)
nsr(%)

This

setting

is

the

same

Nonlocal prior
5.75
0
28.13

as

Bayesian

gLASSO
44.9
81.8
0

subset

BSR
8.2
3.66
12.22

regression

(BSR)

method

in

[Liang, Song and Yu (2013)]. The numerical results are summarized in Table 4.4.
To measure the performance of each method, we calculate the false selection rate (fsr) and
negative selection rate (nsr) among 10 replications. Let s∗i denote the set of selected features
in the dataset i. Then,

f sr =

10
∗
i=1 |si \s|
,
10
∗
i=1 |si |

nsr =

10
∗
i=1 |s\si |
.
10
i=1 |s|

Variable selection method with low fsr and nsr implies good performance. Nonlocal prior
has the smallest fsr and highest nsr. Also nonlocal prior chooses the smallest model size
among three methods, which implies that nonlocal prior tends to choose a simpler model.
By choosing the model as simple as possible, nonlocal prior may omit some causal variables,
which results in high nsr in Table 4.4. This indicates the proposed method is definitely
effective in regression setting, but should be carefully adopted if there are causal variables.

4.4.3

Real data application

In this part, we test nonlocal prior Bayesian variable selection method on colon gene expression data ([Alon et al. (1999)]) and compare our result with gLASSO. Colon data set
studies on 62 samples, which are composed of 40 colon tumor samples and 22 normal colon

75

tissue samples, analyzed with 2000 human genes. The response variable has two level: 0 for
normal colon tissue and 1 for colon tumor. To see which genes are closely correlated with
colon tumor, we use a logistic regression model and select variables with our nonlocal prior.
To evaluate our method, we divide data into two groups, use 52 samples as a training data
set, and 10 samples as a test data set. To reduce potential bias in sample observations, 20
repetitions are considered, in which 10 samples are randomly selected as a test set for each
time. In each replication, with 52 samples, we first apply nonlocal prior Bayesian variable
selection method and use the maximum likelihood estimator for the estimation of regression
coefficients for the selected covariates. Then we apply this fitted logistic regression model to
test classification using a data set with 10 samples for checking performance of this method
in terms of prediction. Since gLASSO is a renowned method for variable selection, we
compare nonlocal prior prediction result with gLASSO. To make this comparison consistent,
we keep the same training data set for gLASSO and nonlocal prior for each replication. The
summarized prediction results are listed in Table 4.5. All results in Table 4.5 are averaged
among 20 replications.
We consider True positive and False positive for the performance measures. True positive
(TP) represents those samples observed as colon tumor is predicted as colon tumor (response
value 1). False positive (FP) is samples predicted as colon tumor while observed as colon
normal tissue (response value 0). True negative (TN) is for those predicted as colon normal
tissue but observed as colon normal tissue. False negative (FN) means samples predicted as
colon normal tissue while observed as colon tumor. True positive rate (TPR), also called
as sensitivity, measures the test’s ability to correctly identify patients who do have the
condition.

76

Table 4.5: Compare on prediction results based on Nonlocal prior and gLASSO. Result from
gLASSO is based on tuning parameter chosen as the average of 20 cross-validation value.
All results are averaged among 20 replications.
Sensitivity
86.96%
86.96%

Nonlocal prior
gLASSO

FNR
13.04%
13.04%

Specificity
65.85%
70.73%

FPR
34.15%
29.27%

(a) Prediction accuracy result summary

Nonlocal prior
gLASSO

Average model size
TR
FR
1.25
79.09% 20.91%
12.82
80.91% 19.09%

(b) Overall prediction performance comparison. TR represents for total
correct prediction rate, FR represents for total incorrect prediction rate.
Detailed definition explained in chapter.

Nonlocal prior
gLASSO

Deviance
-23.97
-21.15

(c) Average deviance by Nonlocal
prior and gLASSO.

It is defined as
T P R(sensitivity) =

TP
,
TP + FN

and false negative rate (FNR)=1-sensitivity. True negative rate (TNR), also called specificity
is related to the test’s ability to correctly delete patients without the condition. Definition
is expressed as
T N R(specif icity) =

TN
,
TN + FP

and false positive rate (FPR)=1-specificity. To measure prediction power of the model,
higher sensitivity and specificity means better prediction power. In Table 4.5 (b), true rate
+T N
(TR) is the percentage for all correct prediction, expressed by T R = T P +FT PP +T
N +F N and

false rate(FR)=1-TR. Again, we would expect high TR for the model with good prediction
performance.

77

From the prediction result in Table 4.5, we can see that nonlocal prior and gLASSO has
the same sensitivity, so they have the same performance in predicting the positive result.
And gLASSO shows more accurate result when predicting negative result since gLASSO has
higher specificity. When comparing overall prediction performance in Table 4.5 (b), gLASSO
has higher correct prediction rate than nonlocal prior (in terms of TR in the table), with 2%
difference. In most cases, nonlocal prior recommends a simpler model. The averaged model
size for nonlocal prior is 1.25 and gLASSO contains 12.82 covariates on average. Nonlocal
prior is only slightly underperformed than gLASSO but with a much simpler model.
We also compare the goodness of fit in model fitting, measured by deviance. Here, deviance
is defined as −2(logLike − logLike sat), where logLike is the log-likelihood for the fitted
model and logLike sat is the log-likelihood for the saturated model. Result is listed in Table
4.5 (c). Similarly as before, numbers are averaged among 20 replications. Nonlocal prior
shows smaller deviance, which means better fitting.
We now apply our model with full data set, the recommended model would contain two genes,
gene 493 and gene 1884. gLASSO with tuning parameter λ given by cross-validation agrees
with model size 8. Compared with gLASSO, nonlocal prior Bayesian method recommends
a simpler model.

4.5

Conclusion

The unique characteristic of nonlocal prior provides us with a new model selection method
in generalized linear model, which could efficiently eliminate unnecessary covaraites and lead
to a parsimonious model. Convergence rate is derived under some attainable assumptions.
Laplace approximation is applied to overcome calculation difficulty.

78

Based on application and simulation result, our proposed nonlocal prior method leads to a
simpler interpretation of the model and coefficients without added issue of over selection.
Based on application result, our method is comparable with gLASSO in terms of prediction
rate, but results in a simpler model. Simulation result shows our proposed method could
identify the true model with less non significant covariates included in the model, compared
with gLASSO and empirical Bayesian LASSO. Also, our method shows better goodness of
fit in model fitting.
The proposed method is well defined in theory along with a clear algorithm. However, one
limitation for the method is intensive computing time. In application, the proposed method
is much slower than gLASSO especially when candidate dimension is large. In future study,
an efficient algorithm needs to be developed to improve this intensive computing issue.

4.6

Proof of theorem 3

[Jiang (2007)] provide general conditions for the prior to give a convergence rate of the
probability regarding the Hellinger distance between the posterior model and the true model.
We check the conditions are satisfied in our setting. In particular, it is enough to show
conditions (O) and (N) of [Jiang (2007)] are satisfied. Condition (O) limits the tail densities
of prior and Condition (N) defines the prior density on an approximation neighborhood.
In the following description of Condition (O), Kn is the same as the candidate dimension
pn . To be consistent with the notation with [Jiang (2007)], we use Kn to denote pn .

79

4.6.1

Condition (O)

Let D(R) = 1 + R · sup|h|≤R |a (h)| · sup|h|≤R |ψ(h)| for any R > 0. Define r¯n as maximal
model size, which satisfy 1 ≤ r¯n < Kn . There exist some Cn > 0 such that

r¯n log(1/ 2n ) ≺ n 2n

(4.9)

r¯n log Kn ≺ n 2n

(4.10)

r¯n log D(¯
rn Cn ) ≺ n 2n

(4.11)

Furthermore, for all large enough n, the following two equations hold:

2

π(|γ| > r¯n ) ≤ e−4n n

(4.12)

and for all γ such that |γ| ≤ r¯n , for all j ∈ γ,

2

π(|βj | > Cn |γ) ≤ e−4n n

4.6.2

(4.13)

Condition (N)

Assume that a sequence of models γn exists such that, as n increases,

|βj∗ | ≺ 2n

(4.14)

j∈γn

and for any sufficiently small η > 0, there exists Nη such that, for all n > Nη , we have

2
π(γ = γn ) ≥ e−n n /8

80

(4.15)

and
2
π βγ ∈ M (γn , η) |γ = γn ≥ e−n n /8

where M (γn , η) = βj∗ ± η 2n /|γn |

4.6.3

j∈γn

(4.16)

.

Verification of Conditions (O) and (N) for nonlocal prior

We first check condition (O).
(4.9) and (4.10) are satisfied automatically by (C 1) and (C 2) in Assumption 1. Recall
Kn = pn . By setting r¯n = pn − 1, 1 ≤ r¯n < Kn . rn is the prior expectation of model size.

rn Kn
,
Kn

π(|γ| > r¯n ) = π(|γ| = pn ) =
logπ(|γ| = pn ) = Kn log

rn
Kn

(4.17)

≤ −4n 2n

(4.18)

by (C 6) in Assumption 1 so that (4.12) holds.
2

Next we need to show π(|βj | > Cn |γ) ≤ e−4n n holds.
∞

π(|βj | > c|γ) ≤ M

β 2 exp −

c

β2
˜n
2B



dβ

(4.19)

β2

∞

∞

β2

˜n (−β) exp −
= MB
+
exp −
˜n c
˜n
2B
2B
c


2
2
c
1
c

< M c exp −
+ exp −
˜n
˜n
c
2B
2B
=M

c+

1
c

exp −

c2
˜n
2B

81

.


dβ 

(4.20)

In (4.19), M is the normalization constant when taking marginal distribution from the joint
˜n . If c = Cn = 2 B
˜n n 2 and n 2
distribution. M in (4.20) is M B
n
n

M

c+

1
c

exp −

c2
˜n
2B

˜n bounded,
1 with B

≤ exp −n 2n .

Take n = n /2, we can get

2

π(|βj | > Cn |γ) ≤ e−4n n

(4.21)

˜n n 2 by assumpso that (4.13) is satisfied. Also condition (4.11) is satisfied with Cn = 2 B
n
tion (C 3). Thus, Condition (O) is checked.

Now, we verify Condition (N) for nonlocal prior.
Take the sequence of models γn such that, for each n, γ = γn reaches its infimum in
(rn ) = infγ:|γ|=rn

j:j∈γ

|βj∗ |. Then

j∈γn

|βj∗ | =

(rn ) ≺ 2n . For the condition on

prior π[β ∈ (βj∗ ± η 2n /rn )j∈γn |γn ]:
(a) if for any j ∈ γn , 0 is not covered by interval (βj∗ ± η 2n /rn ),
− 21 −0.5· τ1 β¯T Mγn β¯
π[β ∈ (βj∗ ± η 2n /rn )j∈γn |γn ] ≥ |2πMγ−1
|
(η 2n /rn )rn
e
n

β¯i2
i

β¯i2 .

= T1 ·
i

Here β¯ is some intermediate value making the density achieving its minimum over (βj∗ ±
− 21 −0.5· τ1 β¯T Aγn β¯
η 2n /rn )j∈γn . Also, we denote |2πMγ−1
|
e
(η 2n /rn )rn as T1 . Define c1 , c2 ,
n

c3 as positive constants, and c2 > c1 . We need to show T1

82

¯2
i βi

2

e−c2 n n . Since

2

e−c1 n n with (C 5) holds, it is sufficient to show

[Jiang (2007)] showed T1

2

e−c1 n n

β¯i2

2

e−c2 n n

i

which implies

¯2
i βi

2

e−(c2 −c1 )n n . Consequently, we need to show

¯2
i βi

2

e−c3 n n .
2

Without loss of generality, suppose βi is positive then the minimum of βi is βi∗ − η rnn , so
¯2
i βi

>

2
∗
n 2
i (βi − η rn ) .

Now our question is to show

is enough to show the order for

η

2
η rnn

2
n

rn

2

2

e−cn n . For this, it

since βi∗ is constant. That is, we want to show

2

=

2
∗
n 2
i (βi − η rn )

η

2
n

rn

2rn

2

e−cn n

(4.22)

Since rn log 12 ≤ pn log 12 ≺ n 2n and rn log rn ≤ pn log pn ≺ n 2n holds by (C 1) and (C
n
n
2), we can derive −2rn log

2
η rnn

2
2rn log η rn
n

≺ cn 2n , which implies e

2

e−cn n . This is

equivalent to (4.22).
(b) if there is at least one j, such that 0 is covered by interval (βj∗ ± η 2n /rn ) where j ∈ γn .
We can separate the index set into two groups, the first group corresponding with intervals
not cover 0, denote as I1 , all other index belongs to group I2 . Consider
= ( , , . . . , )|γn | such that

83

close to 0, denote

π[βj ∈ (βj∗ ± η 2n /rn )j∈γn |γn ]
≥π[∩j∈I1 βj ∈ (βj∗ ± η 2n /rn ) ∩
∩j∈I2 βj ∈ (βj∗ − η 2n /rn , − ) ∪
−1/2 e−0.5 T Mγn
≥|2πMγ−1
n|

η

2
n

rn

, βj∗ + η 2n /rn

|γn ]

(4.23)

rn

−2

2rn

(4.24)

The inequality in (4.24) holds, since density function of β goes to 0 as β is close to 0, so f ( )
can be very small when

is very small. With the constraint limn→∞

pn
∗
j=1 |βj |

know that βj is bounded for any j. In this case, we can always find very small

< ∞, we
such that

f ( ) is smaller than any f (β) in the stated interval.
TM
γn

T
2
≤ rn 2 B(rn ), so e−0.5 Mγn ≥ e−0.5rn B(rn ) , with bounded B(rn ) and

→ 0,

this term can be bounded away from below. For 2rn , need to show 2rn is not smaller than
2

e−cn n , that is

2rn

2

e−cn n ,
cn 2
− 2r n
n .
e

Also, we need

2

2

is smaller than rnn in order, since rnn is the interval width. Following is how

84

to show this may hold. By (C 1) and (C 2), we have

rn
rn log 2 ≺ n 2n
n

rn
n2
log 2 ≺ n
rn
n
n 2
− rnn
e

(4.25)

2

≺ n
rn

(4.26)

n 2

(4.26) can be derived from (4.25) since we have rnn → ∞ from (C 2) in Assumption 1. The
A → 0 and B → ∞, then (B − A) → ∞ and eA−B → 0, which is eA ≺ eB .
logic here is, if B

Since

n 2
n
rn

→ ∞, we can find some small

such that ≥

cn 2
− 2r n
n .
e

Also, we consider such

is

2

small enough so that ≤ rnn .
(4.26) also implies
2
n
rn

so that

2
n
rn

rn

2
η rnn

−2

2
n
rn

rn

= O

2

e−cn n . Also,
2
n
rn

, then

2
η rnn

2

≤ rnn implies at least 2 is an order of
rn

−2

2

e−cn n holds as a result of
2

2

e−cn n . Thus, condition π(βγ ∈ M (γn , η)|γ = γn ) ≥ e−n n in (4.16) is checked.

2
Next check inequality π(γ = γn ) ≥ e−n n /8 in (4.15). Notice that γn is chosen such that
rn rn
rn Kn −rn
rn
|γn | = rn , so π(γ = γn ) = ( K
) (1 − K
)
, since K
≺ 1 by (C 4), we have
n
n
n
rn
logπ(γ = γn ) ∼ rn log K
≥ −rn logKn since rn logrn is positive, and rn logKn ≺ n 2n by (C
n
√
2
4) and (C 2), so π(γ = γn ) ≥ e−n˜n /8 by taking ˜n = n / 8.

√
If we take n = n / 8, the inequality in (4.21) still holds. So we can apply n replaced by
√
n in Theorem 1, so that the Hellinger neighborhood will take a radius 8 n . Condition
(N) is checked.

85

4.6.4

Proof of theorem 1

By checking conditions at Verification of Conditions (O) and (N) for nonlocal prior,
Theorem 1 holds as a result of Theorem 4 from [Jiang (2007)].

86

BIBLIOGRAPHY

87

BIBLIOGRAPHY
[Akaike (1973)] Akaike, H., ”Information Theory and an Extension of the Maximum Likelihood Principle,” in 2nd International Symposium on Information Theory, B. N. Petrov
and F. Csaki, eds., Akademiai Kiado, Budapest, 1973, pp. 267-281.
[Alon et al. (1999)] Alon, Uri, et al. ”Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.”
Proceedings of the National Academy of Sciences 96.12 (1999): 6745-6750.
[Bhattacharya et al.(2014)] Bhattacharya, Anirban, et al. ”Dirichlet-Laplace priors for optimal shrinkage.” Journal of the American Statistical Association just-accepted (2014):
00-00.
[Bondell and Reich(2012)] Bondell, Howard D., and Brian J. Reich. ”Consistent highdimensional Bayesian variable selection via penalized credible regions.” Journal of the
American Statistical Association 107.500 (2012): 1610-1624.
[Candes and Tao (2007)] Candes, Emmanuel, and Terence Tao. ”The Dantzig selector: Statistical estimation when p is much larger than n.” The Annals of Statistics (2007):
2313-2351.
[Carvalho, Polson, and Scott(2009)] Carvalho, Carlos M., Nicholas G. Polson, and James
G. Scott. ”Handling sparsity via the horseshoe.” International Conference on Artificial
Intelligence and Statistics. 2009.
[Carvalho, Polson, and Scott(2010)] Carvalho, Carlos M., Nicholas G. Polson, and James G.
Scott. ”The horseshoe estimator for sparse signals.” Biometrika (2010): asq017.
[Castillo and vander Vaart(2012)] Castillo, Ismael, and Aad van der Vaart. ”Needles and
straw in a haystack: Posterior concentration for possibly sparse sequences.” The Annals
of Statistics 40.4 (2012): 2069-2101.
[Chen and Chen (2012)] Chen, Jiahua, and Zehua Chen. ”Extended BIC for small-n-large-P
sparse GLM.” Statistica Sinica (2012): 555-574.
[Chen, Ibrahim and Yiannoutsos (1999)] Chen, M-H., Joseph G. Ibrahim, and Constantin
Yiannoutsos. ”Prior elicitation, variable selection and Bayesian computation for logistic regression models.” Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 61.1 (1999): 223-242.
[Dellaportas et al. (1997)] Dellaportas, Petros, Dimitris Karlis, and Evdokia Xekalaki.
”Bayesian analysis of finite poisson mixtures.” Manuscript (1997).

88

[Fan and Lv(2008)] Fan, Jianqing, and Jinchi Lv. ”Sure independence screening for ultrahigh dimensional feature space.” Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 70.5 (2008): 849-911.
[Fan and Li (2001)] Fan, Jianqing, and Runze Li. ”Variable selection via nonconcave penalized likelihood and its oracle properties.” Journal of the American statistical Association 96.456 (2001): 1348-1360.
[Friedman, Hastie and Tibshirani (2010)] Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. ”Regularization paths for generalized linear models via coordinate descent.”
Journal of statistical software 33.1 (2010): 1.
[George (1999)] George, Edward I. ”Bayesian model selection.” Encyclopedia of Statistical
Sciences Update 3 (1999): 39-46.
[George and Foster(2000)] George, Edward I., and Dean P. Foster. ”Calibration and empirical Bayes variable selection.” Biometrika (2000): 731-747.
[George and McCulloch (1993)] George, Edward I., and Robert E. McCulloch. ”Variable
selection via Gibbs sampling.” Journal of the American Statistical Association 88.423
(1993): 881-889.
[George and McCulloch (1997)] George, Edward I., and Robert E. McCulloch. ”Approaches
for Bayesian variable selection.” Statistica sinica (1997): 339-373.
[Hoff (2009)] Hoff, Peter D. ”Nonconjugate priors and Metropolis-Hastings algorithms.” A
First Course in Bayesian Statistical Methods. Springer New York, 2009. 171-193.
[Huang, Xu and Cai(2013)] Huang, Anhui, Shizhong Xu, and Xiaodong Cai. ”Empirical
Bayesian LASSO-logistic regression for multiple binary trait locus mapping.” BMC
genetics 14.1 (2013): 5.
[Huang et al. (2008)] Huang, J., Horowitz, J. L., Ma, S. (2008). Asymptotic properties of
bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 587-613.
[Ishwaran and Rao (2005)] Ishwaran, Hemant, and J. Sunil Rao. ”Spike and slab variable
selection: frequentist and Bayesian strategies.” Annals of Statistics (2005): 730-773.
[Ishwaran and Rao (2011)] Ishwaran, Hemant, and J. Sunil Rao. ”Consistency of spike and
slab regression.” Statistics and Probability Letters 81.12 (2011): 1920-1928.
[Jiang (2006)] Jiang, Wenxin. ”On the consistency of Bayesian variable selection for high
dimensional binary regression and classification.” Neural computation 18.11 (2006):
2762-2776.

89

[Jiang (2007)] Jiang, Wenxin. ”Bayesian variable selection for high dimensional generalized
linear models: convergence rates of the fitted densities.” The Annals of Statistics 35.4
(2007): 1487-1511.
[Johnson(2013)] Johnson, Valen E. ”On Numerical Aspects of Bayesian Model Selection
in High and Ultrahigh-dimensional Settings.” Bayesian analysis (Online) 8.4 (2013):
741-758
[Johnson and Rossell(2010)] Johnson, Valen E., and David Rossell. ”On the use of nonlocal
prior densities in Bayesian hypothesis tests.” Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 72.2 (2010): 143-170.
[Johnson and Rossell(2012)] Johnson, Valen E., and David Rossell. ”Bayesian model selection in high-dimensional settings.” Journal of the American Statistical Association
107.498 (2012): 649-660.
[Kan(2008)] Kan, Raymond. ”From moments of sum to moments of product.” Journal of
Multivariate Analysis 99.3 (2008): 542-554.
[Liang, Song and Yu (2013)] Liang, Faming, Qifan Song, and Kai Yu. ”Bayesian subset modeling for high-dimensional generalized linear models.” Journal of the American Statistical Association 108.502 (2013): 589-606.
[Liang, Truong and Wong(2001)] Liang, Faming, Young K. Truong, and Wing Hung Wong.
”Automatic Bayesian model averaging for linear regression and applications in
Bayesian curve fitting.” Statistica Sinica 11.4 (2001): 1005-1030.
[Madigan and Raftery(1994)] Madigan, David, and Adrian E. Raftery. ”Model selection and
accounting for model uncertainty in graphical models using Occam’s window.” Journal
of the American Statistical Association 89.428 (1994): 1535-1546.
[Mitchell and Beauchamp(1988)] Mitchell, Toby J., and John J. Beauchamp. ”Bayesian variable selection in linear regression.” Journal of the American Statistical Association
83.404 (1988): 1023-1032.
[Narisetty and He(2014)] Narisetty, Naveen Naidu, and Xuming He. ”Bayesian variable selection with shrinking and diffusing priors.” The Annals of Statistics 42.2 (2014): 789817.
[Nott and Leonte (2012)] Nott, David J., and Daniela Leonte. ”Sampling schemes for
Bayesian variable selection in generalized linear models.” Journal of Computational
and Graphical Statistics (2012).
[Park and Casella (2008)] Park, Trevor, and George Casella. ”The bayesian lasso.” Journal
of the American Statistical Association 103.482 (2008): 681-686.

90

[Polson and Scott(2010)] Polson, Nicholas G., and James G. Scott. ”Shrink globally, act
locally: sparse Bayesian regularization and prediction.” Bayesian Statistics 9 (2010):
501-538.
[Rossell et al. (2013)] Rossell, David, Donatello Telesca, and Valen E. Johnson. ”Highdimensional Bayesian classifiers using non-local priors.” Statistical Models for Data
Analysis. Springer, Heidelberg, 2013. 305-313.
[Schwarz (1978)] Schwarz, Gideon. ”Estimating the dimension of a model.” The annals of
statistics 6.2 (1978): 461-464.
[Scott and Berger.(2010)] Scott, James G., and James O. Berger. ”Bayes and empiricalBayes multiplicity adjustment in the variable-selection problem.” The Annals of Statistics 38.5 (2010): 2587-2619.
[Sha et al.(2004)] Sha, Naijun, et al. ”Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage.” Biometrics 60.3 (2004): 812-819.
[Tibshirani(1996)] Tibshirani, Robert. ”Regression shrinkage and selection via the lasso.”
Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288.
[Van de Geer (2008)] Van de Geer, Sara A. ”High-dimensional generalized linear models and
the lasso.” The Annals of Statistics (2008): 614-645.
[Xu and Ghosh(2015)] Xu, Xiaofan, and Malay Ghosh. ”Bayesian variable selection and
estimation for group lasso.” Bayesian Analysis 10.4 (2015): 909-936.
[Zhou, Liu and Wong(2004)] Zhou, Xiaobo, Kuang-Yu Liu, and Stephen TC Wong. ”Cancer
classification and prediction using logistic regression with Bayesian gene selection.”
Journal of Biomedical Informatics 37.4 (2004): 249-259.
[Zou (2006)] Zou, Hui. ”The adaptive lasso and its oracle properties.” Journal of the American statistical association 101.476 (2006): 1418-1429.
[Zou and Hastie (2005)] Zou, Hui, and Trevor Hastie. ”Regularization and variable selection
via the elastic net.” Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.2 (2005): 301-320.

91