BAYESIAN VARIABLE SELECTION AND FUNCTIONAL DATA ANALYSIS:

APPLICATION TO BRAIN IMAGING

By

Asish Kumar Banik

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Statistics – Doctor of Philosophy

2019

ABSTRACT

BAYESIAN VARIABLE SELECTION AND FUNCTIONAL DATA ANALYSIS:

APPLICATION TO BRAIN IMAGING

By

Asish Kumar Banik

High-dimensional statistics is one of the most studied topics in the ﬁeld of statistics. The most
interesting problem to arise in the last 15 years is variable selection or subset selection. Variable
selection is a strong statistical tool that can be explored in functional data analysis. In the ﬁrst part
of this thesis, we implement a Bayesian variable selection method for automatic knot selection.
We propose a spike-and-slab prior on knots and formulate a conjugate stochastic search variable
selection for signiﬁcant knots. The computation is substantially faster than existing knot selection
methods, as we use Metropolis-Hastings algorithms and a Gibbs sampler for estimation. This work
focuses on a single nonlinear covariate, modeled as regression splines. In the next stage, we study
Bayesian variable selection in additive models with high-dimensional predictors. The selection
of nonlinear functions in models is highly important in recent research, and the Bayesian method
of selection has more advantages than contemporary frequentist methods. Chapter 2 examines
Bayesian sparse group lasso theory based on spike-and-slab priors to determine its applicability for
variable selection and function estimation in nonparametric additive models.

The primary objective of Chapter 3 is to build a classiﬁcation method using longitudinal
volumetric magnetic resonance imaging (MRI) data from ﬁve regions of interest (ROIs). A
functional data analysis method is used to handle the longitudinal measurement of ROIs, and the
functional coeﬃcients are later used in the classiﬁcation models. We propose a Pólya-gamma
augmentation method to classify normal controls and diseased patients based on functional MRI
measurements. We obtain fast-posterior sampling by avoiding the slow and complicated Metropolis-
Hastings algorithm. Our main motivation is to determine the important ROIs that have the highest
separating power to classify our dichotomous response. We compare the sensitivity, speciﬁcity,

and accuracy of the classiﬁcation based on single ROIs and with various combinations of them.
We obtain a sensitivity of over 85% and a speciﬁcity of around 90% for most of the combinations.
Next, we work with Bayesian classiﬁcation and selection methodology. The main goal of
Chapter 4 is to employ longitudinal trajectories in a signiﬁcant number of sub-regional brain
volumetric MRI data as statistical predictors for Alzheimer’s disease (AD) classiﬁcation. We
use logistic regression in a Bayesian framework that includes many functional predictors. The
direct sampling of regression coeﬃcients from the Bayesian logistic model is diﬃcult due to its
complicated likelihood function.
In high-dimensional scenarios, the selection of predictors is
paramount with the introduction of either spike-and-slab priors, non-local priors, or Horseshoe
priors. We seek to avoid the complicated Metropolis-Hastings approach and to develop an easily
implementable Gibbs sampler. In addition, the Bayesian estimation provides proper estimates of
the model parameters, which are also useful for building inference. Another advantage of working
with logistic regression is that it calculates the log of odds of relative risk for AD compared
to normal control based on the selected longitudinal predictors, rather than simply classifying
patients based on cross-sectional estimates. Ultimately, however, we combine approaches and use a
probability threshold to classify individual patients. We employ 49 functional predictors consisting
of volumetric estimates of brain sub-regions, chosen for their established clinical signiﬁcance.
Moreover, the use of spike-and-slab priors ensures that many redundant predictors are dropped
from the model.

Finally, we present a new approach of Bayesian model-based clustering for spatiotemporal data
in chapter 5 . A simple linear mixed model (LME) derived from a functional model is used to
model spatiotemporal cerebral white matter data extracted from healthy aging individuals. LME
provides us with prior information for spatial covariance structure and brain segmentation based
on white matter intensity. This motivates us to build stochastic model-based clustering to group
voxels considering their longitudinal and location information. The cluster-speciﬁc random eﬀect
causes correlation among repeated measures. The problem of ﬁnding partitions is dealt with by
imposing prior structure on cluster partitions in order to derive a stochastic objective function.

Copyright by
ASISH KUMAR BANIK
2019

I dedicate this dissertation to my late grandparents for their endless love, support and blessings

v

ACKNOWLEDGEMENTS

I would like to ﬁrst thank my supervisor Dr. Tapabrata Maiti for his continuous support, en-
couragement and assistance. His expertise and experience helped me to formulate the research
problems and kept me motivated all the time. I am also grateful to the Department of Statistics and
Probability for providing me the opportunity to grow my knowledge over years.

Next, I would like to thank Dr. Alla Sikorskii. Dr. Sikorskii supported me as a Research
I have received a great deal of support
Assistant with College of Nursing for over four years.
from her to build my expertise around clinical trials research. I would also like to thank Professor
Dr. Andrew Bender for providing us the brain image data and helped me to understand the nitty
gritty details of human brain mechanisms. His research questions motivated me to build various
statistical methods for this dissertation.

My sincerest gratitude to other committee members Dr. R.V. Ramamoorthi and Dr. Seungik
I also would like to thank all

Baek for providing important suggestions and invaluable advice.
Department of Statistics professors for their valuable lessons through various courses.

Last but not least, I would like to devote my deepest gratitude to my parents. They believed in
me like nobody else. I would like to thank my brother and sister-in-law for encouraging throughout
writing this dissertation. My ﬁnal gratitude is for my friends and peers. I always received much
needed important suggestions from them through out my research journey.

vi

TABLE OF CONTENTS

.
.

.
.

.
.

.
.

.
.

.
.

1.1 Variable selection .

LIST OF TABLES .
LIST OF FIGURES .
CHAPTER 1

.
.
.
.
INTRODUCTION .
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Bayesian Variable Selection and Spike-and-Slab Prior . . . . . . . . . . . .
1.1.2
Penalization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.3 Bayesian Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
1
3
6
1.2 Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Functional data smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Bayesian model based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 ADNI Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

x

xi

.

.

CHAPTER 2 NONPARAMETRIC FUNCTION ESTIMATION USING BAYESIAN

.

.

Introduction .

2.1.1
2.1.2 Variable selection using Spike-Slab Prior

VARIABLE SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Univariate function estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
. . . . . . . . . . . . . . . . . . 22
2.1.2.1 Model description and Prior . . . . . . . . . . . . . . . . . . . . 23
The Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2.2
2 . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2.3 Updating β and σ
2.1.3 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3.1
Simulated example . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3.2 Chloride concentration Data . . . . . . . . . . . . . . . . . . . . 30
2.1.3.3 Discussion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Functional additive model estimation with bi-level selection . . . . . . . . . . . . . 32
2.2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 Bayesian Sparse group Lasso in nonparametric additive models . . . . . . . 34
Simulated Example: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.3
2.2.4 Conclusion:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Introduction .

. .

.

.

.

.

.

CHAPTER 3 BAYESIAN CLASSIFICATION OF ALZHEIMER’S DISEASE STAGES

.
.

.
.

.
.

.
.

. .
.

Introduction .

3.1
3.2 Methodology .

FROM LONGITUDINAL VOLUMETRIC MRI DATA . . . . . . . . . . . 43
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1
Smoothing functional data . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Classiﬁcation using Pólya-Gamma Augmentation . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Classiﬁcation using Hippocampus . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Data Description .
3.4 Application results

.
.

.
.

.
.

.
.

vii

3.4.2 Classiﬁcation with Single ROI
3.4.3 Classiﬁcation using combination of ROIs
3.4.4 Conclusion .
.

. . . . . . . . . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

. .
.
.

. .

.
.

.

3.5 Discussion . .

.

.

.

.

.

. .

Introduction .

Spike-Slab prior .

CHAPTER 4 BAYESIAN PENALIZED MODEL FOR CLASSIFICATION AND SE-
LECTION OF FUNCTIONAL PREDICTORS USING LONGITUDI-
NAL MRI DATA FROM ADNI . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Bayesian Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 Bayesian Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Functional smoothing for longitudinal data . . . . . . . . . . . . . . . . . . . . .
. 68
4.4 Simultaneous Classiﬁcation of binary response with selection of functional predictors 71
4.4.1 Classiﬁcation using Pólya-Gamma Augmentation . . . . . . . . . . . . . . 71
Selection using Bayesian Group Lasso . . . . . . . . . . . . . . . . . . . . 72
4.4.2
4.5 Median thresholding and Theoretical properties . . . . . . . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
. 85
. 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.6 Simulation results .
4.7 Application on ADNI MRI data . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 An alternate modeling Proposal:
. . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Discussion:

4.5.1 Marginal Prior for β j:
4.5.2 Median thresholding as posterior estimates
Posterior Consistency:
4.5.3

.

.

.

.

.

.

.

.

.

.

.

CHAPTER 5 BAYESIAN SPATIOTEMPORAL CLUSTERING MODEL FOR ANA-

LYZING WHITE MATTER DATA . . . . . . . . . . . . . . . . . . . . . . 98

5.1 Spatiotemporal Linear mixed eﬀects modeling for the distribution of cerebral

Introduction .
.

white matter on MRI scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.2 Methods .
.
. . . . . . . . . . . . . . . 101

5.1.2.1 Voxel-wise linear mixed eﬀect model
5.1.2.2 Clustering voxels into homogeneous regions using K-means

.
.

.
.

.

.
.
.

. .
. .
. .

5.1.3 WM Data .
5.1.4 Results
. .
5.1.5 Discussion .

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.2.3 Cluster based Spatiotemporal linear mixed eﬀect models . . . . . 105
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Bayesian Model based clustering in application to Spatiotemporal data . . . . . . . 117
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
The Ewens-Pitman Prior on Cluster partitions : . . . . . . . . . . 119
Spatiotemporal Linear Mixed Eﬀects model:
. . . . . . . . . . . 122
. . . . . . . . . . . . . . . . . . . 123

5.2.2.1
5.2.2.2
5.2.2.3 Objective function derivation:

5.2.1
Introduction .
5.2.2 Methodology :

.
.

.

viii

CHAPTER 6 FUTURE WORK .
APPENDIX .
.
.
.
.
BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

ix

LIST OF TABLES

Table 2.1: Log(ISE) comparison with Kernel methods of approximation . . . . . . . . . . . 30

Table 3.1: Patients Baseline Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Table 3.2: Data Characteristics

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Table 3.3: Classiﬁcation performance using single ROI . . . . . . . . . . . . . . . . . . . . 57

Table 4.1: Classiﬁcation and selection performance Table . . . . . . . . . . . . . . . . . . 87

Table 4.2: Patients Baseline Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Table 5.1: Subject Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Table A1: Combination two ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Table A2: Combination three ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Table A3: Combination four ROIs .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

x

LIST OF FIGURES

Figure 1.1: spike-and-slab Priora

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.2: Horseshoe priora .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.3: Lasso vs Ridge solution Patha . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.4: Threshold functions for orthonormal designa . . . . . . . . . . . . . . . . . . .

Figure 1.5: comparison between Lasso, Bayesian Lasso and Ridge regression solution pathsa

2

2

5

5

6

Figure 1.6: B-spline smoothing under varying orders and knotsa . . . . . . . . . . . . . . . 14

Figure 2.1: Nonparametric function estimation comparison between variable selection,

linear regression, and ﬁxed knot approximation

. . . . . . . . . . . . . . . . . 31

Figure 2.2: Example 1 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 39

Figure 2.3: Example 1 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 40

Figure 2.4: Example 2 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 41

Figure 2.5: Example 2 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 42

Figure 3.1: Longitudinal volume of ROIs for dementia patients and normal controls. The
ﬁrst ﬁve plots (plots in the ﬁrst column and the top two plots in the second
column) are scaled ROIs of interest. The last plot has MMSE scores of
patients. On the X-axis, we have the number of visits for patients. Blue lines
are for the dementia group and orange lines are for the normal group. Thin
lines represent each patient’s data, and thick lines are the pooled mean for the
AD and CN groups.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

.

Figure 3.2:

In (a), the colored points represent hippocampus values for each time point,
and the black dot signiﬁes the mean value of the smoothed curve for that time
point. Orange is for CN and blue is for AD. In (b), the points are the predicted
class probability for patients in the test data set. (c) presents the spread of
the estimated regression coeﬃcients from the Bayesian logistic model in 100
repeated runs. (d) shows the ROC curve after classifying patients using the
hippocampus only. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 4.1: Plots based on Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

xi

Figure 4.2: Brain volume changes of Left Hippocampus, Left Lateral Orbitofrontal cor-

tex, and Left Posterior Cingulate over time for Normal and Dementia patients . . 93

Figure 4.3: Acceptance probability of MCMC sample for Left Hippocampus and Left-

Lateral Orbitofrontal brain regions

. . . . . . . . . . . . . . . . . . . . . . . . 94

Figure 4.4: Pictorial representation of selected brain ROI’s discriminating diseased group

from normal control .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Figure 5.1: Change of WM across Z-axis from top-left to bottom-right Z-slice 15,30,45,60 . 110

Figure 5.2: white matter histogram with density plot

. . . . . . . . . . . . . . . . . . . . . 111

Figure 5.3: white matter measurement greater than 0.5 . . . . . . . . . . . . . . . . . . . . 111

Figure 5.4: voxels’ directional density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Figure 5.5: 2-dimensional distributions of T1 & T2 imaging sequence . . . . . . . . . . .

. 111

Figure 5.6: Residual vs predicted plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 112

Figure 5.7: Within sum of squares by number of clusters . . . . . . . . . . . . . . . . . .

. 112

Figure 5.8: Segmentation of brain into similar regions . . . . . . . . . . . . . . . . . . .

. 113

Figure 5.9: Fisher’s scoring method convergence . . . . . . . . . . . . . . . . . . . . . . . 114

Figure 5.10: Estimated parameters plot for all clusters . . . . . . . . . . . . . . . . . . . . . 114

Figure 5.11: Histogram of Intercepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Figure 5.12: Histogram of Gender coeﬃcients . . . . . . . . . . . . . . . . . . . . . . . .

. 115

xii

CHAPTER 1

INTRODUCTION

1.1 Variable selection

Statistical modeling with large numbers of predictors is one of the most studied topics in the ﬁeld
of statistics. The most interesting problem to arise in the last 15 years is variable selection or subset
selection. Statisticians are excited about variable selection for various reasons. Most often, when
a model contains a large number of predictors, we are not satisﬁed with prediction accuracy, in-
terpretation, over-ﬁtting, estimation, or testing of regression coeﬃcients, among others. Numerous
variable selection methods have been proposed in the Bayesian and non-Bayesian literature, such
as forward and backward stepwise selection, shrinkage methods, principal components regression,
and partial least squares. The problem of choosing an optimal model from a subset of all possible
models has led to a variety of algorithms and methodology building. The following discusses
Bayesian variable selection and its application to various nonparametric regression problems.

1.1.1 Bayesian Variable Selection and Spike-and-Slab Prior

We start by deﬁning the setup,

y =

Xi βi + 

p

i=1

where p (cid:29) n and y is a n × 1 vector. It is mathematically intuitive that the parameter vector β can
only be estimated by n observations. If β is sparse in terms of l0 norm i.e. || β||0
0 = |{i : βi (cid:44) 0}| < n,
then we can achieve reasonably good estimates. Various authors worked on this problem. Miller,
2002[89] used a spike-and-slab prior with a spike on the zero coeﬃcient and a slab part on the
estimable coeﬃcients. A spike-and-slab prior was introduced by [43] where βi follows a normal
mixture distribution:

βi|γi ∼ (1 − γi)N(0, τ

i )
i ) + γiN(0, c2
2
2
i τ

1

Figure 1.1: spike-and-slab Priora

a Plot obtained from on-line resources

Figure 1.2: Horseshoe priora

where γi is a latent variable that controls the number of signiﬁcant coeﬃcients or model size.
The motivation for the frequently used Dirac-distributed spike-and-slab prior came from the above
2
structure. We observe a point mass on the spike part when τ
tends to 0. Model complexity is a
i
required property. The degree of sparseness can be set based on an optimality criterion. Burnham
and Anderson, 2004[3] mentioned the best predictive ability as an optimal criterion in their book,
“Model Selection and Multi-model Inference.” As we have noted P(γi = 1) controls the sparsity
of the model; George & McCulloch, 1993[43] proposed to use a Bernoulli distribution with 0.5
probability. Bayesian variable selection is not limited to spike-and-slab prior. For example, Liang
et al., 2008 [78] used a mixture of g-priors for variable selection. Carvalho et al., 2010[18]
introduced the horseshoe prior as a global-local shrinkage. The horseshoe is another popular prior
choice for variable selection where a half-cauchy prior is used to estimate signiﬁcant predictors.
We have seen that this method penalizes less on the strong signals and parameters are estimated
with fewer constraints. However, it comes with a computation cost, as in most cases it is diﬃcult
to determine a proper posterior distribution. Casella and Moreno, 2006[21] presented an objective
Bayesian criterion for variable selection. In general, the literature on variable selection is vast, but
we focused on spike-and-slab prior in this study.

2

The application of a spike-and-slab prior can be extended to various problems, such as non-
parametric function estimation or curve estimation. Smith and Kohn, 1996[115] applied a similar
approach for univariate nonparametric regression. We closely studied their work for function esti-
mation. Although their prior structure does not directly resemble a spike-and-slab setup, they also
used an indicator variable γ, such that

βγ ∼ N(0, cσ

2(XT

γ Xγ)−1), p(γi) = πi

where Xγ is the design matrix with columns whose regression coeﬃcients are nonzero. Once we
ﬁnd p(γ|y), it is easy to sample using Gibbs sampling. Smith and Kohn, 1996[115] used this
approach to select signiﬁcant knots from an unknown function f(x) approximated by cubic spline.
Later, Ishwaran and Rao, 2005[57] started reshaping the spike-and-slab prior. They described the
following setup in their paper:

2),

(i = 1, ..., n)

i β, σ

(Yi/xi, β, σ

2) ind∼ N(x(cid:48)
(β/γ) ∼ N(0, Γ),
γ ∼ π(dγ),
2 ∼ µ(dσ
2),

σ

The above formulation is a general setup, where Γ is a diagonal matrix. The authors developed con-
tinuous bimodal priors which have separate spike and slab parts. Malsiner-Walli and Wagner[82]
performed a thorough revision of spike-and-slab priors. They argued that the selected indicator
variables are independent conditional on the prior inclusion probability, but marginal dependence
is not a logical condition. Instead, they used an individual inclusion probability for each regression
coeﬃcient.

1.1.2 Penalization Methods

Variable selection has been extensively studied in the non-Bayesian literature too. Lasso (Tib-
shirani, 1996 [120]) is the most popular variable selection technique, initially developed for linear

3

(cid:18)1

n

(cid:18)1

n

(cid:19)

(cid:19)

regression models.
introduced the idea of l1 penalization on regression coeﬃcients,as formulated below:

If p > n, the ordinary least square estimator is not unique and Tibshirani

ˆβlasso(λ) = arg min

β

||y − X β||2

2 + λ|| β||1

where λ ≥ 0 is the penalty parameter. This estimator can drag many coeﬃcients towards zero,
depending on the sparsity of the design matrix and the extent of the penalization. The solution
is a convex optimization problem and easy to implement. Lasso can perform variable selection
in the sense that it can provide an exact zero solution for some components. On the other hand,
another old and popular penalization method, ridge regression, shrinks regression coeﬃcients by
penalizing on the size. The ridge coeﬃcients minimize a penalized residual sum of squares,

ˆβridge(λ) = arg min

β

||y − X β||2

2 + λ|| β||2
2

Similarly λ, the complexity parameter, controls the amount of shrinkage. Ridge regression performs
proportional shrinkage, and lasso truncates the coeﬃcient at zero. To understand the relationship,
we use the famous picture of the lasso and ridge regression solution path for two parameters in
Figure 1.3. The solution path has elliptical contours centered around OLS estimates. The solution
for ridge is the disk, while for lasso it is the diamond. The solution is obtained when the elliptical
contour ﬁrst hits the constraint region. As the disk has corners, lasso has one solution equal to
zero. For p > 2, we can have more solutions exactly equal to zero. In case of an orthonormal
design matrix, we can have the explicit solutions for lasso and ridge. On the other hand, lasso uses
soft-thresholding, as depicted in Figure 1.4.

Lasso and Ridge regression both use Bayes estimates with diﬀerent priors. Tibshirani, 1996
[120] suggested that Lasso estimates can be obtained if the regression parameters have Laplace or
double-exponential priors. Park and Casella, 2008[96] provided an easy-to-compute Gibbs sampler

4

Figure 1.3: Lasso vs Ridge solution Patha

Figure 1.4: Threshold functions for orthonormal
designa

a Plot obtained from on-line resources

for Bayesian lasso by introducing a scale mixture of normal priors on regression coeﬃcients.

2

β|σ

2
1, .., τ

, τ

p

p ∼ Np(0, σ
2Dτ)
2
Dτ = diag(τ
p)
2
2
1, .., τ
2
exp(−λ
p ∼
2
λ
2
i=1
2) ∼ 1
2

σ

2
τ
1, .., τ

π(σ

2

i /2)dτ
2

2
i

τ

Although this method helps us computationally, we do not obtain exact zero estimates from the
posterior median solution. Figure 1.5 comes from Park and Casella, 2008’s paper, where it is
clearly visible that by using lasso, we can have a zero solution, in contrast to Bayesian lasso or
ridge. Bayesian lasso provides very small estimates for insigniﬁcant predictors, and one needs to
impose one more level of threshold to drop redundant variables from the model. This problem
can be solved using a point mass spike-and-slab prior on regression coeﬃcients. Xu and Ghosh,
2015[128] used this phenomenon for Bayesian Group Lasso , which we discuss in next section.

5

Figure 1.5: comparison between Lasso, Bayesian Lasso and Ridge regression solution pathsa

a Plot obtained from on-line resources

1.1.3 Bayesian Group Lasso

Grouping structures among predictors occurs in many situations, such as when a model has
categorical variables or functional predictors. The categorical predictors are represented by a
group of dummy variables and, in the case of functional predictors, sometimes a group of basic
functions used in the model. Gene expression data forms groups in the form of gene pathways,
and these groups have a natural correlation structure. The lasso solution is not entirely satisfactory
in such scenarios, as it only selects the individual predictors instead of the whole group. It also
depends on how the dummy variables are orthogonalized. To deal with this problem Yuan et al.
[132] introduced ’Group Lasso’ which can select important grouped variables as a whole. Consider
this model-

G

g=1

Yn×1 =

Xg βg + 

where  follows normal distribution and βg is the g-th grouped predictor among G total group
variables. The group size can vary. Yuan et al. [132] proposed group lasso, which is an extension

6

of lasso, to handle grouped variables. The group lasso solution for linear regression is deﬁned as

||Y − G

g=1

min
βg

G

g=1

Xg βg|| + λ

|| βg||2

In this dissertation, we deal with a generalized linear model with binary response. Hence, we
are interested in how lasso and group lasso have been developed for logistic regression. The lasso
penalty can be applied to logistic regression or multinomial logistic regression. Genkin et al. [42]
applied l1 penalty to their sparse model and used a Laplace prior to avoid overﬁtting. Instead of
the residual sum of squares, they used a negative log-likelihood criterion and MAP estimates of the
regression parameters. The ﬁrst introduction of group lasso penalty for logistic regression model
was observed in Kim et al. [66], where they proposed a gradient descent algorithm for estimation.
Meier et al. [87] presented group lasso for a “large p” which minimizes the negative log-likelihood
function with group penalty. The group lasso solution in their paper is deﬁned as follows:

 n

i=1

G

g=1

yi

−

min
βg

Xi,g βg − log(1 + exp( G

Xi,g βg))

g=1

 + λ

G

g=1

√

mg|| βg||2

where mg is the size of βg. They propose a block co-ordinate gradient descent algorithm to ﬁnd
the solution to this minimizing problem.

Despite the more common use of frequentist group lasso methods, Bayesian approaches may
be more appropriate for the same problem. This is because the estimates of the standard lasso
estimators do not provide meaningful standard errors that can be used for hypothesis testing or
constructing conﬁdence intervals. The lasso estimator has a complex limit distribution and is
complicated to implement (Knight et al. [68],Chatterjee et al.[23]). On the other hand, a Bayesian
prior-based formulation of lasso can provide reliable standard errors of estimates by obtaining
the MAP estimators. In his seminal Lasso paper, Tibshirani[120] stated that the posterior mode
with an independent double exponential prior for the regression coeﬃcient is the same as a lasso
estimator. Later, Park and Casella[96] developed a highly eﬃcient Gibbs sampler for Bayesian
lasso by introducing scale mixture priors for regression coeﬃcients. The regression coeﬃcients

7

follow the normal prior, and the variance components of the normal distribution follow the Gamma
prior, giving us a highly eﬃcient way of posterior sampling. Based on Park and Casella’s[96]
work, Kyung et al.[69] provided a Bayesian hierarchical model for Bayesian group lasso. As in
penalized linear regression with grouped variables, the conditional prior β/σ

2 can be written as

− λ

σ

G

g=1



|| βg||2

π(β/σ

2) ∝ exp

Kyung et al.[69] expressed this same prior as

βg/τ

2
g, σ

2
g σ

2Img)
2 ∼ Nmg(0, τ
g ∼ Gamma(mg + 1
2
2

,

τ

2

λ

2 ), g = 1, .., G

Li et al. 2015[75] worked with a high-dimensional varying-coeﬃcient model equipped with
Bayesian group lasso. They tried to obtain group lasso estimators as posterior mode estimates of
multivariate i.i.d. Laplace priors on regression parameters.

Although it is highly computationally eﬃcient, a signiﬁcant disadvantage of Park and Casella’s[96]

normal mixture prior setup is that it does not ensure an exact zero solution for the regression co-
eﬃcients. This is also the case if we obtain the posterior mean or median from the prior setup by
Kyung et al.[69] – it will not provide exact 0 estimates for βg’s. Therefore, to impose sparsity in
group level, Xu et al. [128] proposed a multivariate zero-inﬂated mixture prior for each βg. The
following is a hierarchical structure with an independent spike-and-slab prior for each βg:

Y|X, β, σ
βg|τ

2
g, σ

2I)

(cid:32) mg + 1

2 ∼ N(X β, σ
2 ∼ (1 − π0)Nmg(0, σ
g ∼ Gamma
2
2 ∼ IG(α, γ)
σ
π0 ∼ Beta(a, b)

2

τ

(cid:33)

2

,

τ

g Img) + π0δ0(βg), g = 1, .., G
2
2
λ
2

, g = 1, .., G

8

where δ0(βg) denotes point mass at 0. The mixing probability π0 can be deﬁned as a function of
the number of predictors to impose more sparsity as the feature size increases. The choice of λ is
very critical to control sparsity. Large values of λ will produce biased estimates and very small λ
values will impose diﬀuse distribution for the slab part. Xu et al. [128] mentioned an empirical
Bayes approach to estimate λ. Due to the intractability of marginal likelihood, they proposed a
Monte Carlo EM (Expectation Maximization) algorithm for the estimation of λ. Moreover, they
have shown theoretically and numerically that the median thresholding of posterior βg samples
provides exact zero estimates for insigniﬁcant group predictors.

The class of spike-and-slab zero-inﬂated mixture prior was ﬁrst introduced by Mitchell et
al.[91]. It is highly useful for variable selection. In the zero-inﬂated mixture prior, the slab part
assumes some known distribution at nonzero coeﬃcients, and the spike part has point mass at zero
coeﬃcients. Later George et al.
([43],[44]) used a zero-inﬂated normal mixture prior to build
Gibbs sampling for variable selection. Narisetty et al.[94] worked with shrinking and diﬀusing
priors. The predictors with zero or very small coeﬃcients have variance tending to 0, as those
coeﬃcients reach point mass at zero (spike), and the active predictors’ variance reaches inﬁnity as
a diﬀused prior (slab). On a diﬀerent note, Lykou et al.[79] worked with Bayesian lasso variable
selection; they assumed independent double exponential distribution for regression coeﬃcients and
concentrated on the shrinkage parameter λ. Bayes factors criteria are used to choose the selection
vector and the shrinkage parameter. Zhang et al.[134] generalized this prior using Dirac spike-slab
where each coeﬃcient group follows a normal distribution, as follows:

βg|γg, σ

2

g ∼ γgN(0, σ
2

, τ

2Dτg) + (1 − γg)δ0(βg)

2
g1, .., τ

= diag(τ

gmg) and γg follows a Bernoulli distribution. The more interesting
2
where Dτg
advancement in their work is the simultaneous selection of groups and the members within those
groups. In addition, they incorporated group serial correlations with Bayesian fused lasso technique
for within-group selection. Xu et al. [128] also proposed an algorithm for bi-level selection in group
lasso with a two-way sparsity assumption: among grouped predictors and within selected groups.

9

The main application of this kind of selection arises in genetic association studies. For instance,
when not all genetic variations in a selected gene are responsible for a disease, it is necessary to
incorporate bi-level selection.

1.2 Bayesian logistic regression

Classiﬁcation using longitudinal data can be challenging with a large number of predictors. The
ﬁrst signiﬁcant approach to handle longitudinal predictors is to consider each multiple-occasion
observation as a single function that is observed over a time interval. Functional predictors have a
high correlation between adjacent measurements, and the observational space is high-dimensional.
The number of predictors required for estimation often exceeds the number of observations, which
introduces the problem of dimensionality. A regression framework is often the most suitable to
model all possible longitudinal eﬀects across ROIs, where the proposed method will select the
important predictors. Moreover, many biomedical studies have shown that a limited number of
speciﬁc brain regions or ROIs are essential for AD classiﬁcation. Thus, dimension reduction tech-
niques can be applied, and classiﬁcation can be limited to the reduced feature set. Zhu et al. [136]
proposed a method for classiﬁcation and selection of functional predictors: ﬁrst, functional prin-
ciple component scores are calculated for each functional predictor; then, the functional principle
component scores are used for the classiﬁcation of each individual observation. They proposed
using Gaussian priors for selection and developed a hybrid Metropolis-Hastings/Gibbs sampler
algorithm. Although the method reported in the present study was inspired by this method, we
developed a simple Gibbs sampler where MCMC samples are drawn from standard distributions.
We also focused on applying penalized regression for dimension reduction. In the Bayesian variable
selection literature, the spike-and-slab prior has widespread applications due to its superior selec-
tion power. George et al. ([43],[44]) initially proposed that each coeﬃcient β can be modeled either
from the “spike” distribution, where most of its mass is concentrated around zero, or from the “slab”
distribution, which resembles a diﬀuse distribution. Instead of imposing the spike-and-slab prior
directly on regression coeﬃcients, Ishwaran et al. [57] introduced a method in which they place

10

a spike-and-slab prior on the variance of Gaussian priors. The Bayesian variable selection meth-
ods also include diﬀerent Bayesian regularization methods, such as Bayesian lasso [96], Bayesian
Group Lasso, and Bayesian elastic net [76]. We opted for spike-slab Bayesian group lasso which is
extensively discussed in Xu et al. [128]. The group structure among the coeﬃcients in our model
comes from functional smoothing of the coeﬃcients, and group lasso facilitates the selection of
the important functional predictors. Thus, our proposed method uses the idea of Bayesian variable
selection in a generalized functional linear model with binary responses.

We model binary response yi ∈ {0,1} (i = 1, .., n) and predictors with logistic regression. We
know that posterior sampling of logistic regression coeﬃcients is diﬃcult due to the model’s com-
plicated likelihood function. In addition, we assume a Gaussian prior for regression coeﬃcients,
which makes the likelihood function analytically inconvenient. Full posterior sampling of param-
eters requires candidate density following the Metropolis-Hastings algorithm. Bayesian inference
for a probit model is comparatively easier [2]. Diﬀerent sampling algorithms motivated by Albert et
al.’s work [2] have been proposed. For instance, Holmes et al. [54] presented an indirect sampling
method by introducing auxiliary variables for binary and multinomial regression. Later, more
methods based on latent variables for logistic regression were advanced by Frühwirth-Schnatter
et al. [41], Gramacy et al. [45] and Polson et al. [97]. Among all these works, Polson et al.’s
algorithm is the most interesting to us due to its ease of computational implementation and its
sampling eﬃciency. Our aim is to avoid the complex Metropolis algorithm while sampling from
posterior distributions of regression coeﬃcients.

Holmes et al. [54] introduced an auxiliary variable to avoid the conditional non-conjugacy for

11

updating β. The prior structure they used is

1,

yi =

if zi > 0
0, otherwise

zi = xi β + i
i ∼ N(0, λi)
λi = (2ψi)2
ψi ∼ Kolmogorv − Smirnov distribution
β ∼ Normal distribution

The above prior is interesting because the marginal likelihood L(β|data) is same as the likelihood
for logit model. One can achieve a conjugate full conditional distribution of β given data, but the
conditional distribution of π(λi|zi, β) does not have any standard form. One has to use a complicated
rejection sampling method to sample for conditional λi. Hence, adding an auxiliary variable does
not give us signiﬁcant computational improvement compared to using the Metropolis-Hastings
algorithm. Frühwirth-Schnatter et al. [41] addressed the problem with the same approach, but
instead of using a single auxiliary variable they used a two-stage augmentation method. In the
ﬁrst stage, they assumed the existence of a latent variable, where the binary response variable is
conditional on the sign of the auxiliary variable. They assumed the error part of the model to
follow a type I extreme value distribution, which is non-normal density. Then, in the second stage
of data augmentation, this non-normal error distribution was approximated by the mixture of the
normal distribution. Finally, they obtained a multivariate normal distribution for the posterior of
β. Among all the popular methods, pólya-gamma augmentation is most interesting to us, and it is
easy to apply due to the availability of the R package.

1.3 Functional data smoothing

Here we discuss some of the key features of functional data or functional observations in the
regression context. Throughout this thesis, we use linear combinations of basis functions to rep-

12

resent functions. The pivotal philosophy of functional data analysis is to consider observed data
functions as single observations. Measurement error is a frequent problem in functional obser-
vations. In the following, we discuss the simplest case of nonparametric regression problem: we
observe (xi, yi)i=1,..,n with yi = f(xi) + σi where xi’s follow a Uniform (0,1) distribution (i.i.d).

To estimate ˆf(x), we must minimize(cid:104)

(cid:105). We need basis functions to esti-

EÞ 1

0 ( ˆf(x) − f(x))2dx

mate the unknown function. Basis functions have mathematical properties that help statisticians
to approximate any function by taking a linear combination. Some popular basis functions are the
Haar basis, Fourier basis, Spline basis, wavelets, and polynomial bases, among others. In general,
basis functions have the property of orthonormality. Now, if we have a set of basis functions φ j
n
0 φ j(x)dx. We plug in the
in L2(0,1) then we can deﬁne fJ(x) = 1
i=1 yi φ j(xi) such that E( ˆθ j) = θ j. The choice of basis function is crucial in
estimator ˆθ j = 1
n
functional data analysis. We focus on B-spline basis functions in this thesis, and we discuss its
properties in the following.

j=0 θ j φ j(x) where θ j =Þ 1
J

n

Spline functions are most often used to approximate non-periodic functional data. Splines are
easy to handle computationally and provide better results than polynomial approximations. One
can approximate most of the functions using a moderate number of basis functions. The ﬁrst step
in constructing splines in the input space is to divide the space into breakpoints, which are called
knots. Let 0 < 1 and K < K+1 be two binary knots which can be deﬁned as the range of the
domain over which we approximate our function. Now, we deﬁne:

• τ1 ≤ τ2 ≤ · · · ≤ τM ≤ 0

• τj+M = j, j = 1, .., K
• K+1 ≤ τK+M+1 · · · ≤ τK+2M

Although the additional knots are deﬁned outside of boundary points, in practice they are all equal
to 0 and K+1 respectively. We deﬁne Bi,k(x) as the ith B-spline basis function of order ’k’ with

13

Figure 1.6: B-spline smoothing under varying orders and knotsa

a Plot obtained from online resources

knot sequence τ and they are constructed as follows:

Bi,1(x) = 1,
Bi,k(x) =

τi+k−1 − τi

i f τi ≤ x ≤ Ti+1,
x − τi

Bi,k−1(x) +

f or i = 1, .., K + 2M − 1
Bi+1,k−1(x),

τi+k − x
τi+k − τi+1

f or i = 1, .., K + 2M − k

To understand the construction we obtained Figure 1.6 from a paper by Dertimanis et al., 2018[30].
Now we can replace the general φ j(x) basis functions with Bi,k(x) basis functions, and a function
j=1 βj Bj(x). The least square method can be used to

f(x) can be approximated as f(x) = K

estimate unknown βs.
Knot selection is another domain of research that statisticians have already explored. Automatic
knot selection can be achieved using model selection criteria such as cross validation and Mallow’s
Cp in a linear regression framework. The aim is to choose a set of knots from a large number of

14

candidate knots such that optimization criteria are satisﬁed. The number of all possible models
increases as the cardinality of candidate knots does. Some works to mention on this topic are by
(Smith and Kohn, 1996[115],Denison et al., 1997[24],DiMatteo et al., 2001[24]). Smith and Kohn,
1996[115], Denison et al., 1997[24] utilized Bayesian variable selection to build MCMC methods
for function estimation.

models. Many authors are interested in selecting signiﬁcant functions for yi =p

The problem of univariate function estimation can be extended to function selection in additive
j=1 fj(Xi j) + i
when p (cid:29) n. Scheipl et al., 2012[106] proposed a normal-mixture-of-inverse gamma distribu-
tions by introducing similar indicator variable used in spike and slab prior. Their setup is for
a generalized linear model with continuous response replaced by link functions. Scheipl et al.,
2013[107] mentioned more than one method in their paper, including using penalized likelihood
and smoothness priors, and an indicator selection approach after transforming functions into a lin-
ear combination of centered B-spline expansion. Lan Xue[129] proposed a penalized polynomial
spline method for a simultaneous model estimation and variable selection in additive models. He
explored spline SCAD (Smoothly Clipped Absolute Deviation) penalty as a regularization method.
Although Fan et al., 2015[33] did not engage in variable selection in their work, they did utilize a
penalized least squares optimization technique to eﬃciently deal with high-dimensional functional
predictors. They minimized a loss function with a penalized l2 norm of functions to ﬁnd estimated
functions. McLean et al., 2013[85] explored both MCMC and variational Bayes algorithms to ﬁt
a functional generalized additive model. Huang et al.[56] tackled the same problem in a conven-
tional non-Bayesian variable selection approach by applying adaptive group lasso. To start, they
also decomposed functional predictors as a linear combination of B-spline basis functions and later
penalized the unknown coeﬃcients as groups. The Bayesian group lasso we built is inspired by
their work, but we deal with binary response that is equivalent to a generalized linear model.

15

1.4 Bayesian model based clustering

Cluster analysis is the modern data mining tool used to group or segment objects into clusters
based on similarity measurements. Many available algorithms have been built based on a particular
criterion that objects within a cluster are more similar to each other than to objects that belong to
other clusters. Some popular clustering methods are K-means, agglomerative clustering, hierar-
chical clustering, and model-based clustering. We focus more on model-based clustering in this
chapter. A disadvantage of hierarchical clustering or K-means methods is that they are model-free
heuristic methods. Model-based clustering is an alternative to provide an option to formulate a
model while grouping objects. Model-based clustering was ﬁrst introduced Banﬁeld and Raftery,
1993[6], who assumed that observations come from multivariate normal distribution and that max-
imum likelihood yields the estimates . Fraley and Raftery, 2002[37] introduced the ﬁnite mixture
models as a formal setting for model based clustering. Let y1, .., yn are independently distributed
p-dimensional observations from a K-component mixture distribution-

f(y; τ, θ) =

τk fk(yi|θk)

(1.1)

n

K

i=1

k=1

Here τk is the probability that a particular object belongs to kth cluster and θk parametrizes the
density fk. In general, fk is multivariate normal density where fk ≡ MV N(µk, Σk). The EM
algorithm helps us to get the estimates of unknown parameters. In a later study, Melnykov and
Maitra, 2010[88] provided a thorough reference for model-based clustering. A point of subjectivity
that is always attached to cluster analysis is choosing the number of clusters. Similar to K-means,
model-based clustering needs a speciﬁed number of clusters at the start of the algorithm. Fraley
and Raftery, 1998[36] tried to answer this question using model selection criteria. They suggested
studying BIC over all possible cluster numbers and choosing the one with lowest BIC. Another study
to mention is that of Yeung et al., 2001 [131], who applied model-based clustering to group gene
expression data. Over time, researchers included other factors inside the model-based approach,
such as variable selection[101] and penalizing the parameter space θ[95]. All the aforementioned
studies have been done in a conventional statistical framework without using any prior information

16

on the cluster structure, the number of clusters, or unknown parameters (τ, θ). Only Bayesian
model selection has been used to select optimal number of clusters. Handcock et al., 2007[46]
described two methods to estimate unknown parameters: ﬁrst, a two-stage maximum likelihood
method, and second, a Bayesian estimation method. In the Bayesian part, regression parameters
have a multivariate normal prior, and the mixture probability τ has a Dirichlet prior with MCMC
estimation. Furthermore, Fraley and Raftery, 2007[38] discussed the problems of using the EM
algorithm and where it can fail to converge. Instead of MLE, they preferred to work with the MAP
estimator with conventional priors on mean and covariance parameters, such as MVN on mean and
an Inverse-Wishart distribution on a covariance matrix. Medvedovic et al., 2004 [86] proposed a
contrasting formulation of a Bayesian mixture-based clustering algorithm to group gene expression
data with replicates. Interestingly, the Poisson-Dirichlet process is the most logical and convenient
way of assuming priors on the number of clusters and mixture probability if one does not want to
limit the number of clusters when starting the algorithm [70].

The methods and literature described above have various advantages and can be applied to
diﬀerent complicated problems faced by researchers. Here we should reiterate the problem with
which we are dealing: we aim to build a Bayesian model-based clustering method with the same
white matter data used for spatiotemporal modeling discussed in Section 5.1. We have longitudinal
voxel-wise white matter measurements for healthy aging subjects, and we would like to group
voxels in homogeneous regions considering the longitudinal and spatial information. To this
end, we review the literature for spatiotemporal clustering methods. Authors in the computer
science literature have examined this topic extensively. Kalnis et al., 2005[65] dealt with a highly
complicated problem of detecting clusters among moving objects that change locations over time.
For instance, they clustered trajectories and mining movement patterns for a group of migrating
animals or a convoy of cars moving in a city. The complexity of this problem is manifold, as we are
simplifying our situation by assuming that spatial locations of voxels are ﬁxed over time. Kisilevich
et al., 2009[67] described a detailed study of spatiotemporal clustering on trajectories and provided

17

in-depth research development on this topic.

1.5 ADNI Longitudinal Data

Alzheimer’s disease (AD) is the most common form of age-related neurodegeneration and
dementia. More than 25 million people in the world are currently aﬀected by dementia, with most
suﬀering from AD, and around 5 million new cases occur every year. The number of people with
dementia is anticipated to double every 20 years [99]. Some suggest that earlier diagnosis and
intervention may oﬀer the greatest potential for treatment, making the early detection of AD of
the utmost clinical importance. AD symptoms are diagnosed via clinical neuropsychological and
cognitive measures, including the Clinical Dementia Rating (CDR) scale.
In addition to other
standardized clinical neuropsychological measures, such as the Ray Auditory Verbal Learning Task
or the Mini Mental Status Examination (MMSE), biomarkers from MRI and PET imaging are also
used in diagnostic classiﬁcation. Based in part on these scores, patients are diagnosed or classiﬁed
into one of three primary categories: cognitively normal (CN), mild cognitive impairment (MCI),
and AD. In this dissertation, we demonstrate a method to diﬀerentiate AD patients from CN persons
via the diﬀerential nonlinear longitudinal trajectories in regional brain volumes. Various high-
dimensional classiﬁcation and regression methods have been proposed for biostatistical applications
to neurological diagnostic methods for early-stage AD detection. The present study focuses on
applying volumetric MRI data at the region of interest (ROI) level, and limiting the included
features to more clinically established sub-regions of brain volumes in AD and aging. Unlike
commonly used Bayesian classiﬁcation methods, we emphasize the selection of the longitudinal
trajectories of brain volumes as predictors for classiﬁcation of patients as AD vs. CN.

MRI is a valuable tool for in vivo assessment of brain biomarkers related to disease progression.
Longitudinal patterns of brain atrophy follow a common pattern, with prominent volumetric loss
in established and canonical regions, such as the hippocampus or the entorhinal cortex, as well as
ventricular and sulcal expansion. The currently available, fully processed ADNI longitudinal MRI
data includes cortical and white matter (WM) parcellation, surface area, and cortical thickness of

18

diﬀerent brain regions. Instead of using all possible regional measurements, we focus on volumetric
longitudinal changes in a more theoretically guided and limited number of brain sub-regions to
classify patients. We obtained nearly 115 diﬀerent sub-regions’ volume measurements from ADNI
1, ADNI 2, & ADNI GO. Leung et al. [74] compared MCI converters and MCI nonconverters
considering longitudinal hippocampal volumes, and showed that MCI converters have higher rates
of hippocampal atrophy. Reduced hippocampal volume has been proposed to diagnose AD earlier
than clinical diagnosis [40]. We believe that measuring volumetric change over multiple occasions
rather than at a single time-point will provide a more reliable estimate of quantitative change. In this
dissertation, we concentrate on the selection of key brain regions that diﬀer in longitudinal change
between AD patients and CN older adults using a Bayesian variable selection method. In addition
to prior ﬁndings related to the selection of volumetric MRI measurements, there are also numerous
studies involving the selection of functional predictors. The key advantage of using longitudinal
MRI data is that it captures the atrophy rates with time as patients’ progress through multiple
disease stages. FreeSurfer software is able to measure accurate brain volumes and cortical thick-
ness, and the ADNI website has 1.5T structural MRI data with parcellation performed by FreeSurfer.

The pivotal contribution of this dissertation is that it explores functional data analysis with
Bayesian variable selection. We extensively refer to the methodologies described in this introduction
in the development of later chapters. Chapter 2 introduces the applicability of spike-and-slab priors
for knot selection in function estimation. The research interest in function estimation is still
prevalent in the statistics literature. Spike-and-slab group lasso, which was discussed earlier,
works well for group selection and outperforms other methodologies. Chapter 4 presents extensive
application of Bayesian group lasso to function selection. Moreover, it helps to establish some
statistical properties.
In addition, Bayesian classiﬁcation is another ﬁeld that is discussed later
in this dissertation. The aim of this introduction was to help the reader better understand the
perspective of this dissertation.

19

CHAPTER 2

NONPARAMETRIC FUNCTION ESTIMATION USING BAYESIAN VARIABLE

SELECTION

Bayesian variable selection for nonparametric function estimation is a challenging research ﬁeld.
We have tried to explore it with a large number of functional predictors in an additive model. We
started with the knot selection problem for a single function estimation.

2.1 Univariate function estimation

The literature on function estimation is quite old and has already established excellent mathe-
matical properties. Bayesian methods have worked well for estimating functions with a penalized
spline. However, we propose a new approach to function estimation by assuming uncertainty with
knots. We describe our methodology in the following.

Introduction

2.1.1
Consider a very simple regression problem with data observed as (yi, Xi)i=1,..,n where X is a
single covariate and Y is n × 1 response variable. We regard a nonlinear model as a potential
setup, such that the mean E(y|X) varies nonlinearly with the predictor. Then, we can replace the
conventional regression problem of yi = Xi β + i with

yi = f(Xi) + i,

i = 1, .., n

where f(.) belongs to some class of nonlinear functions. There are numerous methods available
to estimate the unknown function, one of which is basis function expansion. This single covariate
has a nonlinear component with respect to response, which is modeled as a regression spline basis
and can be written as

f(x) =

βj φ j(x)

(2.1)

K

j=1

20

where φ j(x) are some orthonormal basis functions. Now, if we limit our study only for spline
functions, the value ’K’ depends on the number of internal knots we are using. Let us assume that
we have τ1, .., τK internal knots to approximate the function. Then, equation (1.1) can be written
with pth degree spline model as

K

j=1

βj φ j(x)

f(x) = β0 + β1X + · · · + βpX p +

βpj(X − τj)p

+

K+p+1

j=1

=

here (X − τj)+ = max(0,(x − τj)) and as (X − τj)p
of the spline function increases with a value of p. We work with cubic spline bases here.

+ has p− 1 continuous derivatives, the smoothness

To start the statistical problem, we assume that we have a large number of candidate knots,
and we would like to build a data-driven algorithm that automatically selects important knots to
accurately approximate a given function. Knots’ locations are another point of interest. Various
model selection criteria have been used before, such as cross-validation and Mallow’s Cp etc. The
number of possible models increases as the cardinality of candidate knots expands. There is the
potential to over-ﬁt the data if we use a large number of knots. One advantage of using spline bases
is that the design matrix for the linear model is sparse and we can explore the variable selection
methods in computation. We examine Bayesian variable selection as a potential method for knot
selection regarding this problem.

Our method implicitly assumes that f is well approximated by a linear combination of spline
basis functions with some number of knots. In practice, we assume that f can be represented by
a linear combination of basis functions. This class of basis splines is large and approximates any
locally smooth function arbitrarily well. We note that our approach determines which knots are
eﬀective in estimating the spline function, and we closely follow the works on variable selection
in linear regression models by George and McCulloch, 1993,1997[43][44]. These authors ﬁrst

21

introduced the spike-and-slab prior with the following setup:

βi|γi ∼ (1 − γi)N(0, τ

i )
i ) + γiN(0, c2
2
2
i τ

where γi is a latent variable that controls the number of signiﬁcant coeﬃcients or model size. Smith
and Kohn ,1996[115] implemented another method of g-priors on the regression coeﬃcients and
applied this prior in univariate function estimation. On the other hand DiMatteo et al., 2001[31]
quantiﬁed two unknown quantities, such as number of knots and location of knots with Poisson
prior and Dirichlet prior, respectively.

We focus on George and McCulloch, 1993,1997[43][44]’s approach in this section, as the
conventional spike-and-slab prior is not considered for function estimation. The Gibbs sampler is
computationally feasible and easy to implement when the model has a large number of predictors.
Once we integrate corresponding variance and regression coeﬃcients, we show that the Bayesian
variable selection convergence rate is much faster than with other approaches.

2.1.2 Variable selection using Spike-Slab Prior

A spike-and-slab model is often represented by a Bayesian hierarchical model. The most

conventional deﬁnition is given below:

2),

(i = 1, ..., n)

i β, σ

(Yi/xi, β, σ

2) ind∼ N(x(cid:48)
(β/γ) ∼ N(0, Γ),
γ ∼ π(dγ),
2 ∼ µ(dσ
2),

σ

where 0 is a K-dimensional zero vector, Γ is the K × K diagonal matrix diag(γ1, ..., γK), π is the
prior measure for γ = (γ1, ..., γK)t and µ is the prior measure for σ

2.

George and McCulloch(1993)[43] ﬁrst developed the most widely used spike-and-slab model
version. It identiﬁes zero and nonzero βi’s by using zero-one indicator variables γi, assuming a

22

scale mixture of two normal distributions:

(βi/γi) ind∼ (1 − γi)N(0, τ

i ) + γiN(0, c2
i ),
2
2
i τ

i = 1, .., K

2
i > 0 is some suitably small value, while ci > 0 is some suitably large value. γi = 1 represents
τ
the βi’s that are signiﬁcant, and the variances of these coeﬃcients are large, with larger posterior
βi values. The opposite occurs when γi = 0. The prior hierarchy for β is completed by assuming
2
tends to zero we provide more masses on 0 which operates as the prior for
a prior for γi. When τ
i
insigniﬁcant βs. The prior distribution for the regression coeﬃcients then can be written as

(βi/γi) ind∼ (1 − γi)I0 + γiN(0, ν

2)

(∗)

with I0 point mass at 0 coeﬃcients, where ν
large enough c2
i

. We use prior (*) to select signiﬁcant knots from candidate knots.

2
2 is the limit for c2
i τ
i

2
when τ
i

tends to zero with a

2.1.2.1 Model description and Prior

We start with the simple linear model:

y = X β + ε

where y is the n × 1 vector of observations. X is the n × r design matrix,
ε ∼ N(0, σ
Let γ be the r x1 vector of indicator variable with i-th element γi such that:

2In) is the error vector, and β = (β1, ..., βr)(cid:48) is the r x1 vector of regression coeﬃcients.

 1, when βi (cid:44) 0

0, when βi = 0



γi =

Given γ, let βγ consist of all the nonzero coeﬃcients of β and let Xγ be the columns of X corre-
sponding to particular components of γ that are equal to one. We then make the following prior
assumptions:

1. Given γ and σ

γ Xγ)−1), where c is a positive scale factor. For
our simulation study we used c=100. The above formulation of βs is identical to the spike-and-slab

2, the prior for βγ ∼ N(0, cσ

2(X(cid:48)

23

prior deﬁned in equation (*).For large values of c, the prior for βγ contains very little information
γ Xγ)−1
about βγ compared to the likelihood. We take the prior variance of βγ proportional to σ
which makes the Gibbs sampler very fast. We also think this prior contains the associative rela-
tionships of βs.

2(X(cid:48)

2. The γi are assumed to be a priori independent with

P(γi = 1) = 1 − P(γi = 0) = pi

(i = 1, ...,r)

p(γ/Y) ∝ p(Y/γ)p(γ) contains information relevant to variable selection. We started with pi = 1
2
which justiﬁes no bias towards model size.

3. Finally, for σ

2 we use the inverse gamma conjugate prior

p(σ

2/γ) ∼ IG(νγ/2, νγλγ/2),

2. For our
We note that νγ and λγ may depend on γ to incorporate dependence between β and σ
practical example, we choose νγ ≡ 0 and λγ ≡ 0 which gives us a commonly used prior for σ
2:
p(σ

2/γ) ∝ 1
2 .

σ

2.1.2.2 The Gibbs Sampler

The primary advantage of applying a conjugate hierarchical setup is that it enables analytical

r

i=1

γi be the number

marginalizing of γ with respect to βγ and σ
p(βγ, σ
, γ/Y) ∝ p(Y/βγ, σ
of nonzero elements of β and

2)p(βγ/σ

2

2

, γ)p(σ

2 from
2)p(γ). For a given γ, let qγ =

S(γ) = y

(cid:48)

y − c

1 + c y

(cid:48)Xγ(X(cid:48)

γ Xγ)−1X(cid:48)
γ y

24

Then,

Þ

(cid:26)Þ

p(y/γ) ∝

p(y/βγ, σ
∝ (1 + c)− qγ2 S(γ)− n2 ;

σ

β

2)p(βγ/σ

(cid:27)

2) dβγ

p(σ

2) dσ

2

βγ is integrated out as a normal integral and σ
The posterior distribution of γ is

2 is integrated out as an inverse gamma integral.

p(γ/y) ∝ p(y/γ)p(γ)

∝ (1 + c)−qγ/2S(γ) r

i=1

i (1 − pi)1−γi
pγi

≡ g(γ),

The marginalizing constant can be obtained by computing g(γ) for all γ values. To ﬁnd an eﬃcient
posterior sample, we do not immediately need normalizing constant.

Gibbs Sampler:

We will start with an initial sequence of γ

0 = (γ

0
1, .., γ

r ) and then we draw each member of γ
0

from the Bernoulli distribution with probability:

p(γi = 1|y, γj(cid:44)i) =

1
1 + h, where

(cid:32)

(cid:33) n2

1 − πi
πi
1 = (γ1, .., γi−1, γi = 1, .., γr) and γ
0 = (γ1, .., γi−1, γi = 0, .., γr). We repeat this step for a warm
γ
up period and sampling period until the sequence converges.

(1 + c)1
2

S(γ
S(γ

, where

1)
0)

h =

For large r, the above method takes a long time to execute, and computation complexity is O(r3).
However, MCMC chain still can still be used to ﬁnd high-probability γ values. We can construct

25

easy MCMC algorithms to simulate a Markov chain

1

γ

, ...., γK

which converges in distribution to p(γ/y).

Metropolis-Hastings algorithms:

Another way of sampling from g(γ) is to use Metropolis-Hastings (MH) algorithms. We start
0 value,
1 values. Once we determine the candidate proposal

1), which is also called candidate density. For each γ

1) is a probability distribution over γ

, γ

0

0

with a proposal density q(γ
q(γ
density q(γ

, γ

, γ

0

1), the below algorithm populates posterior sample of γ:

1. Generate γnew with probability q(γ
2. Set γ(j+1) = γnew with probability

0

, γ

1).

αMH(γ

(j)

, γnew) = min

(cid:40)

(cid:41)

q(γnew, γ(j))
q(γ(j), γnew)

g(γnew)
g(γ(j)) ,1

= 0, Otherwise

1), the sequence obtained by this algorithm will be a Markov

Under weak conditions on q(γ
0
, γ
chain that converges to p(γ/y).

In case of symmetric proposal density, Metropolis-Hastings becomes simple Metropolis algo-

rithm, i.e. if q(γ

0

, γ

1) is symmetric. Then the above αMH simpliﬁes to

(cid:27)

g(γ(j)) ,1

(cid:26) g(γnew)
r

|γ
i − γ
0

i | = 1
1

αM(γ

(j)

, γnew) = min

One of the simple proposal density is

0

q(γ

, γ

1) =

1
r ,

if

26

i=1

Which alters the above MH algorithm-

1. Generate γnew by changing one element of γ(j).
2. Set γ(j+1) = γnew with probability αM(γ(j), γnew). Otherwise, γ(j+1) = γ(j).

This algorithm was proposed in Raftery, Madigan and Hoeting (1993)[53].

From the above algorithm, we estimate γ with the highest probability of occurrence. We can
2. If we plug this ˆγ in the model and select
use the estimated ˆγ to obtain the estimates of γ and σ
2 becomes much easier. This is
only those covariates for which the γis are 1, estimation of β and σ
a huge advantage of using the spike-and-slab prior, and it is the main motivation behind using it.

2.1.2.3 Updating β and σ

2

We use George and McCulloch’s (1993)[43] SSVS procedure to simulate a full parameter se-
quence. This method is based on the Gibbs sampler. At each step, we use the full conditional
distribution of β and σ

2 to generate new samples.

Let ˆγ be the ﬁnal posterior estimate of γ:

(cid:18)

2) ∝

1
2πσ

2

(cid:19)n/2

exp(− 1
2σ

p(y/β ˆγ, σ

2) p(β ˆγ/σ

2) p(σ

(cid:19)q ˆγ/2

(cid:18) 1

×

2

σ

2(y − X ˆγ β ˆγ)(cid:48)(y − X ˆγ β ˆγ))
(cid:18) 1

(cid:19)

exp(− 1
2cσ

2(X ˆγ β ˆγ)(cid:48)(X ˆγ β ˆγ)) ×

2

σ

The full conditional for β ˆγ is multivariate normal with mean A−1X ˆγ y and variance σ
A = (1 + 1

ˆγ X ˆγ. Symbolically,

c)X(cid:48)

2A−1, where

p(β ˆγ/y, σ

2) ∼ N(A−1X ˆγ y, σ

2A−1)

27

The full conditional distribution of σ

2 is inverse gamma: The terms in the joint distribution

involving σ

2 are

2(cid:17)−( n2 +

(cid:16)

σ

q ˆγ2 +1)

exp(− 1
2σ

2((y − X ˆγ β ˆγ)(cid:48)(y − X ˆγ β ˆγ) + 1

c(X ˆγ β ˆγ)(cid:48)(X ˆγ β ˆγ))

so σ

2 is conditionally inverse gamma with shape parameter n2 +

2((y − X ˆγ β ˆγ)(cid:48)(y − X ˆγ β ˆγ) + 1
1

c(X ˆγ β ˆγ)(cid:48)(X ˆγ β ˆγ)).

q ˆγ
2 and scale parameter

The Gibbs sampler cyclically samples from the distributions of β and σ

current values of the another parameter. Finally, we obtain the estimates of β and σ
posterior median.

2 conditional on the
2 from the

2.1.3 Regression Splines

Let’s assume,

yi = f(xi) + i,

i = 1, ..., n

where yi is the i-th response, error part i follows N(0, σ
2), and f(x) is the unknown smooth
function that needs to be estimated. Under the suitable smoothness assumption, we propose to
approximate f(x) using a normalized B-spline basis. One of the research aims involved here is to
quantify the location and the number of knots. Badly placed knots will deﬁnitely miss the function’s
properties and provide bad approximations. On the other hand, if we use a large number of knots,
we will deﬁnitely have low bias but very high variance. The most convenient way to approach this
problem is to introduce variable selection. One can assume that there are initially a large number
of candidate knots and can then select a small set of signiﬁcant knots using variable selection (e.g.,
Friedman and Silverman, 1989[39]). Now we can visualize the problem as a potential variable
selection problem, as columns of the design matrix correspond to individual knots. This is what
guides us to use the spike-and-slab prior. In a univariate case, we place knots in at least a quarter
of the positions, depending on the number of observations, but we try to stay below 40 knots. We

28

want to ensure that placed knots capture the curvature of functions. For our simulation study, we
place knots after every fourth sorted observation.

2.1.3.1 Simulated example

In this section we studied performance of Bayesian nonparametric estimators using simulated

data. We used these functions in the simulation:

1.

2.

f1(x) = sin(2x) + 2 exp(−16x2)

f2(x) = (φ(x,0.15,0.05) + φ(x,0.6,0.2))/4

which is a nonlinear function. A well-judged nonparametric kernel-based method can estimate
these functions. We approximate these functions with a cubic B-spline basis.

One hundred response observations are generated from N( f(x), σ

2). The choice of σ is made such that the signal to noise ratio sd( f)

2) with xi ∼ U(0,1) and
ei ∼ N(0, σ
sd() stays around 3.
The number of initial knots is directly proportional to the number of observations such that when
n increases, so does the number of knots. For our simulation, we start with n4 number of , which
implies that as n increases, there is more sparsity. We place knots at every fourth position, which
generates a total of r=28 columns in the design matrix X with a B-spline of degree=3. The Gibbs
sampler is run on 1,600 iterations, with the sampling starting after the 100th iteration.

If ˆf(x) is the estimated unknown function, then we need some statistic to measure accuracy of
x{ f(x) − ˆf(x)}2dx over the unit interval is the
i = 1, ...,400 with

statistic on which we focus. We created 400 equally spaced grids zi = i/400,

estimation. The integrated squared error (ISE)=Þ
(cid:16)
400

f(zi) − ˆf(zi)(cid:17)2

ISE =

1
400

i=1

29

We repeat the simulation 100 times. ISE results are shown in the below table and compared to the
kernel method of estimation.

The simulation result with Bayesian variable selection show better performance than kernel

Table 2.1: Log(ISE) comparison with Kernel methods of approximation

r
28
28

Bayes Mode log(ISE)
Mean (sd)
-2.7 (0.05)
-0.92 (0.02)

Kernel log(ISE)
Mean (sd)
-1.28 (0.03)
-0.49 (0.01)

f1(x)
f2(x)

estimation. We use Gaussian kernel with 0.10 bandwidth. Next, we illustrate the above method
using ’Chloride’ data from Bates and Watts, 1988[8].

2.1.3.2 Chloride concentration Data

The Chloride data contains 54 observations of concentrations of chloride taken over an interval
of time. We implement the same variable selection algorithm for knot selection discussed above,
and we compare it to ﬁxed knot basis approximation and linear regression estimates, as shown in
Figure 1.7. We can state that B-spline basis approximation with four internal knots works better
than the Bayesian variable selection method. We obtain 0.23 of mean log(ISE) and posterior mode
selects 11 basis columns out of 17 basis functions.

2.1.3.3 Discussion:

The main drawback of this method is its sensitivity to the initial choice of basis. We need to
account for the uncertainty in selection based on the properties of the corresponding basis function.
Here we work with cubic B-spline basis functions, which are assumed to be smooth. However,
this assumption is not always valid when dealing with spike functions. In these situations, one
requires prior information or needs to work with the Bayesian penalized spline method. Hence,
prior information should be given on both the basis choice and the corresponding basis coeﬃcients.

30

Figure 2.1: Nonparametric function estimation comparison between variable selection, linear
regression, and ﬁxed knot approximation

31

2.2 Functional additive model estimation with bi-level selection

We examine a diﬀerent problem in this section: we seek to explore how Bayesian variable

selection can perform in function selection in additive models.

2.2.1

Introduction

Modern-day life sciences develop applied questions regarding technological innovations for the
selection of inﬂuential components in regression models. The statistics literature is full of feature
selection methods and their corresponding theories in linear models where the number of predictors
is very large compared to the sample size. The literature is rich in terms of variable selection that has
been developed for generalized linear models, hazard rate models, and so forth. On the other hand,
the selection of signiﬁcant functions in additive models is very recent. Consider the nonparametric
additive model

p

j=1

Yi = µ +

fj(Xi j) + i,

where µ is the intercept term, Xi j is ith observation of jth covariate Xj, fjs are unknown functions
and i is an unobserved random variable with mean 0 and variance σ

2.

The questions we aim to answer are: Are all fj(Xj)’s signiﬁcant? And based on the signiﬁcant
ones, can we develop an algorithm that handles the feature selection and estimates the true functions
simultaneously? We assume that not all fj(Xj)’s are important and that we are dealing with a much
larger number of functions than the sample size, i.e . p is larger than n. We further assume that
the number of true functions in a model is still less than n. The estimation of true functions is an
important component of this article, whether the true functions are linear or nonlinear.

Both frequentist and Bayesian statisticians deal with model selection or function selection in
additive models. Most of the ideas have come from variable selection in regression models with
a large number of predictors. The most popular methods of variable selection are lasso [Tib-
shirani, 1996][120], SCAD penalty [Fan and Li, 2001][32], adaptive LASSO [Zou, 2006][138],

32

group LASSO [Yuan and Lin,2006][132]. Bayesian variable selection methods, particularly for
linear models, can be divided into two segments: one based on the spike-and-slab prior, and an-
other that introduces selection indicators with the regression coeﬃcients. George and McCulloch
[1193,1997][43][44] proposed the basic idea that each coeﬃcient βj’s can be modeled either from
the “spike” distribution, where most of its mass is concentrated around zero, or from the “slab”
distribution, which is like a diﬀuse distribution. The introduction of selection indicators for vari-
able selection is a closely related idea, where we decide whether a coeﬃcient is in the model or
not by using selection indicators. Smith and Kohn [1996][115] developed this notion by adding
selection indicators with the columns of the design matrix which eventually selects the signiﬁcant
covariates. Furthermore, unlike George and McCulloch’s spike-and-slab, which focuses on β coeﬃ-
cient, Ishwaran and Rao [2005][57] placed a spike-and-slab prior on the variance of Gaussian priors.

The two ideas of Bayesian variable selection also extend to group selection. In their paper,
Ghosh and Xu [2015][128] described how spike-and-slab priors can be used to select groups in co-
variates as a form of Bayesian group lasso. Our method of Bayesian function selection implements
Ghosh and Xu’s [2015][128] ideas. First, we represent each function in our model by B-spline basis
function coeﬃcients with a subsequently large number of knots placed in each function. Next, we
introduce two-stage spike-and-slab priors to ﬁnd the signiﬁcant functions and the signiﬁcant knots
for true functions. In both situations, we assume that the number of functions in the model is large,
that they are sparse (large p, small n), and that the number of knots introduced to estimate the
functions are also sparse. Chen et al. [2016][24] published a paper on Bayesian sparse group se-
lection using selection indicators, which is an extension of variable selection in grouping structures.

In the next section, we describe the construction of functions in terms of B-spline basis func-
tions, and the application of Bayesian sparse group lasso with a spike-and-slab prior for selecting
signiﬁcant functions and estimating in "large p, small n" settings. The posterior median threshold-
ing, as shown in Ghosh and Xu [2015][128], estimates both the null functions and their members

33

as zero. As we use a large number of internal knots to estimate the functions, the spike-and-slab
prior with median thresholding estimates the insigniﬁcant knots as zero based on the structure of
the true functions. Section 1.2.3 presents simulation studies to evaluate the performance of our
method.

2.2.2 Bayesian Sparse group Lasso in nonparametric additive models
Suppose we have data (Y, Xj),
the identiﬁable conditions for functions fj, we assume that

E fj(Xj) = 0, 1 ≤ j ≤ p. For the model Yi = µ +p

j = 1, .., p; and Xj ∈ [a, b] with a, b < ∞. To hold
fj’s are centered around zero, i.e.
j=1 fj(Xi j) + i, after centering covariates and
responses, the intercept term can be dropped for simplicity. Corresponding nonlinear components
can be transformed into a linear regression model with spline coeﬃcients. Thus, the problem of
detecting a signiﬁcant function in the additive model is equivalent to selecting groups of variables
in a linear regression setup with a predeﬁned grouping structure. We hence deﬁne each function
fj through basis functions.

We deﬁne the construction of B-splines as follows. Let d denote the degree of the B-spline, which
implies the order as d +1. Our next assumption is that a sequence of knots is placed in each function
where the number of knots is large but does not exceed the sample size; as sample size increases,
so does the number of knots. To deﬁne the sequence of knots, let a = ξ0 < ξ1 < · · · ξK < ξK+1 = b
be a partition of [a, b] into K sub intervals. In addition, deﬁne d knots ξ−d = ξ−d+1 = .. = ξ−1 = ξ0
and another set of d knots ξK+1 = ξK+2 = .. = ξK+d+1. The B-spline basis functions are deﬁned
as



34

1,

ξi ≤ x < ξi+1

Bi,1(x) =

Bi,d+1(x) =

x − ξi
ξi+d − ξi

0, otherwise
Bi,d(x) +

ξi+d+1 − x
ξi+d+1 − ξi+1

Bi+1,d(x),

for j = −d, .., K. With the use of additional knots we will get precisely K + d +1 basis functions.

Thus for any function fj, deﬁned on the function space Fj, can be written as

mn

fj(x) =

βjk Bjk(x)

k=1
and the vector fj of function can be expressed as

with basis function values Bjk(Xi j) as elements of the design matrix X. Finally the model can be
written as

fj = Xj β j

p

j=1

Y =

Xj β j + ,

where β j’s are the grouped coeﬃcients corresponding to the basis functions Xj. The error  has
mean 0 and variance σ

2.

We now introduce the method of Bayesian sparse group selection with a spike-and-slab prior.
The importance of this method is that we assume sparsity at a number of functions and also at
a number of knots within a function. Simon et al. [2013][113] proposed sparse group lasso to
produce exact 0 coeﬃcients at the group level and within a group. The sparse group lasso estimator
of β is given by

Xg βg||2

2 + λ1|| β||1 + λ2

|| βg||2

.

(cid:18)

||Y − G

g=1

min
β

(cid:26)

(cid:19)

G

g=1

(cid:27)

,

G

g=1

A corresponding prior can be constructed as

π(β) ∝ exp

− λ1|| β||1 − λ2

|| βg||2

which can be expressed as a scale mixture of normals. However, to select variables both at group
level and within a group, Ghosh and Xu [2015][128] developed a hierarchical spike-and-slab prior
structure that shrinks coeﬃcients to exactly 0 with posterior median thresholding. Two sets of
spike-and-slab distributions, one at group level and another at individual level with a posterior

35

median estimator, have great variable selection and prediction performance.

We describe the model speciﬁcations exactly same as Ghosh and Xu [2015][128] described in
their paper. The coeﬃcients β j’s are reparameterized to handle two diﬀerent degrees of sparsity.
2 = diag(τj1, .., τjmj), τjk ≥ 0, j = 1, .., p; k = 1, .., mj. To select variables
β j = V j
at the group stage, the ﬁrst set of spike-and-slab prior is introduced.

1
2 b j where V j

1

ind∼ (1 − π0)Nmj(0, Imj) + π0δ0(b j),

b j

j = 1, .., p

When τjk = 0, β jk drops from the model even though b jk (cid:44) 0. τjk controls the magnitude of
the elements of β j. Hence to choose variables within a group Ghosh and Xu [2015][128] placed
another spike-and-slab prior with τjk’s:

where N +(0, s2) is a normal distribution N(0, s2) truncated below 0 with mean(cid:113) 2

ind∼ (1 − π1)N +(0, s2) + π1δ0(τjk),

j = 1, .., p; k = 1, .., mj

τjk

π s and variance

s2. The error variance σ

2 follows a Inverse Gamma distribution with shape α and scale γ:

2 ∼ InverseGamma(α, γ).

σ

Ghosh and Xu [2015][128] set up an hierarchical setup to decide the values for hyper-parameters
π0, π1.

π0 ∼ Beta(a1, a2), π1 ∼ Beta(c1, c2).

For s2, a conjugate inverse gamma prior is placed,

s2 ∼ InverseGamma(1, t).

We can update ’t’ with a monte carlo EM algorithm (Casella, 2001[20]; Park and Casella, 2008[96]).
For kth EM update,

where the posterior mean of 1
iteration.

s2 can be obtained through the mean of Gibbs samples with (k − 1)th

t(k) =

1
Et(k−1)[ 1

s2 |Y],

36

Gibbs Sampler:

Let β(j) denote β vector without jth group.

β(j) = (βT1 , .., βT

j−1, βT

j+1, .., βT

p)T .

X(j) is the corresponding design matrix to β(j). Similarly β(jk) denote the whole set of vector of
coeﬃcients without kth element corresponding jth group:

β(jk) = (β11, .., β1mn, .., β j1, .., β jk−1, β jk+1, .., β jmn, .., βp1, .., βpmn)

The corresponding full conditional posterior distributions are:

b j|rest ∼ lj δ0(b j) + (1 − lj)Nmj(µj, Σj),

where lj = P(b j = 0|rest),

lj =

π0 + (1 − π0)|Σj| 1

2 exp

(cid:26) 1

2σ

1
1
4 ||Σ
2
j V

(cid:27) ,

1
2
j XT

j (Y − X(j)V(j) b(j))||2
2

with

1
2 ΣjV

1
2
j XT

σ

µj =

j (Y − X(j)V(j) b(j)), and Σj = (Imn +
Full conditional posterior of τjk is a spike-and-slab distribution:
τjk|rest ∼ qjk δ0(τjk) + (1 − qjk)N +(ujk, ν

jk),
2

1
2 V

σ

1
2
j XT

j X jV

1
2

j )−1

j = 1, .., p; k = 1,2, .., mn, mn + 1, .., p ∗ mn,

where

ujk =

1
2 ν

σ

and qjk = P(τjk = 0|rest),

jk(Y − X(jk)β(jk))T X jk b jk, ν
2

2
jk

= ( 1

s2 +

1
2 XT

σ

jk X jk b

2

jk)−1

qjk =

π1 + 2(1 − π1)(s2)−1
2(ν

2|rest ∼ Inverse Gamma

σ

2

2 exp(cid:8) u2
1
jk)1
(cid:18) n

jk
2
2ν
jk

2 + α,

(cid:9)(cid:2)Φ(ujk

νjk

)(cid:3) .
(cid:19)

1
2||Y − X β||2

2 + γ

.

37

With Conjugate Beta and Inverse Gamma prior, the subsequent posteriors are:

π0|rest ∼ Beta(#(b j = 0) + a1,#(b j (cid:44) 0) + a2),
π1|rest ∼ Beta(#(τjk = 0) + c1,#(τjk (cid:44) 0) + c2)

s2|rest ∼ Inverse Gamma

1 +

1
2

#(τjk = 0), t +

1
2

(cid:18)

(cid:19)

2
jk

τ


j,k

2.2.3 Simulated Example:

We apply Ghosh and Xu’s [2015] bi-level Bayesian sparse group selection method in function

selection and estimation for two simulated examples.

Example 1.

We deﬁne 4 functions-

• f1(x) = 5x

• f2(x) = 3(5x − 1)2
10 sin(2πx)
2−sin(2πx)

• f3(x) =

• f4(x) = 6(0.1 sin(2πx) + 0.2 cos(2πx) + 0.3 sin(2πx)2 + 0.4 cos(2πx)3 + 0.5 sin(2πx)3)

This generating model is the same as Huang et al.’s [2010][] Example 1. We use p=20 and n=100.
We employ n4 as the number of internal knots for each function, so that in total we have pn
4 number
of coeﬃcients. Again, our assumption is those n4 internal knots are sparse, and we will select im-
portant knots from them. We use B-spline basis functions of degree=3 with n4 numbers of internal
knots. We have f5(x) = f6(x) = .. = f20(x) = 0 as null function.

We generate the covariates as xj = (wj + t ∗ u)/(1 + t)

f or j = 1,2,3,4 and xj = (wj + t ∗
f or j = 5, ..,20. (wj, u, v) ind∼ N +(0,1) and t controls the amount of correlation. The

v)/(1 + t)

38

Figure 2.2: Example 1 simulation plot

covariates from zero component and non-zero components are independent. We work with t=0,0.7
and a signal to noise ratio as 9.

We run this simulation setup 100 times and examine the selection performance of the method.
We compute the number of true positives and true negatives for 100 repetitions. For our model, we
ﬁrst have four models as true functions and the rest as null functions. Our ﬁrst function is detected
20% of the time, and the other three functions are correctly detected 100% of the time. In contrast,
the null functions are not selected in a model a single time in our 100 repetitions. We compute the
ISE for each true function and plot them in a boxplot for 100 repetitions, with the error computed
for overall response as well. We have a few of other plots above, such as actual vs. predicted
plots for response, and plots of rejection probabilities for true functions and null functions. All the
results above are found in the setup where predictors are not correlated.

In our next step, we run the same simulation setup with predictors that are correlated. The
response is generated by additive models with the same four functions mentioned above and the
same signal-to-noise ratio, but we set the correlation between covariates as 0.7. We run this setup

39

Figure 2.3: Example 1 simulation plot

for 100 repetitions and study the results. Out of 100 repetitions, the selection of the ﬁrst function
is very low at 6%, while the other three true functions are selected 100% of the iterations. On the
other hand, one null function come as signiﬁcant once that is the selection percentage is 1% for
that null function. Below are the same plots that were given for the non-correlated setup:

Example 2.

In 2nd example we took another four diﬀerent functions deﬁned as:

• f1(x) = 3x,
• f2(x) = x + (2x−2)2
5.5
• f3(x) = −x + π sin(πx),

,

• f4(x) = 0.5x + 15φ(2(x − 0.2))− φ(x + 0.4), where φ() is a standard normal density function.

Again, in this example we use p=20 and n=100. We utilize n4 as the number of internal knots
4 number of coeﬃcients. Again, our assumption is that

for each function, so that in total we have pn

40

Figure 2.4: Example 2 simulation plot

those n4 internal knots are sparse for functional estimation. We use the B-spline basis functions
constructed from predictors. The predictors are generated independently from Uniform (-2,2), and
the signal-to-noise ratio is kept at 5. The simulation setup runs for 100 repetitions. Out of 100
repetitions, the ﬁrst four true functions are selected 75%, 28%, 85%, and 92% of the iterations,
respectively, whereas the null functions are not selected a single time in 100 repetitions. A few
plots are included below to provide a better picture.

In the next step, we generate the covariates from AR(1) process with ρ = 0.7. In the case of
correlated covariates, the relationship between predictors and response is still of theoretical interest.
We repeat the simulation for the same set of functions mentioned above and gather outputs. We run
this setup for 100 repetitions and study the results. Out of 100 repetitions, the selection percentages
of the functions are 97%, 50%, 13%, and 97%, respectively. The selection of false positives is
zero, except for one null function that is signiﬁcant with 1% appearances among the null functions
we’ve used. No null functions we use. No null function is selected in the 100 repetitions. Below
are the same plots as in the uncorrelated scenario.

41

Figure 2.5: Example 2 simulation plot

2.2.4 Conclusion:

We explored an application of Bayesian variable selection for a sparse additive model. The
motivation behind applying bi-level selection for function selection and estimation is to ﬁnd true
nonzero functions and estimate them by knot selection. Our idea worked for a small number of
functions. If we have 20 functions and 4 of them are signiﬁcant, the proposed algorithm performs
satisfactorily. However, the initial problem was to apply this to a large number of functional spaces.
Therefore, more research is needed to draw a ﬁnal conclusion.

42

CHAPTER 3

BAYESIAN CLASSIFICATION OF ALZHEIMER’S DISEASE STAGES FROM

LONGITUDINAL VOLUMETRIC MRI DATA

3.1 Introduction

Alzheimer’s disease (AD) is the most frequent neurodegenerative and age-related form of de-
mentia. AD patients suﬀer loss of memory and diﬃculty with speech, and over time become
unable to perform daily tasks such as bathing, dressing, eating, and using the bathroom. AD is
becoming a signiﬁcant societal and ﬁnancial burden among elderly people. As many treatments are
being developed and evaluated, it is important to be able to determine early and accurately which
individuals are relatively more likely to progress clinically. Based on signs and symptoms, physi-
cians usually track AD using the Clinical Dementia Rating (CDR), and subjects are classiﬁed in
three states: cognitively normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease
(AD). AD accounts for between 60% and 80% of dementia cases. Brain MRI scans provide useful
information regarding dementia and its progression. Many group studies based on volumetric re-
gions of interest (ROI)([105],[118]), voxel-based morphometry ([124],[125]) or group comparison
of cortical thickness ([55],[84]) have shown that many brain regions like the enthorinal cortex,
the hippocampus, lateral and inferior temporal structures, and the anterior and posterior cingulate
are responsible for brain atrophy in AD. The atrophy patterns vary in diﬀerent disease stages over
diﬀerent regions. According to Leung et al. [74] MCI converters have higher decaying rates of
brain volume than nonconverters do. Hence, hippocampal atrophy using MRI is a marker of AD
pathology.

MRI provides scientiﬁcally accurate and easy-to-collect data on brain sub-regional volumes,
surface area, and cortical thickness. Besides measuring the cross-sectional volume of hippocampal
or other regions’ structural changes, there is great interest in longitudinal brain volume changes. The

43

greatest advantage of using longitudinal data is that it can capture a high correlation between obser-
vations and time trends, and it can track the rate of changes over time. A prominent phenomenon of
signiﬁcant hippocampal atrophy over time has been observed in AD groups ([7],[49],[60],[119]),
whereas healthy aging control groups do not have such high rates of atrophy. In this chapter, we
focus on brain regional volumetric changes over time for ﬁve ROIs measured by MRI scans: the
hippocampus (H), entorhinal cortex (EC), middle temporal cortex (MTC), fusiform gyrus (FG),
and the whole brain (WB). The key aim of this chapter is to apply a novel Bayesian classiﬁcation
method for classifying patients into disease groups by observing their longitudinal brain regional
changes. Throughout this chapter, we work with dichotomous target variables: AD and CN.

A number of high-dimensional classiﬁcation and regression methods have been developed for
the classiﬁcation of patients at diﬀerent stages of disease, and also for the prediction of future
clinical changes in MCI patients. Several new machine learning algorithms are used to deal with
high-dimensional data, such as support vector machines (SVMs) and linear discriminant analysis.
Diﬀerent approaches have been proposed in various papers ([50],[80],[90],[100]) to classify pa-
tients with Dementia from other stages of disease. Zhang et al. [133] applied a multi-kernel SVM
for classiﬁcation of patients using a longitudinal feature selection method. In terms of classiﬁcation
between MCI and AD patients, they achieved 78.4% accuracy, 79% sensitivity, and 78% speci-
ﬁcity. Lee et al. 2016 [72] used logistic regression with fused lasso regularization to predict the
conversion from MCI to Alzheimer’s. Seixas et al. [108] developed a Bayesian network model that
accounted for a combination of expert knowledge and data-oriented modeling. Writers assumed a
known structured Bayesian network model and estimated parameters using the EM algorithm.

In this chapter, we develop a Bayesian classiﬁcation method using longitudinal volumetric
data. This data was obtained from the ADNI database (http://www.loni.ucla.edu/ADNI).
Longitudinal MRI data has variability in the number of observations for each subject. However, we
assume that the longitudinal volumetric measurements are functional observations, and we consider

44

patients with at least three data points. Longitudinal observations of each subject are smoothed
using basis splines even if only a few observations are available. Finally, a Bayesian classiﬁcation
method is employed to classify individuals between AD and CN. Our classiﬁcation method is
unique and easy to implement, and it works with functional predictors. In the Bayesian literature,
very few classiﬁcation methods have been developed using functional predictors. In this context,
we want to mention Zhu et al. [136], who worked with functional predictors for classiﬁcation. They
developed a selection method to obtain important functional predictors for a binary model. Later,
they proposed two MCMC algorithms for posterior sampling, which are Metropolis-Hastings/Gibbs
sampler hybrids. We propose the Bayesian Pólya-gamma augmentation method for classiﬁcation;
it is easy to implement using the Gibbs sampler algorithm. Functional predictors are smoothed
by cubic basis splines, and basis coeﬃcients are used as predictors in the classiﬁcation model.
We obtain very good results of classiﬁcation in terms of sensitivity, speciﬁcity, and accuracy.
This chapter is divided into ﬁve sections. Section 3.2 discusses the functional smoothing and
classiﬁcation method developed for the analysis.
In Section 3.3, we explain the data in detail.
Section 3.4 presents the numerical results and some of the key points we extract from our analysis.
Section 3.5 concludes with a short discussion.

3.2 Methodology

In this section, we develop a supervised classiﬁcation technique to handle functional predictors.

3.2.1 Smoothing functional data

Classiﬁcation with functional data is a challenge. Functional predictors have high correlation be-
tween observations from the same individual. The sole purpose of using functional predictors is to
catch the time trend present in the data. Therefore, we take an approach to smooth the parametric
curves with a cubic basis spline. After we smooth a particular longitudinal functional curve, the
resulting coeﬃcient vector can be considered as a predictor for the next stage of classiﬁcation. To
avoid the complexity of the problem, we use data from patients who have at least three time period

45

observations, such that smoothed curves are comparable. [61] applied a similar approach to obtain
the estimates of a generalized linear model with functional predictors. However, his approach was
frequentist, whereas ours is Bayesian.

Let us assume that we observe n patients with their functional observations where each patient
has p functions (co-variates). Let xi j(t) be the j-th function observed at time point t from the i-th
patient. Let T be the compact domain of xi j(t) and xi j(t) ∈ L2[T]. With the functional predictors
xi j(t) we assume that we have binary response variable yi which takes value 0 and 1. Therefore a
logistic regression equation would look like-

(cid:26) P(yi = 1|xi1, .., xip)

1 − P(yi = 1|xi1, .., xip)

(cid:27)

=

Þ

p

j=1

T

log

xi j(t)βj(t)dt

(3.1)

To ﬁt the discrete observations xi j(t) we assume that, at any given time t, instead of xi j(t), we
observe Xi j(t) where

xi j(t) = Xi j(t) + e(t)

where e(t) is a zero-mean Gaussian process. We use a basis function expansion for Xi j(t) of the
form

q

k=1

Xi j(t) =

ci jk φ

j

k(t) = c(cid:48)

i j φ j(t)

where φ j(t) is the q-dimensional spline basis at time t for jth function, ci j the q-dimensional
spline coeﬃcients for the jth predictor from ith patient. We used ordinary least square estimates
for estimating spline coeﬃcients. A simple linear smoother is obtained by minimizing the least
squares criterion ||xi j − Φci j||2 as

(cid:48)
Φ)−1
ˆci j = (Φ

Φ

(cid:48)xi j

(3.2)

Once the orthonormal basis coeﬃcients have been estimated, we can combine (1) and (2) by

46

plugging ˆxi j(t) in (1), which gives-

(cid:26) P(yi = 1|xi1, .., xip)

1 − P(yi = 1|xi1, .., xip)

log

ˆc(cid:48)
i j φ(t)βj(t)dt

Þ

(cid:27)

p
p

j=1

=

=

j=1
= c(cid:48)
i β

T
ˆc(cid:48)
i j β j

(3.3)

=Þ

T βj(t)φ jT(t)dt, the coeﬃcient vector for the jth functional predictor. Here ci vector
where βT
j
has ﬁrst element as 1 and rest of the spline coeﬃcients for ith patient and β contains intercept of
the model as ﬁrst element.

Functional principal component (FPC) analysis is also a popular method that can be applied
here. Instead of least square basis estimates, one can work with FPC scores for classiﬁcation. [93]
worked on functional modeling that extends the applicability of FPC analysis for longitudinal data.
Speciﬁcally, when we have few repeated and irregularly observed data points, FPC scores can be
used. In our functional smoothing method, we expand the functional observation with spline basis
functions and use the basis coeﬃcients for classiﬁcation. On the other hand, the same intuition
can be applied for FPC scores. For functional component analysis, we assume that longitudinal
observations are observations from a smooth random function X(t) mean function µ(t) = E X(t)
and covariance function G(s, t) = cov(X(s), X(t)). The covariance function can be represented as
k=1 λk φk(s)φk(t) where φk’s are eigenfunctions and λk’s are eigenvalues. Then the

G(s, t) =∞

underline process can be written as:

X(t) = µ(t) +

∞

k=1

ξk φk(t),

where ξk’s are frequently referred to as FPC scores. These scores can be used later in the classiﬁca-
tion model. We do not work with inﬁnite numbers of scores; instead, the above sum is approximated
with a ﬁnite K that explains the majority of the variance in functional observations. For most cases,
the ﬁrst two FPC scores are enough to build a good classiﬁcation model. Zhu et al. [136] also

47

used FPC scores in their classiﬁcation model; they chose functional predictors with signiﬁcant FPC
scores.

There are relatively few counterparts to functional PCA in the Bayesian literature. Behseta
[9] proposed Bayesian FPCA using two methods, one with a random-coeﬃcient model
et al.
and another with a hierarchical Gaussian process model. They found better performance for
the hierarchical model. Later, in 2008, van der Linde [121] introduced a variational algorithm
for one-parameter exponential families to obtain approximate Bayesian inference in functional
PCA. Recently, Suarez and Ghosal [117] proposed a prior structure on the covariance function
of functional observations. Their model simultaneously smooths functional observations while
estimating principal components. In this chapter, we work with a basis spline smoothing method
due to its ease of implementation in statistical software. In R, we have the splines package, which
ﬁts cubic basis splines on longitudinal data with equally placed knots. We do not investigate
any ﬁndings using FPC scores instead of basis spline coeﬃcients, as our main focus is on the
classiﬁcation algorithm and basis spline coeﬃcients work very well for our classiﬁcation model.

3.2.2 Classiﬁcation using Pólya-Gamma Augmentation

In this section, we present the relationship between predictors and outcome using a logistic
regression model. The binary data is denoted as yi ∈ {0,1} (i = 1, .., n). Now, posterior sampling
of logistic regression coeﬃcients is diﬃcult due to the model’s complicated likelihood function.
The assumption of the Gaussian prior for regression coeﬃcients is highly important, but the full
posterior distribution of regression coeﬃcients becomes analytically inconvenient. In comparison
to the logistic model, it is computationally easier to execute Bayesian inference using a probit
model [2]. Diﬀerent sampling algorithms have been proposed [2]. Holmes et al. [54] developed
an indirect sampling method by introducing auxiliary variables for binary and multinational re-
gression. Later, more methods based on latent variables for logistic regression were reported by
Frühwirth-Schnatter et al. [41], Gramacy et al. [45] and Polson et al. [97]. Among all these works,

48

Polson et al.’s algorithm is most interesting to us due to its ease of computational implementation
as well as its sampling eﬃciency. Our aim is to avoid the complex Metropolis algorithm while
sampling from posterior distributions of regression coeﬃcients.

A vast body of literature is concerned with the analysis of Bayesian logistic models. As
mentioned earlier, Holmes et al. [54] proposed using an auxiliary variable to avoid the conditional
non-conjugacy for updating β. The prior structure they used was:

1,

yi =

if zi > 0
0, otherwise

zi = xi β + i
i ∼ N(0, λi)
λi = (2ψ)2
ψi ∼ Kolmogorv − Smirnovdistribution
β ∼ Normaldistribution

The above prior is interesting because the marginal likelihood L(β|data) is the same as the like-
lihood for the logit model. However, the main disadvantage of using this prior structure is that
although we obtain a conjugate full conditional distribution of β given data, the conditional dis-
tribution of π(λi|zi, β) does not have any standard form. One must use a complicated rejection
sampling method to sample for conditional λi. Hence, adding an auxiliary variable does not give
us signiﬁcant computational improvement compared to using the Metropolis-Hastings algorithm.
[41] addressed this problem with the same approach, but instead of
Frühwirth-Schnatter et al.
In the ﬁrst stage, they
a single auxiliary variable they used a two-stage augmentation method.
assumed the existence of a latent variable, where the binary response variable was conditional on
the sign of the auxiliary variable. The error part of the model was assumed to follow a type I
extreme value distribution that had non-normal density. Then, in the second stage of data aug-
mentation, they approximated this non-normal error distribution using the mixture of the normal

49

distribution. Finally, they obtained a multivariate normal distribution for the posterior of β. Among
all popular methods, pólya-gamma augmentation is most interesting to us and it’s easy to apply due
to availability of the R package.

We now discuss Polson et al.’s [97] algorithm in detail. They showed how a Gaussian variance
mixture distribution with a Pólya-gamma mixing density can approximate logit likelihood. We
start with deﬁning Pólya-gamma density-

Random variable X ∼ PG(b, c), a Pólya-gamma distribution with parameters b > 0 and c ∈ (cid:60),

if

X d=

1
2π

2

∞

k=1

(k − 1

gk
2)2 + c2
2
4π

,

where gk ∼ Gamma(b,1) are independent gamma random variables and d= indicates equality in
distribution.

In their work, Polson et al. [97] showed that Bernoulli likelihoods parametrized by log-odds
can be represented as mixtures of Gaussians with respect to Pólya-gamma distribution. Assume
we have latent variables ω such that ω ∼ PG(b,0) distribution, which is inﬁnite sum of gammas:
d= 1
2π

2)2 . Now if ω ∼ PG(b,0) below equality can be proved:
gk
(k−1

2∞

k=1

ω

Þ ∞

0

− ωψ

2
2 p(ω)dω

e

(3.4)

(eψ)a
(1 + eψ)b

= 2−beκψ

where κ = a − b2.

Motivated by Polson et al.’s [97] integral result, we construct a Bayesian prior formulation
targeted to handle binary logistic regression. Equation 3.3 has a Bernoulli likelihood function with
logit link. We proposed a Normal prior for β vector in equation (3.3). To derive our Gibbs sampler

50

, we introduce a latent variable ω. Our prior set up is-
ec(cid:48)
iβ
1 + ec(cid:48)
iβ
ωi ∼ PG(1,0), i = 1, .., n
β ∼ Normal(b, B)

yi|ci, β ∼ Bernoulli(

), i = 1, .., n

Gibbs Sampler:

(3.5)

The likelihood for ith observation is
(ec(cid:48)
iβ)yi
1 + ec(cid:48)
iβ
∝ eκic(cid:48)

iβÞ ∞

Li(β) =

0

(cid:40)

exp

− ωi(c(cid:48)

iβ)2
2

(cid:41)

p(ωi)dω

where κi = yi − 1

2. We combine the likelihood function with β prior, given ω = (ω1, .., ωn):

(cid:41)

(cid:40)

p(β|ω,y,c) ∝ p(β) n
(cid:26)
(cid:26)
(cid:26)

∝ exp

× exp

exp

iβ − ωi(c(cid:48)
iβ)2
(cid:27)
κic(cid:48)
2
i=1
(cid:27)
−1
2(z − Cβ)T Ω(z − Cβ)
−1
2(β − b)T B−1(β − b)
−1
2(β − Cω(CT Ωz + B−1b))TC−1

(cid:27)

∝ exp
ωn), Ω = diag(ω1, .., ωn) and Cω = (CT ΩC + B−1)−1.

ω (β − Cω(CT Ωz + B−1b))

where z = ( κ1

ω1 , .., κn

Finally, posterior samples for regression coeﬃcients can be computed with Pólya-gamma aug-

mentation. The steps are

(ωi|β) ∼ PG(1,cT
(β|y, ω) ∼ N(Cω(CT Ωz + B−1b),Cω)

i β)

(3.6)

51

We used 0.5 probability threshold to construct our response variable Y as dichotomous {1,0}.
Once we obtain the posterior samples of β, we can calculate the predicted probability for each
patient. We use the posterior median estimate for β - that is, we calculate the median of the posterior
samples of β and use those estimates for validation in the test data set. Finally, we need to decide
a threshold to classify each patient into either the AD or the CN group. The classiﬁcation method
is implemented in the R package BayesLogit. The sampler is highly eﬃcient for sampling from
Pólya-gamma distribution with a positive parameter as one. In 2013, Choi et al. [26] published a
paper in which they showed that "the Pólya-gamma Gibbs sampler for Bayesian logistic regression
is uniformly ergodic". This result is very crucial, as it guarantees that Monte Carlo averages of
posterior samples follow the central limit theorem. This is a strong reason to adopt this method for
our analysis.

3.3 Data Description

The MRI data used in the analysis of this thesis was collected from the Alzheimer’s disease
Neuroimaging Initiative (ADNI) website. We consider ﬁve main longitudinal predictors as po-
tential markers of diﬀerences between AD patients and normal aging persons. AD patients and
normal aging persons both have brain region atrophy, and our aim is to ﬁnd the essential markers
related to brain atrophy that causes dementia after a certain age. We have longitudinal volumetric
measurements of the hippocampus (H), entorhinal cortex (EC), middle temporal cortex (MTC),
fusiform gyrus (FG), and the whole brain (WB). We believe that these are the potential predictors
that can distinguish the AD patients from normal aging patients.

From the ADNI database, we have information on a total of 1,279 patients diagnosed as either
AD or CN over their visits for tests. We discard ﬁve patients who changed their disease status from
CN to AD over their longitudinal measurements. At the start of our analysis, we decide to keep
patients who have at least three data points as longitudinal measurements. After ﬁltering for number
of data points and excluding patients whose predictor-related information is missing, 528 patients

52

are included in the study, divided into two groups: AD (n=206) and CN (n=322). The rest of the
patients have the status of mild cognitive impairment (MCI). All patients went through an initial
clinical evaluation to measure their baseline scores for diﬀerent tests, such as the Mini-Mental State
Examination (MMSE) and the 11-item Alzheimer’s disease Assessment Scale–Cognitive Subscale
(ADAS11). In addition, patients underwent a baseline structural MRI scan, giving us baseline mea-
surements of all the predictors of interest. In addition, at the start of the study, subjects provided
apolipoprotein E (APOE) genotyping information. Data was collected from patients’ visits at spe-
ciﬁc time points, such as after 6, 12, 18, 24, and 36 months. Patients’ visits were irregular, and we
do not have uniformity over diﬀerences between consecutive measurements. We start by comparing
the baseline measurements between the AD and the CN group, as shown in Table 1. Although
age distribution and gender ratio are not signiﬁcantly diﬀerent between these two groups, the other
three features, namely baseline MMSE, ADAS11 score, and APOE proportion, show signiﬁcant
diﬀerences between the two disease classes at the 5% level of signiﬁcance. This implies that those
three features can be used as strong predictors for classifying patients. Furthermore, we examine
the patterns and spread for the number of visits for patients. For all the patients we consider, the
minimum observed time span is 10 months (only three observations were collected), while the
longest is 10 years. In terms of number of visits or data points, there is a minimum of three data
points due to our threshold, and the maximum is 11. The median number of data points is around 4.
Table 2 provides a detailed breakdown of this information with respect to the AD and the CN group.

In the spaghetti plot in Figure 1, we plot the predictor variables for each subject segmented by
their disease status. In addition, we create another plot for patients’ MMSE scores. The mean values
for each group are highlighted in the plots. In the ﬁrst ﬁve plots, we can see that the hippocampus
and other important variables have much higher values for the CN group than for dementia patients.
We can visually observe the MMSE score patterns, although in Table 1 we see much higher MMSE
scores for normal patients. The goal is to predict the probability of a patient belonging to either of
these groups with measured MRI scans.

53

Table 3.1: Patients Baseline Characteristics

n

Age (Mean ± sd)
Gender (F/M)
MMSE (Mean ± sd)
ADAS11 (Mean ± sd)

APOE (+/-)

AD
206

97/109

73.53 ± 7.67
24.38 ± 2.5
16.8 ± 6.42

143/63

CN
322

p-value

74.15 ± 5.67
0.325
0.814
155/167
29.07 ± 1.13 <0.0001
5.85 ± 2.87
<0.0001
<0.0001

85/237

Notes: Comparison of Baseline Age, Gender ratio, MMSE score, ADAS11 score and APOE ratio between
AD and CN groups

Table 3.2: Data Characteristics

n
AD 206
CN 322

Span (Months)

#Data Points (Nobs)
Max Min Median Max Min Median
71
120

24
36

10
11

7
11

3
3

4
4

* duration of patients’ visits and number of visits available for each group

3.4 Application results

This section presents the numerical results of applying the method proposed in Section 2 to
ADNI data. We ﬁrst deal with a single potential covariate, “hippocampus,” as a representation of
a single ROI’s volume. Section 4.1 lists the results we obtain using the hippocampus as a single
predictor. We repeat the same procedure in Section 4.2 by listing all classiﬁcation results using
single ROIs. Section 4.3 presents the results regarding combinations of the ROIs to determine the
best model with the greatest classiﬁcation power. We work with ﬁve potential predictors and a
signiﬁcant number of combinations of ﬁve ROIs.

3.4.1 Classiﬁcation using Hippocampus

In this section, we consider the volumetric MRI data of the hippocampus to demonstrate the appli-
cability of the proposed method with a single predictor.

54

Figure 3.1: Longitudinal volume of ROIs for dementia patients and normal controls. The ﬁrst ﬁve
plots (plots in the ﬁrst column and the top two plots in the second column) are scaled ROIs of
interest. The last plot has MMSE scores of patients. On the X-axis, we have the number of visits
for patients. Blue lines are for the dementia group and orange lines are for the normal group. Thin
lines represent each patient’s data, and thick lines are the pooled mean for the AD and CN groups.

We start our analysis by smoothing the longitudinal observed values of the hippocampus. A
simple least squares approximation is suﬃcient, as we assume that the residuals of the true curve
are independently and identically distributed with mean 0 and constant variance. We use the cubic
B-spline basis functions for spline smoothing of the observed hippocampus volume. Four internal
knots are used for spline smoothing with intercept, which gives us eight basis functions. We want
to ensure that the smoothed estimated curve yields a good ﬁt to each patient’s observed curve. As
we do not have a large number of data points for each patient, we do not consider controlling over
ﬁtting of our estimated curve. Figure 4 shows the observed hippocampus volume of patients in
both group for all visits. The black points represent the mean values of the smoothed curves for all
patients at each time point or visit. The smooth curve seems to track the variation in the volumetric

55

measurements accurately. Besides least squares smoothing, functional principle component scores
can also be used for this analysis. At the start of the analysis, we divide the data set into two parts:
three-quarters of the patients (457) are reserved as a training data set, and the rest (152) are kept for
testing. For each patient in the training data set, we gather the basis coeﬃcients used as predictors
for classiﬁcation. For classiﬁcation, we generate β prior from a multivariate normal distribution
with a mean vector as 0 and a diagonal covariance matrix with diagonal elements as 100. The
large diagonal numbers for the covariance matrix represent the absence of any prior knowledge
for β and indicate a non-informative prior. The pólya-gamma augmentation method is repeated
for 10,000 iterations, and the initial 5,000 iterations are considered a burn-in period for the Monte
Carlo Markov Chain. Figure 5 shows the predicted probability plot, with a red horizontal line
representing a probability value equal to 0.5. A probability threshold of 0.5 is used to classify
patients as either AD or CN. Patients with more than a 0.5 probability are tagged as AD patients.
Figure 4c presents a boxplot of the β coeﬃcients estimated from Pólya-gamma augmentation. The
coeﬃcients are very small and mostly close to 0, except for the intercept, which is not shown in the
plot.

We apply the method described in Section 3 to the training data and obtain the estimated β
coeﬃcients. We use functional smoothing on the hippocampus volume test data set. Using both
estimated coeﬃcients and basis matrices for this data set, we obtain the predicted probability. To
check the robustness of our classiﬁcation method, we repeat the procedure 100 times and examine
the corresponding sensitivity, speciﬁcity, accuracy, and area under the ROC curve (AUC). Table 3
lists the results for all important predictors we consider when used as single ROIs in the model,
including sensitivity, speciﬁcity, accuracy, and AUC mean with standard deviation. We obtain 82%
sensitivity, 88% speciﬁcity, and 86% accuracy for the classiﬁcation using the Hippocampus, which
gives us strong conﬁdence in the validity of our methods. The corresponding ROC curve is also
shown in Figure 3.

56

3.4.2 Classiﬁcation with Single ROI

In the previous section, we showed the applicability of our proposed method using the hippocam-
pus as a single ROI. We now evaluate the classiﬁcation performance if we consider all ﬁve ROIs:
H, WB, EC, FG, and MTC. We repeat our proposed procedure 100 times and calculate sensitivity,
speciﬁcity, accuracy, and AUC with one-quarter of the test data set. We set the threshold probability
to 0.5 for all cases. Sensitivity, speciﬁcity, accuracy, and AUC mean with standard deviation are
listed in Table 3.

Table 3.3: Classiﬁcation performance using single ROI

Entorhinal Cortex

Hippocampus
(Mean ± sd)
WholeBrain
(Mean ± sd)
(Mean ± sd)
Fusiform gyrus
(Mean ± sd)
(Mean ± sd)

Middle Temporal Cortex

sensitivity
0.82 ± 0.01
0.75 ± 0.01
0.80 ± 0.01
0.73 ± 0.01
0.75 ± 0.01

speciﬁcity
0.88 ± 0.01
0.74 ± 0.01
0.90 ± 0.001
0.87 ± 0.01
0.87 ± 0.01

accuracy
0.86 ± 0.001
0.75 ± 0.001
0.86 ± 0.01
0.81 ± 0.01
0.83 ± 0.01

AUC

0.92 ± 0.02
0.83 ± 0.03
0.92 ± 0.02
0.89 ± 0.03
0.89 ± 0.03

Notes: Sensitivity is the proportion of correct AD predictions; speciﬁcity is the proportion of correct CN
predictions; AUC is area under the ROC curve. The mean and standard deviation are based on 100 repeated
results in test data sets. The probability threshold is 0.5.

It is evident from Table 3 that single ROIs are critical for distinguishing dementia patients from
normal aging people. We obtain an accuracy rate of around 0.80 for single ROIs, and AUC is
around 0.9. These are excellent statistics. We also examine the sensitivity and speciﬁcity using
single ROIs. A high sensitivity of 0.80 gives us conﬁdence in our proposed method and implies
that volumetric MRI data of single ROI does have strong classiﬁcation power.

57

(a) Mean smoothed Hippocampus scores

(b) Predicted class probability

(c) Box plot for posterior median estimates of model
coeﬃcients using Hippocampus

(d) ROC curve for Hippocampus model

Figure 3.2: In (a), the colored points represent hippocampus values for each time point, and the
black dot signiﬁes the mean value of the smoothed curve for that time point. Orange is for CN and
blue is for AD. In (b), the points are the predicted class probability for patients in the test data set.
(c) presents the spread of the estimated regression coeﬃcients from the Bayesian logistic model in
100 repeated runs. (d) shows the ROC curve after classifying patients using the hippocampus only.

58

3.4.3 Classiﬁcation using combination of ROIs

In the ﬁnal stage of our analysis, we consider all combinations of ROIs. We ﬁrst check the
classiﬁcation performance of our model using combinations of ﬁve ROIs, and we later include the
baseline MMSE, ADAS11 score, and APOE status. Table 1 showed that patients had signiﬁcant
diﬀerences in MMMSE and ADAS11 scores at the beginning of the study. We include these
variables to make our model stronger in terms of classiﬁcation power. We describe our analysis
in the following. Before applying our method directly to the data, we make one change in the
volumetric MRI ROIs: we normalize all the volumetric ROIs by dividing them by the intracranial
volume (ICV) to bring each patient’s ROI measurements to the same level. For example, to
normalize the ROI volume, we consider each patient’s ICV for each time point and divide the
ROI volume by the corresponding ICV for each patient at each time point. We then calculate the
basis coeﬃcients from longitudinal measurements of volumetric ROIs and use those coeﬃcients,
including baseline MMSE, ADAS11, and APOE, in the Bayesian logistic model for each subject.
The same pólya-gamma augmentation is used to avoid complicated Metropolis-Hastings algorithm.
The following is the model for classiﬁcation with pi = P(yi = 1|xi j(t) using the combination of H,
WB, EC, FG and MTC:

logit(pi) =β0 + β1cH

i + β2cW B

i

+ β3cEC

i + β4cFG

i + β5cMTC

+ β6M MSEi + β7ADAS11i + β8APOEi,

i
i = 1, ..., n

(3.7)

We use this model with our normalized data to classify patients into two groups: AD and CN.
We obtain astonishingly better results than what we observed using single ROIs. The classiﬁcation
power of the model is very high, with 93

3.4.4 Conclusion

We collected results for 26 diﬀerent combinations of ROIs, excluding the results of single ROIs.

We list the important points and conclusions below based on those results:

• Overall, we obtain 80% sensitivity, 90% speciﬁcity, and 85% accuracy using single ROIs as

59

potential predictors. This implies that each individual ROI has strong classiﬁcation power in
diﬀerentiating AD patients from normal aging people. Among the ﬁve potential volumetric
MRIs, the whole brain yields the lowest sensitivity, speciﬁcity, and accuracy, while the
hippocampus performs best.

• To compare diﬀerent model results using combinations of ROIs, we start with the ﬁndings
obtained from the combination of two ROIs. The overall sensitivity moves to around 80%
and the speciﬁcity to around 90%. The hippocampus and fusiform gyrus together have the
best performance. We do not use the normalized volume of ROIs for these models.

• In comparing results between diﬀerent combinations, we mostly emphasize sensitivity, as it
indicates the percentage of patients correctly identiﬁed as Alzheimer’s patients. We achieve
more than 80% sensitivity for most of the combinations. Furthermore, we obtain more than
85% sensitivity for the majority of combinations of three ROIs. Speciﬁcally, H+WB+FG
has 91% sensitivity and 84% accuracy.

• Next, when using combinations of four ROIs, H+WB+EC+FG has 0.91 sensitivity and
0.79 speciﬁcity. However, H+WB+EC+MTC has sensitivity of 0.77 and 0.89 speciﬁcity,
indicating that the inclusion of the middle temporal cortex (MTC) in the model and the
exclusion of the fusiform gyrus (FG) decreases sensitivity and increases speciﬁcity.

• We obtain the best result when using all ﬁve ROIs with baseline MMSE, ADAS11, and
APOE in the model. This model achieves 96% speciﬁcity and 95% accuracy, which is high
compared to other combinations. The addition of MMSE and ADAS11 scores in the model
deﬁnitely increases the performance of method. However, the ﬁve potential ROIs together
yield 81% sensitivity, 77% speciﬁcity, and 79% accuracy.

3.5 Discussion

We have proposed a Bayesian classiﬁcation method motivated by a practical problem in the
domain of Alzheimer’s disease. The diﬀerentiation of Alzheimer’s patients based on their MRI

60

inputs is crucial in medical science. The importance of this problem is two-fold: ﬁrst, doctors need
to know the main factors involved with dementia in aging patients; and second, a computationally
feasible and less time-consuming classiﬁcation method is proposed here. Bayesian logistic regres-
sion with functional predictors is itself a challenge due to its analytically complicated likelihood
function. The Metropolis-Hastings algorithm is a popular tool used by many researchers to tackle
this problem, but the choice of candidate density is highly subjective and convergence rates are not
great. In this situation, Pólya-gamma augmentation provides us with sampling eﬃciency that can
easily be integrated into hierarchical Bayesian modeling.

One more challenging question arises when working with functional predictors: What about
the selection of predictors? In our study, we work with ﬁve important ROIs and a few baseline
score variables. This problem can be extended to a situation in which one has multiple functional
predictors and some of them are redundant. It is crucial in modern Bayesian theories to ﬁnd a proper
variable selection method for classiﬁcation which works with functional predictors as well. In the
context of Alzheimer’s disease, not all MRI measurements over time are important for predicting
AD patients. In our next study, we will build a data-driven method to provide a feasible solution to
this problem.

In this chapter, we applied a classiﬁcation method to distinguish AD patients from healthy con-
trols. However, in the Alzheimer’s disease domain, the diﬀerentiation of MCI converters (MCI-c)
from MCI nonconverters (MCI-nc) is more interesting. The main problem we faced while dealing
with this situation was the low ratio of MCI-c to MCI-nc. The data set was sparse, and we did not
ﬁnd a signiﬁcant number of disease cases in the test data on which we could validate our method.
In addition, most of the baseline MCI patients had very few data points in their longitudinal mea-
surements.

Besides the points mentioned above, our method performs very well for the given data. The

61

best classiﬁcation model is
H+WB+EC+FG+MTC+MMSE_bl+ADAS11_bl+APOE_bl with sensitivity=0.93, speciﬁcity=0.96,
accuracy=0.95 and AUC=0.98. This is almost an oracle classiﬁer. Our Bayesian logistic model
uses very few functional predictor variables and performs much better than other exiting methods.
Our method prefers a highly cost-eﬀective and less time-consuming data collection process, as
fewer MRI measurements are needed for analysis. In summary, the method proposed in this chapter
shows good performance in distinguishing Alzheimer’s patients from normal aging controls.

Acknowledgment

We would like to thank Yingjie Li[77], email:liyingj1@stt.msu.edu, for providing us with the

data used in this study

62

CHAPTER 4

BAYESIAN PENALIZED MODEL FOR CLASSIFICATION AND SELECTION OF
FUNCTIONAL PREDICTORS USING LONGITUDINAL MRI DATA FROM ADNI

4.1 Introduction

The research literature on applied mathematical approaches and classiﬁcation methods using
longitudinal MRI data has seen massive growth over the past decade. Among the broad range
of methods applied with variable degrees of success, several warrant mention. Misra et al. [90]
implemented a high-dimensional pattern recognition method to baseline and longitudinal MRI
scans to predict conversion from MCI to AD over a 15-month period. Zhang et al. [133] used a
multi-kernel SVM for classiﬁcation of patients between MCI and AD, achieving 78.4% accuracy,
79% sensitivity, and 78% speciﬁcity. Lee et al.
[72] applied logistic regression in predicting
conversion from MCI to Alzheimer’s, using fused lasso regularization to select important fea-
tures. Seixas et al. [108] proposed a Bayesian network decision model for detecting AD and MCI
which considered the uncertainty and causality behind diﬀerent disease stages. Their Bayesian
network used a blended eﬀect of expert knowledge and data-oriented modeling, and the param-
eters were estimated using an EM algorithm. Adaszewski et al.
[1] employed classical group
analyses and automated SVM classiﬁcation of longitudinal MRI data at the voxel level. Arlt et
al. [4] studied the correlation between the test scores over time with fully automated MRI-based
volume at the baseline. However, few studies to date have developed methods that increase the sen-
sitivity, accuracy, and speciﬁcity of classiﬁcation in AD diagnosis or progression to more than 80%.

Classiﬁcation using longitudinal data can be a challenge with a large number of predictors. The
ﬁrst signiﬁcant approach to handle longitudinal predictors is to consider each multiple-occasion
observation as a single function observed over a time interval. Functional predictors have a high
correlation with adjacent measurements, and the observational space is high-dimensional. The

63

number of predictors required for estimation often exceeds the number of observations, thus in-
troducing the problem of dimensionality. A regression framework is frequently the most suitable
to model all possible longitudinal eﬀects across ROIs, where the proposed method will select
the important predictors. Moreover, many biomedical studies have shown that a limited number
of speciﬁc brain regions or ROIs are essential for AD classiﬁcation. Thus, dimension reduction
techniques can be applied, and classiﬁcation can be limited to the reduced feature set. Zhu et
al. [136] advanced a method for classiﬁcation and selection of functional predictors that entails
calculation of functional principle component scores for each functional predictor, followed by
the use of these scores to classify each individual observation. They proposed using Gaussian
priors for selection and created a hybrid Metropolis-Hastings/Gibbs sampler algorithm. Although
the method reported in the present study is inspired by this method, we develop a simple Gibbs
sampler where MCMC samples are drawn from standard distributions. We also focus on applying
penalized regression for dimension reduction.
In the Bayesian variable selection literature, the
spike-and-slab prior has widespread applications due to its superior selection power. George et al.
([43],[44]) initially proposed that each coeﬃcient β can be modeled either from the “spike” distri-
bution, where most of its mass is concentrated around zero, or from the “slab” distribution, which
resembles a diﬀuse distribution. Instead of imposing the spike-and-slab prior directly on regression
coeﬃcients, Ishwaran et al. [57] introduced a method in which they placed a spike-and-slab prior
on the variance of Gaussian priors. The Bayesian variable selection methods also include diﬀerent
Bayesian regularization methods, such as Bayesian Lasso [96], Bayesian Group Lasso, Bayesian
elastic net [76]. We employ a Bayesian group lasso algorithm blended with a spike-and-slab prior
obtained from Xu and Ghosh, 2015[128]. The group structure among coeﬃcients in our model
comes from functional smoothing of the coeﬃcients, and group lasso facilitates the selection of the
important functional predictors. Thus, our proposed method takes the idea of Bayesian variable
selection to a generalized functional linear model with binary responses.

The fundamental challenge of this work is to perform logistic regression in a Bayesian frame-

64

work while using a large number of functional predictors. The direct sampling of regression
coeﬃcients from the Bayesian logistic model is diﬃcult due to its complicated likelihood function.
In high-dimensional scenarios, selection of predictors becomes crucial with the introduction of
either a spike-and-slab prior, non-local priors, or horseshoe priors. For all such priors, the full
posterior distribution of regression coeﬃcients is analytically inconvenient. We obtain the Pólya-
gamma augmentation method with priors proposed by Xu and Ghosh, 2015[128], which yields
full conditional samples from standard distributions. Our aim is to avoid the complications of
Metropolis-Hastings and to develop an easily implementable Gibbs sampler. In addition, Bayesian
estimation provides proper estimates of the model parameters, which are also useful for building
inference. The key advantage of this method is that it calculates the log of odds of AD with respect
to CN based on the selected longitudinal predictors. Moreover, we use a probability threshold
for classifying individual patients to validate our modeling performance. We obtained the data
used in the dissertation from the ADNI server. The volumetric MRI brain data includes parcel-
lated sub-regions of the whole brain, with separate subdivisions for the left and right hemispheres.
Volumetric measurements of brain sub-regions across multiple occasions over time demonstrate
diﬀerential patterns of brain atrophy between AD patients and normal aging people. Because not
all brain regions are as closely related to AD, the redundant features derived from the unrelated
brain regions can be removed by limiting the selection to brain sub-regions important to classiﬁ-
cation. The problem of identifying important brain sub-regions from a large number of functional
predictors or longitudinal measurements is far from simple. Various variable selection methods
have been designed for single-time-point data with respective target variables. We apply a Bayesian
variable selection method to select longitudinal features or functional predictors for our data set.
We work with 49 functional predictors consisting of longitudinal volumetric measurements in dif-
ferent sub-regional brain ROIs. The use of the spike-and-slab prior ensures that a large number of
redundant predictors are dropped from the model. The ROI sub-regions selected by our method
will be helpful for future studies to detect the progression of dementia.

65

The chapter is organized as follows. In section 2, we introduce Bayesian variable selection with
a spike-and-slab prior. Section 3 discusses functional smoothing of the longitudinal predictors.
In Section 4, we introduce our methodology and algorithm for simultaneous selection and classi-
ﬁcation. Theoretical properties and consistency results are shown in Section 5. We then discuss
the application results with simulated data and real data in Sections 6 and 7. Finally, Section 8
examines another potential modeling approach, and Section 9 covers the overall development and
limitations of the methodology.

4.2 Bayesian Variable selection

We will brieﬂy discuss about Bayesian variable selection below:

4.2.1 Spike-Slab prior

A Bayesian model with a spike-and-slab prior can be constructed as follows:

2),

(i = 1, ..., n)

i β, σ

(Yi/xi, β, σ

2) ind∼ N(x(cid:48)
(β/γ) ∼ N(0, Γ),
γ ∼ π(dγ),
2 ∼ µ(dσ
2),

σ

where 0 is a p-dimensional zero vector, Γ is the p x p diagonal matrix diag(γ1, ..., γp), π is the prior
measure for γ = (γ1, ..., γp)t and µ is the prior measure for σ
2. Ishwaran et al. [57] proposed this
setup and developed optimal properties based on the prior choice of (β/γ).

A popular version of the spike-and-slab model, introduced by George et al. ([43],[44]), identiﬁes
zero and non-zero βi’s by using zero-one indicator variables γi and assuming a scale mixture of
two normal distributions:

(βi/γi) ind∼ (1 − γi)N(0, τ

i ),
i ) + γiN(0, c2
2
2
i τ

i = 1, .., p

66

2
The value for τ
i > 0 is some suitably small value, while ci > 0 is some suitably large value. γi = 1
represents the βi’s which are signiﬁcant, and these coeﬃcients have large posterior hypervariances
and large posterior βi values. The opposite occurs when γi = 0. The prior hierarchy for β is
2
tends to zero we provide more masses on 0, as the
completed by assuming a prior for γi. When τ
i
prior for insigniﬁcant βs. The prior distribution for the regression coeﬃcients can then be written
as:

(βi/γi) ind∼ (1 − γi)I0 + γiN(0, ν
2
2 is the limit for c2
i τ
i

2)

2
when τ
i

tends to zero and c2
i

is

with I0 point mass at 0 coeﬃcients; and ν
large enough.

4.2.2 Bayesian Group lasso

We discussed extensively about Bayesian Group Lasso in introduction. The form of Bayesian
Group lasso we extensively worked with initiated in Xu and Ghosh, 2015[128]. A multivariate
zero-inﬂated mixture prior can bring sparsity in group level which is elaborately discussed in Xu
and Ghosh, 2015[128]. The following hierarchical structure with independent spike-and-slab prior
for each βg:

Y|X, β, σ
βg|τ

2
g, σ

2I)

(cid:32) mg + 1

2 ∼ N(X β, σ
2 ∼ (1 − π0)Nmg(0, σ
g ∼ Gamma
2
2 ∼ IG(α, γ)
σ
π0 ∼ Beta(a, b)

2

τ

(cid:33)

2

,

τ

g Img) + π0δ0(βg), g = 1, .., G
2
2
λ
2

, g = 1, .., G

where δ0(βg) denotes point mass at 0. The mixing probability π0 can be deﬁned as a function of
the number of predictors to impose more sparsity as the feature size increases. The choice of λ is
very critical for Xu and Ghosh’s prior setup. Large values of λ produce biased estimates, while very

67

small λ values impose diﬀuse distribution for the slab part. Xu and Ghosh, 2015[128] mentioned an
empirical Bayes approach to estimate λ. Due to intractability of marginal likelihood, they proposed
a Monte Carlo EM algorithm for the estimation of λ. Moreover, they showed theoretically and
numerically that the median thresholding of posterior βg samples provides exact zero estimates for
insigniﬁcant group predictors.

4.3 Functional smoothing for longitudinal data

Classiﬁcation with the selection of signiﬁcant functional predictors is challenging. Researchers
commonly observe high correlation values between functional predictors.
In this dissertation,
we work with the assumptions of independence between predictors; hence, later we propose a
corresponding prior in the coeﬃcient space.The main advantage of using functional predictors is
that it allows us to measure time trends present in data . We start our methodology by smoothing
functional observations using a cubic basis spline. We restrict our data set to patients with at
least four time period observations, such that smoothed curves are comparable. Gareth M. James
[61] used a similar approach to obtain the estimates of a generalized linear model with functional
predictors.

Let us assume that we observe n patients with their functional observations and each patient has
p functions. We assume that not all p functional observations are important. Let xi j(t) be the jth
function observed from the ith patient. Let T be the compact domain of xi j(t) and xi j(t) ∈ L2[T].
With the functional predictors (xi1(t), .., xip(t)), we assume that we have binary response variable yi
which takes value 0 and 1. We also assume that the predictors have been centered in this work, so
that we can ignore the intercept term. Therefore, we have the following logistic regression equation:

(cid:26) P(yi = 1|xi1, .., xip)

1 − P(yi = 1|xi1, .., xip)

(cid:27)

=

Þ

p

j=1

T

log

xi j(t)βj(t)dt

(4.1)

Next, we construct an orthonormal basis φk(t) that can be used to decompose the functional

68

predictors and the corresponding logistic regression coeﬃcients, such as

q

k=1

q

k=1

xi j(t) =

ci jk φk(t),

βj(t) =

βjk φk(t)

where ci jk and βjk are the coeﬃcients of xi j(t) and βj(t) with respect to the kth orthonormal basis
φk(t). For notational convenience, we denote the basis coeﬃcients as βjk. These are diﬀerent
than the functional coeﬃcients βj(t). We use cubic basis splines as the orthonormal basis for our
simulation examples and real data applications. Hence, the choice of q completely depends on the
number of internal knots used in basis spline constructions. The jth component in equation (1) can
thus be written as

Þ

T

q

k=1

xi j(t)βj(t)dt =

ci jk βjk = c(cid:48)

i j β j

(4.2)

To ﬁt the discrete observations xi j(t), we assume that, at any given time t, instead of xi j(t), we

observe Xi j(t):

xi j(t) = Xi j(t) + e(t)

where e(t) is a zero-mean Gaussian process. We use the same basis function expansion for Xi j(t)
of the form

q

k=1

Xi j(t) =

ci jk φk(t) = c(cid:48)

i j φ(t)

where φ(t) is the q-dimensional spline basis at time t for jth function, ci j the q-dimensional
spline coeﬃcients for the jth predictor from ith patient. We use ordinary least square estimates
for estimating spline coeﬃcients. A simple linear smoother is obtained by minimizing the least
squares criterion ||xi j − Φci j||2 as

(cid:48)
Φ)−1
ˆci j = (Φ

Φ

(cid:48)xi j

(4.3)

Once the orthonormal basis coeﬃcients have been estimated, we can combine (1), (2) and (3)

69

by plugging ˆxi j(t) in (1), which yields

(cid:26) P(yi = 1|xi1, .., xip)

1 − P(yi = 1|xi1, .., xip)

log

ˆc(cid:48)
i j φ(t)βj(t)dt

Þ

(cid:27)

p
p

j=1

=

=

j=1
= c(cid:48)
i β

T
ˆc(cid:48)
i j β j

X(t) = µ(t) +

ξk φk(t),

∞

k=1

70

=Þ

(4.4)
T βj(t)φT(t)dt, the coeﬃcient vector for the jth functional predictor. Here, ci vector
where βT
j
has its ﬁrst element as 1 and rest of the spline coeﬃcients for ith patient, and β contains intercept of
the model as its ﬁrst element. We use no intercept form for our real data and simulation application
where c(cid:48)
)T has group
i
structure with each group size=q. Our selection method drops the redundant β’s and will select the
important coeﬃcient groups.

= ( ˆci1, .., ˆcip)(cid:48) does not have ﬁrst element as 1 and βpqx1 = (β

Tqx1
1

Tqx1
p

, .., β

Functional principal component (FPC) analysis is another popular method that can be applied
here. Instead of least square basis estimates, one can work with FPC scores for classiﬁcation. Zhu
et al. [136] also used FPC scores in their classiﬁcation model, and they selected the functional
predictors whose FPC scores were signiﬁcant. Müller [93] extended the applicability of FPC
analysis for modeling longitudinal data. Speciﬁcally, FPC scores can be used when we have
few repeated and irregularly observed data points.
In our functional smoothing method, we
expanded the functional observation with spline basis functions and used the basis coeﬃcients for
classiﬁcation. The same intuition can also be applied for FPC scores. For functional component
analysis, we assume that longitudinal observations are from a smooth random function X(t) and its
mean function is µ(t) = E X(t) and covariance function G(s, t) = cov(X(s), X(t)). The covariance
k=1 λk φk(s)φk(t) where φk’s are eigenfunctions and

function can be represented as G(s, t) =∞

λk’s are eigenvalues. Then, the underline process can be written as:

where ξk’s are frequently referred to as FPC scores. These scores can be used later in the clas-
siﬁcation model. We do not work with an inﬁnite number of scores; instead, the above sum is

approximated with a ﬁnite K that explains the majority of the variance in functional observations.
For most cases, the ﬁrst two FPC scores are enough to build a good classiﬁcation model. In this
chapter, we work with the basis spline smoothing method due to its ease of implementation in
statistical software. In R, we have the splines package, which ﬁts cubic basis splines on longitudi-
nal data with equally placed knots. We do not investigate any ﬁndings using FPC scores instead
of basis spline coeﬃcients, as our main focus is on the classiﬁcation algorithm, and basis spline
coeﬃcients work very well for our classiﬁcation model.

4.4 Simultaneous Classiﬁcation of binary response with selection of func-

tional predictors

4.4.1 Classiﬁcation using Pólya-Gamma Augmentation

In the following, we discuss Polson et al.’s [97] algorithm; these authors showed how a Gaussian
variance mixture distribution with a Pólya-gamma mixing density can approximate logit likelihood.
We start by deﬁning Pólya-gamma density-

Random variable X ∼ PG(b, c), a Pólya-gamma distribution with parameters b > 0 and c ∈ (cid:60),

if

X d=

1
2π

2

∞

k=1

(k − 1

gk
2)2 + c2
2
4π

,

where gk ∼ Gamma(b,1) are independent gamma random variables and d= means equality in
distribution.

Polson et al.’s [97] main result parametrized the log-odds of logistic likelihood as mixtures
of Gaussian with respect to Pólya-gamma distribution. The fundamental integral result, which is
easily integrated into the Gaussian prior hierarchy is that, for b > 0 -

(eψ)a
(1 + eψ)b

= 2−beκψ

− ωψ

2
2 p(ω)dω

e

(4.5)

Þ ∞

0

71

where κ = a − b/2 and ω ∼ PG(b,0). The introduction of latent variables (ω1, .., ωn) later helped
us in deriving conjugate posterior distribution. R package BayesLogit has an eﬃcient algorithm to
sample from Pólya-gamma distribution and it was proposed by Windle et al.,2014[126].

4.4.2 Selection using Bayesian Group Lasso

As we discussed in the Section 2.2, Meier et al.[87] developed group lasso for logistic regression
in a frequentist setup. In our model, we have p number of functional predictors (xi1(t), .., xip(t)
with binary response yi ∈ {0,1}, each group has q levels. We can write our model as-

(cid:26) P(yi = 1|xi1(t), .., xip(t))

1 − P(yi = 1|xi1(t), .., xip(t))

(cid:27)

log

p

j=1

According to Meier et al.[87] method, the logistic group lasso estimator with basis spline coeﬃcients
would look like

=

c(cid:48)
i j β j

= ηβ(ci)

√

q|| βj||2



−l(β) + λ

p

j=1

ˆβGL = min

β

where l(β) =n

i=1(yiηβ(ci) − log(1 + exp{ηβ(ci)}) is the log-likelihood function.

Before moving on to our proposed Bayesian method, we want to mention a similar model
presented by Zhu et al. [136]: they used latent variables for Bayesian logistic regression, and FPC
scores represented the functional predictors. They proposed a normal prior for the concatenation
coeﬃcients, which is the same as our coeﬃcients βjs.

Now, motivated by Polson et al.’s [97] integral result, we construct a Bayesian prior formulation
targeted to handle binary logistic regression. Equation 4.4 has a Bernoulli likelihood function with
logit link. We propose a spike-and-slab prior motivated by Xu and Ghosh, 2015[128] with a
zero-inﬂated mixture prior, which helps us in selecting the important group coeﬃcients. As
previously described, we introduce latent variables (ω1, .., ωn) to take advantage of the integral

72

identity described in equation (5). Our prior setup is
i β)
exp(cT
1 + exp(cT
i β)
i = 1, .., n

yi|ci, β ∼ Bernoulli

(cid:32)

(cid:33)

i = 1, .., n

,

βj|τ

τ

ωi ∼ PG(1,0),
j , π0 ∼ (1 − π0)Nq(0, τ
2
2 ∼ Gamma(q + 1
j |λ
2
2 ,
π0 ∼ Beta(a, b)

j Iq) + π0δ0(βj),
2
2
2 ),
j = 1, ..., p

λ

j = 1, ..., p

(4.6)

Gibbs Sampler:

The likelihood for ith observation is:

Li(β) =

(ecT
i β)yi
1 + ecT
i β
i β}
∝ exp{κicT

(cid:40)

exp

Þ ∞

0

(cid:41)

i β)2
− ωi(cT
2

p(ωi)dωi,

f rom equation (5)

where κi = yi − 0.5 and ωi ∼ PG(1,0). If we consider all n independent observations, given ωi

we can write the joint likelihood as-

n

i=1

(cid:40)

n

exp

Li(β|ωi) =

(cid:26) ωi

i β − ωi(cT
i β)2
)2(cid:27)
κicT
2
2 (cT
i β − κi
ωi
ωn) and Ω = diag(ω1, .., ωn).

i=1
= exp

= exp

(cid:41)
(cid:26)

where z = ( κ1

ω1 , .., κn

(cid:27)

−1
2(z − C β)T Ω(z − C β)

73

Next, we combine the likelihood function with β prior, given ω = (ω1, .., ωn):

p(β, τ

2

, π0|Y,C, ω) ∝ exp

p

×

(cid:26)
(1 − π0)(τ

−1
2(z − C β)T Ω(z − C β)
2 exp{− 1
2τ

j )− q

2

(cid:27)

2
j

j=1
2) q+1
j ) q+1
2 −1
2 (τ
(1 − π0)b−1

× (λ
× πa−1

2

0

2
j

2
τ
2

− λ

e

j βj}I(βj (cid:44)0) + π0δ0(βj)
βT



Due to the introduction of Pólya-gamma augmentation, we can derive a block Gibbs sampler with
a posterior distribution of βj’s. The same method is derived in Xu and Ghosh, 2015[128] for
continuous Y in linear model setup. The blocks Gibbs sampler was introduced by Hobert et al.[52].
To build this sampler, we start with some notations. Let β(j) denotes the β vector without j-th
group,

j−1, βT
and the corresponding design matrix can be written as:

β(j) = (βT1 , .., βT

p)T
j+1, .., βT

C(j) = (C1, ..,Cj−1,Cj+1, ..,Cp)

Cj is the corresponding design matrix for βj.

When βj (cid:44) 0:

p(βj|rest) ∝ exp

× exp

∝ exp

∝ exp



βT
j βj

(cid:26)
− 1
−1
βT
(cid:26)

2τ

2
j

(cid:27)





−1
2(z − C(j) β(j) − Cj βj)T Ω(z − C(j) β(j) − Cj βj)

Iq)βj − 2(z − C(j) β(j))T ΩCj βj

1
2
j

j (CT

j ΩCj +

2
−1
2(βj − Aj)T Bj(βj − Aj)

τ

(cid:27)

74

where, Bj = (CT
of βj is a spike-and-slab distribution,

j ΩCj + 1
2
j

Iq) and Aj = B−1

τ

j CT

j Ω(z− C(j) β(j)) Hence the posterior full conditional

j = 1, .., p

(4.7)

(βj|rest) ∼ (1 − lj)Nq(Aj, B−1

) + lj δ0(βj),
where lj = p(βj = 0|rest). Now we will ﬁnd the probability lj :
lj = p(βj = 0|rest)

j

Þ

=

=

j , π0)
p(βj = 0, y|C, ω, τ
2
βj (cid:44)0 p(βj, y|C, ω, τ
j , π0)dβj
2

p(y| βj = 0,C, ω, τ

j , π0)p(βj = 0|τ
2

p(y| βj = 0,C, ω, τ

j , π0) +Þ

2

j , π0)p(βj = 0|τ
j , π0)
2
2
βj (cid:44)0 p(y| βj (cid:44) 0,C, ω, τ

j , π0)p(βj (cid:44) 0|τ
2

j , π0)dβj
2

M π0

=

M π0 + N(1 − π0)
where π0 = p(βj = 0|τ

j , π0),
2

M = p(y| βj = 0,C, ω, τ

j , π0) = exp
2

Þ
Þ

(cid:26)

N =

βj (cid:44)0

p(y| βj (cid:44) 0,C, ω, τ

j , π0)dβj
2

exp

Þ

=

βj (cid:44)0
= M ×

= M × (τ

= M × (τ

2

exp

j (CT

βT

−1
2(z − C β)T Ω(z − C β)
−1
(cid:27)Þ
(cid:26)1
(cid:27)
(cid:26)1
βj (cid:44)0
|Bj|−1
2

j ΩCj +

j Bj Aj

2 AT

j Bj Aj

2 AT

2 exp

2 exp

βj (cid:44)0
j )− q
2
j )− q

2

(cid:26)

(cid:27)

−1
2(z − C(j) β(j))T Ω(z − C(j) β(j))

(cid:27)

− βT
j βj
2
2τ
j dβj

(2πτ

2

j )− q
2 e



 (2πτ
(cid:27)

τ

1
2
j
(2π)− q

j Ω(z − C(j) β(j))

Iq)βj − 2βT
j CT
−1
2(βj − Aj)T Bj(βj − Aj)

2 exp

(cid:26)

2

j )− q

2 dβj

dβj

Hence,

lj =

π0 + (1 − π0)(τ

2 exp(cid:110) 1

π0

2

j )− q

2 |Bj|−1

2 AT

j Bj Aj

(cid:111)

(4.8)

75

The posterior full conditional distributions of other parameters are stated below, and the deriva-

tions of the posteriors are described in appendix.

(cid:169)(cid:173)(cid:171) 1

τ

2
j

|rest(cid:170)(cid:174)(cid:172) ∼

Inverse − Gamma( q+1

Inverse − Gaussian(

2
2 ),
2 , λ
λ|| βj||2 , λ

if βj = 0
if βj (cid:44) 0

2),

for all j = 1,, ., p. Let, G j deﬁne whether a certain group is selected or not

Then,

G j =

1,
(π0|rest) ∼ Beta(cid:169)(cid:173)(cid:171)p −

0,

p

j=1

if βj (cid:44) 0
if βj = 0

G j + a,

p

j=1

G j + b(cid:170)(cid:174)(cid:172)

(4.9)

(4.10)

We will sample our augmented variables ω = (ω1, .., ωn) using the posterior samples of β:

(ωi| β) ∼ PG(1,c(cid:48)

i β),

i = 1, .., n

(4.11)

Finally, we are left with the values of λ. λ is the most crucial parameter for our model and
should be treated carefully. A large λ shrinks most of the group coeﬃcients towards zero and
produces biased estimates. In our real data analysis, we try to control the λ value by assigning a
diﬀerent range of values. Xu and Ghosh, 2015[128] proposed a Monte Carlo EM algorithm for
estimating λ. The following is the kth EM update for λ from their paper-

(cid:118)(cid:117)(cid:116)

p

(cid:104)

(cid:105)

p(q + 1)
λ(k−1)

j=1 E

j |y
2

τ

(k) =

λ

The expected value of τ
can be calculated by taking mean of posterior samples of τ

j |y for binary response y is intractable. In other words, this expected value
2

.

2
j

4.5 Median thresholding and Theoretical properties

4.5.1 Marginal Prior for β j:

We ﬁrst study the marginal priors of β j’s to examine the theoretical properties of the Bayesian
group lasso estimators. We aim to establish the connection between β j group priors and existing

76

Group Lasso penalization methods. We integrate out τ
from β j priors. The marginal priors for
β j’s are calculated based on Xu and Ghosh, 2015[128] work with extension to binary response
instead of continuous response. For β j (cid:44) 0:
p(βj/π0) ∝

p(βj/τ

j , π0)p(τ
2

j )dτ
2

2
j

2
j

Þ
Þ ∞

2
j

τ

0

2

2 exp

βT
j βj

j )− q

− 1
 (λ
2 exp(cid:8)−λ|| βj||2(cid:9)Þ ∞
2) q+1
2(cid:17) q
2 exp(cid:8)−λ|| βj||2(cid:9)

(α

2τ

2
j

0

∝

(1 − π0)(τ

∝ (1 − π0)(λ

∝ (1 − π0)(cid:16)

λ

(cid:41)

(cid:40)

− λ

2
2 τ

2
j

2) q+1
2 (τ

2

j)−3

2 exp

(cid:20)

2 −1 exp

2

j ) q+1
βT
j βj
2
α
j

−1

2

dτ

2
j

(cid:21)2 dα

2
j

j −
2

α

λ

|| βj||2

2
j

= 1
2
τ
j

where α
. The marginal prior for β j’s are also spike-slab with with point mass at 0 and the
slab part consists of a Multi-Laplace distribution which same as the one considered in Bayesian
group lasso (Kyung et al.,2010[69]) or matches with penalization mentioned in Bayesian Adaptive
Lasso (Leng et al.,2014[73]).

βj/π0 ∼ (1 − π0)M − Laplace

+ π0δ0(β j)

(4.12)

(cid:19)

(cid:18)

0,

1
λ

Combining spike and slab both, the components facilitates variable selection at group level and
shrinks the coeﬃcients of the selected groups.

4.5.2 Median thresholding as posterior estimates

We previously discussed obtaining the selected group coeﬃcient estimation through median thresh-
olding of the MCMC sample. Xu and Ghosh, 2015[128] generalized the median thresholding pro-
posed by Johnstone et al.,2004[64] for multivariate spike-and-slab prior. Johnstone et al.,2004[64]
showed median thresholding, under a spike-and-slab prior for normal means, has some desirable
properties. In this section, we generalize this idea to a binary classiﬁcation problem and show that
the posterior median estimator serves as group variable selection by obtaining a zero coeﬃcient for
the redundant groups. We further demonstrate the posterior median as a soft thresholding estimator

77

that is consistent in model selection and has an optimal asymptotic estimation rate.

Focusing on only one group, then Xu and Ghosh, 2015[128] proposed the following theorem

on Median thresholding:

Zmx1 ∼ f(z − µ)

µ ∼ π0δ0(µ) + (1 − π0)γ(µ)

where Z is an m-dimensional random variable, and γ(.) and f(.) are both density functions for
f(t) is maximized at t = 0. Let Med(µi| z) denote the marginal
m-dimensional random vectors.
posterior median of µi given data. By deﬁnition,

Þ f(−ν)γ(ν)dν

c =

f(0)

Þ f(0)γ(ν)dν

f(0)

≤

= 1

Theorem 1: Suppose π0 > c1+c

, then there exists a threshold t(π0) > 0, such that when ||z||2 < t,

Med(µi|z) = 0,

f or any 1 ≤ i ≤ m

Next, we focus on our problem setup. If we assume βj follows a Gaussian prior, βj ∼ N(0, Bj)
j ΩC(j) = 0. Then the posterior estimates of βj|rest

and the design matrix satisﬁes the condition CT
is:

ˆβj = βj|rest ∼ N(µj, Σj)

Σj = (CT
µj = ΣjCT

j ΩCj + B−1
j Ωz

j

)−1

According to theorem 1, assuming π0 > c1+c
posterior median of βjk under prior (6) satisﬁes

, then there exists t(π0) > 0, such that the marginal

Med(βjk| ˆβj) = 0 f or any 1 ≤ k ≤ q

78

when || ˆβj||2 < t. We can interpret this result in the context of the same explanation provided by Xu
and Ghosh, 2015[128]: the median estimator of the j-th group of regression coeﬃcients is zero when
the norm of the posterior estimates under any other prior distribution is less than a certain threshold.

Posterior Median as soft thresholding:

We assume that CT

j ΩCj = nIq and C matrix is group wise Orthogonal with CT

j ΩC(j) = 0. We
and it depends on n. In this set-up, the

are considering the model deﬁned in (6) with ﬁxed τ
posterior distribution of βj will be similar to the one derived in the previous section:

2
j,n

(cid:19)

(1 − Dj,n)CT

j Ωz,

1
n

(1 − Dj,n)Iq

+ lj δ0(βj),

βj|C, y, ω ∼ (1 − lj)Nq

where Dj,n =

and,

1
1+nτ

2
j,n

lj =

π0 + (1 − π0)(1 + nτ

2n(1 − Dj,n)||CT

j Ωz||2
2

Then, the marginal posterior distribution for βjk(1 ≤ k ≤ q) conditional on the observed data

is a spike-and-slab distribution,

βjk|C, y, ω ∼ lj δ0(βjk) + (1 − lj)N

(1 − Dj,n)CT

jk Ωz,

n

1
n

(1 − Dj,n)

where Cjk is the k-th vector of the Cjth group matrix. The corresponding soft thresholding estimator
is

(cid:111)

(cid:19)

(cid:18)1

n

(cid:16)

2

π0

2 exp(cid:110) 1
j,n)− q
(cid:18)1

(cid:17)(cid:18)1
(cid:32)

n

79

(cid:19)

(cid:113)1 − Dj,n

+

jk Ωz| − 1√
(cid:33)

n

Q j

ˆβjk = Med(βjk|C, y, ω) = sgn

CT
jk Ωz

(1 − Dj,n)|CT

where z+ is the positive part of z and Q j = Φ−1
. Our results also follow Xu and
Ghosh, 2015[128]’s work to show the soft thresholding. One should especially note that the term
Dj,n depends on τ

which controls the shrinkage factor.

2(1−min(1

2,lj))

1

2
j,n

Oracle Property:

(cid:16)

0

, β

0
jk

Let β

0
, β j
I(||β j||2 (cid:44) 0), j = 1, .., p

A =
estimator ˆβ j is An =
only if limn P(An→∞ = A) = 1.

be the true values β, β j, βjk, respectively. The index vector of true model is

(cid:17), and the index vector model selected by certain thresholding
(cid:17). Model selection consistency is attained if and

I(|| ˆβ j||2 (cid:44) 0), j = 1, .., p

(cid:16)

j,n → ∞ and
2
Theorem 2: Assume the following design exists, CT
j,n)/n → 0 as n → ∞, for j = 1, .., p, then the median thresholding estimator has oracle
2

log(τ
property, that is, varaible selection consistency,

j ΩCj = nIq. Suppose √

nτ

n→∞ P(AMed
lim

n

= A) = 1

The proof follows same as steps as the proof of Theorem 4 in Xu and Ghosh, 2015[128].

4.5.3 Posterior Consistency:

In this section, we conduct a theoretical investigation regarding the convergence of the group
lasso estimator model to the true model. To show model consistency, we refer to the results and
theorems mentioned in the paper titled “On the consistency of Bayesian variable selection for high
dimensional binary regression and classiﬁcation” by Jiang,2006[62]. In this paper, the author setup
Bayesian variable selection similar to Smith et al.,1996[115] by introducing a selection indicator
vector γ = (γ1, .., γp) where γi = 0/1. The corresponding prior setup is as follows:

2(cid:16)

γ Xγ)−1(cid:17)

i = 1, .., p

y = X β + 
βγ ∼ N(0, cσ
XT
γi ∼ Bernoulli(π),
2|γ) ∼ 1/σ

2

(σ

We can establish a direct connection between our model and the above penalized regression. We re-
parametrize the groups coeﬃcient vector β j = γj b j where γj, j = 1, .., p is the selection indicator

80

0/1 valued. As in section 5.1 we have shown the marginal prior of βj follows a Multi-Laplace
distribution, we can place a Bernoulli prior in γj,

bj|λ ∼ Multi − Laplace(0,
γj ∼ Bernoulli(1 − π0),

)

1
λ
j = 1, .., p

(4.13)

The marginal prior distribution of βj is same as in equation (12).

j=1 cT

j=1 cT
j βj
j=1 cT
j βj

= ψ(pn

Next, we study the asymptotic results as n → ∞. Let y be the binary response and (cid:174)c is
pn
the corresponding basis coeﬃcients for any given subject. Let the true model be of the form
1+pn
j βj), βj is a qX1 vector with pn(↑ n) number of group vectors
µo(c) = e
present in the model. As described by Jiang,2006[62], we assume that the data dimension sat-
isﬁes 1 ≺ pn and log(pn) ≺ n, where an ≺ bn represents an = o(bn), or limn→∞ an
= 0. We
bn
j=1 || βj||2 < ∞,
which implies that only a limited number of group coeﬃcients are nonzero. We further assume
||cj||2 ≤ 1, j = 1, .., pn for simplicity.

assume sparsity of the regression coeﬃcients on the group level, i.e. limn→∞pn

We assume n i.i.d. observations. Dn = ((cid:174)c1, .., (cid:174)cpn, yi)n

i=1 and f0 = µy

0(1 − µ0)1−y. Before we

move forward with the results, we deﬁne the posterior estimator of the true density f0 as-

βγ
and we deﬁne the posterior estimate of µ0 as

Þ
ˆfn(y, c) =
Þ
ˆµn(c) =

γ

f(y, c|γ, βγ)πn(βγ, γ|Dn)dβγ

ψ(cT

γ βγ)πn(βγ, γ|Dn)dβγ.

βγ

γ

We deﬁne the classiﬁer as ˆCn(c) = I[ ˆµn(c) > 0.5], so that ˆCn(c) will be the validation tool for our
algorithm’s performance.

Next we deﬁne consistency using Jiang’s,2006[62] description of density function, and measure

the distance between two density functions with Hellinger distance dH( f , f0) =

81

(cid:113)ÞÞ ((cid:112) f −(cid:112) f0)2dxdy.

The below deﬁnitions are quoted from Jiang’s,2006[62] article.

Deﬁnition 1: "Suppose Dn is i.i.d. sample based on density f0. The posterior πn(.|Dn) is

asymptotically consistent for f0 over Hellinger neighborhood if for any  > 0,

(cid:2) f : dH( f , f0) ≤ |Dn(cid:3) P→ 1, as n → ∞ (Densit y Consistency)

πn

" Next we deﬁne consistency in classiﬁcation from Jiang,2006[62] paper in terms of how the
misclassiﬁcation error EDnP[ ˆCn(c) (cid:44) y|Dn] approaches the minimal error P[C0(c) (cid:44) y], where
C0(c) = I[µ0(c) > 0.5].

Deﬁnition 2: "Let ˆBn(c) be a classiﬁcation rule obtained based on the observed data Dn. If
limn→∞ EDnP[ ˆBn(c) (cid:44) y|Dn] = P[C0(c) (cid:44) y], then ˆBn(c) is called a consistent classiﬁcation rule."

Combining Proposition 1 and Proposition 3 from Jiang,2006[62], under conditions I, S, and
L, density consistency directly implies classiﬁcation consistency. The proof follows by checking
conditions I, S, and L from Jiang’s(2006) paper[62], since our prior satisﬁes his prior setup. To
have density consistency and classiﬁcation consistency for posterior estimates, we need to check
whether our prior setup follows Jiang’s conditions. The motivation for the proof and the technique
of checking conditions to establish the theorem were discussed in theses Atreyee Majumder, 2017
[81] and Guiling Shi, 2017[111].

Condition I: (On inverse link function ψ) "Denote w(u) as the log odds function w(u) =
log[ψ(u)/(1− ψ(u))]. The derivative of the log odds w(cid:48)(u) is continuous and satisﬁes the following
boundaries condition when the size of the domain increases: sup|u|≤C|w(cid:48)(u)| ≤ Cq for some q ≥ 0,
for all large enough C."

Condition S: (For prior πn on small approximation set.)

creasing to inﬁnity as n → ∞, such that for any η > 0, andj(cid:60)γ(rn) || βj||2 ≺ 

"There exists a sequence rn in-
2
n, we have

82

πn[γ = γ(rn)] > e−cn

2
n and πn[βγ ∈ M(rn, η)|γ = γ(rn)] > e−cn

2
n , for all large enough n."

Condition L: (For prior π outside a large region) "There exist some ¯rn = o(n/lnpn), ¯rn ∈ [1, pn],
n = o(1) and lnCn = o(n/¯rn), such that for some c > 0, πn[|γ| > ¯rn] ≤
n) for all |γ| ≤ ¯rn, for all large enough n."
2

and some Cn satisfying C−1
exp(−cn

(cid:2)|| βj||2 > Cn

(cid:3) |γ(cid:170)(cid:174)(cid:172) ≤ exp(−cn

n), and πn(cid:169)(cid:173)(cid:171) 

2

j:γj =1

We checked for the conditions; corresponding proofs are in the appendix.

4.6 Simulation results

We assess the performance of our proposed simultaneous classiﬁcation and selection methodol-
ogy with simulated data sets. We apply our method to both simulated and real data. We compare
the results from our Bayesian method with those from a frequentist group lasso selection method for
binary response. To the best of our knowledge, no other Bayesian method reported in the literature
is as convenient and eﬃcient as the presently proposed method. The following section reports
the method testing by creating three diﬀerent examples with varying numbers of predictors. We
generate a binary response with simulated functional predictors; there are a signiﬁcant number of
inessential predictors.

Example 1: We ﬁrst generate functional predictors xi j(t) using a 10-dimensional Fourier basis
√
φ0(t) = 1 and φk(t) =
2 cos(kπt), k = 1, ..,9, adding an error term. We work with a similar
simulation set up mentioned in Fan et al.[33], as Fan’s model setup is also based on functional
predictors. We generate our predictors as follows:

xi j(tk) = φ(tk)T θi j + i jk,

i jk ∼ N(0, σ

2), θi j ∼ N10(0, I)

where φ(tk) = (φ0(tk), φ1(tk), .., φ10(tk))(cid:48). We take σ = 0.5 and we generate 200 i.i.d observations
using 20 functional predictors. Each predictor is observed at 50 time points, and time points are
equally distributed between 0 and 1. θi j and i jk are independently sampled. It is easier to under-
stand the set up notationally as ’i’ varies from 1 to 200, ’j’ varies from 1 to 20 and ’k’ varies from

83

1 to 50. We construct a cubic basis spline on (0 = t1, .., t50 = 1) with four internal knots equally
spaced at 20%, 40%, 60% and 80% quantiles. We use R-package ’splines’ and the ’bs’ function
to construct the basis matrix φ. With 4 internal knots, plus intercept and degree=3, we end up
having eight columns in the basis matrix for each predictor, i.e. q=8. To validate classiﬁcation and
selection performance, we use 75% of the observations as training data, and the remaining 25% for
testing purposes. We repeat this process 100 times to limit sampling bias in data and concatenate
all results considering the 100 repetitions.

Example 2 and 3: In example 2, we increase the number of predictors from 20 to 50 while
maintaining 200 observations with 50 time points for each observation. The functional predictor
generation in Example 3 follows the same method as in Example 1, but generates 500 observations
with 100 functional predictors and 20 time points for each observation. We use three internal knots
to smooth the predictors.

In both cases, we chose the second and ﬁnal predictor, i.e. β2(t)and βp(t) as non-zero, and the
rest of the coeﬃcients are zero. We generate the binary response y ∈ (0,1) from a Bernoulli distri-
bution using the set of pre-assigned β. In all of the examples, 75% of the data is used for training
and 100 repetitions are used to normalize sampling bias. We obtain 20,000 Gibbs samples, and the
ﬁrst one-third of these samples are discarded as a burn-in period. All the parameter estimates are
obtained using the remaining samples. As Xu and Ghosh, 2015[128] showed that median thresh-
olding gives exact 0 estimates for the redundant group coeﬃcients, we apply a posterior median on
posterior samples to obtain β estimates. We choose a=1,b=1 as the initial parameter values for the
prior distribution of π0 and β = 0 is used as the initial choice for the ﬁrst iteration. Although we
have p number of functional predictors, the number of coeﬃcients we need to estimate is p*q. In
Example 2, we have p=50, and with four internal knots for each function we obtain q=8. Hence, the
number of coeﬃcients we need to estimate is 400 using 200 observations. From this perspective,
our algorithm is applicable to “large p, small n” conditions. The simulation results are presented

84

below.

Example1 Results: We obtain a 100% true positive rate and a 0% false positive rate in terms of
selection, i.e. the two nonzero coeﬃcients are captured in all 100 iterations. Moreover, none of the
predictors that originally had zero coeﬃcients are selected. In terms of classiﬁcation, our method
shows 97% sensitivity, 93% speciﬁcity, 95% accuracy, and AUC=0.99. Below are the rejection
probability plots for β2(t) and β1(t), of which the ﬁrst is nonzero and the second is zero in the true
model. In addition, we plot the posterior median estimates of the coeﬃcient function with respect
to its true values. The ROC curve establishes the diﬀerentiating power of our method.

Example 2 and 3 Results: In Example 2, we obtain a 100% true positive rate and a 0.73%
false positive rate out of 100 repetitions, with 97% sensitivity and 95% speciﬁcity. In Example 3,
we achieve a 100% true positive rate and a 0% false positive rate with 98% sensitivity and 97%
speciﬁcity. We compare our simulation results with those of frequentist group lasso for logistic
regression for all the setups above. Our methodology yields the best results in terms of classifying
subjects into the right class, far exceeding frequentist group lasso. Although the frequentist group
lasso approach successfully identiﬁes the true signiﬁcant predictors for the model, it also selects
many redundant functional predictors that have zero eﬀect on the true model. The false selection of
predictors in the model is very high compared to that of our algorithm. The table below summarizes
the numerical results of all three aforementioned examples, with comparisons to frequentist group
lasso for logistic regression.

4.7 Application on ADNI MRI data

This section reports the results of the application of our proposed method to ADNI data. The MRI
data used in all analyses was downloaded from the ADNI database (http://www.adni-info.
org/). The fundamental goal of ADNI is to develop a large, standardized neuroimaging database

85

Figure 4.1: Plots based on Example 1

86

Table 4.1: Classiﬁcation and selection performance Table

Bayesian Classiﬁcation with

Bayesian Group Lasso
TPR FPR

Speciﬁcity

Sensitivity

-2 Log
likelihood

0.97 (0.01)

0.93 (0.01)

1 (0)

0 (0)

3.65

0.97 (0.01)

0.95 (0.01)

1 (0)

0.0073
(0.05)

0.219

0.98 (0.001)

0.97 (0.01)

1 (0)

0 (0)

6.87

Logistic regression

with frequentist group lasso

Sensitivity

Speciﬁcity

TPR FPR

-2 Log
likelihood

0.92 (0.01)

0.86 (0.01)

1 (0)

0.114 (0.04)

69.23

0.81 (0.01)

0.88 (0.01)

1 (0)

0.34 (0.05)

53.66

0.91 (0.001)

0.94 (0.001)

1 (0)

0.05 (0.02)

190.23

Example 1

Example 2

n=200
p=20
t=50

n=200
p=50
t=50

n=500
p=100
t=20

Example 3

Example 1

Example 2

n=200
p=20
t=50

n=200
p=50
t=50

n=500
p=100
t=20

Example 3

Notes: Simulation result comparisons between Bayesian and Frequentist methods

with strong statistical power for research on potential biomarkers in AD incidence, diagnosis, and
disease progression. ADNI data available at this time include three projects: ADNI-1, ADNI-GO,
and ADNI-2. Starting in 2004, ADNI-1 collected prospective data on cognitive performance, brain
structure, and biochemical changes every 6 months. Participants in ADNI-1 included 200 CN, 200
MCI, and 400 AD patients. Then, starting in 2009, ADNI-GO continued the longitudinal study of

87

the existing patients from ADNI-1 and established a new cohort that included early MCI patients,
who were enrolled to identify biomarkers manifesting at earlier stages of the disease. ADNI-GO
and ADNI-2 together contain additional MRI sequences plus perfusion and diﬀusion tensor imag-
ing. The volumetric estimation for our data set was performed using FreeSurfer by the UCSF/SF
VA Medical Center.

Considerable research has been conducted to develop automatic approaches for patient classi-
ﬁcation into diﬀerent clinical groups, with many ADNI studies identifying ROIs associated with
diﬀerent disease stages. A support vector machine (SVM) is a primary tool utilized in many studies
to evaluate the patterns in training data sets and to create classiﬁers to identify new patients. Fan et
al.[34] used neuroimaging data to create a structural phenotypic score reﬂecting brain abnormalities
associated with AD. In classifying AD vs CN, a positive score in their framework identiﬁed AD-like
structural brain patterns. Their classiﬁer obtained 94.3% accuracy in AD vs CN, although their
approach used only left and right whole brain volumes as potential predictors. Some researchers
have used Bayesian statistical methods in studying Alzheimer’s data. Shen et al.[110] employed
a sparse Bayesian learning method, which they named automatic relevance determination (ARD)
and predictive ARD, to classify AD patients. This method outperformed an SVM classiﬁer. Yang
et al.[130] proposed a data-driven approach to the automatic classiﬁcation of MRI scans based on
disease stages. Their methodology was broadly divided into two parts. First, they extracted the
potentially classifying features from normalized MRI scans using independent component analysis.
Next, the separated independent coeﬃcients were applied for the SVM classiﬁcation of patients.
In contrast to this approach, our proposed method selects important components and classiﬁes pa-
tients simultaneously. Moreover, we consider multiple brain sub-regions to identify those potential
regions whose longitudinal trajectories are speciﬁcally related to AD. Another seminal paper by
Jack et al.[58] used MRI-based measurements of hippocampal volume to assess the future risk of
conversion from MCI to AD. A bivariate model included hippocampal volume and other factors
like age and APOE genotype, but only hippocampal volume was identiﬁed as signiﬁcant. Wang

88

et al.[123] employed a functional modeling approach using Haar wavelets and lasso regularization
to ﬁnd ROIs in voxel-level data. In that approach, large Haar wavelet coeﬃcients were related to
most important features, with a sparse structure among redundant features. The majority of these
methods are based on SVM classiﬁcation, which often uses kernel-based methods for functional
smoothing. Casanove et al.[19] utilized a penalized logistic regression approach, and they calcu-
lated estimates using coordinate-wise descent optimization techniques from the GLMNET library.
Similarly, our method employs penalized logistic regression with group lasso penalty. However,
our approach diﬀers in its use of both functional predictors and a custom algorithm developed
in-house.

We consider the longitudinal volume of various brain regions, such as the Para hippocampal
gyrus, cerebellar cortices, entorhinal cortex, fusiform gyrus, and precuneus, among many others.
Although the accessed ADNI data set includes corresponding volume, surface area, and cortical
thickness information, we work with only the volume information to acquire uniformity over lon-
gitudinal predictors. Because the brain is divided into right and left hemispheres, the data includes
sub-regional brain volumes for both hemispheres. Our main objective is to identify the brain sub-
regions whose volumetric trajectories can diﬀerentiate AD patients from the normal aging control
group. As mentioned in the introduction, dementia is associated with widespread brain atrophy,
although the time course and magnitude of shrinkage varies across regions.

The initial sample includes 761 patients’ data from the ADNI database, classiﬁed as AD, MCI,
or CN throughout their visits for the study. We exclude all patients classiﬁed as MCI, and any AD
or CN patients whose diagnostic status changed over time. This is because our model assumes
that response does not depend on time. Of the remaining patients, we include those with data
from at least four longitudinal measurement occasions. This yields 296 patients who have at least
four data points and unchanging diagnoses of either AD or CN. The ﬁnal sample is composed
of 174 AD patients and 122 normally aging controls. All patients underwent a thorough initial

89

Table 4.2: Patients Baseline Characteristics

AD
174

69/105

74.76 ± 7.23
25.43 ± 2.40
14.78 ± 5.44

119/55

CN
122

55/67

75.61 ± 5.45
28.96 ± 1.17
5.95 ± 2.94

30/90

p-value

0.25
0.35
<.0001
<.0001
<.0001

n

Age (Mean ± sd)
Gender (F/M)
MMSE (Mean ± sd)
ADAS11 (Mean ± sd)

APOE (+/-)

Notes: Comparison of Baseline Age, Gender ratio, MMSE score, ADAS11 score and APOE ratio between
AD and CN groups

clinical evaluation to measure baseline cognitive and medical scores, including MMSE, the 11-
item Alzheimer Cognitive Subscale (ADAS11), and other standardized neuropsychological tests.
In addition, at baseline, APOE genotyping information was obtained from patients. Longitudinal
structural MRI scans were parcellated into sub-regional brain volumetric measurements. Our initial
model includes 49 sub-regional brain volumes chosen by Dr. Andrew Bender, based on knowledge
of the extant literature regarding atrophy patterns in AD. Although these 49 sub-regions are not
assumed to change in uniform magnitude, the direction of change over time is hypothesized to be
consistent (i.e., shrinking). Thus, the model includes 49 longitudinal predictors that we consider
as functional predictors. We assume that not all predictors are potential candidates for classifying
patients, and that the sparse assumption is valid. However, because some patients’ visits were
irregular, we do not have an equal number of time points across patients. We start by comparing
the baseline measurements between the AD and CN groups, as shown in Table 1.

In the next stage, we smooth the longitudinal trajectories for the observed volumes of all brain
sub-regions. A simple least squares approximation is suﬃcient, as we assume that the residuals of
the true curve are independently and identically distributed with mean 0 and have constant variance.
We use the cubic B-spline basis functions for spline smoothing of observed volumes. Three internal
knots are used for spline smoothing with intercept, which gives us seven basis functions. We seek
to ensure that the smoothed estimated curve is a good ﬁt for each patient’s observed curve. As

90

we do not have a large number of data points for each patient, we do not consider controlling for
potential overﬁtting of our estimated curve. Besides least squares smoothing, functional principle
component scores can also be used for this analysis.

Prior to analysis, we scale the brain volumes to the corresponding patient’s brain ICV mea-
surement by ﬁtting a simple regression to adjust volume measurements for individual brain volume
changes. The aim is to remove systematic variation in brain volumes due to diﬀerences in physical
size. The formula we use is ROIadj = ROIvol − β0(ICV − ICVmean), where β0 is the regression
coeﬃcient by regressing ROIvol on ICV [103] [59]. We adjust or correct the volumes using the
above method for each gender group: male and female. Next, we scale the corrected volumes
between 0 and 1 to bring all brain regions onto the same scale. We then divide the data set into
two parts: two-thirds of the patients are reserved for the training data set (n=198), and the rest are
kept for testing (n=98). We gather the basis coeﬃcients for each patient in the training data set and
use them as predictors for classiﬁcation. We initialize choice of β with all zero to start iterations.
The π0 probability has a Beta(a,b) distribution with a and b both set up as 1. As a ﬁrst step,
we examine λ using Pólya-Gamma transformation of our sample with a spike-slab penalty on the
training data. After estimating λ, we evaluate the remainder of the algorithm on the training data
with 30,000 MCMC samples. The ﬁrst one-third of observations are left out as a burn-in period.
We propose a spike-and-slab prior on the β coeﬃcient, which transforms into posterior estimates
of zero for most the functional predictors. We run our model 100 times with diﬀerent training
samples to nullify sampling bias in the training and test data. In the 100 iterations, the model does
not consistently or uniformly select many of the brain sub-regions; therefore, we choose the brain
regions that frequently appear as signiﬁcant in each iteration. The median thresholding selects the
left hippocampus, left lateral orbitofrontal cortex, and left posterior cingulate gyrus with 100%
probability. Other brain regions that are selected as important are the right Para hippocampal gyrus,
left caudate nucleus, left medial orbitofrontal cortex, left putamen, left superior temporal gyrus,
left thalamus, right hippocampus, and right middle temporal gyrus. In Figure 2, we plot the brain

91

volume changes of the left hippocampus, left lateral orbitofrontal cortex, and left posterior cingulate
gyrus over time. Orange and green signify the normal aging and dementia group, respectively. The
bold thick line represents the mean curve for the corresponding group. The plot shows that there
are signiﬁcant diﬀerences in volume between the groups, and our model identiﬁes these regions
as signiﬁcant. In Figure 3, we plot the acceptance probability of the MCMC sample for the left
hippocampus and left lateral orbitofrontal brain regions.

The method classiﬁes patients into the correct group with 77% accuracy. We achieve 72% sensi-
tivity, 85% speciﬁcity, and a corresponding AUC of 0.87. We use the median predicted probability
from the training sample as the threshold for classiﬁcation validation. We also test the classiﬁcation
by adding clinical measurements such as the ADAS11 (11-item Alzheimer Cognitive subscale),
MMSE scores, "CDRSB," "RAVLT immediate," and "RAVLT forgetting,” measured over time. In
this classiﬁcation, we initially select longitudinal brain volumes that are signiﬁcant, and then we
add the clinical variables. We achieve very high classiﬁcation measures of 97% accuracy, 97%
sensitivity, and 98% speciﬁcity. If we ignore the MMSE score and run the model with the rest of
the functional predictors, we observe similar classiﬁcation accuracy. In all scenarios, we model
diseased patients as 1 and CN as 0 for the interpretation of classiﬁcation sensitivity/speciﬁcity.

In addition to ﬁnding functional models of longitudinal trajectories in sub-regional brain vol-
umes to diﬀerentiate between the AD and normal groups, we also apply our method for MCI
converters vs MCI nonconverters. We select patients who entered the study as MCI, and we assign
the label of MCI nonconverter (MCI-nc) to those who did not transition to AD across all mea-
surement occasions and a label of MCI converter (MCI-c) for any who did transition to AD. The
total subsample includes 163 patients who were either MCI-c or MCI-nc. We use three-quarters
of the patients to train our model. We note the signiﬁcant brain ROIs that are selected after 100
iterations. Among the selected ROIs that contribute to classiﬁcation are the right posterior cingulate
gyrus, right superior parietal cortex, right thalamus, right isthmus cingulate gyrus, right fusiform

92

Figure 4.2: Brain volume changes of Left Hippocampus, Left Lateral Orbitofrontal cortex, and
Left Posterior Cingulate over time for Normal and Dementia patients

93

Figure 4.3: Acceptance probability of MCMC sample for Left Hippocampus and Left-Lateral
Orbitofrontal brain regions

gyrus, left thalamus, and left precuneus. However, the classiﬁcation performance is not as good
as compared to the previous model: 62% accuracy and 0.66 AUC. The biological explanation for
this result is critical to acknowledge. The mean diﬀerence of functional predictors between MCI-c
vs MCI-nc is not signiﬁcant for segmenting patients. Moreover, we also neglect some time points’
data for this set of patients.

4.8 An alternate modeling Proposal:

It is always important in penalization setups to protect strong signal estimations. We should
control over-shrinkage of the selected group coeﬃcients with lesser bias. Hence, we propose a
horseshoe prior (Carvalho et al.,2009[17]) within the spike-slab structure. This setup provides
a heavy-tailed distribution for the slab part. We propose a normal prior for the slab part with

94

(a) Left Hippocampusa

(b) Posterior Singulatea

(c) Middle Temporal Gyrusa

(d) Left lateral orbitofrontal cortexa

Figure 4.4: Pictorial representation of selected brain ROI’s discriminating diseased group from
normal control
a Plot obtained from on-line resources

two hyper-parameters controlling local and global shrinkage. We compare the corresponding
penalization on group coeﬃcients with our main proposed penalization from the previous section.
We contrast the results of applying both methods on the AD patients. Below, we report an alternative
exploratory setup:

βj/λj1, .., λjq, τj, π0 ∼ (1 − π0)Nq(0, τ

λjk ∼ C+(0,1),
τj ∼ C+(0,1),

2
j DΛ j
k = 1, .., q
j = 1, .., p

) + π0δ0(βj),

j = 1, .., p

(4.14)

95

= diag(λ

where DΛ j
model parameters from the corresponding full conditional distributions:

jq). We can also develop an eﬃcient Gibbs sampler to update the
2

2
j1, .., λ

βj/rest ∼ (1 − lj)Nq( ¯βj, Σj) + lj δ0(βj),

j = 1, .., p

= (cid:169)(cid:173)(cid:171)CT

−1
j

Σ

j ΩCj +

(cid:16)

τ

D

1
2
j

−1
Λ j
z − C(j) β(j)

(cid:17)

¯βj = ΣjCT
j Ω

(cid:170)(cid:174)(cid:172)

lj =

π0 + (1 − π0)(τ

2 exp(cid:110) 1

2

(cid:111)

¯βj(Σ−1

j

)T ¯βj

(4.15)

π0
|−1
2 |Σ−1
j

2

j )− q

It is also important to sample the global and local scale parameters λjk’s and τj’s eﬃciently.
Polson et al.,2014[98] provided an eﬃcient slice sampling scheme which can acquired for our
setup. We deﬁne αjk = λ−2
and next we sample
jk
from αjk/ujk ∼ E xp(2τ
j /β
2

, and sample ujk/αjk ∼ Uni f orm
jk)I(αjk <
2

1+αjk

(cid:19)

(cid:18)

0,

).

1

1−αjk
αjk

4.9 Discussion:

This chapter discusses the use of Bayesian group lasso penalization combined with Pólya-
Gamma augmentation to build a simultaneous classiﬁcation and selection method. The Bayesian
spike-and-slab prior helps in identifying functional parameters generated from longitudinal tra-
jectories of multiple brain ROIs, and discriminates the patient group from normal controls. The
inclusion of Pólya-Gamma augmentation helps avoid the Metropolis-Hastings algorithm or the
incorporation of other expensive sampling algorithms related to latent variables. We consider the
longitudinal brain ROI volume measurements as functional predictors, and the cubic basis splines
smooth the curves over time. The next steps include using those smoothed functional predictors as
discriminating inputs with sparsity assumptions among them.

Our method performs very well on simulated data sets, outperforming available frequentist
methods. Furthermore, our method is applied on a data set that has a large number of predictors. We
have one more strong assumption for the functional predictors: namely, that they are independent of

96

each other. As our prior setup does not account for any correlation structure among predictors, this
method will not work well for highly correlated functional predictors. To handle this incompetency,
we also proposed an alternative prior structure in Section 4.8, but we did not compare its results to
those of our original method. This will serve as the basis for further research.

97

CHAPTER 5

BAYESIAN SPATIOTEMPORAL CLUSTERING MODEL FOR ANALYZING WHITE

MATTER DATA

5.1 Spatiotemporal Linear mixed eﬀects modeling for the distribution of

cerebral white matter on MRI scans

5.1.1

Introduction

Bernal-Rusiel et al., 2013[12] introduced a linear mixed eﬀects (LME) model for mass-univariate
analysis of longitudinal neuroimaging data. They tried to exploit the spatial structure of the cortical
thickness in the brain’s sub-regions using a spatiotemporal setup. Before that, Bernal-Rusiel et al.,
2012 [11] used a similar approach with an LME model but for longitudinal modeling; they built a
base for the extension of space information. We seek to develop a similar spatiotemporal model with
a diﬀerent research question and a completely distinct application. Spatiotemporal modeling has
not been studied extensively in the literature regarding longitudinal white matter (WM) changes for
healthy aging with diﬀusion tensor imaging (DTI). Bender et al. 2016[10] compared age -related
changes across brain regions and identiﬁed the inﬂuences of age, vascular risk at the baseline on
WM diﬀusion. They conducted a seven-year follow-up study to establish that diﬀusion properties
in association with WM tracts deteriorated with age. Due to the limited study of spatiotemporal
association with changes in WM integrity obtained by fractional anisotropy (FA), we work with
data from healthy middle-aged and older adults. Some authors have reported evidence of brain
region shrinkage with aging (Raz and Rodrigue, 2006[104];Raz and Kennedy, 2009[102]).
In
contrary, very little is known about age-related changes in structural properties of cerebral WM,
which varies at diﬀerent rates in various parts of the brain. Diﬀusion tensor imaging provides
principal directions from diﬀusion tensor eigenvalues, one of which is fractional anisotropy. Ac-
cording to Montag and Reuter, 2015[92], "Fractional anisotropy (FA) is a scalar value between
zero and one that describes the degree of anisotropy of a diﬀusion process. A value of zero means

98

that diﬀusion is isotropic, i.e. it is unrestricted (or equally restricted) in all directions. A value of
one means that diﬀusion occurs only along one axis and is fully restricted along all other directions."

The literature on longitudinal neuroimaging analysis is vast; in the following, we mention
some of the signiﬁcant studies that have been done in the last few years. Asami et al., 2011[5]
studied longitudinal loss of gray matter among schizophrenia patients. They established a clinical
correlation with gray matter decay. They used a high-dimensional warping and individualized
baseline-rescan to measure longitudinal volume changes within subjects, and they performed a
comparison with time-varying manual ROI analysis on the same subjects. Chetelat et al., 2005[25]
found the highest rates of atrophy in the anterior, inferior, middle, and medial parts of the temporal
lobes in their study of the dynamics of gray matter changes for patients transitioning from mild
cognitive impairment (MCI) to Alzheimer’s disease (AD). On the other hand, Fjell et al., 2009[35]
worked with healthy aging subjects, similar to those in our experiment. They found that changes
in brain structure can occur within one year and that atrophy accelerates with the advancement of
age. Although longitudinal studies have yielded some novel discoveries, they have not answered all
questions regarding brain image data. Moreover, most statistical methods are not enough to capture
the real system.
Bernal-Rusiel et al., 2012 [11] implemented an LME model to analyze brain
sub-regions’ cortical thickness in Alzheimer’s patients. They performed hypothesis testing of the
regression coeﬃcients, calculated the sample size, and derived statistical power. In longitudinal
studies, researchers broadly focus on two methodologies. The ﬁrst is the repeated measure of
variance and secondly the cross-sectional study after aggregated data. We previously conducted a
separate study with longitudinal data by assuming brain volume changes as a function of time, and
we used estimated functions as potential predictors for classiﬁcation; the manuscript is under review.

The aim of this article is to apply an LME model for longitudinal data where space information
is considered as an additional eﬀect. The LME method captures the covariance structure of brain
voxels properly for serial measurements. It also has the ability to impute missing responses for any

99

time point. We work with a balanced data case, but the extension of the methodology to unbalanced
data is trivial. We obtain our data in the “nifty” image format, and each pixel of the image has
location coordinates. We want to ﬁnd regions in the brain that have homogeneous properties, which
will help us to track region-wise longitudinal changes instead of voxel-wise analysis. Several previ-
ous studies have examined longitudinal changes for voxel-level data (Zipunnikov et al., 2011[137],
Zhang et al., 2009[135], Skup et al., 2012[114], Shinohara et al., 2011[112]). However, most of
these works employ voxel-based analysis and do not account for spatial information. Bernal-Rusiel
et al., 2013[12] apply their work to a diﬀerent problem with cortical thickness measurements, but
their covariance structure works nicely for our WM DTI data as well. As our motivation is to work
with homogeneous brain regions and to track changes in fractional anisotropy WM measurements
in relation to subjects’ demographic and clinical factors, we work with the K-means algorithm.
We consider each voxel’s location, measurements, and predictive response values as clustering at-
tributes for voxel-wise analysis. The homogeneous regions oﬀer a scope to build a spatiotemporal
model separately for each cluster. This strategy has several advantages. Firstly, this is a highly
parsimonious way of building models, such that the number of estimable parameters are much
lesser than voxel-wise models. Secondly, voxel-level models do not consider spatial structure in
the data, which implies that estimators are less informative with lower statistical power. Finally, we
are able to identify brain regions behaving similarly in terms of WM atrophy as the healthy aging
people age.

This chapter is organized as follows.

In part 1 of Section 2, we describe voxel-wise model
building, estimation of ﬁxed eﬀect parameters, and temporal covariance structure with respect
to subjects’ time-variant random eﬀects. We calculate residual values for each voxel used as an
attribute in the clustering algorithm. We describe the agglomeration method in part 2, and ﬁnally,
we apply the spatiotemporal mixed eﬀect model in part 3. We perform separate inference in part
3 for region-wise reasoning. Section 3 then describes the WM data and how we obtained it from
subjects. Results are described in Section 4, while a ﬁnal discussion is presented in Section 5.

100

5.1.2 Methods

5.1.2.1 Voxel-wise linear mixed eﬀect model

We start modeling with a voxel-wise LME model ﬁtting with WM MRI data. In a longitudinal
study, the outcome variable for, for instance, the fractional anisotropy or cerebral WM in our data is
measured repeatedly for the same individual for a number of times. The aim is to measure changes
in the patients’ response over time and their association with demographic and biological factors,
such as gender, hypertension, MMSE score, and starting age for the study. We ﬁx a spatial location
such as for a ﬁxed particular voxel, we have the leeway to apply a single LME independently for that
particular voxel. One advantage of this modeling is that it can assess the within-subject changes
or correlation across diﬀerent time points. Voxel-wise LME models have been used in various
studies. For example, Bowman and Kilts, 2003[16] implemented a mixed eﬀect model to analyze
PET data. They acheived improved estimation and inference by considering various covariance
structures to handle correlation among individuals’ repeated scans. Bernal-Rusiel et al., 2013[12]
worked with a voxel-wise mixed eﬀect model; they analyzed longitudinal brain MRI data from
ADNI. The following are other signiﬁcant papers on these methods: (Delaloye et al., 2008[28],
Lau et al., 2008[71], Shaw et al., 2008[109]). A longitudinal study in a spatiotemporal setup
oﬀers a distinctive understanding; for example, it reﬂects the temporal trajectory of an underlying
non-stationary continuous process. The correlation between pairs of repeated measures is directly
proportional to the distance between consecutive time points, and between-subject variations are
not constant, which gives us scope to assume a subject-speciﬁc covariance structure.

In general, we observe three diﬀerent variabilities in the covariance structure for voxel-speciﬁc
longitudinal measurements. A particular voxel’s WM measurements have (i) between-subject
variability, (ii) within-subject changes over time, and, ﬁnally, (iii) a measurement error uniform
with the other two variations. Now, the voxel-wise LME model imposes structure on the covariance
matrix by introducing random eﬀects. In totality, all the above variations are handled with this

101

structure except the measurement error, although we believe that measurement error has a direct
inﬂuence on the covariance between longitudinal measurements within a subject. The mean
trajectory of WM is modeled with subject-speciﬁc ﬁxed eﬀects and time-variant random eﬀects.
In this research article, we consider working with balanced, normally distributed data. Now, let Yi
be the T × 1 vector of serial univariate measurement for a particular voxel for subject i, where T is
the number of time points. Xi is the T × p subject speciﬁc design matrix built with ﬁxed eﬀects
(variables such as gender, hypertension indicator, MMSE score, Base Age, CESD score etc.). Let
Zi be the T × q design matrix of random eﬀects (e.g. systolic and diastolic pressure, measurement
time etc.). We consider time-variant predictors as the random eﬀects and the intercept as the ﬁxed
eﬀect. Hence, we can write the model as follows:

i = 1, .., M

(5.1)

Yi = Xi β + Ziui + i,
ui ∼ Nq(0, σ
i ∼ NT(0, σ

r Dr)
2
2IT)

We deﬁne

Dr =



1

ρ
...
ρq−1

ρ
1
...
ρq−2

. . . ρq−1
. . . ρq−2

. . .
. . . 1



i

and Xi = 1T ⊗ (cid:174)xT
. The components of ui control the deviation from the population mean for i-th
subject. As we mentioned above, this model structure handles the variability present in longitudinal
data, and i controls the remaining measurement error. Finally, we assume that we have M number
of subjects/patients.

The subject speciﬁc mean of Yi given ui is E(Yi|ui) = Xi β + Ziui; thus, the vector ui with
ﬁxed eﬀect provides subject speciﬁc coeﬃcients. The β regression coeﬃcients are the same for all
individuals. The initial motivation to write about the single voxel longitudinal model was to ﬁnd

102

the temporal covariance matrix for ith subject, and

Cov(Yi) = Σi = σ

2
r ZiDr ZT

i + σ

2IT

(5.2)

We can ﬁnd the unbiased estimates of the unknown parameters β, σr, ρ, σ by maximizing the
log-likelihood function.

Parameter Estimation:

In the following, we shortly describe the method to estimate the
unknown parameters mentioned above, as we use the voxel-wise estimation to segment the whole
brain into homogeneous regions. We explain the segmentation method in the next section. Now,
given the normally distributed WM assumptions, the vector of measurements for a particular
subject’s ﬁxed voxel is as follows:

If we have the estimates of ˆσr, ˆρ, ˆσ, the closed-form solution of β is based on MLE and given as:

Yi ∼ N(Xi β, Σi)

(cid:33)−1

M

i=1

ˆβ =

−1
ˆΣ
i Xi

XT
i

−1
ˆΣ
i Yi

XT
i

(cid:32) M

i=1

M

i=1

M

i=1

Once we maximize the restricted log-likelihood we can have unbiased estimates of ˆσr, ˆρ, ˆσ (Verbeke
and Molenberghs, 2000[122]):

lReM L =

1
2

−1
log |Σ
i

| − 1
2

(Yi − Xi

−1
ˆβ)T Σ
i

(Yi − Xi

ˆβ) − 1
2

log

XT
i Σ

−1
i Xi

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M

i=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

The solution of the estimates can be obtained by using either the expectation maximization (EM)
algorithm, the Newton-Raphson algorithm, or Fisher’s scoring algorithm. We employ Fisher’s
scoring algorithm to obtain the parameter estimates, and we need some initial estimates to run the
algorithm.

5.1.2.2 Clustering voxels into homogeneous regions using K-means algorithm

Our key aim is to build a spatiotemporal model for homogeneous regions in the brain. Homo-
geneous regions or clusters contain voxels that are similar in terms of cerebral WM, location in

103

the brain, and quantitative relationship with the subjects’ demographic/biological factors. Once
we obtain homogeneous regions in the brain from the clustering algorithm, we can assume that
temporal variance is shared among all voxels present in a particular region. The K-means algorithm
we implement is a universal method and completely data-driven. Our goal is to partition the voxels
into k-uniform clusters C1,C2, ..,Ck such that i) Ci ∩ Cj = φ for i (cid:44) j and ii) elements of Ci
are homogeneous based on some conditions. We now list the conditions we consider to run the
K-means algorithm.

We ﬁrst calculate the coeﬃcient of variation of a particular voxel based on all the longitudinal
In our study, we have M ∗ T number of cerebral WM
observations present for all subjects.
meanv for each voxel. We use the CV and
measurements for each voxel, and we calculate CV v = sdv
mean measurement for each voxel in the K-means algorithm. Next, we calculate the residuals for
each voxel by ﬁtting the mixed-eﬀect model described in the last section. Once we estimate the
parameters for the temporal covariance matrix and ﬁxed eﬀect regression coeﬃcients, it is easy to
compute residuals for each voxel. In addition, we have one more piece of directional information
for FA measurements over voxels. This information helps us to collate voxels whose WM changes
are in a uniform direction. Finally, we calculate the polar coordinates of the three-dimensional
voxel locations from their Cartesian coordinates using the following formula:

(cid:113)

r =

x2 + y

2 + z2

, θ = tan−1( y
x

), φ = cos−1( z
r

)

The K-means algorithm starts by deﬁning a dissimilarity matrix as an input. In our data, we
have measurements xi j for i = 1,2, .., N and m = 1,2, .., m where N is the number of active voxels
in the brain with positive WM measurements and m is the number of attributes considered for a
particular voxel. We use ﬁve attributes to cluster the voxels. We deﬁne the dissimilarity matrix
between voxel i and i’ as

m

j=1

D(xi, xi(cid:48)) =

dj(xi j, xi(cid:48) j)

Most of the common choices are Euclidean distances: dj(xi j, xi(cid:48) j) = (xi j − xi(cid:48) j)2 or Manhattan

104

distance: |xi j − xi(cid:48) j| or based on correlation:

j(xi j − ¯xi)(xi(cid:48) j − ¯x(cid:48)
(cid:113)j(xi j − ¯xi)2j(xi(cid:48) j − ¯x(cid:48)

i)

ρ(xi, xi(cid:48)) =

As shown in Hastie et al., 2009[47], if the attributes are standardized, thenj(xi j − xi(cid:48) j)2 ∝

i)2

2(1 − ρ(xi, xi(cid:48))). Hence, we use Euclidean distance to implement K-means clustering.

5.1.2.3 Cluster based Spatiotemporal linear mixed eﬀect models

Spatiotemporal models that consider a temporal covariance matrix and that are shared among spa-
tial locations have been studied in the functional neuroimaging literature for a long time. Bowman,
2007[14] described a framework for spatiotemporal modeling by deﬁning a functional distance
metric in the model to perform voxel-wise inference. The distance metric reﬂected bonding be-
tween voxels based on functionality rather than proximity. The spatiotemporal model estimated
temporal and spatial correlations in a given region (ROI). Later, Bowman et al., 2008[15], proposed
a spatial Bayesian hierarchical model for analyzing functional neuroimaging data. They built the
hierarchical model after anatomical parcellation of the brain consisting of 116 regions. Derado
et al., 2012[29] also worked with a multivariate Bayesian hierarchical model, where a Gaussian
prior was imposed on a voxel-speciﬁc population-level mean parameter separately for a baseline
and a follow-up session. Next, the wishart prior on the inverse covariance matrix of subject speciﬁc
measurements for a particular voxel provides estimators. Woolrich et al., 2004[127] presented a full
Bayesian modeling approach to handle spatiotemporal noise modeling and haemodynamic response
function (HRF) modeling. They implemented a simultaneously speciﬁed autoregressive model in
which the spatial noise structure would remain the same over time. Although this method precisely
estimated the parameters and grasped the spatiotemporal dependency, it was important to model the
spatial covariance matrix among those locations of voxels present in a particular cluster. Moreover,
the proposed voxel-wise model or prior did not account for the inter-voxel correlation. The authors
assumed that responsive voxels are spread over the whole brain and that the temporal covariance
matrix is a scaled version of a global covariance structure matrix. We aim to model our cerebral

105

WM data with a similar assumption as in Bernal-Rusiel et al., 2013[12], for two main reasons. First,
we believe that the temporal covariance structure is likely to be diﬀerent in diﬀerent segments of
the brain and that it needs to be modeled separately for each cluster of voxels. Second, inter-voxel
correlations depend on the location of the voxels; for example, voxels in close proximity share
similar behavior at diﬀerent stages in various disease processes. The spatiotemporal covariance of
the cerebral WM for a particular brain segment is a product of both the spatial covariance of the
voxels present in that region and the temporal covariance of a particular patient whose parameters
are diﬀerent for diﬀerent regions. The spatial structure presented in Bernal-Rusiel et al., 2013[12]
is inspired from Bowman, 2007[14]. Bowman, 2007[14] proposed the alternate strategy of using
a spatiotemporal model to estimate spatial and temporal correlations in a given ROI. We assume a
similar spatial covariance matrix captured through a parametric matrix that models the dependency
as an exponential transformation of the distance between voxels given a region/cluster.

Our assumption is that a temporal covariance structure is shared among voxels within a homo-
geneous region obtained from the K-means algorithm. The parametric covariance structure that
controls the dependency between voxels in a region solely depends on the distance between those
voxels. We consider a homogeneous parcellation of the brain consisting of g = 1, .., G regions. Let
g be the region/segment/cluster in which we are interested and vg be the number of voxels present
in the gth cluster. Let Yig be the (T ∗ vg) × 1 vector of measurements for WM for ith subject. As
we deﬁned in section (2.1), we have M number of patients, with each patient having T time points
of observations (balanced case). We can expand the Yig vector as-

where Yigv is the T × 1 vector measurements of vth voxel for patient i. We assumed similar



Yig1
Yig2
...
Yigvg



Yig =

106



covariance matrix used by Bernal-Rusiel et al., 2013[12] for Yig as-

cov(Yig) = Wig = Gg ⊗ Σig =

Gg11Σig Gg12Σig
Gg21Σig Gg22Σig

. . . Gg1vg Σig
. . . Gg2vg Σig

...

...

...

Ggvg1Σig Ggvg2Σig
2
rgZiDrgZT
i

+ σ

. . . Ggvgvg Σig

where ⊗ is the Kronecker product, and Σig = σ
2
g IT from equation (2) is the temporal
covariance matrix for ith patient in gth cluster. Gg is the vg × vg spatial covariance matrix that
accounts for the correlation between voxels present in the gth cluster. Similarly, we used the spatial
structure proposed in Bernal-Rusiel et al., 2013[12] which is empirically useful:

Gg =

1

exp−agd12−bgd2
12

exp−agd21−bgd2
21

...

−agdvg1−bgd2
vg1

exp

1
...

−agdvg2−bgd2
vg2

exp

−bgd2
1vg
−bgd2
2vg

. . . exp

. . . exp
...

. . .

−agd1vg
−agd2vg
...
1







where ag, bg ≥ 0 are unknown parameters and djk is the Euclidean distance between voxels j &
k in the gth cluster. Hence, the joint distribution of the Yigv vector of WM measurements in gth
cluster follows:

Yig ∼ N(Xig βig,Wig),

(5.3)

where Xig = Ivg ⊗ Xi is (vg ∗ T) ⊗ (vg ∗ p) matrix and

βg =

βg11, . . . , βg1p, βg21, . . . , βgvg1, . . . , βgvg p

is a (vg∗ p)×1 stacked vector of ﬁxed eﬀects which gives separate ﬁxed eﬀect estimates for each
voxel present in gth cluster. Finally, we will work with restricted log-likelihood and Fisher’s Scoring
method to estimate the parameters associated to covariance matrix Wig i.e. (σgr, ρg, σg, ag, bg).
The restricted log-likelihood is:
log |Wig| − 1
2

log | M

ig (Yig − Xig

igW−1
XT

ig Xig|,

ˆβg) − 1
2

ˆβg)TW−1

(Yig − Xig

l =

M

i=1

1
2

i=1

(cid:16)

M

i=1

(cid:17)T

107

(cid:104)M

(cid:105)−1M

i=1 XT
ig

i=1 XT
ig

ˆW−1
ig Xig

ˆW−1
where ˆβg =
ig Yig is the least square estimator. One more ad-
vantage of segmenting the brain or clustering homogeneous voxels together is that it reduces the
number of parameters to estimate. In the situation of voxel-wise model building, we are required to
estimate parameters for the covariance matrix for each voxel, whereas in this situation we estimate
(σgr, ρg, σg, ag, bg) only for G number of regions.

Fisher’s scoring algorithm starts by calculating partial derivatives and expected information

matrix for each iteration to update parameter estimates:

(k)) δl
δθ
Let’s deﬁne θg = (σgr, ρg, σg) and sg = (ag, bg). Next, deﬁne

(k) + I−1(θ

(k+1) = θ

θ

|

θ=θ(k)

δΣig
δθg j

, Hg =

XT
i Σ

−1
ig Xi, Fgk =

δGg
δsgk

The individual entries of Fisher’s Information matrix for gth cluster is described below:

−1
ig Migk) − tr(H−1

g (2Qg jk − Pg j H−1

g Pgk))

Ig(θg j, θgk) =

Ig(θg j, sgk) =

Ig(sg j, sgk) =

(cid:33)

i=1

tr(Σ

−1
ig Mig j Σ

(cid:32)(cid:32) M

vg
2
1
2tr(FgkG−1
g )
(M ∗ T − p)
tr(Fg jG−1

i=1

2

tr(Σ

−1
ig Mig j)
g FgkG−1
g ),

(cid:33)

+ tr(H−1

g Pg j)

(cid:33)

Yig, and rig = Yig − Xig
Then the partial derivatives of the restricted log-likelihood functions are:

−1
g XT
i Σ
ig

i=1

ˆβg

(cid:19)

−1
ig ) +

g ⊗ Σ

−1
ig Mig j Σ

−1
ig )rig

XT
i Σ

−1
ig Mig j Σ

i=1
G−1
g FgkG−1

g ⊗ Σ

−1
ig

−1
ig Xi)
rig − (M ∗ T − p)

2

tr(FgkG−1
g )

1
ig(G−1
2rT
(cid:17)

Ivg ⊗ H−1
(cid:18)

tr(Mig j Σ

M

Mig j =

ˆβg =

(cid:16)
M
M
= − vg
2
i=1
2 tr(H−1
(cid:16)
M
1
2
(cid:32) M

rT
ig

i=1

vg

=

g

+

δl
δθg j

δl
δsgk

M
(cid:17)

i=1

108

Table 5.1: Subject Characteristics

% Hypertension 37.5%

Baseline
variables

Age

MMSE

Systolic BP
Diastolic BP

ED

% Smoker

Women
Mean (sd)
62.60 (8.51)
28.50 (0.92)

133.75 (9.4)
78.65 (7.5)
15.5 (3.11)
0%

Men

Mean (sd)
63.62 (9.61)
29.12 (1.35)
25%
128.81 (7.8)
76.93 (8.5)
17.87 (2.8)
12.5%

p-value
0.825
0.30
1
0.27
0.67
0.13
1

where Qg jk =M

ig Xi, and Pg j =M

p-values computed for two-sample t-test or test of association, sd=standard deviation

i=1 XT

i Σ−1

ig Mig j Σ−1

ig Migk Σ−1

i=1 XT

i Σ−1

ig Mig j Σ−1

ig Xi After we

input all the necessary terms in the iteration parameters do converge after sixth or seventh step.

5.1.3 WM Data

The subjects in this study were from the Midwestern part of the United States. They were no
younger than 50 during their ﬁrst visit. All subjects had three longitudinal measurements, simpli-
fying the problem to a balanced case. In total, 16 patients participated. Subjects were screened
for history of psychiatric and neurological disorders or any heart disease, whether hypertension or
not. They were then screened for cognitive impairment using the Mini Mental Status Examination
(MMSE) for symptoms of depression. The sample comprised 16 healthy subjects (50% women)
who were between 50 and 80 years old (mean=62.6, sd=8.51 years). The men and women did not
diﬀer in age, MMSE, hypertension, or systolic and diastolic pressure. We present the demographic
and clinical characteristics in Table 1; the corresponding p-value conﬁrms that there was no gender
bias in the sample. Moreover, the proportion of smoking, regular exercise, and diagnosed hyper-
tension did not diﬀer based on gender.

MRI scans were obtained at three separate time points (T3, T2, and T3), and the DTI sequence

109

Figure 5.1: Change of WM across Z-axis from top-left to bottom-right Z-slice 15,30,45,60

was initiated on the ﬁrst occasion (T1). The mean diﬀerences between each time point are: mean
T1-T2=15.56, SD=1.03 months; mean T2-T3=15.06, SD=2.8 months. The subjects were scanned
using a 1.5T Siemens Magnetom Sonata scanner. One image has a voxel size 1.8 × 1.8 × 3 mm3.
In total we have 182 × 218 × 182 slices of ’nifti’ image with 48 number time points (#patients ×
time points). We have plenty of extraneous slices in our data which are the outer side of skull and
extra-cranial tissues. In order to process the data we need to remove those empty slices. We used
neurobase::dropEmptyImageDimensions function to reduce the dimension with reduced proportion
as 121 × 157 × 100. The voxel-wise measurement changes across the z-axis and Figure 5.1 has a
representation of how image changes if we move from head to toe.

The voxel-wise measurement changes across the z-axis. Figure 5.3 shows the orthographic view
of the reduced “nifti” image in three diﬀerent planes for a single time point, as well as the voxels
whose WM measurements are greater than 0.5. This gives us a scope to investigate brighter regions
and voxels with lower WM intensity. We are more interested in the WM analysis and in studying
WM distribution. After masking the ﬁrst time point’s image, we examine the density plot of the
data. Figure 5.2 shows that WM measurements follow approximately a bell shaped distribution,

110

Figure 5.2: white matter histogram with density
plot

Figure 5.3: white matter measurement greater
than 0.5

Figure 5.4: voxels’ directional density

Figure 5.5: 2-dimensional distributions of T1 &
T2 imaging sequence

with values between 0.2 and 1. This ﬁnding helps us to assume a Gaussian distribution for the
response variable. We also want to compare image ﬁles for diﬀerent time points and to calculate the
correlation between consecutive occurrences. Figure 5.5 presents a two-dimensional distribution
of T1 and T2 imaging sequences against each other. We also have another nifti object, which
contains the directional information about FA measurement changes. The directional values range
from -1 to 1, as plotted in Figure 5.4. Voxels with close directional values have similar properties
regarding change in WM/FA values.

111

Figure 5.7: Within sum of squares by number of
clusters

Figure 5.6: Residual vs predicted plot

5.1.4 Results

We start our analysis with voxel-wise model ﬁt. We work with a linear mixed eﬀect model,
where we use eight factors as ﬁxed eﬀects and two time-variant variables as random eﬀects. The
following ﬁxed eﬀects are considered: gender, hypertension indicator, MMSE score, and baseline
age, systolic pressure, diastolic pressure, ED, and smoking indicators. We have few longitudinal
observations for each patient, which limits us to the use of two time-variant random eﬀects, such
as systolic and diastolic pressure, in the model. We consider the intercept as the ﬁxed eﬀect.
Coeﬃcients are estimated using Fisher’s scoring method, where each voxel has separate estimates
2
for (β, σ
r , ρ) and a best predictor (BP) for ˆui. Our key aim is to calculate residuals for each
voxel to use as attributes for the clustering algorithm. We check the randomness of the residuals
with respect to the voxel-wise predicted response for each patient and each time point. The plot in
Figure 5.6 shows no pattern in residual values.

, σ

2

We start our clustering algorithm by calculating each voxel’s coeﬃcient of variations, mean,
residuals, and directional values, and the polar coordinates of the three-dimensional voxel location.

112

Figure 5.8: Segmentation of brain into similar regions

We then use the K-means algorithm to segment the brain into 3,000 clusters with a maximum
cluster size of 258 voxels. We check the within sum of squares over the number of clusters in
Figure 5.7, and 3,000 is the optimal point for minimum WSS (within sum of squares) to have a
signiﬁcant number of voxels present. The average cluster size is around 105 voxels. Once we
obtain each voxel’s cluster number, i.e. the cluster to which it belongs, we feed the corresponding
number to that voxel as an input. We visually examine how voxels are clustered together in each
hemisphere. Note that, voxels in same region have four major attributes: mean WM, variation in
WM over time, residual values, and location of the voxels. We plot the same brain slices in Figure
5.8 as we used in Figure 5.1 (z-axis: 15, 30, 45, 60) with cluster position numbers. Based on the
visual representation, we can claim that voxels close to similar WM belong to an analogous region.

113

Figure 5.9: Fisher’s scoring method convergenceFigure 5.10: Estimated parameters plot for all

clusters

2

2
r , ρ, σ

In the next step, we work with spatiotemporal linear mixed eﬀects with the same set of predictors
used in voxel-wise longitudinal models. However, this time we consider all the voxels present in
a cluster. Firstly, we estimate parameters (σ
, a, b) related to random eﬀects, measurement
error, and the spatial covariance matrix that controls the spatiotemporal covariance matrix. We
use Fisher’s scoring algorithm to estimate these parameters, and all parameters converge within 20
iterations. We inspect the convergence of parameters for cluster 1 in Figure 5.9. Our motivation
behind segmenting the brain is to ﬁnd regions whose voxels are similar to each other and the clusters
that diﬀer from each other. To study how much variability we observe from one cluster to another,
we plot the estimated parameters for all 3,000 clusters in Figure 5.10. Visually, we can identify that
the correlation estimates diﬀer from one cluster to another.

M

(cid:104)M

(cid:105)−1

Finally, we discuss our pivotal interest in estimating β components. As we have mentioned

ˆW−1
ig Xig

i=1 XT
ig

before, Yig ∼ N(Xig βig,Wig) and ˆβg =
ˆW−1
ig Yig is the least square estimator, where βg is a (vg ∗ p) × 1 vector. This implies that
i=1 XT
ig
for a particular cluster, we will obtain vg set of β(p × 1) estimators, where the ﬁrst component is
the intercept and and the rest of the components belong to the corresponding ﬁxed eﬀect. Note
that vg is the number of voxels present in gth cluster. Similar to the previous approach, we study
the histograms of beta components over diﬀerent clusters. We examine two separate ﬁgures, one

114

Figure 5.11: Histogram of Intercepts

Figure 5.12: Histogram of Gender coeﬃcients

for the intercept and another for the gender coeﬃcient, in Figures 5.11 and 5.12, respectively. We
choose clusters 1, 45, 250, and 500 to inspect variability. From the plots, it is clear that voxel-wise
estimates are diﬀerent among clusters, implying voxel variability over regions.

5.1.5 Discussion

The linear mixed eﬀect model is a powerful tool that has been used here to capture spatiotem-
poral structures hidden in voxel-level data. Bernal-Rusiel et al., 2013[12] proposed this method
for analyzing cortical thickness in brain sub-regions. LME provides a ﬂexible approach to capture
the longitudinal changes and spatial dependence. This method always works better than voxel-wise
longitudinal studies and statistical power increases. To the best of our knowledge, this method
has never before been used to analyze longitudinal WM changes in healthy aging subjects. The
implications of the whole setup can be divided into three parts. i) First, we ﬁnd the longitudinal
changes of WM in each voxel, and we identify how subjects’ age, gender, hypertension, MMSE
score, and systolic and diastolic pressure relate to these repeated measures. ii) In the next stage, we
ﬁnd the brain regions that behave similarly with respect to the WM measurements, voxel locations,
and voxel-wise model coeﬃcients. This gives us homogeneous regions in the brain to build more
parsimonious models. iii) Finally, we develop an individual model of each of these regions, where

115

the covariance matrix is a bi-product of the longitudinal and spatial covariance matrix. We observe
that ﬁxed-eﬀects regression coeﬃcients, estimated variance, and covariance components of random
eﬀects vary from one cluster to another. These give us a scope to study longitudinal changes in
these homogeneous regions rather than in a voxel-wise study.

The main motivation for building this linear mixed eﬀects spatiotemporal model was to work
with Bayesian model-based clustering with some stochastic objective function. In Part A, we used
the K-means algorithm to segment the brain, but we were dissatisﬁed with the limitations of this
method. One such limitation is that we needed to ﬁx the number of clusters a priori, while this is
most often unknown. Researchers study the within sum of squares to choose an optimal number
of clusters, but it does not have any probabilistic implications. An advantage of the stochastic
objective function is that it incorporates dependence among voxels through spatial and temporal
structures within a cluster. We elaborately formulate the method in Part B.

116

5.2 Bayesian Model based clustering in application to Spatiotemporal data

5.2.1

Introduction

Cluster analysis is a modern data mining tool used to group or segment objects into clusters based
on similarity measurements. Many available algorithms have been built based on the particular
criterion that objects within a cluster are more similar to each other than to objects belonging to other
clusters. Some popular clustering methods are K-means, agglomerative clustering, hierarchical
clustering, and model-based clustering. We focus on model-based clustering in this chapter. A
disadvantage of hierarchical clustering and K-means is that they are model-free heuristic methods.
Model-based clustering is an alternative that provides the option to formulate a model while
grouping objects. Model-based clustering was ﬁrst introduced by Banﬁeld and Raftery, 1993[6],
who assumed that observations come from multivariate normal distribution and that maximum
likelihood yields the estimates. Fraley and Raftery, 2002[37] introduced the ﬁnite mixture models as
a formal setting for model-based clustering. Let y1, .., yn be independently distributed p-dimensional
observations from a K-component mixture distribution:

n

K

i=1

k=1

f(y; τ, θ) =

τk fk(yi|θk)

(5.4)

Here, τk is the probability that a particular object belongs to kth cluster, and θk parametrize the
density fk. In general, fk is multivariate normal density, where fk ≡ MV N(µk, Σk). EM algo-
rithm helps us to obtain the estimates of unknown parameters. Melnykov and Maitra, 2010[88]
provided a thorough reference for model-based clustering. One point of subjectivity that is al-
ways attached to cluster analysis is how to decide the number of clusters. Similar to K-means,
model-based clustering needs a speciﬁed number of clusters at the start of the algorithm. Fraley
and Raftery, 1998[36] tried to address this question by using model selection criteria. They sug-
gested examining BIC over all possible cluster numbers and choosing the one with the lowest BIC.
In another notable study, Yeung et al., 2001 [131] applied model-based clustering to group gene
expression data. Over time, researchers have included other factors in the model-based approach,

117

such as variable selection [101] and penalizing the parameter space θ[95]. All the aforementioned
studies were performed in a conventional statistical framework, without using any prior information
on the cluster structure, the number of clusters, or unknown parameters (τ, θ). As an exception,
Bayesian model selection has been used to select an optimal number of clusters. Handcock et al.
2007[46] described two methods to estimate unknown parameters. They ﬁrst used a two-stage
maximum likelihood method, and then they proposed a Bayesian estimation method. In Bayesian
part, parameters have a multivariate normal prior, and the mixture probability τ has Dirichlet prior
with MCMC estimation. Moreover, Fraley and Raftery, 2007[38] published a paper in 2007 in
which they discussed the problems with using the EM algorithm and where it can fail to converge.
Instead of MLE, they preferred to work with a MAP estimator with conventional priors on mean
and covariance parameters, such as MVN on the mean and an Inverse-Wishart distribution on the
covariance matrix. Medvedovic et al., 2004 [86] proposed a contrasting formulation of a Bayesian
mixture-based clustering algorithm for grouping gene expression data with replicates. Interestingly,
the Poisson-Dirichlet process is the most logical and convenient way of assuming priors on the
number of clusters and mixture probability, if one does not want to limit the number of clusters at
the start of the algorithm [70].

The methods and literature described above have various advantages and can be applied to
diﬀerent complicated problems with which researchers deal. Here, we should reiterate the problem
we are addressing. We aim to build a Bayesian model-based clustering method with the same
WM data used for spatiotemporal modeling in Section 5.1. We have longitudinal voxel-wise WM
measurements for healthy aging subjects, and we want to group voxels into homogeneous regions
considering longitudinal and spatial information. To this end, we review some of the literature
regarding spatiotemporal clustering methods. The ﬁeld of computer science has worked with this
topic extensively. Kalnis et al., 2005[65] investigated a highly complicated problem of detecting
clusters among moving objects that changed locations over time. For example, they clustered trajec-
tories and mined movement patterns for a group of migrating animals or a convoy of cars moving in

118

a city. The complexity of this problem is manifold, but we simplify our situation by assuming that
the spatial locations of voxels are ﬁxed over time. Kisilevich et al., 2009[67] described a detailed
study of spatiotemporal clustering on trajectories and provided in-depth research development on
this topic.

In Section 5.1.2.2, we apply K-means clustering of group voxels into homogeneous regions
In advancement, we seek to build a clustering algorithm that
without any model assumption.
satisﬁes the clustering criteria and simultaneously helps us to estimate the regression coeﬃcients
and variance components for each cluster. Section 5.1 presents a two-stage procedure of clustering
objects and then a ﬁtting spatiotemporal model for each cluster. To blend these two processes
together and work closely with the Bayesian clustering structure, we explore the method described
by Booth et al., 2008[13]. They derived an objective function to cluster objects and a stochastic
search process to ﬁnd the posterior distribution for the number of clusters. We also take advantage
of the parsimonious representation of the mean via regression. Similar ideas were explored in
studies published before Booth et al., 2008’s paper came [51] [48]. In Section 5.2.2, we discuss
the general algorithm of our proposed methodology and model building with posterior derivations.
Section 5.2.3 concerns the stochastic search method to optimize the objective function.

5.2.2 Methodology :

5.2.2.1 The Ewens-Pitman Prior on Cluster partitions :

The basic clustering objective is easy to convey. Let us assume we have n voxels in the brain
observed for M subjects over T time points. We want to group these voxels into clusters that are
independent of subject variation, and the algorithm should consider spatial and temporal covariance
structures together. Voxels belonging to the same cluster are similar to each other whereas diﬀers
widely with members belong to other clusters. As stated in Booth et al., 2008[13], we want to

119

ﬁnd an objective function π : Pn → R+, where Pn is all possible partitions of the n objects.
Partitions must be non-empty sets that do not overlap each other. π assigns a score to each partition
measuring the overall achievement. Equivalently, we need to optimize the objective function π so
that we can obtain the partition with the highest score. That partition will be the best partition given
all circumstances. In addition, we use the same linear mixed eﬀect model as the one described
in Section 5.1.2.3, which considers the spatiotemporal dependence among voxels with other ﬁxed,
random eﬀects. Instead of assuming objects coming from equation (5.4) with ﬁxed K groups, it
is more realistic to assume that there is an unknown partition ω with c = c(ω) as the number of
clusters. We can denote the clusters as C1, ..,Cc and equation (5.4) can be written as-

f(y|θ1, ..θc, ω) =

f(yi|θk)

(5.5)

c(ω)


k=1

i∈Ck

with Ci ∩ Cj = Φ for i (cid:44) j. The introduction of ω leads to uncertainty in the number of clusters
or a probabilistic component attached to each partition. Once we ﬁnd estimated ˆω, we can ﬁx the
cluster partition with the fullest conﬁdence and ﬁnd estimated ˆθ for each cluster.

Equation (5.5) draws a general picture of the clustering proposal. Spatiotemporal data needs a
separate covariance structure; we work with the covariance structure mentioned in Bernal-Rusiel
et al., 2013[12]. We used WM data in this covariance structure in Section 5.1. We believe that this
structure accurately captures the dependence between voxels, as well as temporal changes among
time points to model spatiotemporal data in a non-functional way. The objective function that ﬁnal-
izes clusters’ partition is the posterior distribution π(ω|y), hence we need prior distribution on ω.
The optimization of π(ω|y) is highly computationally challenging work, especially in spatiotempo-
ral data. As stated in Booth et al., 2008[13] if there are n data points to cluster, the total number of
all possible partitions is called the Bell number[116] and is denoted by B(n) = #Pn. B(n) increases
rapidly with n. The table below gives an idea of the relationship between ’n’ and B(n). We have

n
B(n)

1
1

2
2

3
5

4
15

5
52

6
203

7
877

8

4140

9

21147

10

115975

40

1.6 × 1035

100

4.8 × 10115

120

around 316,000 voxels to cluster, which is computationally challenging. We also need a stochastic
search algorithm to optimize the objective function. We are still working on building this algorithm.

c(ω)

k=1

We now focus on the prior distribution of ω. The posterior distribution π(ω|y) is calculated
using π(ω) and π(θ|ω) with integrating out nuisance parameter θ from the likelihood. It should be
noted that estimates of θ are cluster speciﬁc and should depend on ω. We explore the same prior
used by Booth et al., 2008[13] for ω ∈ Pn,

π(ω) =

Γ(m)mc(ω)
Γ(n + m)

(5.6)

Γ(nk), c(ω) = 1, .., n, ω ∈ Pn

where nk = #(Ck) is the number of objects in cluster Ck, n =c(ω)

k=1 nk is total number of objects
in study, and m > 0 is a parameter. This distribution was ﬁrst used by Crowley, 1997[27]. This
prior is also known as "The Ewens-Pitman Prior"[22]. McCullagh and Yang, 2006[83] presented
detailed mathematical derivation of this prior. Moreover when c(ω) → ∞ this prior is also called
as Chinese Restaurant Process. Parameter ’m’ controls the number of clusters to be formed, and
it increases proportionally. When m → 0, we can achieve a single cluster with all objects gathered
into it. Similarly, for m → ∞, number of clusters reaches ’n’, which is the same as the number of
objects. As mentioned in Booth et al., 2008[13], the expected number of clusters to be formed if
ω ∼ π(ω) is-

E [c(ω)] = m

This prior has two other interesting properties:

n−1

1

m + i

i=0

• Exchangeability: If two partitions (∈ Pn) have same c(ω) and same n1, n2, .., nc(ω) then
they have exact same probability under π(ω).
• Consistency: Prior probability for a partition based on n objects (ω ∈ Pn) doesn’t depend
ω∈S πn+1(ω) =

on if (n + 1)th element enters into study, i.e. for ω∗ ∈ Pn and S ⊂ Pn+1,

πn(ω∗) is satisﬁed.

121

Next we will deﬁne our assumed model and prior for π(θ|ω). We will calculate posterior of ω as-

Þ

π(ω|y) ∝

f(y|θ, ω)π(θ|ω)π(ω)dθ

θ

5.2.2.2 Spatiotemporal Linear Mixed Eﬀects model:

In this part, we focus on the same linear mixed eﬀect model as the one used in Section 5.1.2.3.
Let us assume that yij is the ith voxel’s jth subject’s measurement observed over T time points,
and for ﬁxed ω, yij ∈ Ck = 1,2, .., nk the kth cluster. Then,

yij = Xj βk + Z jui + i j, f or i = 1, .., nk andj = 1, .., M
k Iq)
ui ∼ Nq(0, λσ
2
k IT)
i j ∼ NT(0, σ
2

(5.7)

is jth subject’s ﬁxed eﬀect design matrix, ZT×q

XT×p
is jth subject’s time-variant random eﬀect
j
matrix, βk is a p-dimensional coeﬃcient vector for kth cluster, ui is the ith voxel’s random eﬀect
associated with jth patient, and i j is the measurement error. Once we aggregate a single voxel’s
measurement for all M subjects, we obtain

j

(cid:16)

(cid:17)

yi = X βk + Zui + i, i = 1, .., nk

cov(yi) = σ

2
k

λZ ZT + IMT

= σ

2
k Σ

(5.8)

Σ contains the temporal covariance matrix, and we can assume that Z is block-wise orthogonal,
such that each patient has an independent random eﬀect. Once we consider that all nk voxels belong
(nk MT)×1 = (y
T)T, we write covariance matrix as a product of
to kth cluster, we get yk
spatial and temporal covariance matrices

T, .., ynk

cov(yk) = σ

2
k Wk = σ

k (Gk ⊗ Σ) = σ
2

2
k

Gk11Σ Gk12Σ
Gk21Σ Gk22Σ

...

...

. . . Gk1nk Σ
. . . Gk2nk Σ

...

Gknk1Σ Gknk2Σ . . . Gknk nk Σ





122

(cid:16)

(cid:17) from equation (5.8) is the temporal

λZ ZT + IMT

where ⊗ is the Kronecker product, and Σ =
covariance matrix for all patients’ voxels present in the kth cluster. Gk is the nk × nk spatial
covariance matrix that accounts for the correlation between voxels present in the kth cluster. The
spatial structure mentioned below and on which we worked in previous sections was proposed by
Bernal-Rusiel et al., 2013[12]. It is empirically useful:
exp−ak d12−bk d2
12

1

−bk d2
1nk
−bk d2
2nk

. . . exp
. . . exp
...

. . .

−ak d1nk
−ak d2nk
...
1





Gk =

exp−ak d21−bk d2
21

...

−ak dnk1−bk d2
nk1

exp

1
...

−ak dnk2−bk d2
nk2

exp

where ak, bk ≥ 0 are unknown parameters and di j is the Euclidean distance between voxels i & j in
the kth cluster. Hence, the joint distribution of the yk vector of WM measurements in kth cluster
is as follows:

yk ∼ N(X∗

k Wk),
2

βk, σ

(5.9)
where X∗ = (cid:174)1nk × X is (nk ∗ MT) ⊗ p matrix. For time being, we assume that Wk is known. Wk
has three unknown parameters λ, ak, bk. (ak, bk) can be plugged in from our previous study, and
we estimate λ from a data-driven method. Our primary focus is on parameters θk = (βk, σ
k), and
a convenient non-informative prior on (βk,log(σk)) is,
k |ω) ∝ 1

π(βk, σ

2

2

2
σ
k

5.2.2.3 Objective function derivation:

Once we decide on the prior distributions, it is easy to derive posterior distributions. We
determine the ﬁrst posterior distribution for π(βk|σ
& ω. Next, we

k , yk, ω) conditional on σ

2
k

2

123

ascertain the marginal posterior distribution of π(σ

2

(cid:16)

βk|σ

2
k , yk, ω

W−1
k yk

ˆβk =

2

X∗T

k |yk, ω).
k Vβk), where
W−1
X∗T

(cid:17) ∼ N( ˆβk, σ
k X∗(cid:19)−1
(cid:18)
k X∗(cid:19)−1
(cid:18)
2(cid:16)(nk ∗ MT) − p, s2
(cid:17)
(cid:17)T

W−1

X∗T

(cid:16)

=

k

yk − X∗ ˆβk

1

Vβk

2
Marginal posterior distribution of σ
K

is

(cid:17) ∼ Inv − χ

(cid:16)

σ

k |yk, ω
2
s2
k

=

(nk ∗ MT) − p

(cid:16)

where
W−1
k

yk − X∗ ˆβk

(cid:17)

(5.10)

(5.11)

(5.12)

To ﬁnd π(ω|y), we need to integrate out βk and σ
k , ω)π(βk, σ
2

f(yk| βk, σ

π(yk|ω) ∝

2
k

k |ω)dβk dσ
2

2
k

so that we can ﬁnd

Þ

Þ
(cid:18)(nk ∗ MT) − p

βk

2
σ
k

2

∝ Γ

(cid:16)

(cid:19)(cid:18)1

2

(cid:17)T

(cid:16)

(cid:17)(cid:19)−(nk∗MT)−p

2

yk − X∗ ˆβk

W−1
k

yk − X∗ ˆβk

Finally, we product the above function over all clusters and multiply it with the prior distribution
on cluster partitions π(ω). This provides us our long-desired objective function,

π(ω|y) ∝ π(ω) c(ω)
k=1
∝ Γ(m)mc(ω)
Γ(n + m)

π(yk|ω)

c(ω)

k=1

Γ(nk)Γ((nk ∗ MT − p)/2)

(cid:16)

(cid:18)1

2

(cid:17)T

(cid:16)

W−1
k

yk − X∗ ˆβk

(cid:17)(cid:19)−(nk∗MT)−p

2

yk − X∗ ˆβk

(5.13)
The posterior function π(ω|y) is logically reasonable. The residual quadratic form calculates
within the sum of squares in a cluster and prefers partitions with lower WSS. Furthermore, this
posterior objective function prefers large homogeneous regions for large nk . It also measures lack
of ﬁt from the linear mixed eﬀect model we have assumed. Optimizing the objective function is
challenging work.

124

CHAPTER 6

FUTURE WORK

The dissertation is focused on Bayesian variable selection and its application to brain image data.
We extensively worked with Bayesian model based clustering in order to segment brain in similar
regions. Optimization of the objective function would be our future works. As we have discussed
earlier, the total number of possible combinations for 316 thousands voxels would be in billions. It’s
implausible to optimize the function for such a large space. In order to utilize this objective function
we decided to reduce our sample space. We will ﬁnd a large number of partitions using K-means
described in section 5.1.2.2 with diﬀerent cluster numbers. Once we ﬁnalize these partitions the
above derived objective function will work as criterion to pick the best partition among selected
large number of partitions.

We focused on the Metropolis-Hastings based on biased random walk algorithm described in
Booth et al., 2008[13]. If we assume each partition as a node of an undirected graph and edges
are created based on connection from one node to another. As authors described, from a particular
partition if we move one element of a cluster and place in another cluster and obtain a new partition
then these two partitions have an edge. Although following this logic we won’t be able to obtain
symmetric candidate density for Metropolis-Hastings algorithm. Our future work would be to
propose and alternative graph creation which will provide symmetric candidate proposal. There
is also a potential problem with our reduced sample space approach. As we’ve mentioned, we
will subset our space with large (not inﬁnite) number of partitions. There is a possibility that two
separate partitions won’t have any edges or they have no connection in terms of voxel distribution.
We need more brainstorming in order to handle this situation. All these potential computational
problems will be answered in our future work.

125

APPENDIX

126

APPENDIX

APPENDIX

A.1 Chapter 3: Tables for Classiﬁcation performance using combination of

ROIs

Table A1: Combination two ROIs

sensitivity
0.87 ±0.02
0.86 ±0.01
0.86 ±0.02
0.84 ±0.02
0.81 ±0.02
0.77 ±0.03
0.78 ±0.01
0.84 ±0.02
0.82 ±0.04
0.81 ±0.03

speciﬁcity
0.85 ±0.01
0.88 ±0.01
0.88 ±0.02
0.90 ±0.01
0.90 ±0.01
0.84 ±0.01
0.88 ±0.01
0.89 ±0.001
0.89 ±0.01
0.84 ±0.01

accuracy
0.86 ±0.01
0.87 ±0.01
0.87 ±0.001
0.88 ±0.01
0.87 ±0.01
0.81 ±0.01
0.84 ±0.01
0.87 ±0.01
0.86 ±0.01
0.83 ±0.01

AUC

0.92 ±0.02
0.93 ±0.02
0.94 ±0.02
0.93 ±0.02
0.92 ±0.03
0.89 ±0.03
0.90 ±0.02
0.92 ±0.03
0.92 ±0.02
0.91 ±0.03

H + WB

H + FG

H + EC

(Mean ± SD)
(Mean ± SD)
(Mean ± SD)
H + MTC
(Mean ± SD)
WB + EC
(Mean ± SD)
WB + FG
(Mean ± SD)
WB + MTC
(Mean ± SD)
EC + FG
(Mean ± SD)
EC + MTC
(Mean ± SD)
FG + MTC
(Mean ± SD)

Notes: Sensitivity is proportion of correct AD prediction, Speciﬁcity is proportion of correct CN prediction,
AUC is area under ROC curve. The mean and standard deviation are based on 100 repeated results on test
data sets. Probability threshold is 0.5.

127

Table A2: Combination three ROIs

sensitivity
0.86±0.02
0.91±0.02
0.87±0.05
0.85±0.03
0.83±0.02
0.86±0.02
0.84±0.04
0.85±0.02
0.82±0.02
0.86±0.02

speciﬁcity
0.86±0.02
0.80±0.02
0.86±0.03
0.87±0.02
0.90±0.01
0.87±0.01
0.86±0.01
0.89±0.01
0.86±0.03
0.85±0.01

accuracy
0.86±0.01
0.84±0.01
0.87±0.02
0.86±0.001
0.87±0.01
0.86±0.01
0.85±0.02
0.88±0.01
0.84±0.02
0.85±0.01

AUC

0.91±0.05
0.91±0.10
0.93±0.07
0.93±0.03
0.93±0.03
0.93±0.02
0.91±0.03
0.94±0.02
0.91±0.03
0.92±0.03

H+WB+EC
(Mean ± sd)
H+WB+FG
(Mean ± sd)
H+WB+MTC
(Mean ± sd)
H+EC+FG
(Mean ± sd)
H+EC+MTC
(Mean ± sd)
H+FG+MTC
(Mean ± sd)
WB+EC+FG
(Mean ± sd)
WB+EC+MTC
(Mean ± sd)
WB+FG+MTC
(Mean ± sd)
EC+FG+MTC
(Mean ± sd)

Notes: Sensitivity is proportion of correct AD prediction, Speciﬁcity is proportion of correct CN prediction,
AUC is area under ROC curve. The mean and standard deviation are based on 100 repeated results on test
data sets. Probability threshold is 0.5.

128

Table A3: Combination four ROIs
speciﬁcity
0.79±0.04
0.89±0.03
0.90±0.02
0.84±0.02
0.86±0.02
0.77±0.01

sensitivity
0.91±0.01
0.77±0.04
0.74±0.08
0.85±0.02
0.85±0.03
0.81±0.01

H+WB+FG+MTC

H+WB+EC+MTC

H+WB+EC+FG
(Mean ± sd)
(Mean ± sd)
(Mean ± sd)
(Mean ± sd)
(Mean ± sd)
(Mean ± sd)

H+EC+FG+MTC

WB+EC+FG+MTC

H+WB+EC+FG+MTC*

accuracy
0.84±0.02
0.84±0.02
0.84±0.03
0.84±0.02
0.86±0.01
0.79±0.01

AUC

0.92±0.06
0.89±0.10
0.85±0.14
0.91±0.05
0.92±0.03
0.87±0.02

Notes: Sensitivity is proportion of correct AD prediction, Speciﬁcity is proportion of correct CN prediction,
AUC is area under ROC curve. The mean and standard deviation are based on 100 repeated results on test
data sets. Probability threshold is 0.5.
* Normalized volumes were used for this model run

A.2 Chapter 4: Posterior consistency proofs

Let us deﬁne B(rn) = supγ=γ(rn)Ch(G−1

rn ) and ¯B(rn) = supγ=γ(rn)Ch(Grn) where Grn =

rn Iq). B(rn) is the largest eigenvalues of Grn and D(R) = 1+R.sup|h|≤R|a

n (cid:31) 1 and assuming the below conditions hold which come from Jiang,
2

(cid:48)(h)|sup|h|≤R|ψ(h)|.

1 Iq, .., τ∗

diag(τ∗
Let n → (0,1] with n
2007[63] paper:

129

Conditions:

2
n

n) ≺ n
i) pn log(1/
2
ii) pn log(pn) ≺ n
2
n
iii) pn log(D( pn
¯B(rn)n
λn
iv) rn (cid:31) pn
v) rn log( ¯B(rn)n) (cid:31) n
2
) ≤ −4n
vi) log( rn
n
pn
pn

n)) (cid:31) n
2

2
n

n and ∆(rn) (cid:31) n
2

2
n

Proof of Condition I: If ψ(u) = eu/(1 + eu), then

w(u) = log[ψ(u)/(1 − ψ(u))] = u

=⇒ w
=⇒ |w

(cid:48)(u) = 1
(cid:48)(u)| ≤ Cq

Proof of Condition S: The proof starts with deﬁning set and notations used in condition S. Let

rn be a large integer > 0 and η is small > 0, then

S(rn, η) =(cid:8)(γ, βγ)) : γ = γ(rn), βγ ∈ M(rn, n)(cid:9)

(cid:40)

M(rn, n) =

2
(b1, .., brn)T : bj ∈ βj ± n
n
rn

, j = 1, ..,rn

(cid:41)

Here rn is the model size and γ(rn) = (1,2, ..,rn,0, ...) is an increasing sequence whose ﬁrst rn
components take value 1.

Let, 1 ≺ rn ≺ min(pn, n/log(pn)) and∞
j=1 || βj||2 < ∞.
(cid:35)

(cid:34)

2
βγ ∈ βj ± n
n
rn

πn

|γ = γ(rn)

(cid:34) (λ
(2π)(q−1)/2 exp(cid:16)−λn

n)(q/2)
2

(cid:113) ¯βT

j

¯βj

(cid:33)(cid:35)

(cid:17)(cid:32) n

2
n
rn

≥ rn

j=1

130

where ¯βj is some intermediate value which achieves the minimum density over (βj ± n
Then ,

2
n

rn )j∈γ(rn).

rn

(cid:113) ¯βT

j

asrn

j=1

(cid:113) ¯βT

j

¯βj ≤ limn→∞pn
rn

j=1

j

λn

¯βj ≤ C1B(rn)

j=1
2
¯βj + n
n
rn

(cid:113) ¯βT
j=1
n)(q/2)
(λ
2
(2π)(q−1)/2 ≥ exp(−C2rn − C3rn log( ¯B(rn))
(cid:35)

(cid:32)

(cid:34)

is bounded. In addition we can show that,

(cid:33)

)

where ¯B(rn) = supγ=γ(rn)Ch(Grn). Therefore,

2
βγ ∈ βj ± n
n
rn

πn

|γ = γ(rn)

≥ exp

−C2rn − C3rn log( ¯B(rn) − C1B(rn) − rn log( rn
2
n
n

≥ exp(−cn

n)
2

To prove the prior condition: Let ¯rn such that rn < ¯rn ≤ pn & ¯rn ≺ n/ln(pn). For our model we
have placed π0,n ∼ Beta distribution which is equivalent way of proposing Bernoulli distribution
on γ = γ(rn) where γ(rn) ∼ Bernoulli(π0,n) [115].

Now,

lnπn = rnlnπ0,n + (pn − rn)ln(1 − π0,n)
if rn ≈ pnλn then for π0,n = rn/pn small and 1 ≺ rn ≺ min(pn, n/lnpn)

=⇒ lnπn ≥ −rnlnpn > −cn
=⇒ πn[γ = γ(rn)] > exp(−cn

2
n f or large n
n)
2

Satisfying condition (S).

131

Proof of Condition L: Let us assume D(R) = 1 + R.sup|h|≤R|a

(cid:48)(h)|sup|h|≤R|ψ(h)| and there

exists some Cn such that

then,

Next,

2
n


) ≺ n

¯rnln( 1
¯rnln(pn) ≺ n
¯rnlnD(¯rnCn) ≺ n

2
n

2
n
2
n

πn(|γ| > ¯rn) = πn(|γ| = pn) = ( rn
)pn
pn
=⇒ ln(πn(|γ| > ¯rn) = pnln( rn
) ≤ −cn
pn
=⇒ πn(|γ| > ¯rn) ≤ e−cn

2
n

2
n

πn(|| βj||2 > t|γ) ∝

Þ ∞

t

≤ 1
λn

(cid:114)

βT
j βj

dβj

−λn

e
e−λnt

2
If t = Cn = cn
n
λn

and n

n (cid:31) 1, then
2
1
λn

Satisfying condition (L).

e−λnt ≤ e−cn

2
n , as λn ≥ 1

132

BIBLIOGRAPHY

133

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Stanisław Adaszewski, Juergen Dukart, Ferath Kherif, Richard Frackowiak, and Bogdan
Draganski. How early can we predict alzheimer’s disease using computational anatomy?
Neurobiology of aging, 34(12):2815–2826, 2013.
James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous
response data. Journal of the American statistical Association, 88(422):669–679, 1993.
DR Anderson and K Burnham. Model selection and multi-model inference. Second. NY:
Springer-Verlag, page 63, 2004.
Sönke Arlt, Ralph Buchert, Lothar Spies, Martin Eichenlaub, Jan T Lehmbeck, and Holger
Jahn. Association between fully automated mri-based volumetry of diﬀerent brain regions
and neuropsychological test performance in patients with amnestic mild cognitive impair-
ment and alzheimer’s disease. European archives of psychiatry and clinical neuroscience,
263(4):335–344, 2013.

Takeshi Asami, Sylvain Bouix, Thomas J Whitford, Martha E Shenton, Dean F Salisbury, and
Robert W McCarley. Longitudinal loss of gray matter volume in patients with ﬁrst-episode
schizophrenia: Dartel automated analysis and roi validation. Neuroimage, 59(2):986–996,
2012.

Jeﬀrey D Banﬁeld and Adrian E Raftery. Model-based gaussian and non-gaussian clustering.
Biometrics, pages 803–821, 1993.
J Barnes, RI Scahill, C Frost, JM Schott, MN Rossor, and NC Fox. Increased hippocam-
pal atrophy rates in ad over 6 months using serial mr imaging. Neurobiology of aging,
29(8):1199–1203, 2008.
Douglas M. Bates and Donald G. Watts. Nonlinear Regression Analysis and Its Applications.
1988.

Sam Behseta, Robert E Kass, and Garrick L Wallstrom. Hierarchical models for assessing
variability among functions. Biometrika, 92(2):419–434, 2005.

[10] Andrew R Bender, Manuel C Völkle, and Naftali Raz. Diﬀerential aging of cerebral white
matter in middle-aged and older adults: a seven-year follow-up. Neuroimage, 125:74–83,
2016.

[11]

[12]

Jorge L Bernal-Rusiel, Douglas N Greve, Martin Reuter, Bruce Fischl, Mert R Sabuncu,
Alzheimer’s Disease Neuroimaging Initiative, et al. Statistical analysis of longitudinal
neuroimage data with linear mixed eﬀects models. Neuroimage, 66:249–260, 2013.
Jorge L Bernal-Rusiel, Martin Reuter, Douglas N Greve, Bruce Fischl, Mert R Sabuncu,
Alzheimer’s Disease Neuroimaging Initiative, et al. Spatiotemporal linear mixed eﬀects

134

[13]

modeling for the mass-univariate analysis of longitudinal neuroimage data. Neuroimage,
81:358–370, 2013.
James G Booth, George Casella, and James P Hobert. Clustering using objective func-
tions and stochastic search. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70(1):119–139, 2008.

[14] F Dubois Bowman. Spatiotemporal models for region of interest analyses of functional
neuroimaging data. Journal of the American Statistical Association, 102(478):442–453,
2007.

[15] F DuBois Bowman, Brian Caﬀo, Susan Spear Bassett, and Clinton Kilts. A bayesian
hierarchical framework for spatial modeling of fmri data. NeuroImage, 39(1):146–156,
2008.

[16] F DuBois Bowman and Clinton Kilts. Modeling intra-subject correlation among repeated
scans in positron emission tomography (pet) neuroimaging data. Human brain mapping,
20(2):59–70, 2003.

[17] Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the

horseshoe. In Artiﬁcial Intelligence and Statistics, pages 73–80, 2009.

[18] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for

sparse signals. Biometrika, 97(2):465–480, 2010.

[19] Ramon Casanova, Benjamin Wagner, Christopher T Whitlow, Jeﬀ D Williamson, Sally A
Shumaker, Joseph A Maldjian, and Mark A Espeland. High dimensional classiﬁcation of
structural mri alzheimer’s disease data based on large scale regularization. Frontiers in
neuroinformatics, 5:22, 2011.

[20] George Casella. Empirical bayes gibbs sampling. Biostatistics, 2(4):485–500, 2001.
[21] George Casella and Elias Moreno. Objective bayesian variable selection. Journal of the

American Statistical Association, 101(473):157–167, 2006.

[22] George Casella, Elías Moreno, F Javier Girón, et al. Cluster analysis, model selection, and

prior distributions on models. Bayesian Analysis, 9(3):613–658, 2014.

[23] Arindam Chatterjee and Soumendra Nath Lahiri. Bootstrapping lasso estimators. Journal

of the American Statistical Association, 106(494):608–625, 2011.

[24] Ray-Bing Chen, Chi Hsiang Chu, Shinsheng Yuan, and Ying Nian Wu. Bayesian sparse
group selection. Journal of Computational and Graphical Statistics, 25(3):665–683, 2016.
[25] G Chetelat, B Landeau, F Eustache, F Mezenge, F Viader, V De La Sayette, B Desgranges,
and J-C Baron. Using voxel-based morphometry to map the structural changes associated
with rapid conversion in mci: a longitudinal mri study. Neuroimage, 27(4):934–946, 2005.
[26] Hee Min Choi, James P Hobert, et al. The polya-gamma gibbs sampler for bayesian logistic

regression is uniformly ergodic. Electronic Journal of Statistics, 7:2054–2064, 2013.

135

[27] Evelyn M. Crowley. Product partition models for normal means. Journal of the American

Statistical Association, 92(437):192–198, 1997.

[28] Christophe Delaloye, Guenael Moy, Fabienne De Bilbao, K Weber, S Baudois, Sven Haller,
Aikaterini Xekardaki, Alessandra Canuto, U Giardini, K-O Lövblad, et al. Longitudinal
analysis of cognitive performances and structural brain changes in late-life bipolar disorder.
International journal of geriatric psychiatry, 26(12):1309–1318, 2011.

[29] Gordana Derado, F DuBois Bowman, Lijun Zhang, and Alzheimer’s Disease Neuroimaging
Initiative. Predicting brain activity using a bayesian spatial model. Statistical methods in
medical research, 22(4):382–397, 2013.

[30] Vasilis K Dertimanis, Minas D Spiridonakos, and Eleni N Chatzi. Data-driven uncer-
tainty quantiﬁcation of structural systems via b-spline expansion. Computers & Structures,
207:245–257, 2018.

[31]

[32]

Ilaria Dimatteo, Christopher R. Genovese, and Robert E. Kass. Bayesian curve-ﬁtting with
free-knot splines. Biometrika, 88(4):1055–1071, 2001.
Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and
its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360,
2001.

[33] Yingying Fan, Gareth M James, Peter Radchenko, et al. Functional additive regression. The

Annals of Statistics, 43(5):2296–2325, 2015.

[34] Yong Fan, Nematollah Batmanghelich, Chris M Clark, Christos Davatzikos, Alzheimer’s
Disease Neuroimaging Initiative, et al. Spatial patterns of brain atrophy in mci patients,
identiﬁed via high-dimensional pattern classiﬁcation, predict subsequent cognitive decline.
Neuroimage, 39(4):1731–1743, 2008.

[35] Anders M Fjell, Kristine B Walhovd, Christine Fennema-Notestine, Linda K McEvoy,
Donald J Hagler, Dominic Holland, James B Brewer, and Anders M Dale. One-year brain
atrophy evident in healthy aging. Journal of Neuroscience, 29(48):15223–15231, 2009.

[36] Chris Fraley and Adrian E Raftery. How many clusters? which clustering method? answers

via model-based cluster analysis. The computer journal, 41(8):578–588, 1998.

[37] Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis, and
density estimation. Journal of the American statistical Association, 97(458):611–631, 2002.
[38] Chris Fraley and Adrian E Raftery. Bayesian regularization for normal mixture estimation

and model-based clustering. Journal of classiﬁcation, 24(2):155–181, 2007.
Jerome H. Friedman and Bernard W. Silverman. Flexible parsimonious smoothing and
additive modeling. Technometrics, 31(1):3–21, 1989.

[39]

136

[40] Thomas Frodl, Annette Schaub, Sandra Banac, Marketa Charypar, Markus Jäger, Petra
Kümmler, Ronald Bottlender, Thomas Zetzsche, Christine Born, Gerda Leinsinger, et al.
Reduced hippocampal volume correlates with executive dysfunctioning in major depression.
Journal of Psychiatry and Neuroscience, 31(5):316, 2006.

[41] Sylvia Frühwirth-Schnatter and Rudolf Frühwirth. Auxiliary mixture sampling with appli-
cations to logistic models. Computational Statistics & Data Analysis, 51(7):3509–3528,
2007.

[42] Alexander Genkin, David D Lewis, and David Madigan. Large-scale bayesian logistic

regression for text categorization. Technometrics, 49(3):291–304, 2007.

[43] Edward I George and Robert E McCulloch. Variable selection via gibbs sampling. Journal

of the American Statistical Association, 88(423):881–889, 1993.

[44] Edward I George and Robert E McCulloch. Approaches for bayesian variable selection.

Statistica sinica, pages 339–373, 1997.

[45] Robert B Gramacy, Nicholas G Polson, et al. Simulation-based regularized logistic regres-

sion. Bayesian Analysis, 7(3):567–590, 2012.

[46] Mark S Handcock, Adrian E Raftery, and Jeremy M Tantrum. Model-based clustering for
social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society),
170(2):301–354, 2007.

[47] Trevor J. Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical

Learning. 2009.

[48] Nicholas A Heard, Christopher C Holmes, and David A Stephens. A quantitative study of
gene regulation involved in the immune response of anopheline mosquitoes: An application
of bayesian hierarchical clustering of curves. Journal of the American Statistical Association,
101(473):18–29, 2006.

[49] WJP Henneman, JD Sluimer, J Barnes, WM Van Der Flier, IC Sluimer, NC Fox, Ph Scheltens,
H Vrenken, and F Barkhof. Hippocampal atrophy rates in alzheimer disease added value
over whole brain volume measures. Neurology, 72(11):999–1007, 2009.

[50] Chris Hinrichs, Vikas Singh, Lopamudra Mukherjee, Guofan Xu, Moo K Chung, Sterling C
Johnson, Alzheimer’s Disease Neuroimaging Initiative, et al. Spatially augmented lpboosting
for ad classiﬁcation with evaluations on the adni dataset. Neuroimage, 48(1):138–149, 2009.
[51] David B Hitchcock, George Casella, and James G Booth. Improved estimation of dissim-
ilarities by presmoothing functional data. Journal of the American Statistical Association,
101(473):211–222, 2006.

[52]

James P Hobert and Charles J Geyer. Geometric ergodicity of gibbs and block gibbs samplers
for a hierarchical random eﬀects model. Journal of Multivariate Analysis, 67(2):414–430,
1998.

137

[53]

Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky. Bayesian
model averaging: a tutorial. Statistical science, pages 382–401, 1999.

[54] Chris C Holmes, Leonhard Held, et al. Bayesian auxiliary variable models for binary and

multinomial regression. Bayesian analysis, 1(1):145–168, 2006.

[55] Xue Hua, Suh Lee, Igor Yanovsky, Alex D Leow, Yi-Yu Chou, April J Ho, Boris Gutman,
Arthur W Toga, Cliﬀord R Jack Jr, Matt A Bernstein, et al. Optimizing power to track
brain degeneration in alzheimer’s disease and mild cognitive impairment with tensor-based
morphometry: an adni study of 515 subjects. Neuroimage, 48(4):668–681, 2009.
Jian Huang, Joel L. Horowitz, and Fengrong Wei. Variable selection in nonparametric
additive models. Annals of Statistics, 38(4):2282–2313, 2010.

[56]

[57] Hemant Ishwaran, J Sunil Rao, et al. Spike and slab variable selection: frequentist and

bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005.

[58] Cliﬀord R Jack, Ronald C Petersen, Yue Cheng Xu, Peter C O’Brien, Glenn E Smith, Robert J
Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen. Predic-
tion of ad with mri-based hippocampal volume in mild cognitive impairment. Neurology,
52(7):1397–1397, 1999.

[59] Cliﬀord R Jack, Ronald C Petersen, Yuecheng Xu, Peter C O’Brien, Glenn E Smith, Robert J
Ivnik, Eric G Tangalos, and Emre Kokmen. Rate of medial temporal lobe atrophy in typical
aging and alzheimer’s disease. Neurology, 51(4):993–999, 1998.

[60] Cliﬀord R Jack, Ronald Carl Petersen, Y Xu, PC O’brien, Glenn E Smith, Robert J Ivnik,
Bradley F Boeve, Eric George Tangalos, and E Kokmen. Rates of hippocampal atrophy
correlate with change in clinical status in aging and ad. Neurology, 55(4):484–490, 2000.

[61] Gareth M James. Generalized linear models with functional predictors. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 64(3):411–432, 2002.

[62] Wenxin Jiang. On the consistency of bayesian variable selection for high dimensional binary

regression and classiﬁcation. Neural computation, 18(11):2762–2776, 2006.

[63] Wenxin Jiang et al. Bayesian variable selection for high dimensional generalized linear
models: convergence rates of the ﬁtted densities. The Annals of Statistics, 35(4):1487–1511,
2007.
Iain M Johnstone, Bernard W Silverman, et al. Needles and straw in haystacks: Empirical
bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649,
2004.

[64]

[65] Panos Kalnis, Nikos Mamoulis, and Spiridon Bakiras. On discovering moving clusters
in spatio-temporal data. In International Symposium on Spatial and Temporal Databases,
pages 364–381. Springer, 2005.

[66] Yuwon Kim, Jinseog Kim, and Yongdai Kim. Blockwise sparse regression. Statistica Sinica,

pages 375–390, 2006.

138

[67] Slava Kisilevich, Florian Mansmann, Mirco Nanni, and Salvatore Rinzivillo. Spatio-
temporal clustering. In Data mining and knowledge discovery handbook, pages 855–874.
Springer, 2009.

[68] Keith Knight and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics,

pages 1356–1378, 2000.

[69] Minjung Kyung, Jeﬀ Gill, Malay Ghosh, George Casella, et al. Penalized regression, standard

[70]

[71]

errors, and bayesian lassos. Bayesian Analysis, 5(2):369–411, 2010.
John W Lau and Peter J Green. Bayesian model-based clustering procedures. Journal of
Computational and Graphical Statistics, 16(3):526–558, 2007.
Jonathan C Lau, Jason P Lerch, John G Sled, R Mark Henkelman, Alan C Evans, and
Barry J Bedell. Longitudinal neuroanatomical changes determined by deformation-based
morphometry in a mouse model of alzheimer’s disease. Neuroimage, 42(1):19–27, 2008.

[72] Sang Han Lee, Alvin H Bachman, Donghyeon Yu, Johan Lim, Babak A Ardekani,
Alzheimer’s Disease Neuroimaging Initiative, et al. Predicting progression from mild cog-
nitive impairment to alzheimer’s disease using longitudinal callosal atrophy. Alzheimer’s &
Dementia: Diagnosis, Assessment & Disease Monitoring, 2:68–74, 2016.

[73] Chenlei Leng, Minh-Ngoc Tran, and David Nott. Bayesian adaptive lasso. Annals of the

Institute of Statistical Mathematics, 66(2):221–244, 2014.

[74] Kelvin K Leung, Josephine Barnes, Gerard R Ridgway, Jonathan W Bartlett, Matthew J
Clarkson, Kate Macdonald, Norbert Schuﬀ, Nick C Fox, Sebastien Ourselin, Alzheimer’s
Disease Neuroimaging Initiative, et al. Automated cross-sectional and longitudinal hip-
pocampal volume measurement in mild cognitive impairment and alzheimer’s disease. Neu-
roimage, 51(4):1345–1359, 2010.
Jiahan Li, Zhong Wang, Runze Li, and Rongling Wu. Bayesian group lasso for nonparametric
varying-coeﬃcient models with application to functional genome-wide association studies.
The annals of applied statistics, 9(2):640, 2015.

[75]

[76] Qing Li, Nan Lin, et al. The bayesian elastic net. Bayesian analysis, 5(1):151–170, 2010.
[77] Yingjie Li. High Dimensional Classiﬁcation for Spatially Dependent Data with Application

to Neuroimaging. Michigan State University, 2018.

[78] Feng Liang, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger. Mixtures of
g priors for bayesian variable selection. Journal of the American Statistical Association,
103(481):410–423, 2008.

[79] Anastasia Lykou and Ioannis Ntzoufras. On bayesian lasso variable selection and the speci-

ﬁcation of the shrinkage parameter. Statistics and Computing, 23(3):361–390, 2013.

139

[80] Benoît Magnin, Lilia Mesrob, Serge Kinkingnéhun, Mélanie Pélégrini-Issac, Olivier Col-
liot, Marie Sarazin, Bruno Dubois, Stéphane Lehéricy, and Habib Benali. Support vector
machine-based classiﬁcation of alzheimer’s disease from whole-brain anatomical mri. Neu-
roradiology, 51(2):73–83, 2009.

[81] Atreyee Majumder. Variable Selection in High-Dimensional Setup: A Detailed Illustration

Through Marketing and MRI Data. Michigan State University, 2017.

[82] Gertraud Malsiner-Walli and Helga Wagner. Comparing spike and slab priors for bayesian

variable selection. arXiv preprint arXiv:1812.07259, 2018.

[83] Peter McCullagh and Jie Yang. Stochastic classiﬁcation models. In International Congress

of Mathematicians, volume 3, page 72, 2006.

[84] CR McDonald, LK McEvoy, L Gharapetian, C Fennema-Notestine, DJ Hagler, D Holland,
A Koyama, JB Brewer, AM Dale, Alzheimer’s Disease Neuroimaging Initiative, et al. Re-
gional rates of neocortical atrophy from normal aging to early alzheimer disease. Neurology,
73(6):457–465, 2009.

[85] Mathew W. McLean, Fabian Scheipl, Giles Hooker, Sonja Greven, and David Ruppert.
Bayesian functional generalized additive models with sparsely observed covariates. arXiv
preprint arXiv:1305.3585, 2013.

[86] Mario Medvedovic, Ka Yee Yeung, and Roger Eugene Bumgarner. Bayesian mixture model

based clustering of replicated microarray data. Bioinformatics, 20(8):1222–1232, 2004.

[87] Lukas Meier, Sara Van De Geer, and Peter Bühlmann. The group lasso for logistic regression.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71,
2008.

[88] Volodymyr Melnykov, Ranjan Maitra, et al. Finite mixture models and model-based cluster-

ing. Statistics Surveys, 4:80–116, 2010.

[89] Alan Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
[90] Chandan Misra, Yong Fan, and Christos Davatzikos. Baseline and longitudinal patterns of
brain atrophy in mci patients, and their use in prediction of short-term conversion to ad:
results from adni. Neuroimage, 44(4):1415–1422, 2009.

[91] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression.

Journal of the American Statistical Association, 83(404):1023–1032, 1988.

[92] Christian Montag and Martin Reuter. Internet addiction: Neuroscientiﬁc approaches and

therapeutical interventions. 2015.

[93] HANS-GEORG MÜLLER. Functional modelling and classiﬁcation of longitudinal data.

Scandinavian Journal of Statistics, 32(2):223–240, 2005.

[94] Naveen Naidu Narisetty, Xuming He, et al. Bayesian variable selection with shrinking and

diﬀusing priors. The Annals of Statistics, 42(2):789–817, 2014.

140

[95] Wei Pan and Xiaotong Shen. Penalized model-based clustering with application to variable

selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007.

[96] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical

Association, 103(482):681–686, 2008.

[97] Nicholas G Polson, James G Scott, and Jesse Windle. Bayesian inference for logistic
models using pólya–gamma latent variables. Journal of the American statistical Association,
108(504):1339–1349, 2013.

[98] Nicholas G Polson, James G Scott, and Jesse Windle. The bayesian bridge. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 76(4):713–733, 2014.

[99] Chengxuan Qiu, Miia Kivipelto, and Eva von Strauss. Epidemiology of alzheimer’s dis-
ease: occurrence, determinants, and strategies toward intervention. Dialogues in clinical
neuroscience, 11(2):111, 2009.

[100] Olivier Querbes, Florent Aubry, Jérémie Pariente, Jean-Albert Lotterie, Jean-François Dé-
monet, Véronique Duret, Michèle Puel, Isabelle Berry, Jean-Claude Fort, Pierre Celsis, et al.
Early diagnosis of alzheimer’s disease using cortical thickness: impact of cognitive reserve.
Brain, 132(8):2036–2047, 2009.

[101] Adrian E Raftery and Nema Dean. Variable selection for model-based clustering. Journal

of the American Statistical Association, 101(473):168–178, 2006.

[102] Naftali Raz and Kristen M Kennedy. A systems approach to the aging brain: Neuroanatomic
changes, their modiﬁers, and cognitive correlates. In Imaging the aging brain, pages 43–70.
Oxford University Press, 2009.

[103] Naftali Raz, Ulman Lindenberger, Karen M Rodrigue, Kristen M Kennedy, Denise Head,
Adrienne Williamson, Cheryl Dahle, Denis Gerstorf, and James D Acker. Regional brain
changes in aging healthy adults: general trends, individual diﬀerences and modiﬁers. Cere-
bral cortex, 15(11):1676–1689, 2005.

[104] Naftali Raz and Karen M Rodrigue. Diﬀerential aging of the brain: patterns, cognitive

correlates and modiﬁers. Neuroscience & Biobehavioral Reviews, 30(6):730–748, 2006.

[105] H Rusinek, Y Endo, S De Santi, D Frid, W-H Tsui, S Segal, A Convit, and MJ de Leon.
Atrophy rate in medial temporal lobe during progression of alzheimer disease. Neurology,
63(12):2354–2359, 2004.

[106] Fabian Scheipl, Ludwig Fahrmeir, and Thomas Kneib. Spike-and-slab priors for function
selection in structured additive regression models. Journal of the American Statistical
Association, 107(500):1518–1532, 2012.

[107] Fabian Scheipl, Thomas Kneib, and Ludwig Fahrmeir. Penalized likelihood and bayesian
function selection in regression models. AStA Advances in Statistical Analysis, 97(4):349–
385, 2013.

141

[108] Flávio Luiz Seixas, Bianca Zadrozny,

and Débora
Christina Muchaluat Saade. A bayesian network decision model for supporting the di-
agnosis of dementia, alzheimer’s disease and mild cognitive impairment. Computers in
biology and medicine, 51:140–158, 2014.

Jerson Laks, Aura Conci,

[109] Philip Shaw, Noor J Kabani, Jason P Lerch, Kristen Eckstrand, Rhoshel Lenroot, Nitin Gog-
tay, Deanna Greenstein, Liv Clasen, Alan Evans, Judith L Rapoport, et al. Neurodevelopmen-
tal trajectories of the human cerebral cortex. Journal of Neuroscience, 28(14):3586–3594,
2008.

[110] Li Shen, Yuan Qi, Sungeun Kim, Kwangsik Nho, Jing Wan, Shannon L Risacher, Andrew J
Saykin, et al. Sparse bayesian learning for identifying imaging biomarkers in ad prediction. In
International Conference on Medical Image Computing and Computer-Assisted Intervention,
pages 611–618. Springer, 2010.

[111] Guiling Shi. Bayesian Variable Selection: Extensions of Nonlocal Priors. Michigan State

University. Statistics, 2017.

[112] Russell T Shinohara, Ciprian M Crainiceanu, Brian S Caﬀo, and Daniel S Reich. Longi-
tudinal analysis of spatiotemporal processes: a case study of dynamic contrast-enhanced
magnetic resonance imaging in multiple sclerosis. 2011.

[113] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso.

Journal of Computational and Graphical Statistics, 22(2):231–245, 2013.

[114] Martha Skup, Hongtu Zhu, and Heping Zhang. Multiscale adaptive marginal analysis of
longitudinal neuroimaging data with time-varying covariates. Biometrics, 68(4):1083–1092,
2012.

[115] Michael Smith and Robert Kohn. Nonparametric regression using bayesian variable selec-

tion. Journal of Econometrics, 75(2):317–343, 1996.

[116] Richard P. Stanley. Enumerative Combinatorics: Volume 1. 1986.
[117] Adam J Suarez, Subhashis Ghosal, et al. Bayesian estimation of principal components for

functional data. Bayesian Analysis, 12(2):311–333, 2017.

[118] Tero Tapiola, Corina Pennanen, Mia Tapiola, Susanna Tervo, Miia Kivipelto, Tuomo Hän-
ninen, Maija Pihlajamäki, Mikko P Laakso, Merja Hallikainen, Anne Hämäläinen, et al.
Mri of hippocampus and entorhinal cortex in mild cognitive impairment: a follow-up study.
Neurobiology of aging, 29(1):31–38, 2008.

[119] Paul M Thompson, Kiralee M Hayashi, Greig I De Zubicaray, Andrew L Janke, Stephen E
Rose, James Semple, Michael S Hong, David H Herman, David Gravano, David M Doddrell,
et al. Mapping hippocampal and ventricular change in alzheimer disease. Neuroimage,
22(4):1754–1766, 2004.

[120] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society. Series B (Methodological), pages 267–288, 1996.

142

[121] Angelika Van Der Linde. A bayesian latent variable approach to functional principal compo-
nents analysis with binary and count data. AStA Advances in Statistical Analysis, 93(3):307–
333, 2009.

[122] Geert Verbeke and Geert Molenberghs. A model for longitudinal data. Linear mixed models

for longitudinal data, pages 19–29, 2000.

[123] Xuejing Wang, Bin Nan, Ji Zhu, and Robert Koeppe. Regularized 3d functional regression

for brain image data via haar wavelets. The annals of applied statistics, 8(2):1045, 2014.

[124] Jennifer L Whitwell, Scott A Przybelski, Stephen D Weigand, David S Knopman, Bradley F
Boeve, Ronald C Petersen, and Cliﬀord R Jack Jr. 3d maps from multiple mri illustrate
changing atrophy patterns as subjects progress from mild cognitive impairment to alzheimer’s
disease. Brain, 130(7):1777–1786, 2007.

[125] Jennifer L Whitwell, Maria M Shiung, SA Przybelski, Stephen D Weigand, David S Knop-
man, Bradley F Boeve, Ronald C Petersen, and CR Jack. Mri patterns of atrophy associated
with progression to ad in amnestic mild cognitive impairment. Neurology, 70(7):512–520,
2008.

[126] Jesse Windle, Nicholas G Polson, and James G Scott. Sampling polya-gamma random

variates: alternate and approximate techniques. arXiv preprint arXiv:1405.0506, 2014.

[127] Mark William Woolrich, Mark Jenkinson, J Michael Brady, and Stephen M Smith. Fully
bayesian spatio-temporal modeling of fmri data. IEEE transactions on medical imaging,
23(2):213–231, 2004.

[128] Xiaofan Xu, Malay Ghosh, et al. Bayesian variable selection and estimation for group lasso.

Bayesian Analysis, 10(4):909–936, 2015.

[129] Lan Xue. Consistent variable selection in additive models. 2009.

[130] Wenlu Yang, Xinyun Chen, Hong Xie, and Xudong Huang. Ica-based automatic classiﬁcation
In Life System Modeling and Intelligent

of magnetic resonance images from adni data.
Computing, pages 340–347. Springer, 2010.

[131] Ka Yee Yeung, Chris Fraley, Alejandro Murua, Adrian E. Raftery, and Walter L. Ruzzo.
Model-based clustering and data transformations for gene expression data. Bioinformatics,
17(10):977–987, 2001.

[132] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67,
2006.

[133] Daoqiang Zhang, Dinggang Shen, Alzheimer’s Disease Neuroimaging Initiative, et al. Pre-
dicting future clinical changes of mci patients using longitudinal and multimodal biomarkers.
PloS one, 7(3):e33182, 2012.

143

[134] Lin Zhang, Veerabhadran Baladandayuthapani, Bani K Mallick, Ganiraju C Manyam, Pa-
tricia A Thompson, Melissa L Bondy, and Kim-Anh Do. Bayesian hierarchical structured
variable selection methods with application to molecular inversion probe studies in breast
cancer. Journal of the Royal Statistical Society: Series C (Applied Statistics), 63(4):595–620,
2014.

[135] Xiaoxi Zhang, Tim Johnson, Rod Little, and Yue Cao. Longitudinal image analysis of

tumor/brain change in contrast uptake induced by radiation. 2009.

[136] Hongxiao Zhu, Marina Vannucci, and Dennis D Cox. A bayesian hierarchical model for

classiﬁcation with selection of functional predictors. Biometrics, 66(2):463–473, 2010.

[137] Vadim Zipunnikov, Sonja Greven, Haochang Shou, Brian S. Caﬀo, Daniel S. Reich, and
Ciprian M. Crainiceanu. Longitudinal high-dimensional principal components analysis with
application to diﬀusion tensor imaging of multiple sclerosis. Ann. Appl. Stat., 8(4):2175–
2202, 12 2014.

[138] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical

Association, 101(476):1418–1429, 2006.

144