BAYESIAN VARIABLE SELECTION AND FUNCTIONAL DATA ANALYSIS: APPLICATION TO BRAIN IMAGING By Asish Kumar Banik A DISSERTATION Michigan State University in partial fulfillment of the requirements Submitted to for the degree of Statistics – Doctor of Philosophy 2019 ABSTRACT BAYESIAN VARIABLE SELECTION AND FUNCTIONAL DATA ANALYSIS: APPLICATION TO BRAIN IMAGING By Asish Kumar Banik High-dimensional statistics is one of the most studied topics in the field of statistics. The most interesting problem to arise in the last 15 years is variable selection or subset selection. Variable selection is a strong statistical tool that can be explored in functional data analysis. In the first part of this thesis, we implement a Bayesian variable selection method for automatic knot selection. We propose a spike-and-slab prior on knots and formulate a conjugate stochastic search variable selection for significant knots. The computation is substantially faster than existing knot selection methods, as we use Metropolis-Hastings algorithms and a Gibbs sampler for estimation. This work focuses on a single nonlinear covariate, modeled as regression splines. In the next stage, we study Bayesian variable selection in additive models with high-dimensional predictors. The selection of nonlinear functions in models is highly important in recent research, and the Bayesian method of selection has more advantages than contemporary frequentist methods. Chapter 2 examines Bayesian sparse group lasso theory based on spike-and-slab priors to determine its applicability for variable selection and function estimation in nonparametric additive models. The primary objective of Chapter 3 is to build a classification method using longitudinal volumetric magnetic resonance imaging (MRI) data from five regions of interest (ROIs). A functional data analysis method is used to handle the longitudinal measurement of ROIs, and the functional coefficients are later used in the classification models. We propose a Pólya-gamma augmentation method to classify normal controls and diseased patients based on functional MRI measurements. We obtain fast-posterior sampling by avoiding the slow and complicated Metropolis- Hastings algorithm. Our main motivation is to determine the important ROIs that have the highest separating power to classify our dichotomous response. We compare the sensitivity, specificity, and accuracy of the classification based on single ROIs and with various combinations of them. We obtain a sensitivity of over 85% and a specificity of around 90% for most of the combinations. Next, we work with Bayesian classification and selection methodology. The main goal of Chapter 4 is to employ longitudinal trajectories in a significant number of sub-regional brain volumetric MRI data as statistical predictors for Alzheimer’s disease (AD) classification. We use logistic regression in a Bayesian framework that includes many functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, the selection of predictors is paramount with the introduction of either spike-and-slab priors, non-local priors, or Horseshoe priors. We seek to avoid the complicated Metropolis-Hastings approach and to develop an easily implementable Gibbs sampler. In addition, the Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. Another advantage of working with logistic regression is that it calculates the log of odds of relative risk for AD compared to normal control based on the selected longitudinal predictors, rather than simply classifying patients based on cross-sectional estimates. Ultimately, however, we combine approaches and use a probability threshold to classify individual patients. We employ 49 functional predictors consisting of volumetric estimates of brain sub-regions, chosen for their established clinical significance. Moreover, the use of spike-and-slab priors ensures that many redundant predictors are dropped from the model. Finally, we present a new approach of Bayesian model-based clustering for spatiotemporal data in chapter 5 . A simple linear mixed model (LME) derived from a functional model is used to model spatiotemporal cerebral white matter data extracted from healthy aging individuals. LME provides us with prior information for spatial covariance structure and brain segmentation based on white matter intensity. This motivates us to build stochastic model-based clustering to group voxels considering their longitudinal and location information. The cluster-specific random effect causes correlation among repeated measures. The problem of finding partitions is dealt with by imposing prior structure on cluster partitions in order to derive a stochastic objective function. Copyright by ASISH KUMAR BANIK 2019 I dedicate this dissertation to my late grandparents for their endless love, support and blessings v ACKNOWLEDGEMENTS I would like to first thank my supervisor Dr. Tapabrata Maiti for his continuous support, en- couragement and assistance. His expertise and experience helped me to formulate the research problems and kept me motivated all the time. I am also grateful to the Department of Statistics and Probability for providing me the opportunity to grow my knowledge over years. Next, I would like to thank Dr. Alla Sikorskii. Dr. Sikorskii supported me as a Research I have received a great deal of support Assistant with College of Nursing for over four years. from her to build my expertise around clinical trials research. I would also like to thank Professor Dr. Andrew Bender for providing us the brain image data and helped me to understand the nitty gritty details of human brain mechanisms. His research questions motivated me to build various statistical methods for this dissertation. My sincerest gratitude to other committee members Dr. R.V. Ramamoorthi and Dr. Seungik I also would like to thank all Baek for providing important suggestions and invaluable advice. Department of Statistics professors for their valuable lessons through various courses. Last but not least, I would like to devote my deepest gratitude to my parents. They believed in me like nobody else. I would like to thank my brother and sister-in-law for encouraging throughout writing this dissertation. My final gratitude is for my friends and peers. I always received much needed important suggestions from them through out my research journey. vi TABLE OF CONTENTS . . . . . . . . . . . . 1.1 Variable selection . LIST OF TABLES . LIST OF FIGURES . CHAPTER 1 . . . . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Bayesian Variable Selection and Spike-and-Slab Prior . . . . . . . . . . . . 1.1.2 Penalization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Bayesian Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 3 6 1.2 Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Functional data smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Bayesian model based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 ADNI Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 x xi . . CHAPTER 2 NONPARAMETRIC FUNCTION ESTIMATION USING BAYESIAN . . Introduction . 2.1.1 2.1.2 Variable selection using Spike-Slab Prior VARIABLE SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1 Univariate function estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . 22 2.1.2.1 Model description and Prior . . . . . . . . . . . . . . . . . . . . 23 The Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.2.2 2 . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.2.3 Updating β and σ 2.1.3 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.3.1 Simulated example . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.1.3.2 Chloride concentration Data . . . . . . . . . . . . . . . . . . . . 30 2.1.3.3 Discussion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Functional additive model estimation with bi-level selection . . . . . . . . . . . . . 32 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.2 Bayesian Sparse group Lasso in nonparametric additive models . . . . . . . 34 Simulated Example: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.3 2.2.4 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Introduction . . . . . . . . CHAPTER 3 BAYESIAN CLASSIFICATION OF ALZHEIMER’S DISEASE STAGES . . . . . . . . . . . Introduction . 3.1 3.2 Methodology . FROM LONGITUDINAL VOLUMETRIC MRI DATA . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Smoothing functional data . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Classification using Pólya-Gamma Augmentation . . . . . . . . . . . . . . 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Classification using Hippocampus . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Data Description . 3.4 Application results . . . . . . . . vii 3.4.2 Classification with Single ROI 3.4.3 Classification using combination of ROIs 3.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 57 . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . 3.5 Discussion . . . . . . . . . Introduction . Spike-Slab prior . CHAPTER 4 BAYESIAN PENALIZED MODEL FOR CLASSIFICATION AND SE- LECTION OF FUNCTIONAL PREDICTORS USING LONGITUDI- NAL MRI DATA FROM ADNI . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Bayesian Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.2 Bayesian Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Functional smoothing for longitudinal data . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Simultaneous Classification of binary response with selection of functional predictors 71 4.4.1 Classification using Pólya-Gamma Augmentation . . . . . . . . . . . . . . 71 Selection using Bayesian Group Lasso . . . . . . . . . . . . . . . . . . . . 72 4.4.2 4.5 Median thresholding and Theoretical properties . . . . . . . . . . . . . . . . . . . 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . . . . . . . . . . . . . . . . . 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 . 85 . 94 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6 Simulation results . 4.7 Application on ADNI MRI data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 An alternate modeling Proposal: . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Discussion: 4.5.1 Marginal Prior for β j: 4.5.2 Median thresholding as posterior estimates Posterior Consistency: 4.5.3 . . . . . . . . . . . CHAPTER 5 BAYESIAN SPATIOTEMPORAL CLUSTERING MODEL FOR ANA- LYZING WHITE MATTER DATA . . . . . . . . . . . . . . . . . . . . . . 98 5.1 Spatiotemporal Linear mixed effects modeling for the distribution of cerebral Introduction . . white matter on MRI scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.2 Methods . . . . . . . . . . . . . . . . . 101 5.1.2.1 Voxel-wise linear mixed effect model 5.1.2.2 Clustering voxels into homogeneous regions using K-means . . . . . . . . . . . . . . 5.1.3 WM Data . 5.1.4 Results . . 5.1.5 Discussion . algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.2.3 Cluster based Spatiotemporal linear mixed effect models . . . . . 105 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2 Bayesian Model based clustering in application to Spatiotemporal data . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 The Ewens-Pitman Prior on Cluster partitions : . . . . . . . . . . 119 Spatiotemporal Linear Mixed Effects model: . . . . . . . . . . . 122 . . . . . . . . . . . . . . . . . . . 123 5.2.2.1 5.2.2.2 5.2.2.3 Objective function derivation: 5.2.1 Introduction . 5.2.2 Methodology : . . . viii CHAPTER 6 FUTURE WORK . APPENDIX . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 ix LIST OF TABLES Table 2.1: Log(ISE) comparison with Kernel methods of approximation . . . . . . . . . . . 30 Table 3.1: Patients Baseline Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Table 3.2: Data Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Table 3.3: Classification performance using single ROI . . . . . . . . . . . . . . . . . . . . 57 Table 4.1: Classification and selection performance Table . . . . . . . . . . . . . . . . . . 87 Table 4.2: Patients Baseline Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 5.1: Subject Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table A1: Combination two ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Table A2: Combination three ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Table A3: Combination four ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 x LIST OF FIGURES Figure 1.1: spike-and-slab Priora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.2: Horseshoe priora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.3: Lasso vs Ridge solution Patha . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.4: Threshold functions for orthonormal designa . . . . . . . . . . . . . . . . . . . Figure 1.5: comparison between Lasso, Bayesian Lasso and Ridge regression solution pathsa 2 2 5 5 6 Figure 1.6: B-spline smoothing under varying orders and knotsa . . . . . . . . . . . . . . . 14 Figure 2.1: Nonparametric function estimation comparison between variable selection, linear regression, and fixed knot approximation . . . . . . . . . . . . . . . . . 31 Figure 2.2: Example 1 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 2.3: Example 1 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 2.4: Example 2 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 2.5: Example 2 simulation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 3.1: Longitudinal volume of ROIs for dementia patients and normal controls. The first five plots (plots in the first column and the top two plots in the second column) are scaled ROIs of interest. The last plot has MMSE scores of patients. On the X-axis, we have the number of visits for patients. Blue lines are for the dementia group and orange lines are for the normal group. Thin lines represent each patient’s data, and thick lines are the pooled mean for the AD and CN groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 . Figure 3.2: In (a), the colored points represent hippocampus values for each time point, and the black dot signifies the mean value of the smoothed curve for that time point. Orange is for CN and blue is for AD. In (b), the points are the predicted class probability for patients in the test data set. (c) presents the spread of the estimated regression coefficients from the Bayesian logistic model in 100 repeated runs. (d) shows the ROC curve after classifying patients using the hippocampus only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 4.1: Plots based on Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xi Figure 4.2: Brain volume changes of Left Hippocampus, Left Lateral Orbitofrontal cor- tex, and Left Posterior Cingulate over time for Normal and Dementia patients . . 93 Figure 4.3: Acceptance probability of MCMC sample for Left Hippocampus and Left- Lateral Orbitofrontal brain regions . . . . . . . . . . . . . . . . . . . . . . . . 94 Figure 4.4: Pictorial representation of selected brain ROI’s discriminating diseased group from normal control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 5.1: Change of WM across Z-axis from top-left to bottom-right Z-slice 15,30,45,60 . 110 Figure 5.2: white matter histogram with density plot . . . . . . . . . . . . . . . . . . . . . 111 Figure 5.3: white matter measurement greater than 0.5 . . . . . . . . . . . . . . . . . . . . 111 Figure 5.4: voxels’ directional density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Figure 5.5: 2-dimensional distributions of T1 & T2 imaging sequence . . . . . . . . . . . . 111 Figure 5.6: Residual vs predicted plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Figure 5.7: Within sum of squares by number of clusters . . . . . . . . . . . . . . . . . . . 112 Figure 5.8: Segmentation of brain into similar regions . . . . . . . . . . . . . . . . . . . . 113 Figure 5.9: Fisher’s scoring method convergence . . . . . . . . . . . . . . . . . . . . . . . 114 Figure 5.10: Estimated parameters plot for all clusters . . . . . . . . . . . . . . . . . . . . . 114 Figure 5.11: Histogram of Intercepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Figure 5.12: Histogram of Gender coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 115 xii CHAPTER 1 INTRODUCTION 1.1 Variable selection Statistical modeling with large numbers of predictors is one of the most studied topics in the field of statistics. The most interesting problem to arise in the last 15 years is variable selection or subset selection. Statisticians are excited about variable selection for various reasons. Most often, when a model contains a large number of predictors, we are not satisfied with prediction accuracy, in- terpretation, over-fitting, estimation, or testing of regression coefficients, among others. Numerous variable selection methods have been proposed in the Bayesian and non-Bayesian literature, such as forward and backward stepwise selection, shrinkage methods, principal components regression, and partial least squares. The problem of choosing an optimal model from a subset of all possible models has led to a variety of algorithms and methodology building. The following discusses Bayesian variable selection and its application to various nonparametric regression problems. 1.1.1 Bayesian Variable Selection and Spike-and-Slab Prior We start by defining the setup, y = Xi βi +  p i=1 where p (cid:29) n and y is a n × 1 vector. It is mathematically intuitive that the parameter vector β can only be estimated by n observations. If β is sparse in terms of l0 norm i.e. || β||0 0 = |{i : βi (cid:44) 0}| < n, then we can achieve reasonably good estimates. Various authors worked on this problem. Miller, 2002[89] used a spike-and-slab prior with a spike on the zero coefficient and a slab part on the estimable coefficients. A spike-and-slab prior was introduced by [43] where βi follows a normal mixture distribution: βi|γi ∼ (1 − γi)N(0, τ i ) i ) + γiN(0, c2 2 2 i τ 1 Figure 1.1: spike-and-slab Priora a Plot obtained from on-line resources Figure 1.2: Horseshoe priora where γi is a latent variable that controls the number of significant coefficients or model size. The motivation for the frequently used Dirac-distributed spike-and-slab prior came from the above 2 structure. We observe a point mass on the spike part when τ tends to 0. Model complexity is a i required property. The degree of sparseness can be set based on an optimality criterion. Burnham and Anderson, 2004[3] mentioned the best predictive ability as an optimal criterion in their book, “Model Selection and Multi-model Inference.” As we have noted P(γi = 1) controls the sparsity of the model; George & McCulloch, 1993[43] proposed to use a Bernoulli distribution with 0.5 probability. Bayesian variable selection is not limited to spike-and-slab prior. For example, Liang et al., 2008 [78] used a mixture of g-priors for variable selection. Carvalho et al., 2010[18] introduced the horseshoe prior as a global-local shrinkage. The horseshoe is another popular prior choice for variable selection where a half-cauchy prior is used to estimate significant predictors. We have seen that this method penalizes less on the strong signals and parameters are estimated with fewer constraints. However, it comes with a computation cost, as in most cases it is difficult to determine a proper posterior distribution. Casella and Moreno, 2006[21] presented an objective Bayesian criterion for variable selection. In general, the literature on variable selection is vast, but we focused on spike-and-slab prior in this study. 2 The application of a spike-and-slab prior can be extended to various problems, such as non- parametric function estimation or curve estimation. Smith and Kohn, 1996[115] applied a similar approach for univariate nonparametric regression. We closely studied their work for function esti- mation. Although their prior structure does not directly resemble a spike-and-slab setup, they also used an indicator variable γ, such that βγ ∼ N(0, cσ 2(XT γ Xγ)−1), p(γi) = πi where Xγ is the design matrix with columns whose regression coefficients are nonzero. Once we find p(γ|y), it is easy to sample using Gibbs sampling. Smith and Kohn, 1996[115] used this approach to select significant knots from an unknown function f(x) approximated by cubic spline. Later, Ishwaran and Rao, 2005[57] started reshaping the spike-and-slab prior. They described the following setup in their paper: 2), (i = 1, ..., n) i β, σ (Yi/xi, β, σ 2) ind∼ N(x(cid:48) (β/γ) ∼ N(0, Γ), γ ∼ π(dγ), 2 ∼ µ(dσ 2), σ The above formulation is a general setup, where Γ is a diagonal matrix. The authors developed con- tinuous bimodal priors which have separate spike and slab parts. Malsiner-Walli and Wagner[82] performed a thorough revision of spike-and-slab priors. They argued that the selected indicator variables are independent conditional on the prior inclusion probability, but marginal dependence is not a logical condition. Instead, they used an individual inclusion probability for each regression coefficient. 1.1.2 Penalization Methods Variable selection has been extensively studied in the non-Bayesian literature too. Lasso (Tib- shirani, 1996 [120]) is the most popular variable selection technique, initially developed for linear 3 (cid:18)1 n (cid:18)1 n (cid:19) (cid:19) regression models. introduced the idea of l1 penalization on regression coefficients,as formulated below: If p > n, the ordinary least square estimator is not unique and Tibshirani ˆβlasso(λ) = arg min β ||y − X β||2 2 + λ|| β||1 where λ ≥ 0 is the penalty parameter. This estimator can drag many coefficients towards zero, depending on the sparsity of the design matrix and the extent of the penalization. The solution is a convex optimization problem and easy to implement. Lasso can perform variable selection in the sense that it can provide an exact zero solution for some components. On the other hand, another old and popular penalization method, ridge regression, shrinks regression coefficients by penalizing on the size. The ridge coefficients minimize a penalized residual sum of squares, ˆβridge(λ) = arg min β ||y − X β||2 2 + λ|| β||2 2 Similarly λ, the complexity parameter, controls the amount of shrinkage. Ridge regression performs proportional shrinkage, and lasso truncates the coefficient at zero. To understand the relationship, we use the famous picture of the lasso and ridge regression solution path for two parameters in Figure 1.3. The solution path has elliptical contours centered around OLS estimates. The solution for ridge is the disk, while for lasso it is the diamond. The solution is obtained when the elliptical contour first hits the constraint region. As the disk has corners, lasso has one solution equal to zero. For p > 2, we can have more solutions exactly equal to zero. In case of an orthonormal design matrix, we can have the explicit solutions for lasso and ridge. On the other hand, lasso uses soft-thresholding, as depicted in Figure 1.4. Lasso and Ridge regression both use Bayes estimates with different priors. Tibshirani, 1996 [120] suggested that Lasso estimates can be obtained if the regression parameters have Laplace or double-exponential priors. Park and Casella, 2008[96] provided an easy-to-compute Gibbs sampler 4 Figure 1.3: Lasso vs Ridge solution Patha Figure 1.4: Threshold functions for orthonormal designa a Plot obtained from on-line resources for Bayesian lasso by introducing a scale mixture of normal priors on regression coefficients. 2 β|σ 2 1, .., τ , τ p p ∼ Np(0, σ 2Dτ) 2 Dτ = diag(τ p) 2 2 1, .., τ 2 exp(−λ p ∼ 2 λ 2 i=1 2) ∼ 1 2 σ 2 τ 1, .., τ π(σ 2 i /2)dτ 2 2 i τ Although this method helps us computationally, we do not obtain exact zero estimates from the posterior median solution. Figure 1.5 comes from Park and Casella, 2008’s paper, where it is clearly visible that by using lasso, we can have a zero solution, in contrast to Bayesian lasso or ridge. Bayesian lasso provides very small estimates for insignificant predictors, and one needs to impose one more level of threshold to drop redundant variables from the model. This problem can be solved using a point mass spike-and-slab prior on regression coefficients. Xu and Ghosh, 2015[128] used this phenomenon for Bayesian Group Lasso , which we discuss in next section. 5 Figure 1.5: comparison between Lasso, Bayesian Lasso and Ridge regression solution pathsa a Plot obtained from on-line resources 1.1.3 Bayesian Group Lasso Grouping structures among predictors occurs in many situations, such as when a model has categorical variables or functional predictors. The categorical predictors are represented by a group of dummy variables and, in the case of functional predictors, sometimes a group of basic functions used in the model. Gene expression data forms groups in the form of gene pathways, and these groups have a natural correlation structure. The lasso solution is not entirely satisfactory in such scenarios, as it only selects the individual predictors instead of the whole group. It also depends on how the dummy variables are orthogonalized. To deal with this problem Yuan et al. [132] introduced ’Group Lasso’ which can select important grouped variables as a whole. Consider this model- G g=1 Yn×1 = Xg βg +  where  follows normal distribution and βg is the g-th grouped predictor among G total group variables. The group size can vary. Yuan et al. [132] proposed group lasso, which is an extension 6 of lasso, to handle grouped variables. The group lasso solution for linear regression is defined as ||Y − G g=1 min βg G g=1 Xg βg|| + λ || βg||2 In this dissertation, we deal with a generalized linear model with binary response. Hence, we are interested in how lasso and group lasso have been developed for logistic regression. The lasso penalty can be applied to logistic regression or multinomial logistic regression. Genkin et al. [42] applied l1 penalty to their sparse model and used a Laplace prior to avoid overfitting. Instead of the residual sum of squares, they used a negative log-likelihood criterion and MAP estimates of the regression parameters. The first introduction of group lasso penalty for logistic regression model was observed in Kim et al. [66], where they proposed a gradient descent algorithm for estimation. Meier et al. [87] presented group lasso for a “large p” which minimizes the negative log-likelihood function with group penalty. The group lasso solution in their paper is defined as follows:  n i=1 G g=1 yi − min βg Xi,g βg − log(1 + exp( G Xi,g βg)) g=1  + λ G g=1 √ mg|| βg||2 where mg is the size of βg. They propose a block co-ordinate gradient descent algorithm to find the solution to this minimizing problem. Despite the more common use of frequentist group lasso methods, Bayesian approaches may be more appropriate for the same problem. This is because the estimates of the standard lasso estimators do not provide meaningful standard errors that can be used for hypothesis testing or constructing confidence intervals. The lasso estimator has a complex limit distribution and is complicated to implement (Knight et al. [68],Chatterjee et al.[23]). On the other hand, a Bayesian prior-based formulation of lasso can provide reliable standard errors of estimates by obtaining the MAP estimators. In his seminal Lasso paper, Tibshirani[120] stated that the posterior mode with an independent double exponential prior for the regression coefficient is the same as a lasso estimator. Later, Park and Casella[96] developed a highly efficient Gibbs sampler for Bayesian lasso by introducing scale mixture priors for regression coefficients. The regression coefficients 7 follow the normal prior, and the variance components of the normal distribution follow the Gamma prior, giving us a highly efficient way of posterior sampling. Based on Park and Casella’s[96] work, Kyung et al.[69] provided a Bayesian hierarchical model for Bayesian group lasso. As in penalized linear regression with grouped variables, the conditional prior β/σ 2 can be written as − λ σ G g=1  || βg||2 π(β/σ 2) ∝ exp Kyung et al.[69] expressed this same prior as βg/τ 2 g, σ 2 g σ 2Img) 2 ∼ Nmg(0, τ g ∼ Gamma(mg + 1 2 2 , τ 2 λ 2 ), g = 1, .., G Li et al. 2015[75] worked with a high-dimensional varying-coefficient model equipped with Bayesian group lasso. They tried to obtain group lasso estimators as posterior mode estimates of multivariate i.i.d. Laplace priors on regression parameters. Although it is highly computationally efficient, a significant disadvantage of Park and Casella’s[96] normal mixture prior setup is that it does not ensure an exact zero solution for the regression co- efficients. This is also the case if we obtain the posterior mean or median from the prior setup by Kyung et al.[69] – it will not provide exact 0 estimates for βg’s. Therefore, to impose sparsity in group level, Xu et al. [128] proposed a multivariate zero-inflated mixture prior for each βg. The following is a hierarchical structure with an independent spike-and-slab prior for each βg: Y|X, β, σ βg|τ 2 g, σ 2I) (cid:32) mg + 1 2 ∼ N(X β, σ 2 ∼ (1 − π0)Nmg(0, σ g ∼ Gamma 2 2 ∼ IG(α, γ) σ π0 ∼ Beta(a, b) 2 τ (cid:33) 2 , τ g Img) + π0δ0(βg), g = 1, .., G 2 2 λ 2 , g = 1, .., G 8 where δ0(βg) denotes point mass at 0. The mixing probability π0 can be defined as a function of the number of predictors to impose more sparsity as the feature size increases. The choice of λ is very critical to control sparsity. Large values of λ will produce biased estimates and very small λ values will impose diffuse distribution for the slab part. Xu et al. [128] mentioned an empirical Bayes approach to estimate λ. Due to the intractability of marginal likelihood, they proposed a Monte Carlo EM (Expectation Maximization) algorithm for the estimation of λ. Moreover, they have shown theoretically and numerically that the median thresholding of posterior βg samples provides exact zero estimates for insignificant group predictors. The class of spike-and-slab zero-inflated mixture prior was first introduced by Mitchell et al.[91]. It is highly useful for variable selection. In the zero-inflated mixture prior, the slab part assumes some known distribution at nonzero coefficients, and the spike part has point mass at zero coefficients. Later George et al. ([43],[44]) used a zero-inflated normal mixture prior to build Gibbs sampling for variable selection. Narisetty et al.[94] worked with shrinking and diffusing priors. The predictors with zero or very small coefficients have variance tending to 0, as those coefficients reach point mass at zero (spike), and the active predictors’ variance reaches infinity as a diffused prior (slab). On a different note, Lykou et al.[79] worked with Bayesian lasso variable selection; they assumed independent double exponential distribution for regression coefficients and concentrated on the shrinkage parameter λ. Bayes factors criteria are used to choose the selection vector and the shrinkage parameter. Zhang et al.[134] generalized this prior using Dirac spike-slab where each coefficient group follows a normal distribution, as follows: βg|γg, σ 2 g ∼ γgN(0, σ 2 , τ 2Dτg) + (1 − γg)δ0(βg) 2 g1, .., τ = diag(τ gmg) and γg follows a Bernoulli distribution. The more interesting 2 where Dτg advancement in their work is the simultaneous selection of groups and the members within those groups. In addition, they incorporated group serial correlations with Bayesian fused lasso technique for within-group selection. Xu et al. [128] also proposed an algorithm for bi-level selection in group lasso with a two-way sparsity assumption: among grouped predictors and within selected groups. 9 The main application of this kind of selection arises in genetic association studies. For instance, when not all genetic variations in a selected gene are responsible for a disease, it is necessary to incorporate bi-level selection. 1.2 Bayesian logistic regression Classification using longitudinal data can be challenging with a large number of predictors. The first significant approach to handle longitudinal predictors is to consider each multiple-occasion observation as a single function that is observed over a time interval. Functional predictors have a high correlation between adjacent measurements, and the observational space is high-dimensional. The number of predictors required for estimation often exceeds the number of observations, which introduces the problem of dimensionality. A regression framework is often the most suitable to model all possible longitudinal effects across ROIs, where the proposed method will select the important predictors. Moreover, many biomedical studies have shown that a limited number of specific brain regions or ROIs are essential for AD classification. Thus, dimension reduction tech- niques can be applied, and classification can be limited to the reduced feature set. Zhu et al. [136] proposed a method for classification and selection of functional predictors: first, functional prin- ciple component scores are calculated for each functional predictor; then, the functional principle component scores are used for the classification of each individual observation. They proposed using Gaussian priors for selection and developed a hybrid Metropolis-Hastings/Gibbs sampler algorithm. Although the method reported in the present study was inspired by this method, we developed a simple Gibbs sampler where MCMC samples are drawn from standard distributions. We also focused on applying penalized regression for dimension reduction. In the Bayesian variable selection literature, the spike-and-slab prior has widespread applications due to its superior selec- tion power. George et al. ([43],[44]) initially proposed that each coefficient β can be modeled either from the “spike” distribution, where most of its mass is concentrated around zero, or from the “slab” distribution, which resembles a diffuse distribution. Instead of imposing the spike-and-slab prior directly on regression coefficients, Ishwaran et al. [57] introduced a method in which they place 10 a spike-and-slab prior on the variance of Gaussian priors. The Bayesian variable selection meth- ods also include different Bayesian regularization methods, such as Bayesian lasso [96], Bayesian Group Lasso, and Bayesian elastic net [76]. We opted for spike-slab Bayesian group lasso which is extensively discussed in Xu et al. [128]. The group structure among the coefficients in our model comes from functional smoothing of the coefficients, and group lasso facilitates the selection of the important functional predictors. Thus, our proposed method uses the idea of Bayesian variable selection in a generalized functional linear model with binary responses. We model binary response yi ∈ {0,1} (i = 1, .., n) and predictors with logistic regression. We know that posterior sampling of logistic regression coefficients is difficult due to the model’s com- plicated likelihood function. In addition, we assume a Gaussian prior for regression coefficients, which makes the likelihood function analytically inconvenient. Full posterior sampling of param- eters requires candidate density following the Metropolis-Hastings algorithm. Bayesian inference for a probit model is comparatively easier [2]. Different sampling algorithms motivated by Albert et al.’s work [2] have been proposed. For instance, Holmes et al. [54] presented an indirect sampling method by introducing auxiliary variables for binary and multinomial regression. Later, more methods based on latent variables for logistic regression were advanced by Frühwirth-Schnatter et al. [41], Gramacy et al. [45] and Polson et al. [97]. Among all these works, Polson et al.’s algorithm is the most interesting to us due to its ease of computational implementation and its sampling efficiency. Our aim is to avoid the complex Metropolis algorithm while sampling from posterior distributions of regression coefficients. Holmes et al. [54] introduced an auxiliary variable to avoid the conditional non-conjugacy for 11 updating β. The prior structure they used is 1, yi = if zi > 0 0, otherwise zi = xi β + i i ∼ N(0, λi) λi = (2ψi)2 ψi ∼ Kolmogorv − Smirnov distribution β ∼ Normal distribution The above prior is interesting because the marginal likelihood L(β|data) is same as the likelihood for logit model. One can achieve a conjugate full conditional distribution of β given data, but the conditional distribution of π(λi|zi, β) does not have any standard form. One has to use a complicated rejection sampling method to sample for conditional λi. Hence, adding an auxiliary variable does not give us significant computational improvement compared to using the Metropolis-Hastings algorithm. Frühwirth-Schnatter et al. [41] addressed the problem with the same approach, but instead of using a single auxiliary variable they used a two-stage augmentation method. In the first stage, they assumed the existence of a latent variable, where the binary response variable is conditional on the sign of the auxiliary variable. They assumed the error part of the model to follow a type I extreme value distribution, which is non-normal density. Then, in the second stage of data augmentation, this non-normal error distribution was approximated by the mixture of the normal distribution. Finally, they obtained a multivariate normal distribution for the posterior of β. Among all the popular methods, pólya-gamma augmentation is most interesting to us, and it is easy to apply due to the availability of the R package. 1.3 Functional data smoothing Here we discuss some of the key features of functional data or functional observations in the regression context. Throughout this thesis, we use linear combinations of basis functions to rep- 12 resent functions. The pivotal philosophy of functional data analysis is to consider observed data functions as single observations. Measurement error is a frequent problem in functional obser- vations. In the following, we discuss the simplest case of nonparametric regression problem: we observe (xi, yi)i=1,..,n with yi = f(xi) + σi where xi’s follow a Uniform (0,1) distribution (i.i.d). To estimate ˆf(x), we must minimize(cid:104) (cid:105). We need basis functions to esti- EÞ 1 0 ( ˆf(x) − f(x))2dx mate the unknown function. Basis functions have mathematical properties that help statisticians to approximate any function by taking a linear combination. Some popular basis functions are the Haar basis, Fourier basis, Spline basis, wavelets, and polynomial bases, among others. In general, basis functions have the property of orthonormality. Now, if we have a set of basis functions φ j n 0 φ j(x)dx. We plug in the in L2(0,1) then we can define fJ(x) = 1 i=1 yi φ j(xi) such that E( ˆθ j) = θ j. The choice of basis function is crucial in estimator ˆθ j = 1 n functional data analysis. We focus on B-spline basis functions in this thesis, and we discuss its properties in the following. j=0 θ j φ j(x) where θ j =Þ 1 J n Spline functions are most often used to approximate non-periodic functional data. Splines are easy to handle computationally and provide better results than polynomial approximations. One can approximate most of the functions using a moderate number of basis functions. The first step in constructing splines in the input space is to divide the space into breakpoints, which are called knots. Let 0 < 1 and K < K+1 be two binary knots which can be defined as the range of the domain over which we approximate our function. Now, we define: • τ1 ≤ τ2 ≤ · · · ≤ τM ≤ 0 • τj+M = j, j = 1, .., K • K+1 ≤ τK+M+1 · · · ≤ τK+2M Although the additional knots are defined outside of boundary points, in practice they are all equal to 0 and K+1 respectively. We define Bi,k(x) as the ith B-spline basis function of order ’k’ with 13 Figure 1.6: B-spline smoothing under varying orders and knotsa a Plot obtained from online resources knot sequence τ and they are constructed as follows: Bi,1(x) = 1, Bi,k(x) = τi+k−1 − τi i f τi ≤ x ≤ Ti+1, x − τi Bi,k−1(x) + f or i = 1, .., K + 2M − 1 Bi+1,k−1(x), τi+k − x τi+k − τi+1 f or i = 1, .., K + 2M − k To understand the construction we obtained Figure 1.6 from a paper by Dertimanis et al., 2018[30]. Now we can replace the general φ j(x) basis functions with Bi,k(x) basis functions, and a function j=1 βj Bj(x). The least square method can be used to f(x) can be approximated as f(x) = K estimate unknown βs. Knot selection is another domain of research that statisticians have already explored. Automatic knot selection can be achieved using model selection criteria such as cross validation and Mallow’s Cp in a linear regression framework. The aim is to choose a set of knots from a large number of 14 candidate knots such that optimization criteria are satisfied. The number of all possible models increases as the cardinality of candidate knots does. Some works to mention on this topic are by (Smith and Kohn, 1996[115],Denison et al., 1997[24],DiMatteo et al., 2001[24]). Smith and Kohn, 1996[115], Denison et al., 1997[24] utilized Bayesian variable selection to build MCMC methods for function estimation. models. Many authors are interested in selecting significant functions for yi =p The problem of univariate function estimation can be extended to function selection in additive j=1 fj(Xi j) + i when p (cid:29) n. Scheipl et al., 2012[106] proposed a normal-mixture-of-inverse gamma distribu- tions by introducing similar indicator variable used in spike and slab prior. Their setup is for a generalized linear model with continuous response replaced by link functions. Scheipl et al., 2013[107] mentioned more than one method in their paper, including using penalized likelihood and smoothness priors, and an indicator selection approach after transforming functions into a lin- ear combination of centered B-spline expansion. Lan Xue[129] proposed a penalized polynomial spline method for a simultaneous model estimation and variable selection in additive models. He explored spline SCAD (Smoothly Clipped Absolute Deviation) penalty as a regularization method. Although Fan et al., 2015[33] did not engage in variable selection in their work, they did utilize a penalized least squares optimization technique to efficiently deal with high-dimensional functional predictors. They minimized a loss function with a penalized l2 norm of functions to find estimated functions. McLean et al., 2013[85] explored both MCMC and variational Bayes algorithms to fit a functional generalized additive model. Huang et al.[56] tackled the same problem in a conven- tional non-Bayesian variable selection approach by applying adaptive group lasso. To start, they also decomposed functional predictors as a linear combination of B-spline basis functions and later penalized the unknown coefficients as groups. The Bayesian group lasso we built is inspired by their work, but we deal with binary response that is equivalent to a generalized linear model. 15 1.4 Bayesian model based clustering Cluster analysis is the modern data mining tool used to group or segment objects into clusters based on similarity measurements. Many available algorithms have been built based on a particular criterion that objects within a cluster are more similar to each other than to objects that belong to other clusters. Some popular clustering methods are K-means, agglomerative clustering, hierar- chical clustering, and model-based clustering. We focus more on model-based clustering in this chapter. A disadvantage of hierarchical clustering or K-means methods is that they are model-free heuristic methods. Model-based clustering is an alternative to provide an option to formulate a model while grouping objects. Model-based clustering was first introduced Banfield and Raftery, 1993[6], who assumed that observations come from multivariate normal distribution and that max- imum likelihood yields the estimates . Fraley and Raftery, 2002[37] introduced the finite mixture models as a formal setting for model based clustering. Let y1, .., yn are independently distributed p-dimensional observations from a K-component mixture distribution- f(y; τ, θ) = τk fk(yi|θk) (1.1) n K i=1 k=1 Here τk is the probability that a particular object belongs to kth cluster and θk parametrizes the density fk. In general, fk is multivariate normal density where fk ≡ MV N(µk, Σk). The EM algorithm helps us to get the estimates of unknown parameters. In a later study, Melnykov and Maitra, 2010[88] provided a thorough reference for model-based clustering. A point of subjectivity that is always attached to cluster analysis is choosing the number of clusters. Similar to K-means, model-based clustering needs a specified number of clusters at the start of the algorithm. Fraley and Raftery, 1998[36] tried to answer this question using model selection criteria. They suggested studying BIC over all possible cluster numbers and choosing the one with lowest BIC. Another study to mention is that of Yeung et al., 2001 [131], who applied model-based clustering to group gene expression data. Over time, researchers included other factors inside the model-based approach, such as variable selection[101] and penalizing the parameter space θ[95]. All the aforementioned studies have been done in a conventional statistical framework without using any prior information 16 on the cluster structure, the number of clusters, or unknown parameters (τ, θ). Only Bayesian model selection has been used to select optimal number of clusters. Handcock et al., 2007[46] described two methods to estimate unknown parameters: first, a two-stage maximum likelihood method, and second, a Bayesian estimation method. In the Bayesian part, regression parameters have a multivariate normal prior, and the mixture probability τ has a Dirichlet prior with MCMC estimation. Furthermore, Fraley and Raftery, 2007[38] discussed the problems of using the EM algorithm and where it can fail to converge. Instead of MLE, they preferred to work with the MAP estimator with conventional priors on mean and covariance parameters, such as MVN on mean and an Inverse-Wishart distribution on a covariance matrix. Medvedovic et al., 2004 [86] proposed a contrasting formulation of a Bayesian mixture-based clustering algorithm to group gene expression data with replicates. Interestingly, the Poisson-Dirichlet process is the most logical and convenient way of assuming priors on the number of clusters and mixture probability if one does not want to limit the number of clusters when starting the algorithm [70]. The methods and literature described above have various advantages and can be applied to different complicated problems faced by researchers. Here we should reiterate the problem with which we are dealing: we aim to build a Bayesian model-based clustering method with the same white matter data used for spatiotemporal modeling discussed in Section 5.1. We have longitudinal voxel-wise white matter measurements for healthy aging subjects, and we would like to group voxels in homogeneous regions considering the longitudinal and spatial information. To this end, we review the literature for spatiotemporal clustering methods. Authors in the computer science literature have examined this topic extensively. Kalnis et al., 2005[65] dealt with a highly complicated problem of detecting clusters among moving objects that change locations over time. For instance, they clustered trajectories and mining movement patterns for a group of migrating animals or a convoy of cars moving in a city. The complexity of this problem is manifold, as we are simplifying our situation by assuming that spatial locations of voxels are fixed over time. Kisilevich et al., 2009[67] described a detailed study of spatiotemporal clustering on trajectories and provided 17 in-depth research development on this topic. 1.5 ADNI Longitudinal Data Alzheimer’s disease (AD) is the most common form of age-related neurodegeneration and dementia. More than 25 million people in the world are currently affected by dementia, with most suffering from AD, and around 5 million new cases occur every year. The number of people with dementia is anticipated to double every 20 years [99]. Some suggest that earlier diagnosis and intervention may offer the greatest potential for treatment, making the early detection of AD of the utmost clinical importance. AD symptoms are diagnosed via clinical neuropsychological and cognitive measures, including the Clinical Dementia Rating (CDR) scale. In addition to other standardized clinical neuropsychological measures, such as the Ray Auditory Verbal Learning Task or the Mini Mental Status Examination (MMSE), biomarkers from MRI and PET imaging are also used in diagnostic classification. Based in part on these scores, patients are diagnosed or classified into one of three primary categories: cognitively normal (CN), mild cognitive impairment (MCI), and AD. In this dissertation, we demonstrate a method to differentiate AD patients from CN persons via the differential nonlinear longitudinal trajectories in regional brain volumes. Various high- dimensional classification and regression methods have been proposed for biostatistical applications to neurological diagnostic methods for early-stage AD detection. The present study focuses on applying volumetric MRI data at the region of interest (ROI) level, and limiting the included features to more clinically established sub-regions of brain volumes in AD and aging. Unlike commonly used Bayesian classification methods, we emphasize the selection of the longitudinal trajectories of brain volumes as predictors for classification of patients as AD vs. CN. MRI is a valuable tool for in vivo assessment of brain biomarkers related to disease progression. Longitudinal patterns of brain atrophy follow a common pattern, with prominent volumetric loss in established and canonical regions, such as the hippocampus or the entorhinal cortex, as well as ventricular and sulcal expansion. The currently available, fully processed ADNI longitudinal MRI data includes cortical and white matter (WM) parcellation, surface area, and cortical thickness of 18 different brain regions. Instead of using all possible regional measurements, we focus on volumetric longitudinal changes in a more theoretically guided and limited number of brain sub-regions to classify patients. We obtained nearly 115 different sub-regions’ volume measurements from ADNI 1, ADNI 2, & ADNI GO. Leung et al. [74] compared MCI converters and MCI nonconverters considering longitudinal hippocampal volumes, and showed that MCI converters have higher rates of hippocampal atrophy. Reduced hippocampal volume has been proposed to diagnose AD earlier than clinical diagnosis [40]. We believe that measuring volumetric change over multiple occasions rather than at a single time-point will provide a more reliable estimate of quantitative change. In this dissertation, we concentrate on the selection of key brain regions that differ in longitudinal change between AD patients and CN older adults using a Bayesian variable selection method. In addition to prior findings related to the selection of volumetric MRI measurements, there are also numerous studies involving the selection of functional predictors. The key advantage of using longitudinal MRI data is that it captures the atrophy rates with time as patients’ progress through multiple disease stages. FreeSurfer software is able to measure accurate brain volumes and cortical thick- ness, and the ADNI website has 1.5T structural MRI data with parcellation performed by FreeSurfer. The pivotal contribution of this dissertation is that it explores functional data analysis with Bayesian variable selection. We extensively refer to the methodologies described in this introduction in the development of later chapters. Chapter 2 introduces the applicability of spike-and-slab priors for knot selection in function estimation. The research interest in function estimation is still prevalent in the statistics literature. Spike-and-slab group lasso, which was discussed earlier, works well for group selection and outperforms other methodologies. Chapter 4 presents extensive application of Bayesian group lasso to function selection. Moreover, it helps to establish some statistical properties. In addition, Bayesian classification is another field that is discussed later in this dissertation. The aim of this introduction was to help the reader better understand the perspective of this dissertation. 19 CHAPTER 2 NONPARAMETRIC FUNCTION ESTIMATION USING BAYESIAN VARIABLE SELECTION Bayesian variable selection for nonparametric function estimation is a challenging research field. We have tried to explore it with a large number of functional predictors in an additive model. We started with the knot selection problem for a single function estimation. 2.1 Univariate function estimation The literature on function estimation is quite old and has already established excellent mathe- matical properties. Bayesian methods have worked well for estimating functions with a penalized spline. However, we propose a new approach to function estimation by assuming uncertainty with knots. We describe our methodology in the following. Introduction 2.1.1 Consider a very simple regression problem with data observed as (yi, Xi)i=1,..,n where X is a single covariate and Y is n × 1 response variable. We regard a nonlinear model as a potential setup, such that the mean E(y|X) varies nonlinearly with the predictor. Then, we can replace the conventional regression problem of yi = Xi β + i with yi = f(Xi) + i, i = 1, .., n where f(.) belongs to some class of nonlinear functions. There are numerous methods available to estimate the unknown function, one of which is basis function expansion. This single covariate has a nonlinear component with respect to response, which is modeled as a regression spline basis and can be written as f(x) = βj φ j(x) (2.1) K j=1 20 where φ j(x) are some orthonormal basis functions. Now, if we limit our study only for spline functions, the value ’K’ depends on the number of internal knots we are using. Let us assume that we have τ1, .., τK internal knots to approximate the function. Then, equation (1.1) can be written with pth degree spline model as K j=1 βj φ j(x) f(x) = β0 + β1X + · · · + βpX p + βpj(X − τj)p + K+p+1 j=1 = here (X − τj)+ = max(0,(x − τj)) and as (X − τj)p of the spline function increases with a value of p. We work with cubic spline bases here. + has p− 1 continuous derivatives, the smoothness To start the statistical problem, we assume that we have a large number of candidate knots, and we would like to build a data-driven algorithm that automatically selects important knots to accurately approximate a given function. Knots’ locations are another point of interest. Various model selection criteria have been used before, such as cross-validation and Mallow’s Cp etc. The number of possible models increases as the cardinality of candidate knots expands. There is the potential to over-fit the data if we use a large number of knots. One advantage of using spline bases is that the design matrix for the linear model is sparse and we can explore the variable selection methods in computation. We examine Bayesian variable selection as a potential method for knot selection regarding this problem. Our method implicitly assumes that f is well approximated by a linear combination of spline basis functions with some number of knots. In practice, we assume that f can be represented by a linear combination of basis functions. This class of basis splines is large and approximates any locally smooth function arbitrarily well. We note that our approach determines which knots are effective in estimating the spline function, and we closely follow the works on variable selection in linear regression models by George and McCulloch, 1993,1997[43][44]. These authors first 21 introduced the spike-and-slab prior with the following setup: βi|γi ∼ (1 − γi)N(0, τ i ) i ) + γiN(0, c2 2 2 i τ where γi is a latent variable that controls the number of significant coefficients or model size. Smith and Kohn ,1996[115] implemented another method of g-priors on the regression coefficients and applied this prior in univariate function estimation. On the other hand DiMatteo et al., 2001[31] quantified two unknown quantities, such as number of knots and location of knots with Poisson prior and Dirichlet prior, respectively. We focus on George and McCulloch, 1993,1997[43][44]’s approach in this section, as the conventional spike-and-slab prior is not considered for function estimation. The Gibbs sampler is computationally feasible and easy to implement when the model has a large number of predictors. Once we integrate corresponding variance and regression coefficients, we show that the Bayesian variable selection convergence rate is much faster than with other approaches. 2.1.2 Variable selection using Spike-Slab Prior A spike-and-slab model is often represented by a Bayesian hierarchical model. The most conventional definition is given below: 2), (i = 1, ..., n) i β, σ (Yi/xi, β, σ 2) ind∼ N(x(cid:48) (β/γ) ∼ N(0, Γ), γ ∼ π(dγ), 2 ∼ µ(dσ 2), σ where 0 is a K-dimensional zero vector, Γ is the K × K diagonal matrix diag(γ1, ..., γK), π is the prior measure for γ = (γ1, ..., γK)t and µ is the prior measure for σ 2. George and McCulloch(1993)[43] first developed the most widely used spike-and-slab model version. It identifies zero and nonzero βi’s by using zero-one indicator variables γi, assuming a 22 scale mixture of two normal distributions: (βi/γi) ind∼ (1 − γi)N(0, τ i ) + γiN(0, c2 i ), 2 2 i τ i = 1, .., K 2 i > 0 is some suitably small value, while ci > 0 is some suitably large value. γi = 1 represents τ the βi’s that are significant, and the variances of these coefficients are large, with larger posterior βi values. The opposite occurs when γi = 0. The prior hierarchy for β is completed by assuming 2 tends to zero we provide more masses on 0 which operates as the prior for a prior for γi. When τ i insignificant βs. The prior distribution for the regression coefficients then can be written as (βi/γi) ind∼ (1 − γi)I0 + γiN(0, ν 2) (∗) with I0 point mass at 0 coefficients, where ν large enough c2 i . We use prior (*) to select significant knots from candidate knots. 2 2 is the limit for c2 i τ i 2 when τ i tends to zero with a 2.1.2.1 Model description and Prior We start with the simple linear model: y = X β + ε where y is the n × 1 vector of observations. X is the n × r design matrix, ε ∼ N(0, σ Let γ be the r x1 vector of indicator variable with i-th element γi such that: 2In) is the error vector, and β = (β1, ..., βr)(cid:48) is the r x1 vector of regression coefficients.  1, when βi (cid:44) 0 0, when βi = 0  γi = Given γ, let βγ consist of all the nonzero coefficients of β and let Xγ be the columns of X corre- sponding to particular components of γ that are equal to one. We then make the following prior assumptions: 1. Given γ and σ γ Xγ)−1), where c is a positive scale factor. For our simulation study we used c=100. The above formulation of βs is identical to the spike-and-slab 2, the prior for βγ ∼ N(0, cσ 2(X(cid:48) 23 prior defined in equation (*).For large values of c, the prior for βγ contains very little information γ Xγ)−1 about βγ compared to the likelihood. We take the prior variance of βγ proportional to σ which makes the Gibbs sampler very fast. We also think this prior contains the associative rela- tionships of βs. 2(X(cid:48) 2. The γi are assumed to be a priori independent with P(γi = 1) = 1 − P(γi = 0) = pi (i = 1, ...,r) p(γ/Y) ∝ p(Y/γ)p(γ) contains information relevant to variable selection. We started with pi = 1 2 which justifies no bias towards model size. 3. Finally, for σ 2 we use the inverse gamma conjugate prior p(σ 2/γ) ∼ IG(νγ/2, νγλγ/2), 2. For our We note that νγ and λγ may depend on γ to incorporate dependence between β and σ practical example, we choose νγ ≡ 0 and λγ ≡ 0 which gives us a commonly used prior for σ 2: p(σ 2/γ) ∝ 1 2 . σ 2.1.2.2 The Gibbs Sampler The primary advantage of applying a conjugate hierarchical setup is that it enables analytical r i=1 γi be the number marginalizing of γ with respect to βγ and σ p(βγ, σ , γ/Y) ∝ p(Y/βγ, σ of nonzero elements of β and 2)p(βγ/σ 2 2 , γ)p(σ 2 from 2)p(γ). For a given γ, let qγ = S(γ) = y (cid:48) y − c 1 + c y (cid:48)Xγ(X(cid:48) γ Xγ)−1X(cid:48) γ y 24 Then, Þ (cid:26)Þ p(y/γ) ∝ p(y/βγ, σ ∝ (1 + c)− qγ2 S(γ)− n2 ; σ β 2)p(βγ/σ (cid:27) 2) dβγ p(σ 2) dσ 2 βγ is integrated out as a normal integral and σ The posterior distribution of γ is 2 is integrated out as an inverse gamma integral. p(γ/y) ∝ p(y/γ)p(γ) ∝ (1 + c)−qγ/2S(γ) r i=1 i (1 − pi)1−γi pγi ≡ g(γ), The marginalizing constant can be obtained by computing g(γ) for all γ values. To find an efficient posterior sample, we do not immediately need normalizing constant. Gibbs Sampler: We will start with an initial sequence of γ 0 = (γ 0 1, .., γ r ) and then we draw each member of γ 0 from the Bernoulli distribution with probability: p(γi = 1|y, γj(cid:44)i) = 1 1 + h, where (cid:32) (cid:33) n2 1 − πi πi 1 = (γ1, .., γi−1, γi = 1, .., γr) and γ 0 = (γ1, .., γi−1, γi = 0, .., γr). We repeat this step for a warm γ up period and sampling period until the sequence converges. (1 + c)1 2 S(γ S(γ , where 1) 0) h = For large r, the above method takes a long time to execute, and computation complexity is O(r3). However, MCMC chain still can still be used to find high-probability γ values. We can construct 25 easy MCMC algorithms to simulate a Markov chain 1 γ , ...., γK which converges in distribution to p(γ/y). Metropolis-Hastings algorithms: Another way of sampling from g(γ) is to use Metropolis-Hastings (MH) algorithms. We start 0 value, 1 values. Once we determine the candidate proposal 1), which is also called candidate density. For each γ 1) is a probability distribution over γ , γ 0 0 with a proposal density q(γ q(γ density q(γ , γ , γ 0 1), the below algorithm populates posterior sample of γ: 1. Generate γnew with probability q(γ 2. Set γ(j+1) = γnew with probability 0 , γ 1). αMH(γ (j) , γnew) = min (cid:40) (cid:41) q(γnew, γ(j)) q(γ(j), γnew) g(γnew) g(γ(j)) ,1 = 0, Otherwise 1), the sequence obtained by this algorithm will be a Markov Under weak conditions on q(γ 0 , γ chain that converges to p(γ/y). In case of symmetric proposal density, Metropolis-Hastings becomes simple Metropolis algo- rithm, i.e. if q(γ 0 , γ 1) is symmetric. Then the above αMH simplifies to (cid:27) g(γ(j)) ,1 (cid:26) g(γnew) r |γ i − γ 0 i | = 1 1 αM(γ (j) , γnew) = min One of the simple proposal density is 0 q(γ , γ 1) = 1 r , if 26 i=1 Which alters the above MH algorithm- 1. Generate γnew by changing one element of γ(j). 2. Set γ(j+1) = γnew with probability αM(γ(j), γnew). Otherwise, γ(j+1) = γ(j). This algorithm was proposed in Raftery, Madigan and Hoeting (1993)[53]. From the above algorithm, we estimate γ with the highest probability of occurrence. We can 2. If we plug this ˆγ in the model and select use the estimated ˆγ to obtain the estimates of γ and σ 2 becomes much easier. This is only those covariates for which the γis are 1, estimation of β and σ a huge advantage of using the spike-and-slab prior, and it is the main motivation behind using it. 2.1.2.3 Updating β and σ 2 We use George and McCulloch’s (1993)[43] SSVS procedure to simulate a full parameter se- quence. This method is based on the Gibbs sampler. At each step, we use the full conditional distribution of β and σ 2 to generate new samples. Let ˆγ be the final posterior estimate of γ: (cid:18) 2) ∝ 1 2πσ 2 (cid:19)n/2 exp(− 1 2σ p(y/β ˆγ, σ 2) p(β ˆγ/σ 2) p(σ (cid:19)q ˆγ/2 (cid:18) 1 × 2 σ 2(y − X ˆγ β ˆγ)(cid:48)(y − X ˆγ β ˆγ)) (cid:18) 1 (cid:19) exp(− 1 2cσ 2(X ˆγ β ˆγ)(cid:48)(X ˆγ β ˆγ)) × 2 σ The full conditional for β ˆγ is multivariate normal with mean A−1X ˆγ y and variance σ A = (1 + 1 ˆγ X ˆγ. Symbolically, c)X(cid:48) 2A−1, where p(β ˆγ/y, σ 2) ∼ N(A−1X ˆγ y, σ 2A−1) 27 The full conditional distribution of σ 2 is inverse gamma: The terms in the joint distribution involving σ 2 are 2(cid:17)−( n2 + (cid:16) σ q ˆγ2 +1) exp(− 1 2σ 2((y − X ˆγ β ˆγ)(cid:48)(y − X ˆγ β ˆγ) + 1 c(X ˆγ β ˆγ)(cid:48)(X ˆγ β ˆγ)) so σ 2 is conditionally inverse gamma with shape parameter n2 + 2((y − X ˆγ β ˆγ)(cid:48)(y − X ˆγ β ˆγ) + 1 1 c(X ˆγ β ˆγ)(cid:48)(X ˆγ β ˆγ)). q ˆγ 2 and scale parameter The Gibbs sampler cyclically samples from the distributions of β and σ current values of the another parameter. Finally, we obtain the estimates of β and σ posterior median. 2 conditional on the 2 from the 2.1.3 Regression Splines Let’s assume, yi = f(xi) + i, i = 1, ..., n where yi is the i-th response, error part i follows N(0, σ 2), and f(x) is the unknown smooth function that needs to be estimated. Under the suitable smoothness assumption, we propose to approximate f(x) using a normalized B-spline basis. One of the research aims involved here is to quantify the location and the number of knots. Badly placed knots will definitely miss the function’s properties and provide bad approximations. On the other hand, if we use a large number of knots, we will definitely have low bias but very high variance. The most convenient way to approach this problem is to introduce variable selection. One can assume that there are initially a large number of candidate knots and can then select a small set of significant knots using variable selection (e.g., Friedman and Silverman, 1989[39]). Now we can visualize the problem as a potential variable selection problem, as columns of the design matrix correspond to individual knots. This is what guides us to use the spike-and-slab prior. In a univariate case, we place knots in at least a quarter of the positions, depending on the number of observations, but we try to stay below 40 knots. We 28 want to ensure that placed knots capture the curvature of functions. For our simulation study, we place knots after every fourth sorted observation. 2.1.3.1 Simulated example In this section we studied performance of Bayesian nonparametric estimators using simulated data. We used these functions in the simulation: 1. 2. f1(x) = sin(2x) + 2 exp(−16x2) f2(x) = (φ(x,0.15,0.05) + φ(x,0.6,0.2))/4 which is a nonlinear function. A well-judged nonparametric kernel-based method can estimate these functions. We approximate these functions with a cubic B-spline basis. One hundred response observations are generated from N( f(x), σ 2). The choice of σ is made such that the signal to noise ratio sd( f) 2) with xi ∼ U(0,1) and ei ∼ N(0, σ sd() stays around 3. The number of initial knots is directly proportional to the number of observations such that when n increases, so does the number of knots. For our simulation, we start with n4 number of , which implies that as n increases, there is more sparsity. We place knots at every fourth position, which generates a total of r=28 columns in the design matrix X with a B-spline of degree=3. The Gibbs sampler is run on 1,600 iterations, with the sampling starting after the 100th iteration. If ˆf(x) is the estimated unknown function, then we need some statistic to measure accuracy of x{ f(x) − ˆf(x)}2dx over the unit interval is the i = 1, ...,400 with statistic on which we focus. We created 400 equally spaced grids zi = i/400, estimation. The integrated squared error (ISE)=Þ (cid:16) 400 f(zi) − ˆf(zi)(cid:17)2 ISE = 1 400 i=1 29 We repeat the simulation 100 times. ISE results are shown in the below table and compared to the kernel method of estimation. The simulation result with Bayesian variable selection show better performance than kernel Table 2.1: Log(ISE) comparison with Kernel methods of approximation r 28 28 Bayes Mode log(ISE) Mean (sd) -2.7 (0.05) -0.92 (0.02) Kernel log(ISE) Mean (sd) -1.28 (0.03) -0.49 (0.01) f1(x) f2(x) estimation. We use Gaussian kernel with 0.10 bandwidth. Next, we illustrate the above method using ’Chloride’ data from Bates and Watts, 1988[8]. 2.1.3.2 Chloride concentration Data The Chloride data contains 54 observations of concentrations of chloride taken over an interval of time. We implement the same variable selection algorithm for knot selection discussed above, and we compare it to fixed knot basis approximation and linear regression estimates, as shown in Figure 1.7. We can state that B-spline basis approximation with four internal knots works better than the Bayesian variable selection method. We obtain 0.23 of mean log(ISE) and posterior mode selects 11 basis columns out of 17 basis functions. 2.1.3.3 Discussion: The main drawback of this method is its sensitivity to the initial choice of basis. We need to account for the uncertainty in selection based on the properties of the corresponding basis function. Here we work with cubic B-spline basis functions, which are assumed to be smooth. However, this assumption is not always valid when dealing with spike functions. In these situations, one requires prior information or needs to work with the Bayesian penalized spline method. Hence, prior information should be given on both the basis choice and the corresponding basis coefficients. 30 Figure 2.1: Nonparametric function estimation comparison between variable selection, linear regression, and fixed knot approximation 31 2.2 Functional additive model estimation with bi-level selection We examine a different problem in this section: we seek to explore how Bayesian variable selection can perform in function selection in additive models. 2.2.1 Introduction Modern-day life sciences develop applied questions regarding technological innovations for the selection of influential components in regression models. The statistics literature is full of feature selection methods and their corresponding theories in linear models where the number of predictors is very large compared to the sample size. The literature is rich in terms of variable selection that has been developed for generalized linear models, hazard rate models, and so forth. On the other hand, the selection of significant functions in additive models is very recent. Consider the nonparametric additive model p j=1 Yi = µ + fj(Xi j) + i, where µ is the intercept term, Xi j is ith observation of jth covariate Xj, fjs are unknown functions and i is an unobserved random variable with mean 0 and variance σ 2. The questions we aim to answer are: Are all fj(Xj)’s significant? And based on the significant ones, can we develop an algorithm that handles the feature selection and estimates the true functions simultaneously? We assume that not all fj(Xj)’s are important and that we are dealing with a much larger number of functions than the sample size, i.e . p is larger than n. We further assume that the number of true functions in a model is still less than n. The estimation of true functions is an important component of this article, whether the true functions are linear or nonlinear. Both frequentist and Bayesian statisticians deal with model selection or function selection in additive models. Most of the ideas have come from variable selection in regression models with a large number of predictors. The most popular methods of variable selection are lasso [Tib- shirani, 1996][120], SCAD penalty [Fan and Li, 2001][32], adaptive LASSO [Zou, 2006][138], 32 group LASSO [Yuan and Lin,2006][132]. Bayesian variable selection methods, particularly for linear models, can be divided into two segments: one based on the spike-and-slab prior, and an- other that introduces selection indicators with the regression coefficients. George and McCulloch [1193,1997][43][44] proposed the basic idea that each coefficient βj’s can be modeled either from the “spike” distribution, where most of its mass is concentrated around zero, or from the “slab” distribution, which is like a diffuse distribution. The introduction of selection indicators for vari- able selection is a closely related idea, where we decide whether a coefficient is in the model or not by using selection indicators. Smith and Kohn [1996][115] developed this notion by adding selection indicators with the columns of the design matrix which eventually selects the significant covariates. Furthermore, unlike George and McCulloch’s spike-and-slab, which focuses on β coeffi- cient, Ishwaran and Rao [2005][57] placed a spike-and-slab prior on the variance of Gaussian priors. The two ideas of Bayesian variable selection also extend to group selection. In their paper, Ghosh and Xu [2015][128] described how spike-and-slab priors can be used to select groups in co- variates as a form of Bayesian group lasso. Our method of Bayesian function selection implements Ghosh and Xu’s [2015][128] ideas. First, we represent each function in our model by B-spline basis function coefficients with a subsequently large number of knots placed in each function. Next, we introduce two-stage spike-and-slab priors to find the significant functions and the significant knots for true functions. In both situations, we assume that the number of functions in the model is large, that they are sparse (large p, small n), and that the number of knots introduced to estimate the functions are also sparse. Chen et al. [2016][24] published a paper on Bayesian sparse group se- lection using selection indicators, which is an extension of variable selection in grouping structures. In the next section, we describe the construction of functions in terms of B-spline basis func- tions, and the application of Bayesian sparse group lasso with a spike-and-slab prior for selecting significant functions and estimating in "large p, small n" settings. The posterior median threshold- ing, as shown in Ghosh and Xu [2015][128], estimates both the null functions and their members 33 as zero. As we use a large number of internal knots to estimate the functions, the spike-and-slab prior with median thresholding estimates the insignificant knots as zero based on the structure of the true functions. Section 1.2.3 presents simulation studies to evaluate the performance of our method. 2.2.2 Bayesian Sparse group Lasso in nonparametric additive models Suppose we have data (Y, Xj), the identifiable conditions for functions fj, we assume that E fj(Xj) = 0, 1 ≤ j ≤ p. For the model Yi = µ +p j = 1, .., p; and Xj ∈ [a, b] with a, b < ∞. To hold fj’s are centered around zero, i.e. j=1 fj(Xi j) + i, after centering covariates and responses, the intercept term can be dropped for simplicity. Corresponding nonlinear components can be transformed into a linear regression model with spline coefficients. Thus, the problem of detecting a significant function in the additive model is equivalent to selecting groups of variables in a linear regression setup with a predefined grouping structure. We hence define each function fj through basis functions. We define the construction of B-splines as follows. Let d denote the degree of the B-spline, which implies the order as d +1. Our next assumption is that a sequence of knots is placed in each function where the number of knots is large but does not exceed the sample size; as sample size increases, so does the number of knots. To define the sequence of knots, let a = ξ0 < ξ1 < · · · ξK < ξK+1 = b be a partition of [a, b] into K sub intervals. In addition, define d knots ξ−d = ξ−d+1 = .. = ξ−1 = ξ0 and another set of d knots ξK+1 = ξK+2 = .. = ξK+d+1. The B-spline basis functions are defined as  34 1, ξi ≤ x < ξi+1 Bi,1(x) = Bi,d+1(x) = x − ξi ξi+d − ξi 0, otherwise Bi,d(x) + ξi+d+1 − x ξi+d+1 − ξi+1 Bi+1,d(x), for j = −d, .., K. With the use of additional knots we will get precisely K + d +1 basis functions. Thus for any function fj, defined on the function space Fj, can be written as mn fj(x) = βjk Bjk(x) k=1 and the vector fj of function can be expressed as with basis function values Bjk(Xi j) as elements of the design matrix X. Finally the model can be written as fj = Xj β j p j=1 Y = Xj β j + , where β j’s are the grouped coefficients corresponding to the basis functions Xj. The error  has mean 0 and variance σ 2. We now introduce the method of Bayesian sparse group selection with a spike-and-slab prior. The importance of this method is that we assume sparsity at a number of functions and also at a number of knots within a function. Simon et al. [2013][113] proposed sparse group lasso to produce exact 0 coefficients at the group level and within a group. The sparse group lasso estimator of β is given by Xg βg||2 2 + λ1|| β||1 + λ2 || βg||2 . (cid:18) ||Y − G g=1 min β (cid:26) (cid:19) G g=1 (cid:27) , G g=1 A corresponding prior can be constructed as π(β) ∝ exp − λ1|| β||1 − λ2 || βg||2 which can be expressed as a scale mixture of normals. However, to select variables both at group level and within a group, Ghosh and Xu [2015][128] developed a hierarchical spike-and-slab prior structure that shrinks coefficients to exactly 0 with posterior median thresholding. Two sets of spike-and-slab distributions, one at group level and another at individual level with a posterior 35 median estimator, have great variable selection and prediction performance. We describe the model specifications exactly same as Ghosh and Xu [2015][128] described in their paper. The coefficients β j’s are reparameterized to handle two different degrees of sparsity. 2 = diag(τj1, .., τjmj), τjk ≥ 0, j = 1, .., p; k = 1, .., mj. To select variables β j = V j at the group stage, the first set of spike-and-slab prior is introduced. 1 2 b j where V j 1 ind∼ (1 − π0)Nmj(0, Imj) + π0δ0(b j), b j j = 1, .., p When τjk = 0, β jk drops from the model even though b jk (cid:44) 0. τjk controls the magnitude of the elements of β j. Hence to choose variables within a group Ghosh and Xu [2015][128] placed another spike-and-slab prior with τjk’s: where N +(0, s2) is a normal distribution N(0, s2) truncated below 0 with mean(cid:113) 2 ind∼ (1 − π1)N +(0, s2) + π1δ0(τjk), j = 1, .., p; k = 1, .., mj τjk π s and variance s2. The error variance σ 2 follows a Inverse Gamma distribution with shape α and scale γ: 2 ∼ InverseGamma(α, γ). σ Ghosh and Xu [2015][128] set up an hierarchical setup to decide the values for hyper-parameters π0, π1. π0 ∼ Beta(a1, a2), π1 ∼ Beta(c1, c2). For s2, a conjugate inverse gamma prior is placed, s2 ∼ InverseGamma(1, t). We can update ’t’ with a monte carlo EM algorithm (Casella, 2001[20]; Park and Casella, 2008[96]). For kth EM update, where the posterior mean of 1 iteration. s2 can be obtained through the mean of Gibbs samples with (k − 1)th t(k) = 1 Et(k−1)[ 1 s2 |Y], 36 Gibbs Sampler: Let β(j) denote β vector without jth group. β(j) = (βT1 , .., βT j−1, βT j+1, .., βT p)T . X(j) is the corresponding design matrix to β(j). Similarly β(jk) denote the whole set of vector of coefficients without kth element corresponding jth group: β(jk) = (β11, .., β1mn, .., β j1, .., β jk−1, β jk+1, .., β jmn, .., βp1, .., βpmn) The corresponding full conditional posterior distributions are: b j|rest ∼ lj δ0(b j) + (1 − lj)Nmj(µj, Σj), where lj = P(b j = 0|rest), lj = π0 + (1 − π0)|Σj| 1 2 exp (cid:26) 1 2σ 1 1 4 ||Σ 2 j V (cid:27) , 1 2 j XT j (Y − X(j)V(j) b(j))||2 2 with 1 2 ΣjV 1 2 j XT σ µj = j (Y − X(j)V(j) b(j)), and Σj = (Imn + Full conditional posterior of τjk is a spike-and-slab distribution: τjk|rest ∼ qjk δ0(τjk) + (1 − qjk)N +(ujk, ν jk), 2 1 2 V σ 1 2 j XT j X jV 1 2 j )−1 j = 1, .., p; k = 1,2, .., mn, mn + 1, .., p ∗ mn, where ujk = 1 2 ν σ and qjk = P(τjk = 0|rest), jk(Y − X(jk)β(jk))T X jk b jk, ν 2 2 jk = ( 1 s2 + 1 2 XT σ jk X jk b 2 jk)−1 qjk = π1 + 2(1 − π1)(s2)−1 2(ν 2|rest ∼ Inverse Gamma σ 2 2 exp(cid:8) u2 1 jk)1 (cid:18) n jk 2 2ν jk 2 + α, (cid:9)(cid:2)Φ(ujk νjk )(cid:3) . (cid:19) 1 2||Y − X β||2 2 + γ . 37 With Conjugate Beta and Inverse Gamma prior, the subsequent posteriors are: π0|rest ∼ Beta(#(b j = 0) + a1,#(b j (cid:44) 0) + a2), π1|rest ∼ Beta(#(τjk = 0) + c1,#(τjk (cid:44) 0) + c2) s2|rest ∼ Inverse Gamma 1 + 1 2 #(τjk = 0), t + 1 2 (cid:18) (cid:19) 2 jk τ  j,k 2.2.3 Simulated Example: We apply Ghosh and Xu’s [2015] bi-level Bayesian sparse group selection method in function selection and estimation for two simulated examples. Example 1. We define 4 functions- • f1(x) = 5x • f2(x) = 3(5x − 1)2 10 sin(2πx) 2−sin(2πx) • f3(x) = • f4(x) = 6(0.1 sin(2πx) + 0.2 cos(2πx) + 0.3 sin(2πx)2 + 0.4 cos(2πx)3 + 0.5 sin(2πx)3) This generating model is the same as Huang et al.’s [2010][] Example 1. We use p=20 and n=100. We employ n4 as the number of internal knots for each function, so that in total we have pn 4 number of coefficients. Again, our assumption is those n4 internal knots are sparse, and we will select im- portant knots from them. We use B-spline basis functions of degree=3 with n4 numbers of internal knots. We have f5(x) = f6(x) = .. = f20(x) = 0 as null function. We generate the covariates as xj = (wj + t ∗ u)/(1 + t) f or j = 1,2,3,4 and xj = (wj + t ∗ f or j = 5, ..,20. (wj, u, v) ind∼ N +(0,1) and t controls the amount of correlation. The v)/(1 + t) 38 Figure 2.2: Example 1 simulation plot covariates from zero component and non-zero components are independent. We work with t=0,0.7 and a signal to noise ratio as 9. We run this simulation setup 100 times and examine the selection performance of the method. We compute the number of true positives and true negatives for 100 repetitions. For our model, we first have four models as true functions and the rest as null functions. Our first function is detected 20% of the time, and the other three functions are correctly detected 100% of the time. In contrast, the null functions are not selected in a model a single time in our 100 repetitions. We compute the ISE for each true function and plot them in a boxplot for 100 repetitions, with the error computed for overall response as well. We have a few of other plots above, such as actual vs. predicted plots for response, and plots of rejection probabilities for true functions and null functions. All the results above are found in the setup where predictors are not correlated. In our next step, we run the same simulation setup with predictors that are correlated. The response is generated by additive models with the same four functions mentioned above and the same signal-to-noise ratio, but we set the correlation between covariates as 0.7. We run this setup 39 Figure 2.3: Example 1 simulation plot for 100 repetitions and study the results. Out of 100 repetitions, the selection of the first function is very low at 6%, while the other three true functions are selected 100% of the iterations. On the other hand, one null function come as significant once that is the selection percentage is 1% for that null function. Below are the same plots that were given for the non-correlated setup: Example 2. In 2nd example we took another four different functions defined as: • f1(x) = 3x, • f2(x) = x + (2x−2)2 5.5 • f3(x) = −x + π sin(πx), , • f4(x) = 0.5x + 15φ(2(x − 0.2))− φ(x + 0.4), where φ() is a standard normal density function. Again, in this example we use p=20 and n=100. We utilize n4 as the number of internal knots 4 number of coefficients. Again, our assumption is that for each function, so that in total we have pn 40 Figure 2.4: Example 2 simulation plot those n4 internal knots are sparse for functional estimation. We use the B-spline basis functions constructed from predictors. The predictors are generated independently from Uniform (-2,2), and the signal-to-noise ratio is kept at 5. The simulation setup runs for 100 repetitions. Out of 100 repetitions, the first four true functions are selected 75%, 28%, 85%, and 92% of the iterations, respectively, whereas the null functions are not selected a single time in 100 repetitions. A few plots are included below to provide a better picture. In the next step, we generate the covariates from AR(1) process with ρ = 0.7. In the case of correlated covariates, the relationship between predictors and response is still of theoretical interest. We repeat the simulation for the same set of functions mentioned above and gather outputs. We run this setup for 100 repetitions and study the results. Out of 100 repetitions, the selection percentages of the functions are 97%, 50%, 13%, and 97%, respectively. The selection of false positives is zero, except for one null function that is significant with 1% appearances among the null functions we’ve used. No null functions we use. No null function is selected in the 100 repetitions. Below are the same plots as in the uncorrelated scenario. 41 Figure 2.5: Example 2 simulation plot 2.2.4 Conclusion: We explored an application of Bayesian variable selection for a sparse additive model. The motivation behind applying bi-level selection for function selection and estimation is to find true nonzero functions and estimate them by knot selection. Our idea worked for a small number of functions. If we have 20 functions and 4 of them are significant, the proposed algorithm performs satisfactorily. However, the initial problem was to apply this to a large number of functional spaces. Therefore, more research is needed to draw a final conclusion. 42 CHAPTER 3 BAYESIAN CLASSIFICATION OF ALZHEIMER’S DISEASE STAGES FROM LONGITUDINAL VOLUMETRIC MRI DATA 3.1 Introduction Alzheimer’s disease (AD) is the most frequent neurodegenerative and age-related form of de- mentia. AD patients suffer loss of memory and difficulty with speech, and over time become unable to perform daily tasks such as bathing, dressing, eating, and using the bathroom. AD is becoming a significant societal and financial burden among elderly people. As many treatments are being developed and evaluated, it is important to be able to determine early and accurately which individuals are relatively more likely to progress clinically. Based on signs and symptoms, physi- cians usually track AD using the Clinical Dementia Rating (CDR), and subjects are classified in three states: cognitively normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD). AD accounts for between 60% and 80% of dementia cases. Brain MRI scans provide useful information regarding dementia and its progression. Many group studies based on volumetric re- gions of interest (ROI)([105],[118]), voxel-based morphometry ([124],[125]) or group comparison of cortical thickness ([55],[84]) have shown that many brain regions like the enthorinal cortex, the hippocampus, lateral and inferior temporal structures, and the anterior and posterior cingulate are responsible for brain atrophy in AD. The atrophy patterns vary in different disease stages over different regions. According to Leung et al. [74] MCI converters have higher decaying rates of brain volume than nonconverters do. Hence, hippocampal atrophy using MRI is a marker of AD pathology. MRI provides scientifically accurate and easy-to-collect data on brain sub-regional volumes, surface area, and cortical thickness. Besides measuring the cross-sectional volume of hippocampal or other regions’ structural changes, there is great interest in longitudinal brain volume changes. The 43 greatest advantage of using longitudinal data is that it can capture a high correlation between obser- vations and time trends, and it can track the rate of changes over time. A prominent phenomenon of significant hippocampal atrophy over time has been observed in AD groups ([7],[49],[60],[119]), whereas healthy aging control groups do not have such high rates of atrophy. In this chapter, we focus on brain regional volumetric changes over time for five ROIs measured by MRI scans: the hippocampus (H), entorhinal cortex (EC), middle temporal cortex (MTC), fusiform gyrus (FG), and the whole brain (WB). The key aim of this chapter is to apply a novel Bayesian classification method for classifying patients into disease groups by observing their longitudinal brain regional changes. Throughout this chapter, we work with dichotomous target variables: AD and CN. A number of high-dimensional classification and regression methods have been developed for the classification of patients at different stages of disease, and also for the prediction of future clinical changes in MCI patients. Several new machine learning algorithms are used to deal with high-dimensional data, such as support vector machines (SVMs) and linear discriminant analysis. Different approaches have been proposed in various papers ([50],[80],[90],[100]) to classify pa- tients with Dementia from other stages of disease. Zhang et al. [133] applied a multi-kernel SVM for classification of patients using a longitudinal feature selection method. In terms of classification between MCI and AD patients, they achieved 78.4% accuracy, 79% sensitivity, and 78% speci- ficity. Lee et al. 2016 [72] used logistic regression with fused lasso regularization to predict the conversion from MCI to Alzheimer’s. Seixas et al. [108] developed a Bayesian network model that accounted for a combination of expert knowledge and data-oriented modeling. Writers assumed a known structured Bayesian network model and estimated parameters using the EM algorithm. In this chapter, we develop a Bayesian classification method using longitudinal volumetric data. This data was obtained from the ADNI database (http://www.loni.ucla.edu/ADNI). Longitudinal MRI data has variability in the number of observations for each subject. However, we assume that the longitudinal volumetric measurements are functional observations, and we consider 44 patients with at least three data points. Longitudinal observations of each subject are smoothed using basis splines even if only a few observations are available. Finally, a Bayesian classification method is employed to classify individuals between AD and CN. Our classification method is unique and easy to implement, and it works with functional predictors. In the Bayesian literature, very few classification methods have been developed using functional predictors. In this context, we want to mention Zhu et al. [136], who worked with functional predictors for classification. They developed a selection method to obtain important functional predictors for a binary model. Later, they proposed two MCMC algorithms for posterior sampling, which are Metropolis-Hastings/Gibbs sampler hybrids. We propose the Bayesian Pólya-gamma augmentation method for classification; it is easy to implement using the Gibbs sampler algorithm. Functional predictors are smoothed by cubic basis splines, and basis coefficients are used as predictors in the classification model. We obtain very good results of classification in terms of sensitivity, specificity, and accuracy. This chapter is divided into five sections. Section 3.2 discusses the functional smoothing and classification method developed for the analysis. In Section 3.3, we explain the data in detail. Section 3.4 presents the numerical results and some of the key points we extract from our analysis. Section 3.5 concludes with a short discussion. 3.2 Methodology In this section, we develop a supervised classification technique to handle functional predictors. 3.2.1 Smoothing functional data Classification with functional data is a challenge. Functional predictors have high correlation be- tween observations from the same individual. The sole purpose of using functional predictors is to catch the time trend present in the data. Therefore, we take an approach to smooth the parametric curves with a cubic basis spline. After we smooth a particular longitudinal functional curve, the resulting coefficient vector can be considered as a predictor for the next stage of classification. To avoid the complexity of the problem, we use data from patients who have at least three time period 45 observations, such that smoothed curves are comparable. [61] applied a similar approach to obtain the estimates of a generalized linear model with functional predictors. However, his approach was frequentist, whereas ours is Bayesian. Let us assume that we observe n patients with their functional observations where each patient has p functions (co-variates). Let xi j(t) be the j-th function observed at time point t from the i-th patient. Let T be the compact domain of xi j(t) and xi j(t) ∈ L2[T]. With the functional predictors xi j(t) we assume that we have binary response variable yi which takes value 0 and 1. Therefore a logistic regression equation would look like- (cid:26) P(yi = 1|xi1, .., xip) 1 − P(yi = 1|xi1, .., xip) (cid:27) = Þ p j=1 T log xi j(t)βj(t)dt (3.1) To fit the discrete observations xi j(t) we assume that, at any given time t, instead of xi j(t), we observe Xi j(t) where xi j(t) = Xi j(t) + e(t) where e(t) is a zero-mean Gaussian process. We use a basis function expansion for Xi j(t) of the form q k=1 Xi j(t) = ci jk φ j k(t) = c(cid:48) i j φ j(t) where φ j(t) is the q-dimensional spline basis at time t for jth function, ci j the q-dimensional spline coefficients for the jth predictor from ith patient. We used ordinary least square estimates for estimating spline coefficients. A simple linear smoother is obtained by minimizing the least squares criterion ||xi j − Φci j||2 as (cid:48) Φ)−1 ˆci j = (Φ Φ (cid:48)xi j (3.2) Once the orthonormal basis coefficients have been estimated, we can combine (1) and (2) by 46 plugging ˆxi j(t) in (1), which gives- (cid:26) P(yi = 1|xi1, .., xip) 1 − P(yi = 1|xi1, .., xip) log ˆc(cid:48) i j φ(t)βj(t)dt Þ (cid:27) p p j=1 = = j=1 = c(cid:48) i β T ˆc(cid:48) i j β j (3.3) =Þ T βj(t)φ jT(t)dt, the coefficient vector for the jth functional predictor. Here ci vector where βT j has first element as 1 and rest of the spline coefficients for ith patient and β contains intercept of the model as first element. Functional principal component (FPC) analysis is also a popular method that can be applied here. Instead of least square basis estimates, one can work with FPC scores for classification. [93] worked on functional modeling that extends the applicability of FPC analysis for longitudinal data. Specifically, when we have few repeated and irregularly observed data points, FPC scores can be used. In our functional smoothing method, we expand the functional observation with spline basis functions and use the basis coefficients for classification. On the other hand, the same intuition can be applied for FPC scores. For functional component analysis, we assume that longitudinal observations are observations from a smooth random function X(t) mean function µ(t) = E X(t) and covariance function G(s, t) = cov(X(s), X(t)). The covariance function can be represented as k=1 λk φk(s)φk(t) where φk’s are eigenfunctions and λk’s are eigenvalues. Then the G(s, t) =∞ underline process can be written as: X(t) = µ(t) + ∞ k=1 ξk φk(t), where ξk’s are frequently referred to as FPC scores. These scores can be used later in the classifica- tion model. We do not work with infinite numbers of scores; instead, the above sum is approximated with a finite K that explains the majority of the variance in functional observations. For most cases, the first two FPC scores are enough to build a good classification model. Zhu et al. [136] also 47 used FPC scores in their classification model; they chose functional predictors with significant FPC scores. There are relatively few counterparts to functional PCA in the Bayesian literature. Behseta [9] proposed Bayesian FPCA using two methods, one with a random-coefficient model et al. and another with a hierarchical Gaussian process model. They found better performance for the hierarchical model. Later, in 2008, van der Linde [121] introduced a variational algorithm for one-parameter exponential families to obtain approximate Bayesian inference in functional PCA. Recently, Suarez and Ghosal [117] proposed a prior structure on the covariance function of functional observations. Their model simultaneously smooths functional observations while estimating principal components. In this chapter, we work with a basis spline smoothing method due to its ease of implementation in statistical software. In R, we have the splines package, which fits cubic basis splines on longitudinal data with equally placed knots. We do not investigate any findings using FPC scores instead of basis spline coefficients, as our main focus is on the classification algorithm and basis spline coefficients work very well for our classification model. 3.2.2 Classification using Pólya-Gamma Augmentation In this section, we present the relationship between predictors and outcome using a logistic regression model. The binary data is denoted as yi ∈ {0,1} (i = 1, .., n). Now, posterior sampling of logistic regression coefficients is difficult due to the model’s complicated likelihood function. The assumption of the Gaussian prior for regression coefficients is highly important, but the full posterior distribution of regression coefficients becomes analytically inconvenient. In comparison to the logistic model, it is computationally easier to execute Bayesian inference using a probit model [2]. Different sampling algorithms have been proposed [2]. Holmes et al. [54] developed an indirect sampling method by introducing auxiliary variables for binary and multinational re- gression. Later, more methods based on latent variables for logistic regression were reported by Frühwirth-Schnatter et al. [41], Gramacy et al. [45] and Polson et al. [97]. Among all these works, 48 Polson et al.’s algorithm is most interesting to us due to its ease of computational implementation as well as its sampling efficiency. Our aim is to avoid the complex Metropolis algorithm while sampling from posterior distributions of regression coefficients. A vast body of literature is concerned with the analysis of Bayesian logistic models. As mentioned earlier, Holmes et al. [54] proposed using an auxiliary variable to avoid the conditional non-conjugacy for updating β. The prior structure they used was: 1, yi = if zi > 0 0, otherwise zi = xi β + i i ∼ N(0, λi) λi = (2ψ)2 ψi ∼ Kolmogorv − Smirnovdistribution β ∼ Normaldistribution The above prior is interesting because the marginal likelihood L(β|data) is the same as the like- lihood for the logit model. However, the main disadvantage of using this prior structure is that although we obtain a conjugate full conditional distribution of β given data, the conditional dis- tribution of π(λi|zi, β) does not have any standard form. One must use a complicated rejection sampling method to sample for conditional λi. Hence, adding an auxiliary variable does not give us significant computational improvement compared to using the Metropolis-Hastings algorithm. [41] addressed this problem with the same approach, but instead of Frühwirth-Schnatter et al. In the first stage, they a single auxiliary variable they used a two-stage augmentation method. assumed the existence of a latent variable, where the binary response variable was conditional on the sign of the auxiliary variable. The error part of the model was assumed to follow a type I extreme value distribution that had non-normal density. Then, in the second stage of data aug- mentation, they approximated this non-normal error distribution using the mixture of the normal 49 distribution. Finally, they obtained a multivariate normal distribution for the posterior of β. Among all popular methods, pólya-gamma augmentation is most interesting to us and it’s easy to apply due to availability of the R package. We now discuss Polson et al.’s [97] algorithm in detail. They showed how a Gaussian variance mixture distribution with a Pólya-gamma mixing density can approximate logit likelihood. We start with defining Pólya-gamma density- Random variable X ∼ PG(b, c), a Pólya-gamma distribution with parameters b > 0 and c ∈ (cid:60), if X d= 1 2π 2 ∞ k=1 (k − 1 gk 2)2 + c2 2 4π , where gk ∼ Gamma(b,1) are independent gamma random variables and d= indicates equality in distribution. In their work, Polson et al. [97] showed that Bernoulli likelihoods parametrized by log-odds can be represented as mixtures of Gaussians with respect to Pólya-gamma distribution. Assume we have latent variables ω such that ω ∼ PG(b,0) distribution, which is infinite sum of gammas: d= 1 2π 2)2 . Now if ω ∼ PG(b,0) below equality can be proved: gk (k−1 2∞ k=1 ω Þ ∞ 0 − ωψ 2 2 p(ω)dω e (3.4) (eψ)a (1 + eψ)b = 2−beκψ where κ = a − b2. Motivated by Polson et al.’s [97] integral result, we construct a Bayesian prior formulation targeted to handle binary logistic regression. Equation 3.3 has a Bernoulli likelihood function with logit link. We proposed a Normal prior for β vector in equation (3.3). To derive our Gibbs sampler 50 , we introduce a latent variable ω. Our prior set up is- ec(cid:48) iβ 1 + ec(cid:48) iβ ωi ∼ PG(1,0), i = 1, .., n β ∼ Normal(b, B) yi|ci, β ∼ Bernoulli( ), i = 1, .., n Gibbs Sampler: (3.5) The likelihood for ith observation is (ec(cid:48) iβ)yi 1 + ec(cid:48) iβ ∝ eκic(cid:48) iβÞ ∞ Li(β) = 0 (cid:40) exp − ωi(c(cid:48) iβ)2 2 (cid:41) p(ωi)dω where κi = yi − 1 2. We combine the likelihood function with β prior, given ω = (ω1, .., ωn): (cid:41) (cid:40) p(β|ω,y,c) ∝ p(β) n (cid:26) (cid:26) (cid:26) ∝ exp × exp exp iβ − ωi(c(cid:48) iβ)2 (cid:27) κic(cid:48) 2 i=1 (cid:27) −1 2(z − Cβ)T Ω(z − Cβ) −1 2(β − b)T B−1(β − b) −1 2(β − Cω(CT Ωz + B−1b))TC−1 (cid:27) ∝ exp ωn), Ω = diag(ω1, .., ωn) and Cω = (CT ΩC + B−1)−1. ω (β − Cω(CT Ωz + B−1b)) where z = ( κ1 ω1 , .., κn Finally, posterior samples for regression coefficients can be computed with Pólya-gamma aug- mentation. The steps are (ωi|β) ∼ PG(1,cT (β|y, ω) ∼ N(Cω(CT Ωz + B−1b),Cω) i β) (3.6) 51 We used 0.5 probability threshold to construct our response variable Y as dichotomous {1,0}. Once we obtain the posterior samples of β, we can calculate the predicted probability for each patient. We use the posterior median estimate for β - that is, we calculate the median of the posterior samples of β and use those estimates for validation in the test data set. Finally, we need to decide a threshold to classify each patient into either the AD or the CN group. The classification method is implemented in the R package BayesLogit. The sampler is highly efficient for sampling from Pólya-gamma distribution with a positive parameter as one. In 2013, Choi et al. [26] published a paper in which they showed that "the Pólya-gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic". This result is very crucial, as it guarantees that Monte Carlo averages of posterior samples follow the central limit theorem. This is a strong reason to adopt this method for our analysis. 3.3 Data Description The MRI data used in the analysis of this thesis was collected from the Alzheimer’s disease Neuroimaging Initiative (ADNI) website. We consider five main longitudinal predictors as po- tential markers of differences between AD patients and normal aging persons. AD patients and normal aging persons both have brain region atrophy, and our aim is to find the essential markers related to brain atrophy that causes dementia after a certain age. We have longitudinal volumetric measurements of the hippocampus (H), entorhinal cortex (EC), middle temporal cortex (MTC), fusiform gyrus (FG), and the whole brain (WB). We believe that these are the potential predictors that can distinguish the AD patients from normal aging patients. From the ADNI database, we have information on a total of 1,279 patients diagnosed as either AD or CN over their visits for tests. We discard five patients who changed their disease status from CN to AD over their longitudinal measurements. At the start of our analysis, we decide to keep patients who have at least three data points as longitudinal measurements. After filtering for number of data points and excluding patients whose predictor-related information is missing, 528 patients 52 are included in the study, divided into two groups: AD (n=206) and CN (n=322). The rest of the patients have the status of mild cognitive impairment (MCI). All patients went through an initial clinical evaluation to measure their baseline scores for different tests, such as the Mini-Mental State Examination (MMSE) and the 11-item Alzheimer’s disease Assessment Scale–Cognitive Subscale (ADAS11). In addition, patients underwent a baseline structural MRI scan, giving us baseline mea- surements of all the predictors of interest. In addition, at the start of the study, subjects provided apolipoprotein E (APOE) genotyping information. Data was collected from patients’ visits at spe- cific time points, such as after 6, 12, 18, 24, and 36 months. Patients’ visits were irregular, and we do not have uniformity over differences between consecutive measurements. We start by comparing the baseline measurements between the AD and the CN group, as shown in Table 1. Although age distribution and gender ratio are not significantly different between these two groups, the other three features, namely baseline MMSE, ADAS11 score, and APOE proportion, show significant differences between the two disease classes at the 5% level of significance. This implies that those three features can be used as strong predictors for classifying patients. Furthermore, we examine the patterns and spread for the number of visits for patients. For all the patients we consider, the minimum observed time span is 10 months (only three observations were collected), while the longest is 10 years. In terms of number of visits or data points, there is a minimum of three data points due to our threshold, and the maximum is 11. The median number of data points is around 4. Table 2 provides a detailed breakdown of this information with respect to the AD and the CN group. In the spaghetti plot in Figure 1, we plot the predictor variables for each subject segmented by their disease status. In addition, we create another plot for patients’ MMSE scores. The mean values for each group are highlighted in the plots. In the first five plots, we can see that the hippocampus and other important variables have much higher values for the CN group than for dementia patients. We can visually observe the MMSE score patterns, although in Table 1 we see much higher MMSE scores for normal patients. The goal is to predict the probability of a patient belonging to either of these groups with measured MRI scans. 53 Table 3.1: Patients Baseline Characteristics n Age (Mean ± sd) Gender (F/M) MMSE (Mean ± sd) ADAS11 (Mean ± sd) APOE (+/-) AD 206 97/109 73.53 ± 7.67 24.38 ± 2.5 16.8 ± 6.42 143/63 CN 322 p-value 74.15 ± 5.67 0.325 0.814 155/167 29.07 ± 1.13 <0.0001 5.85 ± 2.87 <0.0001 <0.0001 85/237 Notes: Comparison of Baseline Age, Gender ratio, MMSE score, ADAS11 score and APOE ratio between AD and CN groups Table 3.2: Data Characteristics n AD 206 CN 322 Span (Months) #Data Points (Nobs) Max Min Median Max Min Median 71 120 24 36 10 11 7 11 3 3 4 4 * duration of patients’ visits and number of visits available for each group 3.4 Application results This section presents the numerical results of applying the method proposed in Section 2 to ADNI data. We first deal with a single potential covariate, “hippocampus,” as a representation of a single ROI’s volume. Section 4.1 lists the results we obtain using the hippocampus as a single predictor. We repeat the same procedure in Section 4.2 by listing all classification results using single ROIs. Section 4.3 presents the results regarding combinations of the ROIs to determine the best model with the greatest classification power. We work with five potential predictors and a significant number of combinations of five ROIs. 3.4.1 Classification using Hippocampus In this section, we consider the volumetric MRI data of the hippocampus to demonstrate the appli- cability of the proposed method with a single predictor. 54 Figure 3.1: Longitudinal volume of ROIs for dementia patients and normal controls. The first five plots (plots in the first column and the top two plots in the second column) are scaled ROIs of interest. The last plot has MMSE scores of patients. On the X-axis, we have the number of visits for patients. Blue lines are for the dementia group and orange lines are for the normal group. Thin lines represent each patient’s data, and thick lines are the pooled mean for the AD and CN groups. We start our analysis by smoothing the longitudinal observed values of the hippocampus. A simple least squares approximation is sufficient, as we assume that the residuals of the true curve are independently and identically distributed with mean 0 and constant variance. We use the cubic B-spline basis functions for spline smoothing of the observed hippocampus volume. Four internal knots are used for spline smoothing with intercept, which gives us eight basis functions. We want to ensure that the smoothed estimated curve yields a good fit to each patient’s observed curve. As we do not have a large number of data points for each patient, we do not consider controlling over fitting of our estimated curve. Figure 4 shows the observed hippocampus volume of patients in both group for all visits. The black points represent the mean values of the smoothed curves for all patients at each time point or visit. The smooth curve seems to track the variation in the volumetric 55 measurements accurately. Besides least squares smoothing, functional principle component scores can also be used for this analysis. At the start of the analysis, we divide the data set into two parts: three-quarters of the patients (457) are reserved as a training data set, and the rest (152) are kept for testing. For each patient in the training data set, we gather the basis coefficients used as predictors for classification. For classification, we generate β prior from a multivariate normal distribution with a mean vector as 0 and a diagonal covariance matrix with diagonal elements as 100. The large diagonal numbers for the covariance matrix represent the absence of any prior knowledge for β and indicate a non-informative prior. The pólya-gamma augmentation method is repeated for 10,000 iterations, and the initial 5,000 iterations are considered a burn-in period for the Monte Carlo Markov Chain. Figure 5 shows the predicted probability plot, with a red horizontal line representing a probability value equal to 0.5. A probability threshold of 0.5 is used to classify patients as either AD or CN. Patients with more than a 0.5 probability are tagged as AD patients. Figure 4c presents a boxplot of the β coefficients estimated from Pólya-gamma augmentation. The coefficients are very small and mostly close to 0, except for the intercept, which is not shown in the plot. We apply the method described in Section 3 to the training data and obtain the estimated β coefficients. We use functional smoothing on the hippocampus volume test data set. Using both estimated coefficients and basis matrices for this data set, we obtain the predicted probability. To check the robustness of our classification method, we repeat the procedure 100 times and examine the corresponding sensitivity, specificity, accuracy, and area under the ROC curve (AUC). Table 3 lists the results for all important predictors we consider when used as single ROIs in the model, including sensitivity, specificity, accuracy, and AUC mean with standard deviation. We obtain 82% sensitivity, 88% specificity, and 86% accuracy for the classification using the Hippocampus, which gives us strong confidence in the validity of our methods. The corresponding ROC curve is also shown in Figure 3. 56 3.4.2 Classification with Single ROI In the previous section, we showed the applicability of our proposed method using the hippocam- pus as a single ROI. We now evaluate the classification performance if we consider all five ROIs: H, WB, EC, FG, and MTC. We repeat our proposed procedure 100 times and calculate sensitivity, specificity, accuracy, and AUC with one-quarter of the test data set. We set the threshold probability to 0.5 for all cases. Sensitivity, specificity, accuracy, and AUC mean with standard deviation are listed in Table 3. Table 3.3: Classification performance using single ROI Entorhinal Cortex Hippocampus (Mean ± sd) WholeBrain (Mean ± sd) (Mean ± sd) Fusiform gyrus (Mean ± sd) (Mean ± sd) Middle Temporal Cortex sensitivity 0.82 ± 0.01 0.75 ± 0.01 0.80 ± 0.01 0.73 ± 0.01 0.75 ± 0.01 specificity 0.88 ± 0.01 0.74 ± 0.01 0.90 ± 0.001 0.87 ± 0.01 0.87 ± 0.01 accuracy 0.86 ± 0.001 0.75 ± 0.001 0.86 ± 0.01 0.81 ± 0.01 0.83 ± 0.01 AUC 0.92 ± 0.02 0.83 ± 0.03 0.92 ± 0.02 0.89 ± 0.03 0.89 ± 0.03 Notes: Sensitivity is the proportion of correct AD predictions; specificity is the proportion of correct CN predictions; AUC is area under the ROC curve. The mean and standard deviation are based on 100 repeated results in test data sets. The probability threshold is 0.5. It is evident from Table 3 that single ROIs are critical for distinguishing dementia patients from normal aging people. We obtain an accuracy rate of around 0.80 for single ROIs, and AUC is around 0.9. These are excellent statistics. We also examine the sensitivity and specificity using single ROIs. A high sensitivity of 0.80 gives us confidence in our proposed method and implies that volumetric MRI data of single ROI does have strong classification power. 57 (a) Mean smoothed Hippocampus scores (b) Predicted class probability (c) Box plot for posterior median estimates of model coefficients using Hippocampus (d) ROC curve for Hippocampus model Figure 3.2: In (a), the colored points represent hippocampus values for each time point, and the black dot signifies the mean value of the smoothed curve for that time point. Orange is for CN and blue is for AD. In (b), the points are the predicted class probability for patients in the test data set. (c) presents the spread of the estimated regression coefficients from the Bayesian logistic model in 100 repeated runs. (d) shows the ROC curve after classifying patients using the hippocampus only. 58 3.4.3 Classification using combination of ROIs In the final stage of our analysis, we consider all combinations of ROIs. We first check the classification performance of our model using combinations of five ROIs, and we later include the baseline MMSE, ADAS11 score, and APOE status. Table 1 showed that patients had significant differences in MMMSE and ADAS11 scores at the beginning of the study. We include these variables to make our model stronger in terms of classification power. We describe our analysis in the following. Before applying our method directly to the data, we make one change in the volumetric MRI ROIs: we normalize all the volumetric ROIs by dividing them by the intracranial volume (ICV) to bring each patient’s ROI measurements to the same level. For example, to normalize the ROI volume, we consider each patient’s ICV for each time point and divide the ROI volume by the corresponding ICV for each patient at each time point. We then calculate the basis coefficients from longitudinal measurements of volumetric ROIs and use those coefficients, including baseline MMSE, ADAS11, and APOE, in the Bayesian logistic model for each subject. The same pólya-gamma augmentation is used to avoid complicated Metropolis-Hastings algorithm. The following is the model for classification with pi = P(yi = 1|xi j(t) using the combination of H, WB, EC, FG and MTC: logit(pi) =β0 + β1cH i + β2cW B i + β3cEC i + β4cFG i + β5cMTC + β6M MSEi + β7ADAS11i + β8APOEi, i i = 1, ..., n (3.7) We use this model with our normalized data to classify patients into two groups: AD and CN. We obtain astonishingly better results than what we observed using single ROIs. The classification power of the model is very high, with 93 3.4.4 Conclusion We collected results for 26 different combinations of ROIs, excluding the results of single ROIs. We list the important points and conclusions below based on those results: • Overall, we obtain 80% sensitivity, 90% specificity, and 85% accuracy using single ROIs as 59 potential predictors. This implies that each individual ROI has strong classification power in differentiating AD patients from normal aging people. Among the five potential volumetric MRIs, the whole brain yields the lowest sensitivity, specificity, and accuracy, while the hippocampus performs best. • To compare different model results using combinations of ROIs, we start with the findings obtained from the combination of two ROIs. The overall sensitivity moves to around 80% and the specificity to around 90%. The hippocampus and fusiform gyrus together have the best performance. We do not use the normalized volume of ROIs for these models. • In comparing results between different combinations, we mostly emphasize sensitivity, as it indicates the percentage of patients correctly identified as Alzheimer’s patients. We achieve more than 80% sensitivity for most of the combinations. Furthermore, we obtain more than 85% sensitivity for the majority of combinations of three ROIs. Specifically, H+WB+FG has 91% sensitivity and 84% accuracy. • Next, when using combinations of four ROIs, H+WB+EC+FG has 0.91 sensitivity and 0.79 specificity. However, H+WB+EC+MTC has sensitivity of 0.77 and 0.89 specificity, indicating that the inclusion of the middle temporal cortex (MTC) in the model and the exclusion of the fusiform gyrus (FG) decreases sensitivity and increases specificity. • We obtain the best result when using all five ROIs with baseline MMSE, ADAS11, and APOE in the model. This model achieves 96% specificity and 95% accuracy, which is high compared to other combinations. The addition of MMSE and ADAS11 scores in the model definitely increases the performance of method. However, the five potential ROIs together yield 81% sensitivity, 77% specificity, and 79% accuracy. 3.5 Discussion We have proposed a Bayesian classification method motivated by a practical problem in the domain of Alzheimer’s disease. The differentiation of Alzheimer’s patients based on their MRI 60 inputs is crucial in medical science. The importance of this problem is two-fold: first, doctors need to know the main factors involved with dementia in aging patients; and second, a computationally feasible and less time-consuming classification method is proposed here. Bayesian logistic regres- sion with functional predictors is itself a challenge due to its analytically complicated likelihood function. The Metropolis-Hastings algorithm is a popular tool used by many researchers to tackle this problem, but the choice of candidate density is highly subjective and convergence rates are not great. In this situation, Pólya-gamma augmentation provides us with sampling efficiency that can easily be integrated into hierarchical Bayesian modeling. One more challenging question arises when working with functional predictors: What about the selection of predictors? In our study, we work with five important ROIs and a few baseline score variables. This problem can be extended to a situation in which one has multiple functional predictors and some of them are redundant. It is crucial in modern Bayesian theories to find a proper variable selection method for classification which works with functional predictors as well. In the context of Alzheimer’s disease, not all MRI measurements over time are important for predicting AD patients. In our next study, we will build a data-driven method to provide a feasible solution to this problem. In this chapter, we applied a classification method to distinguish AD patients from healthy con- trols. However, in the Alzheimer’s disease domain, the differentiation of MCI converters (MCI-c) from MCI nonconverters (MCI-nc) is more interesting. The main problem we faced while dealing with this situation was the low ratio of MCI-c to MCI-nc. The data set was sparse, and we did not find a significant number of disease cases in the test data on which we could validate our method. In addition, most of the baseline MCI patients had very few data points in their longitudinal mea- surements. Besides the points mentioned above, our method performs very well for the given data. The 61 best classification model is H+WB+EC+FG+MTC+MMSE_bl+ADAS11_bl+APOE_bl with sensitivity=0.93, specificity=0.96, accuracy=0.95 and AUC=0.98. This is almost an oracle classifier. Our Bayesian logistic model uses very few functional predictor variables and performs much better than other exiting methods. Our method prefers a highly cost-effective and less time-consuming data collection process, as fewer MRI measurements are needed for analysis. In summary, the method proposed in this chapter shows good performance in distinguishing Alzheimer’s patients from normal aging controls. Acknowledgment We would like to thank Yingjie Li[77], email:liyingj1@stt.msu.edu, for providing us with the data used in this study 62 CHAPTER 4 BAYESIAN PENALIZED MODEL FOR CLASSIFICATION AND SELECTION OF FUNCTIONAL PREDICTORS USING LONGITUDINAL MRI DATA FROM ADNI 4.1 Introduction The research literature on applied mathematical approaches and classification methods using longitudinal MRI data has seen massive growth over the past decade. Among the broad range of methods applied with variable degrees of success, several warrant mention. Misra et al. [90] implemented a high-dimensional pattern recognition method to baseline and longitudinal MRI scans to predict conversion from MCI to AD over a 15-month period. Zhang et al. [133] used a multi-kernel SVM for classification of patients between MCI and AD, achieving 78.4% accuracy, 79% sensitivity, and 78% specificity. Lee et al. [72] applied logistic regression in predicting conversion from MCI to Alzheimer’s, using fused lasso regularization to select important fea- tures. Seixas et al. [108] proposed a Bayesian network decision model for detecting AD and MCI which considered the uncertainty and causality behind different disease stages. Their Bayesian network used a blended effect of expert knowledge and data-oriented modeling, and the param- eters were estimated using an EM algorithm. Adaszewski et al. [1] employed classical group analyses and automated SVM classification of longitudinal MRI data at the voxel level. Arlt et al. [4] studied the correlation between the test scores over time with fully automated MRI-based volume at the baseline. However, few studies to date have developed methods that increase the sen- sitivity, accuracy, and specificity of classification in AD diagnosis or progression to more than 80%. Classification using longitudinal data can be a challenge with a large number of predictors. The first significant approach to handle longitudinal predictors is to consider each multiple-occasion observation as a single function observed over a time interval. Functional predictors have a high correlation with adjacent measurements, and the observational space is high-dimensional. The 63 number of predictors required for estimation often exceeds the number of observations, thus in- troducing the problem of dimensionality. A regression framework is frequently the most suitable to model all possible longitudinal effects across ROIs, where the proposed method will select the important predictors. Moreover, many biomedical studies have shown that a limited number of specific brain regions or ROIs are essential for AD classification. Thus, dimension reduction techniques can be applied, and classification can be limited to the reduced feature set. Zhu et al. [136] advanced a method for classification and selection of functional predictors that entails calculation of functional principle component scores for each functional predictor, followed by the use of these scores to classify each individual observation. They proposed using Gaussian priors for selection and created a hybrid Metropolis-Hastings/Gibbs sampler algorithm. Although the method reported in the present study is inspired by this method, we develop a simple Gibbs sampler where MCMC samples are drawn from standard distributions. We also focus on applying penalized regression for dimension reduction. In the Bayesian variable selection literature, the spike-and-slab prior has widespread applications due to its superior selection power. George et al. ([43],[44]) initially proposed that each coefficient β can be modeled either from the “spike” distri- bution, where most of its mass is concentrated around zero, or from the “slab” distribution, which resembles a diffuse distribution. Instead of imposing the spike-and-slab prior directly on regression coefficients, Ishwaran et al. [57] introduced a method in which they placed a spike-and-slab prior on the variance of Gaussian priors. The Bayesian variable selection methods also include different Bayesian regularization methods, such as Bayesian Lasso [96], Bayesian Group Lasso, Bayesian elastic net [76]. We employ a Bayesian group lasso algorithm blended with a spike-and-slab prior obtained from Xu and Ghosh, 2015[128]. The group structure among coefficients in our model comes from functional smoothing of the coefficients, and group lasso facilitates the selection of the important functional predictors. Thus, our proposed method takes the idea of Bayesian variable selection to a generalized functional linear model with binary responses. The fundamental challenge of this work is to perform logistic regression in a Bayesian frame- 64 work while using a large number of functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, selection of predictors becomes crucial with the introduction of either a spike-and-slab prior, non-local priors, or horseshoe priors. For all such priors, the full posterior distribution of regression coefficients is analytically inconvenient. We obtain the Pólya- gamma augmentation method with priors proposed by Xu and Ghosh, 2015[128], which yields full conditional samples from standard distributions. Our aim is to avoid the complications of Metropolis-Hastings and to develop an easily implementable Gibbs sampler. In addition, Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. The key advantage of this method is that it calculates the log of odds of AD with respect to CN based on the selected longitudinal predictors. Moreover, we use a probability threshold for classifying individual patients to validate our modeling performance. We obtained the data used in the dissertation from the ADNI server. The volumetric MRI brain data includes parcel- lated sub-regions of the whole brain, with separate subdivisions for the left and right hemispheres. Volumetric measurements of brain sub-regions across multiple occasions over time demonstrate differential patterns of brain atrophy between AD patients and normal aging people. Because not all brain regions are as closely related to AD, the redundant features derived from the unrelated brain regions can be removed by limiting the selection to brain sub-regions important to classifi- cation. The problem of identifying important brain sub-regions from a large number of functional predictors or longitudinal measurements is far from simple. Various variable selection methods have been designed for single-time-point data with respective target variables. We apply a Bayesian variable selection method to select longitudinal features or functional predictors for our data set. We work with 49 functional predictors consisting of longitudinal volumetric measurements in dif- ferent sub-regional brain ROIs. The use of the spike-and-slab prior ensures that a large number of redundant predictors are dropped from the model. The ROI sub-regions selected by our method will be helpful for future studies to detect the progression of dementia. 65 The chapter is organized as follows. In section 2, we introduce Bayesian variable selection with a spike-and-slab prior. Section 3 discusses functional smoothing of the longitudinal predictors. In Section 4, we introduce our methodology and algorithm for simultaneous selection and classi- fication. Theoretical properties and consistency results are shown in Section 5. We then discuss the application results with simulated data and real data in Sections 6 and 7. Finally, Section 8 examines another potential modeling approach, and Section 9 covers the overall development and limitations of the methodology. 4.2 Bayesian Variable selection We will briefly discuss about Bayesian variable selection below: 4.2.1 Spike-Slab prior A Bayesian model with a spike-and-slab prior can be constructed as follows: 2), (i = 1, ..., n) i β, σ (Yi/xi, β, σ 2) ind∼ N(x(cid:48) (β/γ) ∼ N(0, Γ), γ ∼ π(dγ), 2 ∼ µ(dσ 2), σ where 0 is a p-dimensional zero vector, Γ is the p x p diagonal matrix diag(γ1, ..., γp), π is the prior measure for γ = (γ1, ..., γp)t and µ is the prior measure for σ 2. Ishwaran et al. [57] proposed this setup and developed optimal properties based on the prior choice of (β/γ). A popular version of the spike-and-slab model, introduced by George et al. ([43],[44]), identifies zero and non-zero βi’s by using zero-one indicator variables γi and assuming a scale mixture of two normal distributions: (βi/γi) ind∼ (1 − γi)N(0, τ i ), i ) + γiN(0, c2 2 2 i τ i = 1, .., p 66 2 The value for τ i > 0 is some suitably small value, while ci > 0 is some suitably large value. γi = 1 represents the βi’s which are significant, and these coefficients have large posterior hypervariances and large posterior βi values. The opposite occurs when γi = 0. The prior hierarchy for β is 2 tends to zero we provide more masses on 0, as the completed by assuming a prior for γi. When τ i prior for insignificant βs. The prior distribution for the regression coefficients can then be written as: (βi/γi) ind∼ (1 − γi)I0 + γiN(0, ν 2 2 is the limit for c2 i τ i 2) 2 when τ i tends to zero and c2 i is with I0 point mass at 0 coefficients; and ν large enough. 4.2.2 Bayesian Group lasso We discussed extensively about Bayesian Group Lasso in introduction. The form of Bayesian Group lasso we extensively worked with initiated in Xu and Ghosh, 2015[128]. A multivariate zero-inflated mixture prior can bring sparsity in group level which is elaborately discussed in Xu and Ghosh, 2015[128]. The following hierarchical structure with independent spike-and-slab prior for each βg: Y|X, β, σ βg|τ 2 g, σ 2I) (cid:32) mg + 1 2 ∼ N(X β, σ 2 ∼ (1 − π0)Nmg(0, σ g ∼ Gamma 2 2 ∼ IG(α, γ) σ π0 ∼ Beta(a, b) 2 τ (cid:33) 2 , τ g Img) + π0δ0(βg), g = 1, .., G 2 2 λ 2 , g = 1, .., G where δ0(βg) denotes point mass at 0. The mixing probability π0 can be defined as a function of the number of predictors to impose more sparsity as the feature size increases. The choice of λ is very critical for Xu and Ghosh’s prior setup. Large values of λ produce biased estimates, while very 67 small λ values impose diffuse distribution for the slab part. Xu and Ghosh, 2015[128] mentioned an empirical Bayes approach to estimate λ. Due to intractability of marginal likelihood, they proposed a Monte Carlo EM algorithm for the estimation of λ. Moreover, they showed theoretically and numerically that the median thresholding of posterior βg samples provides exact zero estimates for insignificant group predictors. 4.3 Functional smoothing for longitudinal data Classification with the selection of significant functional predictors is challenging. Researchers commonly observe high correlation values between functional predictors. In this dissertation, we work with the assumptions of independence between predictors; hence, later we propose a corresponding prior in the coefficient space.The main advantage of using functional predictors is that it allows us to measure time trends present in data . We start our methodology by smoothing functional observations using a cubic basis spline. We restrict our data set to patients with at least four time period observations, such that smoothed curves are comparable. Gareth M. James [61] used a similar approach to obtain the estimates of a generalized linear model with functional predictors. Let us assume that we observe n patients with their functional observations and each patient has p functions. We assume that not all p functional observations are important. Let xi j(t) be the jth function observed from the ith patient. Let T be the compact domain of xi j(t) and xi j(t) ∈ L2[T]. With the functional predictors (xi1(t), .., xip(t)), we assume that we have binary response variable yi which takes value 0 and 1. We also assume that the predictors have been centered in this work, so that we can ignore the intercept term. Therefore, we have the following logistic regression equation: (cid:26) P(yi = 1|xi1, .., xip) 1 − P(yi = 1|xi1, .., xip) (cid:27) = Þ p j=1 T log xi j(t)βj(t)dt (4.1) Next, we construct an orthonormal basis φk(t) that can be used to decompose the functional 68 predictors and the corresponding logistic regression coefficients, such as q k=1 q k=1 xi j(t) = ci jk φk(t), βj(t) = βjk φk(t) where ci jk and βjk are the coefficients of xi j(t) and βj(t) with respect to the kth orthonormal basis φk(t). For notational convenience, we denote the basis coefficients as βjk. These are different than the functional coefficients βj(t). We use cubic basis splines as the orthonormal basis for our simulation examples and real data applications. Hence, the choice of q completely depends on the number of internal knots used in basis spline constructions. The jth component in equation (1) can thus be written as Þ T q k=1 xi j(t)βj(t)dt = ci jk βjk = c(cid:48) i j β j (4.2) To fit the discrete observations xi j(t), we assume that, at any given time t, instead of xi j(t), we observe Xi j(t): xi j(t) = Xi j(t) + e(t) where e(t) is a zero-mean Gaussian process. We use the same basis function expansion for Xi j(t) of the form q k=1 Xi j(t) = ci jk φk(t) = c(cid:48) i j φ(t) where φ(t) is the q-dimensional spline basis at time t for jth function, ci j the q-dimensional spline coefficients for the jth predictor from ith patient. We use ordinary least square estimates for estimating spline coefficients. A simple linear smoother is obtained by minimizing the least squares criterion ||xi j − Φci j||2 as (cid:48) Φ)−1 ˆci j = (Φ Φ (cid:48)xi j (4.3) Once the orthonormal basis coefficients have been estimated, we can combine (1), (2) and (3) 69 by plugging ˆxi j(t) in (1), which yields (cid:26) P(yi = 1|xi1, .., xip) 1 − P(yi = 1|xi1, .., xip) log ˆc(cid:48) i j φ(t)βj(t)dt Þ (cid:27) p p j=1 = = j=1 = c(cid:48) i β T ˆc(cid:48) i j β j X(t) = µ(t) + ξk φk(t), ∞ k=1 70 =Þ (4.4) T βj(t)φT(t)dt, the coefficient vector for the jth functional predictor. Here, ci vector where βT j has its first element as 1 and rest of the spline coefficients for ith patient, and β contains intercept of the model as its first element. We use no intercept form for our real data and simulation application where c(cid:48) )T has group i structure with each group size=q. Our selection method drops the redundant β’s and will select the important coefficient groups. = ( ˆci1, .., ˆcip)(cid:48) does not have first element as 1 and βpqx1 = (β Tqx1 1 Tqx1 p , .., β Functional principal component (FPC) analysis is another popular method that can be applied here. Instead of least square basis estimates, one can work with FPC scores for classification. Zhu et al. [136] also used FPC scores in their classification model, and they selected the functional predictors whose FPC scores were significant. Müller [93] extended the applicability of FPC analysis for modeling longitudinal data. Specifically, FPC scores can be used when we have few repeated and irregularly observed data points. In our functional smoothing method, we expanded the functional observation with spline basis functions and used the basis coefficients for classification. The same intuition can also be applied for FPC scores. For functional component analysis, we assume that longitudinal observations are from a smooth random function X(t) and its mean function is µ(t) = E X(t) and covariance function G(s, t) = cov(X(s), X(t)). The covariance k=1 λk φk(s)φk(t) where φk’s are eigenfunctions and function can be represented as G(s, t) =∞ λk’s are eigenvalues. Then, the underline process can be written as: where ξk’s are frequently referred to as FPC scores. These scores can be used later in the clas- sification model. We do not work with an infinite number of scores; instead, the above sum is approximated with a finite K that explains the majority of the variance in functional observations. For most cases, the first two FPC scores are enough to build a good classification model. In this chapter, we work with the basis spline smoothing method due to its ease of implementation in statistical software. In R, we have the splines package, which fits cubic basis splines on longitudi- nal data with equally placed knots. We do not investigate any findings using FPC scores instead of basis spline coefficients, as our main focus is on the classification algorithm, and basis spline coefficients work very well for our classification model. 4.4 Simultaneous Classification of binary response with selection of func- tional predictors 4.4.1 Classification using Pólya-Gamma Augmentation In the following, we discuss Polson et al.’s [97] algorithm; these authors showed how a Gaussian variance mixture distribution with a Pólya-gamma mixing density can approximate logit likelihood. We start by defining Pólya-gamma density- Random variable X ∼ PG(b, c), a Pólya-gamma distribution with parameters b > 0 and c ∈ (cid:60), if X d= 1 2π 2 ∞ k=1 (k − 1 gk 2)2 + c2 2 4π , where gk ∼ Gamma(b,1) are independent gamma random variables and d= means equality in distribution. Polson et al.’s [97] main result parametrized the log-odds of logistic likelihood as mixtures of Gaussian with respect to Pólya-gamma distribution. The fundamental integral result, which is easily integrated into the Gaussian prior hierarchy is that, for b > 0 - (eψ)a (1 + eψ)b = 2−beκψ − ωψ 2 2 p(ω)dω e (4.5) Þ ∞ 0 71 where κ = a − b/2 and ω ∼ PG(b,0). The introduction of latent variables (ω1, .., ωn) later helped us in deriving conjugate posterior distribution. R package BayesLogit has an efficient algorithm to sample from Pólya-gamma distribution and it was proposed by Windle et al.,2014[126]. 4.4.2 Selection using Bayesian Group Lasso As we discussed in the Section 2.2, Meier et al.[87] developed group lasso for logistic regression in a frequentist setup. In our model, we have p number of functional predictors (xi1(t), .., xip(t) with binary response yi ∈ {0,1}, each group has q levels. We can write our model as- (cid:26) P(yi = 1|xi1(t), .., xip(t)) 1 − P(yi = 1|xi1(t), .., xip(t)) (cid:27) log p j=1 According to Meier et al.[87] method, the logistic group lasso estimator with basis spline coefficients would look like = c(cid:48) i j β j = ηβ(ci) √ q|| βj||2  −l(β) + λ p j=1 ˆβGL = min β where l(β) =n i=1(yiηβ(ci) − log(1 + exp{ηβ(ci)}) is the log-likelihood function. Before moving on to our proposed Bayesian method, we want to mention a similar model presented by Zhu et al. [136]: they used latent variables for Bayesian logistic regression, and FPC scores represented the functional predictors. They proposed a normal prior for the concatenation coefficients, which is the same as our coefficients βjs. Now, motivated by Polson et al.’s [97] integral result, we construct a Bayesian prior formulation targeted to handle binary logistic regression. Equation 4.4 has a Bernoulli likelihood function with logit link. We propose a spike-and-slab prior motivated by Xu and Ghosh, 2015[128] with a zero-inflated mixture prior, which helps us in selecting the important group coefficients. As previously described, we introduce latent variables (ω1, .., ωn) to take advantage of the integral 72 identity described in equation (5). Our prior setup is i β) exp(cT 1 + exp(cT i β) i = 1, .., n yi|ci, β ∼ Bernoulli (cid:32) (cid:33) i = 1, .., n , βj|τ τ ωi ∼ PG(1,0), j , π0 ∼ (1 − π0)Nq(0, τ 2 2 ∼ Gamma(q + 1 j |λ 2 2 , π0 ∼ Beta(a, b) j Iq) + π0δ0(βj), 2 2 2 ), j = 1, ..., p λ j = 1, ..., p (4.6) Gibbs Sampler: The likelihood for ith observation is: Li(β) = (ecT i β)yi 1 + ecT i β i β} ∝ exp{κicT (cid:40) exp Þ ∞ 0 (cid:41) i β)2 − ωi(cT 2 p(ωi)dωi, f rom equation (5) where κi = yi − 0.5 and ωi ∼ PG(1,0). If we consider all n independent observations, given ωi we can write the joint likelihood as- n i=1 (cid:40) n exp Li(β|ωi) = (cid:26) ωi i β − ωi(cT i β)2 )2(cid:27) κicT 2 2 (cT i β − κi ωi ωn) and Ω = diag(ω1, .., ωn). i=1 = exp = exp (cid:41) (cid:26) where z = ( κ1 ω1 , .., κn (cid:27) −1 2(z − C β)T Ω(z − C β) 73 Next, we combine the likelihood function with β prior, given ω = (ω1, .., ωn): p(β, τ 2 , π0|Y,C, ω) ∝ exp p × (cid:26) (1 − π0)(τ −1 2(z − C β)T Ω(z − C β) 2 exp{− 1 2τ j )− q 2 (cid:27) 2 j j=1 2) q+1 j ) q+1 2 −1 2 (τ (1 − π0)b−1 × (λ × πa−1 2 0 2 j 2 τ 2 − λ e j βj}I(βj (cid:44)0) + π0δ0(βj) βT  Due to the introduction of Pólya-gamma augmentation, we can derive a block Gibbs sampler with a posterior distribution of βj’s. The same method is derived in Xu and Ghosh, 2015[128] for continuous Y in linear model setup. The blocks Gibbs sampler was introduced by Hobert et al.[52]. To build this sampler, we start with some notations. Let β(j) denotes the β vector without j-th group, j−1, βT and the corresponding design matrix can be written as: β(j) = (βT1 , .., βT p)T j+1, .., βT C(j) = (C1, ..,Cj−1,Cj+1, ..,Cp) Cj is the corresponding design matrix for βj. When βj (cid:44) 0: p(βj|rest) ∝ exp × exp ∝ exp ∝ exp  βT j βj (cid:26) − 1 −1 βT (cid:26) 2τ 2 j (cid:27)   −1 2(z − C(j) β(j) − Cj βj)T Ω(z − C(j) β(j) − Cj βj) Iq)βj − 2(z − C(j) β(j))T ΩCj βj 1 2 j j (CT j ΩCj + 2 −1 2(βj − Aj)T Bj(βj − Aj) τ (cid:27) 74 where, Bj = (CT of βj is a spike-and-slab distribution, j ΩCj + 1 2 j Iq) and Aj = B−1 τ j CT j Ω(z− C(j) β(j)) Hence the posterior full conditional j = 1, .., p (4.7) (βj|rest) ∼ (1 − lj)Nq(Aj, B−1 ) + lj δ0(βj), where lj = p(βj = 0|rest). Now we will find the probability lj : lj = p(βj = 0|rest) j Þ = = j , π0) p(βj = 0, y|C, ω, τ 2 βj (cid:44)0 p(βj, y|C, ω, τ j , π0)dβj 2 p(y| βj = 0,C, ω, τ j , π0)p(βj = 0|τ 2 p(y| βj = 0,C, ω, τ j , π0) +Þ 2 j , π0)p(βj = 0|τ j , π0) 2 2 βj (cid:44)0 p(y| βj (cid:44) 0,C, ω, τ j , π0)p(βj (cid:44) 0|τ 2 j , π0)dβj 2 M π0 = M π0 + N(1 − π0) where π0 = p(βj = 0|τ j , π0), 2 M = p(y| βj = 0,C, ω, τ j , π0) = exp 2 Þ Þ (cid:26) N = βj (cid:44)0 p(y| βj (cid:44) 0,C, ω, τ j , π0)dβj 2 exp Þ = βj (cid:44)0 = M × = M × (τ = M × (τ 2 exp j (CT βT −1 2(z − C β)T Ω(z − C β) −1 (cid:27)Þ (cid:26)1 (cid:27) (cid:26)1 βj (cid:44)0 |Bj|−1 2 j ΩCj + j Bj Aj 2 AT j Bj Aj 2 AT 2 exp 2 exp βj (cid:44)0 j )− q 2 j )− q 2 (cid:26) (cid:27) −1 2(z − C(j) β(j))T Ω(z − C(j) β(j)) (cid:27) − βT j βj 2 2τ j dβj (2πτ 2 j )− q 2 e   (2πτ (cid:27) τ 1 2 j (2π)− q j Ω(z − C(j) β(j)) Iq)βj − 2βT j CT −1 2(βj − Aj)T Bj(βj − Aj) 2 exp (cid:26) 2 j )− q 2 dβj dβj Hence, lj = π0 + (1 − π0)(τ 2 exp(cid:110) 1 π0 2 j )− q 2 |Bj|−1 2 AT j Bj Aj (cid:111) (4.8) 75 The posterior full conditional distributions of other parameters are stated below, and the deriva- tions of the posteriors are described in appendix. (cid:169)(cid:173)(cid:171) 1 τ 2 j |rest(cid:170)(cid:174)(cid:172) ∼ Inverse − Gamma( q+1 Inverse − Gaussian( 2 2 ), 2 , λ λ|| βj||2 , λ if βj = 0 if βj (cid:44) 0 2), for all j = 1,, ., p. Let, G j define whether a certain group is selected or not Then, G j = 1, (π0|rest) ∼ Beta(cid:169)(cid:173)(cid:171)p − 0, p j=1 if βj (cid:44) 0 if βj = 0 G j + a, p j=1 G j + b(cid:170)(cid:174)(cid:172) (4.9) (4.10) We will sample our augmented variables ω = (ω1, .., ωn) using the posterior samples of β: (ωi| β) ∼ PG(1,c(cid:48) i β), i = 1, .., n (4.11) Finally, we are left with the values of λ. λ is the most crucial parameter for our model and should be treated carefully. A large λ shrinks most of the group coefficients towards zero and produces biased estimates. In our real data analysis, we try to control the λ value by assigning a different range of values. Xu and Ghosh, 2015[128] proposed a Monte Carlo EM algorithm for estimating λ. The following is the kth EM update for λ from their paper- (cid:118)(cid:117)(cid:116) p (cid:104) (cid:105) p(q + 1) λ(k−1) j=1 E j |y 2 τ (k) = λ The expected value of τ can be calculated by taking mean of posterior samples of τ j |y for binary response y is intractable. In other words, this expected value 2 . 2 j 4.5 Median thresholding and Theoretical properties 4.5.1 Marginal Prior for β j: We first study the marginal priors of β j’s to examine the theoretical properties of the Bayesian group lasso estimators. We aim to establish the connection between β j group priors and existing 76 Group Lasso penalization methods. We integrate out τ from β j priors. The marginal priors for β j’s are calculated based on Xu and Ghosh, 2015[128] work with extension to binary response instead of continuous response. For β j (cid:44) 0: p(βj/π0) ∝ p(βj/τ j , π0)p(τ 2 j )dτ 2 2 j 2 j Þ Þ ∞ 2 j τ 0 2 2 exp βT j βj j )− q − 1  (λ 2 exp(cid:8)−λ|| βj||2(cid:9)Þ ∞ 2) q+1 2(cid:17) q 2 exp(cid:8)−λ|| βj||2(cid:9) (α 2τ 2 j 0 ∝ (1 − π0)(τ ∝ (1 − π0)(λ ∝ (1 − π0)(cid:16) λ (cid:41) (cid:40) − λ 2 2 τ 2 j 2) q+1 2 (τ 2 j)−3 2 exp (cid:20) 2 −1 exp 2 j ) q+1 βT j βj 2 α j −1 2 dτ 2 j (cid:21)2 dα 2 j j − 2 α λ || βj||2 2 j = 1 2 τ j where α . The marginal prior for β j’s are also spike-slab with with point mass at 0 and the slab part consists of a Multi-Laplace distribution which same as the one considered in Bayesian group lasso (Kyung et al.,2010[69]) or matches with penalization mentioned in Bayesian Adaptive Lasso (Leng et al.,2014[73]). βj/π0 ∼ (1 − π0)M − Laplace + π0δ0(β j) (4.12) (cid:19) (cid:18) 0, 1 λ Combining spike and slab both, the components facilitates variable selection at group level and shrinks the coefficients of the selected groups. 4.5.2 Median thresholding as posterior estimates We previously discussed obtaining the selected group coefficient estimation through median thresh- olding of the MCMC sample. Xu and Ghosh, 2015[128] generalized the median thresholding pro- posed by Johnstone et al.,2004[64] for multivariate spike-and-slab prior. Johnstone et al.,2004[64] showed median thresholding, under a spike-and-slab prior for normal means, has some desirable properties. In this section, we generalize this idea to a binary classification problem and show that the posterior median estimator serves as group variable selection by obtaining a zero coefficient for the redundant groups. We further demonstrate the posterior median as a soft thresholding estimator 77 that is consistent in model selection and has an optimal asymptotic estimation rate. Focusing on only one group, then Xu and Ghosh, 2015[128] proposed the following theorem on Median thresholding: Zmx1 ∼ f(z − µ) µ ∼ π0δ0(µ) + (1 − π0)γ(µ) where Z is an m-dimensional random variable, and γ(.) and f(.) are both density functions for f(t) is maximized at t = 0. Let Med(µi| z) denote the marginal m-dimensional random vectors. posterior median of µi given data. By definition, Þ f(−ν)γ(ν)dν c = f(0) Þ f(0)γ(ν)dν f(0) ≤ = 1 Theorem 1: Suppose π0 > c1+c , then there exists a threshold t(π0) > 0, such that when ||z||2 < t, Med(µi|z) = 0, f or any 1 ≤ i ≤ m Next, we focus on our problem setup. If we assume βj follows a Gaussian prior, βj ∼ N(0, Bj) j ΩC(j) = 0. Then the posterior estimates of βj|rest and the design matrix satisfies the condition CT is: ˆβj = βj|rest ∼ N(µj, Σj) Σj = (CT µj = ΣjCT j ΩCj + B−1 j Ωz j )−1 According to theorem 1, assuming π0 > c1+c posterior median of βjk under prior (6) satisfies , then there exists t(π0) > 0, such that the marginal Med(βjk| ˆβj) = 0 f or any 1 ≤ k ≤ q 78 when || ˆβj||2 < t. We can interpret this result in the context of the same explanation provided by Xu and Ghosh, 2015[128]: the median estimator of the j-th group of regression coefficients is zero when the norm of the posterior estimates under any other prior distribution is less than a certain threshold. Posterior Median as soft thresholding: We assume that CT j ΩCj = nIq and C matrix is group wise Orthogonal with CT j ΩC(j) = 0. We and it depends on n. In this set-up, the are considering the model defined in (6) with fixed τ posterior distribution of βj will be similar to the one derived in the previous section: 2 j,n (cid:19) (1 − Dj,n)CT j Ωz, 1 n (1 − Dj,n)Iq + lj δ0(βj), βj|C, y, ω ∼ (1 − lj)Nq where Dj,n = and, 1 1+nτ 2 j,n lj = π0 + (1 − π0)(1 + nτ 2n(1 − Dj,n)||CT j Ωz||2 2 Then, the marginal posterior distribution for βjk(1 ≤ k ≤ q) conditional on the observed data is a spike-and-slab distribution, βjk|C, y, ω ∼ lj δ0(βjk) + (1 − lj)N (1 − Dj,n)CT jk Ωz, n 1 n (1 − Dj,n) where Cjk is the k-th vector of the Cjth group matrix. The corresponding soft thresholding estimator is (cid:111) (cid:19) (cid:18)1 n (cid:16) 2 π0 2 exp(cid:110) 1 j,n)− q (cid:18)1 (cid:17)(cid:18)1 (cid:32) n 79 (cid:19) (cid:113)1 − Dj,n + jk Ωz| − 1√ (cid:33) n Q j ˆβjk = Med(βjk|C, y, ω) = sgn CT jk Ωz (1 − Dj,n)|CT where z+ is the positive part of z and Q j = Φ−1 . Our results also follow Xu and Ghosh, 2015[128]’s work to show the soft thresholding. One should especially note that the term Dj,n depends on τ which controls the shrinkage factor. 2(1−min(1 2,lj)) 1 2 j,n Oracle Property: (cid:16) 0 , β 0 jk Let β 0 , β j I(||β j||2 (cid:44) 0), j = 1, .., p A = estimator ˆβ j is An = only if limn P(An→∞ = A) = 1. be the true values β, β j, βjk, respectively. The index vector of true model is (cid:17), and the index vector model selected by certain thresholding (cid:17). Model selection consistency is attained if and I(|| ˆβ j||2 (cid:44) 0), j = 1, .., p (cid:16) j,n → ∞ and 2 Theorem 2: Assume the following design exists, CT j,n)/n → 0 as n → ∞, for j = 1, .., p, then the median thresholding estimator has oracle 2 log(τ property, that is, varaible selection consistency, j ΩCj = nIq. Suppose √ nτ n→∞ P(AMed lim n = A) = 1 The proof follows same as steps as the proof of Theorem 4 in Xu and Ghosh, 2015[128]. 4.5.3 Posterior Consistency: In this section, we conduct a theoretical investigation regarding the convergence of the group lasso estimator model to the true model. To show model consistency, we refer to the results and theorems mentioned in the paper titled “On the consistency of Bayesian variable selection for high dimensional binary regression and classification” by Jiang,2006[62]. In this paper, the author setup Bayesian variable selection similar to Smith et al.,1996[115] by introducing a selection indicator vector γ = (γ1, .., γp) where γi = 0/1. The corresponding prior setup is as follows: 2(cid:16) γ Xγ)−1(cid:17) i = 1, .., p y = X β +  βγ ∼ N(0, cσ XT γi ∼ Bernoulli(π), 2|γ) ∼ 1/σ 2 (σ We can establish a direct connection between our model and the above penalized regression. We re- parametrize the groups coefficient vector β j = γj b j where γj, j = 1, .., p is the selection indicator 80 0/1 valued. As in section 5.1 we have shown the marginal prior of βj follows a Multi-Laplace distribution, we can place a Bernoulli prior in γj, bj|λ ∼ Multi − Laplace(0, γj ∼ Bernoulli(1 − π0), ) 1 λ j = 1, .., p (4.13) The marginal prior distribution of βj is same as in equation (12). j=1 cT j=1 cT j βj j=1 cT j βj = ψ(pn Next, we study the asymptotic results as n → ∞. Let y be the binary response and (cid:174)c is pn the corresponding basis coefficients for any given subject. Let the true model be of the form 1+pn j βj), βj is a qX1 vector with pn(↑ n) number of group vectors µo(c) = e present in the model. As described by Jiang,2006[62], we assume that the data dimension sat- isfies 1 ≺ pn and log(pn) ≺ n, where an ≺ bn represents an = o(bn), or limn→∞ an = 0. We bn j=1 || βj||2 < ∞, which implies that only a limited number of group coefficients are nonzero. We further assume ||cj||2 ≤ 1, j = 1, .., pn for simplicity. assume sparsity of the regression coefficients on the group level, i.e. limn→∞pn We assume n i.i.d. observations. Dn = ((cid:174)c1, .., (cid:174)cpn, yi)n i=1 and f0 = µy 0(1 − µ0)1−y. Before we move forward with the results, we define the posterior estimator of the true density f0 as- βγ and we define the posterior estimate of µ0 as Þ ˆfn(y, c) = Þ ˆµn(c) = γ f(y, c|γ, βγ)πn(βγ, γ|Dn)dβγ ψ(cT γ βγ)πn(βγ, γ|Dn)dβγ. βγ γ We define the classifier as ˆCn(c) = I[ ˆµn(c) > 0.5], so that ˆCn(c) will be the validation tool for our algorithm’s performance. Next we define consistency using Jiang’s,2006[62] description of density function, and measure the distance between two density functions with Hellinger distance dH( f , f0) = 81 (cid:113)ÞÞ ((cid:112) f −(cid:112) f0)2dxdy. The below definitions are quoted from Jiang’s,2006[62] article. Definition 1: "Suppose Dn is i.i.d. sample based on density f0. The posterior πn(.|Dn) is asymptotically consistent for f0 over Hellinger neighborhood if for any  > 0, (cid:2) f : dH( f , f0) ≤ |Dn(cid:3) P→ 1, as n → ∞ (Densit y Consistency) πn " Next we define consistency in classification from Jiang,2006[62] paper in terms of how the misclassification error EDnP[ ˆCn(c) (cid:44) y|Dn] approaches the minimal error P[C0(c) (cid:44) y], where C0(c) = I[µ0(c) > 0.5]. Definition 2: "Let ˆBn(c) be a classification rule obtained based on the observed data Dn. If limn→∞ EDnP[ ˆBn(c) (cid:44) y|Dn] = P[C0(c) (cid:44) y], then ˆBn(c) is called a consistent classification rule." Combining Proposition 1 and Proposition 3 from Jiang,2006[62], under conditions I, S, and L, density consistency directly implies classification consistency. The proof follows by checking conditions I, S, and L from Jiang’s(2006) paper[62], since our prior satisfies his prior setup. To have density consistency and classification consistency for posterior estimates, we need to check whether our prior setup follows Jiang’s conditions. The motivation for the proof and the technique of checking conditions to establish the theorem were discussed in theses Atreyee Majumder, 2017 [81] and Guiling Shi, 2017[111]. Condition I: (On inverse link function ψ) "Denote w(u) as the log odds function w(u) = log[ψ(u)/(1− ψ(u))]. The derivative of the log odds w(cid:48)(u) is continuous and satisfies the following boundaries condition when the size of the domain increases: sup|u|≤C|w(cid:48)(u)| ≤ Cq for some q ≥ 0, for all large enough C." Condition S: (For prior πn on small approximation set.) creasing to infinity as n → ∞, such that for any η > 0, andj(cid:60)γ(rn) || βj||2 ≺  "There exists a sequence rn in- 2 n, we have 82 πn[γ = γ(rn)] > e−cn 2 n and πn[βγ ∈ M(rn, η)|γ = γ(rn)] > e−cn 2 n , for all large enough n." Condition L: (For prior π outside a large region) "There exist some ¯rn = o(n/lnpn), ¯rn ∈ [1, pn], n = o(1) and lnCn = o(n/¯rn), such that for some c > 0, πn[|γ| > ¯rn] ≤ n) for all |γ| ≤ ¯rn, for all large enough n." 2 and some Cn satisfying C−1 exp(−cn (cid:2)|| βj||2 > Cn (cid:3) |γ(cid:170)(cid:174)(cid:172) ≤ exp(−cn n), and πn(cid:169)(cid:173)(cid:171)  2 j:γj =1 We checked for the conditions; corresponding proofs are in the appendix. 4.6 Simulation results We assess the performance of our proposed simultaneous classification and selection methodol- ogy with simulated data sets. We apply our method to both simulated and real data. We compare the results from our Bayesian method with those from a frequentist group lasso selection method for binary response. To the best of our knowledge, no other Bayesian method reported in the literature is as convenient and efficient as the presently proposed method. The following section reports the method testing by creating three different examples with varying numbers of predictors. We generate a binary response with simulated functional predictors; there are a significant number of inessential predictors. Example 1: We first generate functional predictors xi j(t) using a 10-dimensional Fourier basis √ φ0(t) = 1 and φk(t) = 2 cos(kπt), k = 1, ..,9, adding an error term. We work with a similar simulation set up mentioned in Fan et al.[33], as Fan’s model setup is also based on functional predictors. We generate our predictors as follows: xi j(tk) = φ(tk)T θi j + i jk, i jk ∼ N(0, σ 2), θi j ∼ N10(0, I) where φ(tk) = (φ0(tk), φ1(tk), .., φ10(tk))(cid:48). We take σ = 0.5 and we generate 200 i.i.d observations using 20 functional predictors. Each predictor is observed at 50 time points, and time points are equally distributed between 0 and 1. θi j and i jk are independently sampled. It is easier to under- stand the set up notationally as ’i’ varies from 1 to 200, ’j’ varies from 1 to 20 and ’k’ varies from 83 1 to 50. We construct a cubic basis spline on (0 = t1, .., t50 = 1) with four internal knots equally spaced at 20%, 40%, 60% and 80% quantiles. We use R-package ’splines’ and the ’bs’ function to construct the basis matrix φ. With 4 internal knots, plus intercept and degree=3, we end up having eight columns in the basis matrix for each predictor, i.e. q=8. To validate classification and selection performance, we use 75% of the observations as training data, and the remaining 25% for testing purposes. We repeat this process 100 times to limit sampling bias in data and concatenate all results considering the 100 repetitions. Example 2 and 3: In example 2, we increase the number of predictors from 20 to 50 while maintaining 200 observations with 50 time points for each observation. The functional predictor generation in Example 3 follows the same method as in Example 1, but generates 500 observations with 100 functional predictors and 20 time points for each observation. We use three internal knots to smooth the predictors. In both cases, we chose the second and final predictor, i.e. β2(t)and βp(t) as non-zero, and the rest of the coefficients are zero. We generate the binary response y ∈ (0,1) from a Bernoulli distri- bution using the set of pre-assigned β. In all of the examples, 75% of the data is used for training and 100 repetitions are used to normalize sampling bias. We obtain 20,000 Gibbs samples, and the first one-third of these samples are discarded as a burn-in period. All the parameter estimates are obtained using the remaining samples. As Xu and Ghosh, 2015[128] showed that median thresh- olding gives exact 0 estimates for the redundant group coefficients, we apply a posterior median on posterior samples to obtain β estimates. We choose a=1,b=1 as the initial parameter values for the prior distribution of π0 and β = 0 is used as the initial choice for the first iteration. Although we have p number of functional predictors, the number of coefficients we need to estimate is p*q. In Example 2, we have p=50, and with four internal knots for each function we obtain q=8. Hence, the number of coefficients we need to estimate is 400 using 200 observations. From this perspective, our algorithm is applicable to “large p, small n” conditions. The simulation results are presented 84 below. Example1 Results: We obtain a 100% true positive rate and a 0% false positive rate in terms of selection, i.e. the two nonzero coefficients are captured in all 100 iterations. Moreover, none of the predictors that originally had zero coefficients are selected. In terms of classification, our method shows 97% sensitivity, 93% specificity, 95% accuracy, and AUC=0.99. Below are the rejection probability plots for β2(t) and β1(t), of which the first is nonzero and the second is zero in the true model. In addition, we plot the posterior median estimates of the coefficient function with respect to its true values. The ROC curve establishes the differentiating power of our method. Example 2 and 3 Results: In Example 2, we obtain a 100% true positive rate and a 0.73% false positive rate out of 100 repetitions, with 97% sensitivity and 95% specificity. In Example 3, we achieve a 100% true positive rate and a 0% false positive rate with 98% sensitivity and 97% specificity. We compare our simulation results with those of frequentist group lasso for logistic regression for all the setups above. Our methodology yields the best results in terms of classifying subjects into the right class, far exceeding frequentist group lasso. Although the frequentist group lasso approach successfully identifies the true significant predictors for the model, it also selects many redundant functional predictors that have zero effect on the true model. The false selection of predictors in the model is very high compared to that of our algorithm. The table below summarizes the numerical results of all three aforementioned examples, with comparisons to frequentist group lasso for logistic regression. 4.7 Application on ADNI MRI data This section reports the results of the application of our proposed method to ADNI data. The MRI data used in all analyses was downloaded from the ADNI database (http://www.adni-info. org/). The fundamental goal of ADNI is to develop a large, standardized neuroimaging database 85 Figure 4.1: Plots based on Example 1 86 Table 4.1: Classification and selection performance Table Bayesian Classification with Bayesian Group Lasso TPR FPR Specificity Sensitivity -2 Log likelihood 0.97 (0.01) 0.93 (0.01) 1 (0) 0 (0) 3.65 0.97 (0.01) 0.95 (0.01) 1 (0) 0.0073 (0.05) 0.219 0.98 (0.001) 0.97 (0.01) 1 (0) 0 (0) 6.87 Logistic regression with frequentist group lasso Sensitivity Specificity TPR FPR -2 Log likelihood 0.92 (0.01) 0.86 (0.01) 1 (0) 0.114 (0.04) 69.23 0.81 (0.01) 0.88 (0.01) 1 (0) 0.34 (0.05) 53.66 0.91 (0.001) 0.94 (0.001) 1 (0) 0.05 (0.02) 190.23 Example 1 Example 2 n=200 p=20 t=50 n=200 p=50 t=50 n=500 p=100 t=20 Example 3 Example 1 Example 2 n=200 p=20 t=50 n=200 p=50 t=50 n=500 p=100 t=20 Example 3 Notes: Simulation result comparisons between Bayesian and Frequentist methods with strong statistical power for research on potential biomarkers in AD incidence, diagnosis, and disease progression. ADNI data available at this time include three projects: ADNI-1, ADNI-GO, and ADNI-2. Starting in 2004, ADNI-1 collected prospective data on cognitive performance, brain structure, and biochemical changes every 6 months. Participants in ADNI-1 included 200 CN, 200 MCI, and 400 AD patients. Then, starting in 2009, ADNI-GO continued the longitudinal study of 87 the existing patients from ADNI-1 and established a new cohort that included early MCI patients, who were enrolled to identify biomarkers manifesting at earlier stages of the disease. ADNI-GO and ADNI-2 together contain additional MRI sequences plus perfusion and diffusion tensor imag- ing. The volumetric estimation for our data set was performed using FreeSurfer by the UCSF/SF VA Medical Center. Considerable research has been conducted to develop automatic approaches for patient classi- fication into different clinical groups, with many ADNI studies identifying ROIs associated with different disease stages. A support vector machine (SVM) is a primary tool utilized in many studies to evaluate the patterns in training data sets and to create classifiers to identify new patients. Fan et al.[34] used neuroimaging data to create a structural phenotypic score reflecting brain abnormalities associated with AD. In classifying AD vs CN, a positive score in their framework identified AD-like structural brain patterns. Their classifier obtained 94.3% accuracy in AD vs CN, although their approach used only left and right whole brain volumes as potential predictors. Some researchers have used Bayesian statistical methods in studying Alzheimer’s data. Shen et al.[110] employed a sparse Bayesian learning method, which they named automatic relevance determination (ARD) and predictive ARD, to classify AD patients. This method outperformed an SVM classifier. Yang et al.[130] proposed a data-driven approach to the automatic classification of MRI scans based on disease stages. Their methodology was broadly divided into two parts. First, they extracted the potentially classifying features from normalized MRI scans using independent component analysis. Next, the separated independent coefficients were applied for the SVM classification of patients. In contrast to this approach, our proposed method selects important components and classifies pa- tients simultaneously. Moreover, we consider multiple brain sub-regions to identify those potential regions whose longitudinal trajectories are specifically related to AD. Another seminal paper by Jack et al.[58] used MRI-based measurements of hippocampal volume to assess the future risk of conversion from MCI to AD. A bivariate model included hippocampal volume and other factors like age and APOE genotype, but only hippocampal volume was identified as significant. Wang 88 et al.[123] employed a functional modeling approach using Haar wavelets and lasso regularization to find ROIs in voxel-level data. In that approach, large Haar wavelet coefficients were related to most important features, with a sparse structure among redundant features. The majority of these methods are based on SVM classification, which often uses kernel-based methods for functional smoothing. Casanove et al.[19] utilized a penalized logistic regression approach, and they calcu- lated estimates using coordinate-wise descent optimization techniques from the GLMNET library. Similarly, our method employs penalized logistic regression with group lasso penalty. However, our approach differs in its use of both functional predictors and a custom algorithm developed in-house. We consider the longitudinal volume of various brain regions, such as the Para hippocampal gyrus, cerebellar cortices, entorhinal cortex, fusiform gyrus, and precuneus, among many others. Although the accessed ADNI data set includes corresponding volume, surface area, and cortical thickness information, we work with only the volume information to acquire uniformity over lon- gitudinal predictors. Because the brain is divided into right and left hemispheres, the data includes sub-regional brain volumes for both hemispheres. Our main objective is to identify the brain sub- regions whose volumetric trajectories can differentiate AD patients from the normal aging control group. As mentioned in the introduction, dementia is associated with widespread brain atrophy, although the time course and magnitude of shrinkage varies across regions. The initial sample includes 761 patients’ data from the ADNI database, classified as AD, MCI, or CN throughout their visits for the study. We exclude all patients classified as MCI, and any AD or CN patients whose diagnostic status changed over time. This is because our model assumes that response does not depend on time. Of the remaining patients, we include those with data from at least four longitudinal measurement occasions. This yields 296 patients who have at least four data points and unchanging diagnoses of either AD or CN. The final sample is composed of 174 AD patients and 122 normally aging controls. All patients underwent a thorough initial 89 Table 4.2: Patients Baseline Characteristics AD 174 69/105 74.76 ± 7.23 25.43 ± 2.40 14.78 ± 5.44 119/55 CN 122 55/67 75.61 ± 5.45 28.96 ± 1.17 5.95 ± 2.94 30/90 p-value 0.25 0.35 <.0001 <.0001 <.0001 n Age (Mean ± sd) Gender (F/M) MMSE (Mean ± sd) ADAS11 (Mean ± sd) APOE (+/-) Notes: Comparison of Baseline Age, Gender ratio, MMSE score, ADAS11 score and APOE ratio between AD and CN groups clinical evaluation to measure baseline cognitive and medical scores, including MMSE, the 11- item Alzheimer Cognitive Subscale (ADAS11), and other standardized neuropsychological tests. In addition, at baseline, APOE genotyping information was obtained from patients. Longitudinal structural MRI scans were parcellated into sub-regional brain volumetric measurements. Our initial model includes 49 sub-regional brain volumes chosen by Dr. Andrew Bender, based on knowledge of the extant literature regarding atrophy patterns in AD. Although these 49 sub-regions are not assumed to change in uniform magnitude, the direction of change over time is hypothesized to be consistent (i.e., shrinking). Thus, the model includes 49 longitudinal predictors that we consider as functional predictors. We assume that not all predictors are potential candidates for classifying patients, and that the sparse assumption is valid. However, because some patients’ visits were irregular, we do not have an equal number of time points across patients. We start by comparing the baseline measurements between the AD and CN groups, as shown in Table 1. In the next stage, we smooth the longitudinal trajectories for the observed volumes of all brain sub-regions. A simple least squares approximation is sufficient, as we assume that the residuals of the true curve are independently and identically distributed with mean 0 and have constant variance. We use the cubic B-spline basis functions for spline smoothing of observed volumes. Three internal knots are used for spline smoothing with intercept, which gives us seven basis functions. We seek to ensure that the smoothed estimated curve is a good fit for each patient’s observed curve. As 90 we do not have a large number of data points for each patient, we do not consider controlling for potential overfitting of our estimated curve. Besides least squares smoothing, functional principle component scores can also be used for this analysis. Prior to analysis, we scale the brain volumes to the corresponding patient’s brain ICV mea- surement by fitting a simple regression to adjust volume measurements for individual brain volume changes. The aim is to remove systematic variation in brain volumes due to differences in physical size. The formula we use is ROIadj = ROIvol − β0(ICV − ICVmean), where β0 is the regression coefficient by regressing ROIvol on ICV [103] [59]. We adjust or correct the volumes using the above method for each gender group: male and female. Next, we scale the corrected volumes between 0 and 1 to bring all brain regions onto the same scale. We then divide the data set into two parts: two-thirds of the patients are reserved for the training data set (n=198), and the rest are kept for testing (n=98). We gather the basis coefficients for each patient in the training data set and use them as predictors for classification. We initialize choice of β with all zero to start iterations. The π0 probability has a Beta(a,b) distribution with a and b both set up as 1. As a first step, we examine λ using Pólya-Gamma transformation of our sample with a spike-slab penalty on the training data. After estimating λ, we evaluate the remainder of the algorithm on the training data with 30,000 MCMC samples. The first one-third of observations are left out as a burn-in period. We propose a spike-and-slab prior on the β coefficient, which transforms into posterior estimates of zero for most the functional predictors. We run our model 100 times with different training samples to nullify sampling bias in the training and test data. In the 100 iterations, the model does not consistently or uniformly select many of the brain sub-regions; therefore, we choose the brain regions that frequently appear as significant in each iteration. The median thresholding selects the left hippocampus, left lateral orbitofrontal cortex, and left posterior cingulate gyrus with 100% probability. Other brain regions that are selected as important are the right Para hippocampal gyrus, left caudate nucleus, left medial orbitofrontal cortex, left putamen, left superior temporal gyrus, left thalamus, right hippocampus, and right middle temporal gyrus. In Figure 2, we plot the brain 91 volume changes of the left hippocampus, left lateral orbitofrontal cortex, and left posterior cingulate gyrus over time. Orange and green signify the normal aging and dementia group, respectively. The bold thick line represents the mean curve for the corresponding group. The plot shows that there are significant differences in volume between the groups, and our model identifies these regions as significant. In Figure 3, we plot the acceptance probability of the MCMC sample for the left hippocampus and left lateral orbitofrontal brain regions. The method classifies patients into the correct group with 77% accuracy. We achieve 72% sensi- tivity, 85% specificity, and a corresponding AUC of 0.87. We use the median predicted probability from the training sample as the threshold for classification validation. We also test the classification by adding clinical measurements such as the ADAS11 (11-item Alzheimer Cognitive subscale), MMSE scores, "CDRSB," "RAVLT immediate," and "RAVLT forgetting,” measured over time. In this classification, we initially select longitudinal brain volumes that are significant, and then we add the clinical variables. We achieve very high classification measures of 97% accuracy, 97% sensitivity, and 98% specificity. If we ignore the MMSE score and run the model with the rest of the functional predictors, we observe similar classification accuracy. In all scenarios, we model diseased patients as 1 and CN as 0 for the interpretation of classification sensitivity/specificity. In addition to finding functional models of longitudinal trajectories in sub-regional brain vol- umes to differentiate between the AD and normal groups, we also apply our method for MCI converters vs MCI nonconverters. We select patients who entered the study as MCI, and we assign the label of MCI nonconverter (MCI-nc) to those who did not transition to AD across all mea- surement occasions and a label of MCI converter (MCI-c) for any who did transition to AD. The total subsample includes 163 patients who were either MCI-c or MCI-nc. We use three-quarters of the patients to train our model. We note the significant brain ROIs that are selected after 100 iterations. Among the selected ROIs that contribute to classification are the right posterior cingulate gyrus, right superior parietal cortex, right thalamus, right isthmus cingulate gyrus, right fusiform 92 Figure 4.2: Brain volume changes of Left Hippocampus, Left Lateral Orbitofrontal cortex, and Left Posterior Cingulate over time for Normal and Dementia patients 93 Figure 4.3: Acceptance probability of MCMC sample for Left Hippocampus and Left-Lateral Orbitofrontal brain regions gyrus, left thalamus, and left precuneus. However, the classification performance is not as good as compared to the previous model: 62% accuracy and 0.66 AUC. The biological explanation for this result is critical to acknowledge. The mean difference of functional predictors between MCI-c vs MCI-nc is not significant for segmenting patients. Moreover, we also neglect some time points’ data for this set of patients. 4.8 An alternate modeling Proposal: It is always important in penalization setups to protect strong signal estimations. We should control over-shrinkage of the selected group coefficients with lesser bias. Hence, we propose a horseshoe prior (Carvalho et al.,2009[17]) within the spike-slab structure. This setup provides a heavy-tailed distribution for the slab part. We propose a normal prior for the slab part with 94 (a) Left Hippocampusa (b) Posterior Singulatea (c) Middle Temporal Gyrusa (d) Left lateral orbitofrontal cortexa Figure 4.4: Pictorial representation of selected brain ROI’s discriminating diseased group from normal control a Plot obtained from on-line resources two hyper-parameters controlling local and global shrinkage. We compare the corresponding penalization on group coefficients with our main proposed penalization from the previous section. We contrast the results of applying both methods on the AD patients. Below, we report an alternative exploratory setup: βj/λj1, .., λjq, τj, π0 ∼ (1 − π0)Nq(0, τ λjk ∼ C+(0,1), τj ∼ C+(0,1), 2 j DΛ j k = 1, .., q j = 1, .., p ) + π0δ0(βj), j = 1, .., p (4.14) 95 = diag(λ where DΛ j model parameters from the corresponding full conditional distributions: jq). We can also develop an efficient Gibbs sampler to update the 2 2 j1, .., λ βj/rest ∼ (1 − lj)Nq( ¯βj, Σj) + lj δ0(βj), j = 1, .., p = (cid:169)(cid:173)(cid:171)CT −1 j Σ j ΩCj + (cid:16) τ D 1 2 j −1 Λ j z − C(j) β(j) (cid:17) ¯βj = ΣjCT j Ω (cid:170)(cid:174)(cid:172) lj = π0 + (1 − π0)(τ 2 exp(cid:110) 1 2 (cid:111) ¯βj(Σ−1 j )T ¯βj (4.15) π0 |−1 2 |Σ−1 j 2 j )− q It is also important to sample the global and local scale parameters λjk’s and τj’s efficiently. Polson et al.,2014[98] provided an efficient slice sampling scheme which can acquired for our setup. We define αjk = λ−2 and next we sample jk from αjk/ujk ∼ E xp(2τ j /β 2 , and sample ujk/αjk ∼ Uni f orm jk)I(αjk < 2 1+αjk (cid:19) (cid:18) 0, ). 1 1−αjk αjk 4.9 Discussion: This chapter discusses the use of Bayesian group lasso penalization combined with Pólya- Gamma augmentation to build a simultaneous classification and selection method. The Bayesian spike-and-slab prior helps in identifying functional parameters generated from longitudinal tra- jectories of multiple brain ROIs, and discriminates the patient group from normal controls. The inclusion of Pólya-Gamma augmentation helps avoid the Metropolis-Hastings algorithm or the incorporation of other expensive sampling algorithms related to latent variables. We consider the longitudinal brain ROI volume measurements as functional predictors, and the cubic basis splines smooth the curves over time. The next steps include using those smoothed functional predictors as discriminating inputs with sparsity assumptions among them. Our method performs very well on simulated data sets, outperforming available frequentist methods. Furthermore, our method is applied on a data set that has a large number of predictors. We have one more strong assumption for the functional predictors: namely, that they are independent of 96 each other. As our prior setup does not account for any correlation structure among predictors, this method will not work well for highly correlated functional predictors. To handle this incompetency, we also proposed an alternative prior structure in Section 4.8, but we did not compare its results to those of our original method. This will serve as the basis for further research. 97 CHAPTER 5 BAYESIAN SPATIOTEMPORAL CLUSTERING MODEL FOR ANALYZING WHITE MATTER DATA 5.1 Spatiotemporal Linear mixed effects modeling for the distribution of cerebral white matter on MRI scans 5.1.1 Introduction Bernal-Rusiel et al., 2013[12] introduced a linear mixed effects (LME) model for mass-univariate analysis of longitudinal neuroimaging data. They tried to exploit the spatial structure of the cortical thickness in the brain’s sub-regions using a spatiotemporal setup. Before that, Bernal-Rusiel et al., 2012 [11] used a similar approach with an LME model but for longitudinal modeling; they built a base for the extension of space information. We seek to develop a similar spatiotemporal model with a different research question and a completely distinct application. Spatiotemporal modeling has not been studied extensively in the literature regarding longitudinal white matter (WM) changes for healthy aging with diffusion tensor imaging (DTI). Bender et al. 2016[10] compared age -related changes across brain regions and identified the influences of age, vascular risk at the baseline on WM diffusion. They conducted a seven-year follow-up study to establish that diffusion properties in association with WM tracts deteriorated with age. Due to the limited study of spatiotemporal association with changes in WM integrity obtained by fractional anisotropy (FA), we work with data from healthy middle-aged and older adults. Some authors have reported evidence of brain region shrinkage with aging (Raz and Rodrigue, 2006[104];Raz and Kennedy, 2009[102]). In contrary, very little is known about age-related changes in structural properties of cerebral WM, which varies at different rates in various parts of the brain. Diffusion tensor imaging provides principal directions from diffusion tensor eigenvalues, one of which is fractional anisotropy. Ac- cording to Montag and Reuter, 2015[92], "Fractional anisotropy (FA) is a scalar value between zero and one that describes the degree of anisotropy of a diffusion process. A value of zero means 98 that diffusion is isotropic, i.e. it is unrestricted (or equally restricted) in all directions. A value of one means that diffusion occurs only along one axis and is fully restricted along all other directions." The literature on longitudinal neuroimaging analysis is vast; in the following, we mention some of the significant studies that have been done in the last few years. Asami et al., 2011[5] studied longitudinal loss of gray matter among schizophrenia patients. They established a clinical correlation with gray matter decay. They used a high-dimensional warping and individualized baseline-rescan to measure longitudinal volume changes within subjects, and they performed a comparison with time-varying manual ROI analysis on the same subjects. Chetelat et al., 2005[25] found the highest rates of atrophy in the anterior, inferior, middle, and medial parts of the temporal lobes in their study of the dynamics of gray matter changes for patients transitioning from mild cognitive impairment (MCI) to Alzheimer’s disease (AD). On the other hand, Fjell et al., 2009[35] worked with healthy aging subjects, similar to those in our experiment. They found that changes in brain structure can occur within one year and that atrophy accelerates with the advancement of age. Although longitudinal studies have yielded some novel discoveries, they have not answered all questions regarding brain image data. Moreover, most statistical methods are not enough to capture the real system. Bernal-Rusiel et al., 2012 [11] implemented an LME model to analyze brain sub-regions’ cortical thickness in Alzheimer’s patients. They performed hypothesis testing of the regression coefficients, calculated the sample size, and derived statistical power. In longitudinal studies, researchers broadly focus on two methodologies. The first is the repeated measure of variance and secondly the cross-sectional study after aggregated data. We previously conducted a separate study with longitudinal data by assuming brain volume changes as a function of time, and we used estimated functions as potential predictors for classification; the manuscript is under review. The aim of this article is to apply an LME model for longitudinal data where space information is considered as an additional effect. The LME method captures the covariance structure of brain voxels properly for serial measurements. It also has the ability to impute missing responses for any 99 time point. We work with a balanced data case, but the extension of the methodology to unbalanced data is trivial. We obtain our data in the “nifty” image format, and each pixel of the image has location coordinates. We want to find regions in the brain that have homogeneous properties, which will help us to track region-wise longitudinal changes instead of voxel-wise analysis. Several previ- ous studies have examined longitudinal changes for voxel-level data (Zipunnikov et al., 2011[137], Zhang et al., 2009[135], Skup et al., 2012[114], Shinohara et al., 2011[112]). However, most of these works employ voxel-based analysis and do not account for spatial information. Bernal-Rusiel et al., 2013[12] apply their work to a different problem with cortical thickness measurements, but their covariance structure works nicely for our WM DTI data as well. As our motivation is to work with homogeneous brain regions and to track changes in fractional anisotropy WM measurements in relation to subjects’ demographic and clinical factors, we work with the K-means algorithm. We consider each voxel’s location, measurements, and predictive response values as clustering at- tributes for voxel-wise analysis. The homogeneous regions offer a scope to build a spatiotemporal model separately for each cluster. This strategy has several advantages. Firstly, this is a highly parsimonious way of building models, such that the number of estimable parameters are much lesser than voxel-wise models. Secondly, voxel-level models do not consider spatial structure in the data, which implies that estimators are less informative with lower statistical power. Finally, we are able to identify brain regions behaving similarly in terms of WM atrophy as the healthy aging people age. This chapter is organized as follows. In part 1 of Section 2, we describe voxel-wise model building, estimation of fixed effect parameters, and temporal covariance structure with respect to subjects’ time-variant random effects. We calculate residual values for each voxel used as an attribute in the clustering algorithm. We describe the agglomeration method in part 2, and finally, we apply the spatiotemporal mixed effect model in part 3. We perform separate inference in part 3 for region-wise reasoning. Section 3 then describes the WM data and how we obtained it from subjects. Results are described in Section 4, while a final discussion is presented in Section 5. 100 5.1.2 Methods 5.1.2.1 Voxel-wise linear mixed effect model We start modeling with a voxel-wise LME model fitting with WM MRI data. In a longitudinal study, the outcome variable for, for instance, the fractional anisotropy or cerebral WM in our data is measured repeatedly for the same individual for a number of times. The aim is to measure changes in the patients’ response over time and their association with demographic and biological factors, such as gender, hypertension, MMSE score, and starting age for the study. We fix a spatial location such as for a fixed particular voxel, we have the leeway to apply a single LME independently for that particular voxel. One advantage of this modeling is that it can assess the within-subject changes or correlation across different time points. Voxel-wise LME models have been used in various studies. For example, Bowman and Kilts, 2003[16] implemented a mixed effect model to analyze PET data. They acheived improved estimation and inference by considering various covariance structures to handle correlation among individuals’ repeated scans. Bernal-Rusiel et al., 2013[12] worked with a voxel-wise mixed effect model; they analyzed longitudinal brain MRI data from ADNI. The following are other significant papers on these methods: (Delaloye et al., 2008[28], Lau et al., 2008[71], Shaw et al., 2008[109]). A longitudinal study in a spatiotemporal setup offers a distinctive understanding; for example, it reflects the temporal trajectory of an underlying non-stationary continuous process. The correlation between pairs of repeated measures is directly proportional to the distance between consecutive time points, and between-subject variations are not constant, which gives us scope to assume a subject-specific covariance structure. In general, we observe three different variabilities in the covariance structure for voxel-specific longitudinal measurements. A particular voxel’s WM measurements have (i) between-subject variability, (ii) within-subject changes over time, and, finally, (iii) a measurement error uniform with the other two variations. Now, the voxel-wise LME model imposes structure on the covariance matrix by introducing random effects. In totality, all the above variations are handled with this 101 structure except the measurement error, although we believe that measurement error has a direct influence on the covariance between longitudinal measurements within a subject. The mean trajectory of WM is modeled with subject-specific fixed effects and time-variant random effects. In this research article, we consider working with balanced, normally distributed data. Now, let Yi be the T × 1 vector of serial univariate measurement for a particular voxel for subject i, where T is the number of time points. Xi is the T × p subject specific design matrix built with fixed effects (variables such as gender, hypertension indicator, MMSE score, Base Age, CESD score etc.). Let Zi be the T × q design matrix of random effects (e.g. systolic and diastolic pressure, measurement time etc.). We consider time-variant predictors as the random effects and the intercept as the fixed effect. Hence, we can write the model as follows: i = 1, .., M (5.1) Yi = Xi β + Ziui + i, ui ∼ Nq(0, σ i ∼ NT(0, σ r Dr) 2 2IT) We define Dr =  1 ρ ... ρq−1 ρ 1 ... ρq−2 . . . ρq−1 . . . ρq−2 . . . . . . 1  i and Xi = 1T ⊗ (cid:174)xT . The components of ui control the deviation from the population mean for i-th subject. As we mentioned above, this model structure handles the variability present in longitudinal data, and i controls the remaining measurement error. Finally, we assume that we have M number of subjects/patients. The subject specific mean of Yi given ui is E(Yi|ui) = Xi β + Ziui; thus, the vector ui with fixed effect provides subject specific coefficients. The β regression coefficients are the same for all individuals. The initial motivation to write about the single voxel longitudinal model was to find 102 the temporal covariance matrix for ith subject, and Cov(Yi) = Σi = σ 2 r ZiDr ZT i + σ 2IT (5.2) We can find the unbiased estimates of the unknown parameters β, σr, ρ, σ by maximizing the log-likelihood function. Parameter Estimation: In the following, we shortly describe the method to estimate the unknown parameters mentioned above, as we use the voxel-wise estimation to segment the whole brain into homogeneous regions. We explain the segmentation method in the next section. Now, given the normally distributed WM assumptions, the vector of measurements for a particular subject’s fixed voxel is as follows: If we have the estimates of ˆσr, ˆρ, ˆσ, the closed-form solution of β is based on MLE and given as: Yi ∼ N(Xi β, Σi) (cid:33)−1 M i=1 ˆβ = −1 ˆΣ i Xi XT i −1 ˆΣ i Yi XT i (cid:32) M i=1 M i=1 M i=1 Once we maximize the restricted log-likelihood we can have unbiased estimates of ˆσr, ˆρ, ˆσ (Verbeke and Molenberghs, 2000[122]): lReM L = 1 2 −1 log |Σ i | − 1 2 (Yi − Xi −1 ˆβ)T Σ i (Yi − Xi ˆβ) − 1 2 log XT i Σ −1 i Xi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M i=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) The solution of the estimates can be obtained by using either the expectation maximization (EM) algorithm, the Newton-Raphson algorithm, or Fisher’s scoring algorithm. We employ Fisher’s scoring algorithm to obtain the parameter estimates, and we need some initial estimates to run the algorithm. 5.1.2.2 Clustering voxels into homogeneous regions using K-means algorithm Our key aim is to build a spatiotemporal model for homogeneous regions in the brain. Homo- geneous regions or clusters contain voxels that are similar in terms of cerebral WM, location in 103 the brain, and quantitative relationship with the subjects’ demographic/biological factors. Once we obtain homogeneous regions in the brain from the clustering algorithm, we can assume that temporal variance is shared among all voxels present in a particular region. The K-means algorithm we implement is a universal method and completely data-driven. Our goal is to partition the voxels into k-uniform clusters C1,C2, ..,Ck such that i) Ci ∩ Cj = φ for i (cid:44) j and ii) elements of Ci are homogeneous based on some conditions. We now list the conditions we consider to run the K-means algorithm. We first calculate the coefficient of variation of a particular voxel based on all the longitudinal In our study, we have M ∗ T number of cerebral WM observations present for all subjects. meanv for each voxel. We use the CV and measurements for each voxel, and we calculate CV v = sdv mean measurement for each voxel in the K-means algorithm. Next, we calculate the residuals for each voxel by fitting the mixed-effect model described in the last section. Once we estimate the parameters for the temporal covariance matrix and fixed effect regression coefficients, it is easy to compute residuals for each voxel. In addition, we have one more piece of directional information for FA measurements over voxels. This information helps us to collate voxels whose WM changes are in a uniform direction. Finally, we calculate the polar coordinates of the three-dimensional voxel locations from their Cartesian coordinates using the following formula: (cid:113) r = x2 + y 2 + z2 , θ = tan−1( y x ), φ = cos−1( z r ) The K-means algorithm starts by defining a dissimilarity matrix as an input. In our data, we have measurements xi j for i = 1,2, .., N and m = 1,2, .., m where N is the number of active voxels in the brain with positive WM measurements and m is the number of attributes considered for a particular voxel. We use five attributes to cluster the voxels. We define the dissimilarity matrix between voxel i and i’ as m j=1 D(xi, xi(cid:48)) = dj(xi j, xi(cid:48) j) Most of the common choices are Euclidean distances: dj(xi j, xi(cid:48) j) = (xi j − xi(cid:48) j)2 or Manhattan 104 distance: |xi j − xi(cid:48) j| or based on correlation: j(xi j − ¯xi)(xi(cid:48) j − ¯x(cid:48) (cid:113)j(xi j − ¯xi)2j(xi(cid:48) j − ¯x(cid:48) i) ρ(xi, xi(cid:48)) = As shown in Hastie et al., 2009[47], if the attributes are standardized, thenj(xi j − xi(cid:48) j)2 ∝ i)2 2(1 − ρ(xi, xi(cid:48))). Hence, we use Euclidean distance to implement K-means clustering. 5.1.2.3 Cluster based Spatiotemporal linear mixed effect models Spatiotemporal models that consider a temporal covariance matrix and that are shared among spa- tial locations have been studied in the functional neuroimaging literature for a long time. Bowman, 2007[14] described a framework for spatiotemporal modeling by defining a functional distance metric in the model to perform voxel-wise inference. The distance metric reflected bonding be- tween voxels based on functionality rather than proximity. The spatiotemporal model estimated temporal and spatial correlations in a given region (ROI). Later, Bowman et al., 2008[15], proposed a spatial Bayesian hierarchical model for analyzing functional neuroimaging data. They built the hierarchical model after anatomical parcellation of the brain consisting of 116 regions. Derado et al., 2012[29] also worked with a multivariate Bayesian hierarchical model, where a Gaussian prior was imposed on a voxel-specific population-level mean parameter separately for a baseline and a follow-up session. Next, the wishart prior on the inverse covariance matrix of subject specific measurements for a particular voxel provides estimators. Woolrich et al., 2004[127] presented a full Bayesian modeling approach to handle spatiotemporal noise modeling and haemodynamic response function (HRF) modeling. They implemented a simultaneously specified autoregressive model in which the spatial noise structure would remain the same over time. Although this method precisely estimated the parameters and grasped the spatiotemporal dependency, it was important to model the spatial covariance matrix among those locations of voxels present in a particular cluster. Moreover, the proposed voxel-wise model or prior did not account for the inter-voxel correlation. The authors assumed that responsive voxels are spread over the whole brain and that the temporal covariance matrix is a scaled version of a global covariance structure matrix. We aim to model our cerebral 105 WM data with a similar assumption as in Bernal-Rusiel et al., 2013[12], for two main reasons. First, we believe that the temporal covariance structure is likely to be different in different segments of the brain and that it needs to be modeled separately for each cluster of voxels. Second, inter-voxel correlations depend on the location of the voxels; for example, voxels in close proximity share similar behavior at different stages in various disease processes. The spatiotemporal covariance of the cerebral WM for a particular brain segment is a product of both the spatial covariance of the voxels present in that region and the temporal covariance of a particular patient whose parameters are different for different regions. The spatial structure presented in Bernal-Rusiel et al., 2013[12] is inspired from Bowman, 2007[14]. Bowman, 2007[14] proposed the alternate strategy of using a spatiotemporal model to estimate spatial and temporal correlations in a given ROI. We assume a similar spatial covariance matrix captured through a parametric matrix that models the dependency as an exponential transformation of the distance between voxels given a region/cluster. Our assumption is that a temporal covariance structure is shared among voxels within a homo- geneous region obtained from the K-means algorithm. The parametric covariance structure that controls the dependency between voxels in a region solely depends on the distance between those voxels. We consider a homogeneous parcellation of the brain consisting of g = 1, .., G regions. Let g be the region/segment/cluster in which we are interested and vg be the number of voxels present in the gth cluster. Let Yig be the (T ∗ vg) × 1 vector of measurements for WM for ith subject. As we defined in section (2.1), we have M number of patients, with each patient having T time points of observations (balanced case). We can expand the Yig vector as- where Yigv is the T × 1 vector measurements of vth voxel for patient i. We assumed similar  Yig1 Yig2 ... Yigvg  Yig = 106  covariance matrix used by Bernal-Rusiel et al., 2013[12] for Yig as- cov(Yig) = Wig = Gg ⊗ Σig = Gg11Σig Gg12Σig Gg21Σig Gg22Σig . . . Gg1vg Σig . . . Gg2vg Σig ... ... ... Ggvg1Σig Ggvg2Σig 2 rgZiDrgZT i + σ . . . Ggvgvg Σig where ⊗ is the Kronecker product, and Σig = σ 2 g IT from equation (2) is the temporal covariance matrix for ith patient in gth cluster. Gg is the vg × vg spatial covariance matrix that accounts for the correlation between voxels present in the gth cluster. Similarly, we used the spatial structure proposed in Bernal-Rusiel et al., 2013[12] which is empirically useful: Gg = 1 exp−agd12−bgd2 12 exp−agd21−bgd2 21 ... −agdvg1−bgd2 vg1 exp 1 ... −agdvg2−bgd2 vg2 exp −bgd2 1vg −bgd2 2vg . . . exp . . . exp ... . . . −agd1vg −agd2vg ... 1    where ag, bg ≥ 0 are unknown parameters and djk is the Euclidean distance between voxels j & k in the gth cluster. Hence, the joint distribution of the Yigv vector of WM measurements in gth cluster follows: Yig ∼ N(Xig βig,Wig), (5.3) where Xig = Ivg ⊗ Xi is (vg ∗ T) ⊗ (vg ∗ p) matrix and βg = βg11, . . . , βg1p, βg21, . . . , βgvg1, . . . , βgvg p is a (vg∗ p)×1 stacked vector of fixed effects which gives separate fixed effect estimates for each voxel present in gth cluster. Finally, we will work with restricted log-likelihood and Fisher’s Scoring method to estimate the parameters associated to covariance matrix Wig i.e. (σgr, ρg, σg, ag, bg). The restricted log-likelihood is: log |Wig| − 1 2 log | M ig (Yig − Xig igW−1 XT ig Xig|, ˆβg) − 1 2 ˆβg)TW−1 (Yig − Xig l = M i=1 1 2 i=1 (cid:16) M i=1 (cid:17)T 107 (cid:104)M (cid:105)−1M i=1 XT ig i=1 XT ig ˆW−1 ig Xig ˆW−1 where ˆβg = ig Yig is the least square estimator. One more ad- vantage of segmenting the brain or clustering homogeneous voxels together is that it reduces the number of parameters to estimate. In the situation of voxel-wise model building, we are required to estimate parameters for the covariance matrix for each voxel, whereas in this situation we estimate (σgr, ρg, σg, ag, bg) only for G number of regions. Fisher’s scoring algorithm starts by calculating partial derivatives and expected information matrix for each iteration to update parameter estimates: (k)) δl δθ Let’s define θg = (σgr, ρg, σg) and sg = (ag, bg). Next, define (k) + I−1(θ (k+1) = θ θ | θ=θ(k) δΣig δθg j , Hg = XT i Σ −1 ig Xi, Fgk = δGg δsgk The individual entries of Fisher’s Information matrix for gth cluster is described below: −1 ig Migk) − tr(H−1 g (2Qg jk − Pg j H−1 g Pgk)) Ig(θg j, θgk) = Ig(θg j, sgk) = Ig(sg j, sgk) = (cid:33) i=1 tr(Σ −1 ig Mig j Σ (cid:32)(cid:32) M vg 2 1 2tr(FgkG−1 g ) (M ∗ T − p) tr(Fg jG−1 i=1 2 tr(Σ −1 ig Mig j) g FgkG−1 g ), (cid:33) + tr(H−1 g Pg j) (cid:33) Yig, and rig = Yig − Xig Then the partial derivatives of the restricted log-likelihood functions are: −1 g XT i Σ ig i=1 ˆβg (cid:19) −1 ig ) + g ⊗ Σ −1 ig Mig j Σ −1 ig )rig XT i Σ −1 ig Mig j Σ i=1 G−1 g FgkG−1 g ⊗ Σ −1 ig −1 ig Xi) rig − (M ∗ T − p) 2 tr(FgkG−1 g ) 1 ig(G−1 2rT (cid:17) Ivg ⊗ H−1 (cid:18) tr(Mig j Σ M Mig j = ˆβg = (cid:16) M M = − vg 2 i=1 2 tr(H−1 (cid:16) M 1 2 (cid:32) M rT ig i=1 vg = g + δl δθg j δl δsgk M (cid:17) i=1 108 Table 5.1: Subject Characteristics % Hypertension 37.5% Baseline variables Age MMSE Systolic BP Diastolic BP ED % Smoker Women Mean (sd) 62.60 (8.51) 28.50 (0.92) 133.75 (9.4) 78.65 (7.5) 15.5 (3.11) 0% Men Mean (sd) 63.62 (9.61) 29.12 (1.35) 25% 128.81 (7.8) 76.93 (8.5) 17.87 (2.8) 12.5% p-value 0.825 0.30 1 0.27 0.67 0.13 1 where Qg jk =M ig Xi, and Pg j =M p-values computed for two-sample t-test or test of association, sd=standard deviation i=1 XT i Σ−1 ig Mig j Σ−1 ig Migk Σ−1 i=1 XT i Σ−1 ig Mig j Σ−1 ig Xi After we input all the necessary terms in the iteration parameters do converge after sixth or seventh step. 5.1.3 WM Data The subjects in this study were from the Midwestern part of the United States. They were no younger than 50 during their first visit. All subjects had three longitudinal measurements, simpli- fying the problem to a balanced case. In total, 16 patients participated. Subjects were screened for history of psychiatric and neurological disorders or any heart disease, whether hypertension or not. They were then screened for cognitive impairment using the Mini Mental Status Examination (MMSE) for symptoms of depression. The sample comprised 16 healthy subjects (50% women) who were between 50 and 80 years old (mean=62.6, sd=8.51 years). The men and women did not differ in age, MMSE, hypertension, or systolic and diastolic pressure. We present the demographic and clinical characteristics in Table 1; the corresponding p-value confirms that there was no gender bias in the sample. Moreover, the proportion of smoking, regular exercise, and diagnosed hyper- tension did not differ based on gender. MRI scans were obtained at three separate time points (T3, T2, and T3), and the DTI sequence 109 Figure 5.1: Change of WM across Z-axis from top-left to bottom-right Z-slice 15,30,45,60 was initiated on the first occasion (T1). The mean differences between each time point are: mean T1-T2=15.56, SD=1.03 months; mean T2-T3=15.06, SD=2.8 months. The subjects were scanned using a 1.5T Siemens Magnetom Sonata scanner. One image has a voxel size 1.8 × 1.8 × 3 mm3. In total we have 182 × 218 × 182 slices of ’nifti’ image with 48 number time points (#patients × time points). We have plenty of extraneous slices in our data which are the outer side of skull and extra-cranial tissues. In order to process the data we need to remove those empty slices. We used neurobase::dropEmptyImageDimensions function to reduce the dimension with reduced proportion as 121 × 157 × 100. The voxel-wise measurement changes across the z-axis and Figure 5.1 has a representation of how image changes if we move from head to toe. The voxel-wise measurement changes across the z-axis. Figure 5.3 shows the orthographic view of the reduced “nifti” image in three different planes for a single time point, as well as the voxels whose WM measurements are greater than 0.5. This gives us a scope to investigate brighter regions and voxels with lower WM intensity. We are more interested in the WM analysis and in studying WM distribution. After masking the first time point’s image, we examine the density plot of the data. Figure 5.2 shows that WM measurements follow approximately a bell shaped distribution, 110 Figure 5.2: white matter histogram with density plot Figure 5.3: white matter measurement greater than 0.5 Figure 5.4: voxels’ directional density Figure 5.5: 2-dimensional distributions of T1 & T2 imaging sequence with values between 0.2 and 1. This finding helps us to assume a Gaussian distribution for the response variable. We also want to compare image files for different time points and to calculate the correlation between consecutive occurrences. Figure 5.5 presents a two-dimensional distribution of T1 and T2 imaging sequences against each other. We also have another nifti object, which contains the directional information about FA measurement changes. The directional values range from -1 to 1, as plotted in Figure 5.4. Voxels with close directional values have similar properties regarding change in WM/FA values. 111 Figure 5.7: Within sum of squares by number of clusters Figure 5.6: Residual vs predicted plot 5.1.4 Results We start our analysis with voxel-wise model fit. We work with a linear mixed effect model, where we use eight factors as fixed effects and two time-variant variables as random effects. The following fixed effects are considered: gender, hypertension indicator, MMSE score, and baseline age, systolic pressure, diastolic pressure, ED, and smoking indicators. We have few longitudinal observations for each patient, which limits us to the use of two time-variant random effects, such as systolic and diastolic pressure, in the model. We consider the intercept as the fixed effect. Coefficients are estimated using Fisher’s scoring method, where each voxel has separate estimates 2 for (β, σ r , ρ) and a best predictor (BP) for ˆui. Our key aim is to calculate residuals for each voxel to use as attributes for the clustering algorithm. We check the randomness of the residuals with respect to the voxel-wise predicted response for each patient and each time point. The plot in Figure 5.6 shows no pattern in residual values. , σ 2 We start our clustering algorithm by calculating each voxel’s coefficient of variations, mean, residuals, and directional values, and the polar coordinates of the three-dimensional voxel location. 112 Figure 5.8: Segmentation of brain into similar regions We then use the K-means algorithm to segment the brain into 3,000 clusters with a maximum cluster size of 258 voxels. We check the within sum of squares over the number of clusters in Figure 5.7, and 3,000 is the optimal point for minimum WSS (within sum of squares) to have a significant number of voxels present. The average cluster size is around 105 voxels. Once we obtain each voxel’s cluster number, i.e. the cluster to which it belongs, we feed the corresponding number to that voxel as an input. We visually examine how voxels are clustered together in each hemisphere. Note that, voxels in same region have four major attributes: mean WM, variation in WM over time, residual values, and location of the voxels. We plot the same brain slices in Figure 5.8 as we used in Figure 5.1 (z-axis: 15, 30, 45, 60) with cluster position numbers. Based on the visual representation, we can claim that voxels close to similar WM belong to an analogous region. 113 Figure 5.9: Fisher’s scoring method convergenceFigure 5.10: Estimated parameters plot for all clusters 2 2 r , ρ, σ In the next step, we work with spatiotemporal linear mixed effects with the same set of predictors used in voxel-wise longitudinal models. However, this time we consider all the voxels present in a cluster. Firstly, we estimate parameters (σ , a, b) related to random effects, measurement error, and the spatial covariance matrix that controls the spatiotemporal covariance matrix. We use Fisher’s scoring algorithm to estimate these parameters, and all parameters converge within 20 iterations. We inspect the convergence of parameters for cluster 1 in Figure 5.9. Our motivation behind segmenting the brain is to find regions whose voxels are similar to each other and the clusters that differ from each other. To study how much variability we observe from one cluster to another, we plot the estimated parameters for all 3,000 clusters in Figure 5.10. Visually, we can identify that the correlation estimates differ from one cluster to another. M (cid:104)M (cid:105)−1 Finally, we discuss our pivotal interest in estimating β components. As we have mentioned ˆW−1 ig Xig i=1 XT ig before, Yig ∼ N(Xig βig,Wig) and ˆβg = ˆW−1 ig Yig is the least square estimator, where βg is a (vg ∗ p) × 1 vector. This implies that i=1 XT ig for a particular cluster, we will obtain vg set of β(p × 1) estimators, where the first component is the intercept and and the rest of the components belong to the corresponding fixed effect. Note that vg is the number of voxels present in gth cluster. Similar to the previous approach, we study the histograms of beta components over different clusters. We examine two separate figures, one 114 Figure 5.11: Histogram of Intercepts Figure 5.12: Histogram of Gender coefficients for the intercept and another for the gender coefficient, in Figures 5.11 and 5.12, respectively. We choose clusters 1, 45, 250, and 500 to inspect variability. From the plots, it is clear that voxel-wise estimates are different among clusters, implying voxel variability over regions. 5.1.5 Discussion The linear mixed effect model is a powerful tool that has been used here to capture spatiotem- poral structures hidden in voxel-level data. Bernal-Rusiel et al., 2013[12] proposed this method for analyzing cortical thickness in brain sub-regions. LME provides a flexible approach to capture the longitudinal changes and spatial dependence. This method always works better than voxel-wise longitudinal studies and statistical power increases. To the best of our knowledge, this method has never before been used to analyze longitudinal WM changes in healthy aging subjects. The implications of the whole setup can be divided into three parts. i) First, we find the longitudinal changes of WM in each voxel, and we identify how subjects’ age, gender, hypertension, MMSE score, and systolic and diastolic pressure relate to these repeated measures. ii) In the next stage, we find the brain regions that behave similarly with respect to the WM measurements, voxel locations, and voxel-wise model coefficients. This gives us homogeneous regions in the brain to build more parsimonious models. iii) Finally, we develop an individual model of each of these regions, where 115 the covariance matrix is a bi-product of the longitudinal and spatial covariance matrix. We observe that fixed-effects regression coefficients, estimated variance, and covariance components of random effects vary from one cluster to another. These give us a scope to study longitudinal changes in these homogeneous regions rather than in a voxel-wise study. The main motivation for building this linear mixed effects spatiotemporal model was to work with Bayesian model-based clustering with some stochastic objective function. In Part A, we used the K-means algorithm to segment the brain, but we were dissatisfied with the limitations of this method. One such limitation is that we needed to fix the number of clusters a priori, while this is most often unknown. Researchers study the within sum of squares to choose an optimal number of clusters, but it does not have any probabilistic implications. An advantage of the stochastic objective function is that it incorporates dependence among voxels through spatial and temporal structures within a cluster. We elaborately formulate the method in Part B. 116 5.2 Bayesian Model based clustering in application to Spatiotemporal data 5.2.1 Introduction Cluster analysis is a modern data mining tool used to group or segment objects into clusters based on similarity measurements. Many available algorithms have been built based on the particular criterion that objects within a cluster are more similar to each other than to objects belonging to other clusters. Some popular clustering methods are K-means, agglomerative clustering, hierarchical clustering, and model-based clustering. We focus on model-based clustering in this chapter. A disadvantage of hierarchical clustering and K-means is that they are model-free heuristic methods. Model-based clustering is an alternative that provides the option to formulate a model while grouping objects. Model-based clustering was first introduced by Banfield and Raftery, 1993[6], who assumed that observations come from multivariate normal distribution and that maximum likelihood yields the estimates. Fraley and Raftery, 2002[37] introduced the finite mixture models as a formal setting for model-based clustering. Let y1, .., yn be independently distributed p-dimensional observations from a K-component mixture distribution: n K i=1 k=1 f(y; τ, θ) = τk fk(yi|θk) (5.4) Here, τk is the probability that a particular object belongs to kth cluster, and θk parametrize the density fk. In general, fk is multivariate normal density, where fk ≡ MV N(µk, Σk). EM algo- rithm helps us to obtain the estimates of unknown parameters. Melnykov and Maitra, 2010[88] provided a thorough reference for model-based clustering. One point of subjectivity that is al- ways attached to cluster analysis is how to decide the number of clusters. Similar to K-means, model-based clustering needs a specified number of clusters at the start of the algorithm. Fraley and Raftery, 1998[36] tried to address this question by using model selection criteria. They sug- gested examining BIC over all possible cluster numbers and choosing the one with the lowest BIC. In another notable study, Yeung et al., 2001 [131] applied model-based clustering to group gene expression data. Over time, researchers have included other factors in the model-based approach, 117 such as variable selection [101] and penalizing the parameter space θ[95]. All the aforementioned studies were performed in a conventional statistical framework, without using any prior information on the cluster structure, the number of clusters, or unknown parameters (τ, θ). As an exception, Bayesian model selection has been used to select an optimal number of clusters. Handcock et al. 2007[46] described two methods to estimate unknown parameters. They first used a two-stage maximum likelihood method, and then they proposed a Bayesian estimation method. In Bayesian part, parameters have a multivariate normal prior, and the mixture probability τ has Dirichlet prior with MCMC estimation. Moreover, Fraley and Raftery, 2007[38] published a paper in 2007 in which they discussed the problems with using the EM algorithm and where it can fail to converge. Instead of MLE, they preferred to work with a MAP estimator with conventional priors on mean and covariance parameters, such as MVN on the mean and an Inverse-Wishart distribution on the covariance matrix. Medvedovic et al., 2004 [86] proposed a contrasting formulation of a Bayesian mixture-based clustering algorithm for grouping gene expression data with replicates. Interestingly, the Poisson-Dirichlet process is the most logical and convenient way of assuming priors on the number of clusters and mixture probability, if one does not want to limit the number of clusters at the start of the algorithm [70]. The methods and literature described above have various advantages and can be applied to different complicated problems with which researchers deal. Here, we should reiterate the problem we are addressing. We aim to build a Bayesian model-based clustering method with the same WM data used for spatiotemporal modeling in Section 5.1. We have longitudinal voxel-wise WM measurements for healthy aging subjects, and we want to group voxels into homogeneous regions considering longitudinal and spatial information. To this end, we review some of the literature regarding spatiotemporal clustering methods. The field of computer science has worked with this topic extensively. Kalnis et al., 2005[65] investigated a highly complicated problem of detecting clusters among moving objects that changed locations over time. For example, they clustered trajec- tories and mined movement patterns for a group of migrating animals or a convoy of cars moving in 118 a city. The complexity of this problem is manifold, but we simplify our situation by assuming that the spatial locations of voxels are fixed over time. Kisilevich et al., 2009[67] described a detailed study of spatiotemporal clustering on trajectories and provided in-depth research development on this topic. In Section 5.1.2.2, we apply K-means clustering of group voxels into homogeneous regions In advancement, we seek to build a clustering algorithm that without any model assumption. satisfies the clustering criteria and simultaneously helps us to estimate the regression coefficients and variance components for each cluster. Section 5.1 presents a two-stage procedure of clustering objects and then a fitting spatiotemporal model for each cluster. To blend these two processes together and work closely with the Bayesian clustering structure, we explore the method described by Booth et al., 2008[13]. They derived an objective function to cluster objects and a stochastic search process to find the posterior distribution for the number of clusters. We also take advantage of the parsimonious representation of the mean via regression. Similar ideas were explored in studies published before Booth et al., 2008’s paper came [51] [48]. In Section 5.2.2, we discuss the general algorithm of our proposed methodology and model building with posterior derivations. Section 5.2.3 concerns the stochastic search method to optimize the objective function. 5.2.2 Methodology : 5.2.2.1 The Ewens-Pitman Prior on Cluster partitions : The basic clustering objective is easy to convey. Let us assume we have n voxels in the brain observed for M subjects over T time points. We want to group these voxels into clusters that are independent of subject variation, and the algorithm should consider spatial and temporal covariance structures together. Voxels belonging to the same cluster are similar to each other whereas differs widely with members belong to other clusters. As stated in Booth et al., 2008[13], we want to 119 find an objective function π : Pn → R+, where Pn is all possible partitions of the n objects. Partitions must be non-empty sets that do not overlap each other. π assigns a score to each partition measuring the overall achievement. Equivalently, we need to optimize the objective function π so that we can obtain the partition with the highest score. That partition will be the best partition given all circumstances. In addition, we use the same linear mixed effect model as the one described in Section 5.1.2.3, which considers the spatiotemporal dependence among voxels with other fixed, random effects. Instead of assuming objects coming from equation (5.4) with fixed K groups, it is more realistic to assume that there is an unknown partition ω with c = c(ω) as the number of clusters. We can denote the clusters as C1, ..,Cc and equation (5.4) can be written as- f(y|θ1, ..θc, ω) = f(yi|θk) (5.5) c(ω)  k=1 i∈Ck with Ci ∩ Cj = Φ for i (cid:44) j. The introduction of ω leads to uncertainty in the number of clusters or a probabilistic component attached to each partition. Once we find estimated ˆω, we can fix the cluster partition with the fullest confidence and find estimated ˆθ for each cluster. Equation (5.5) draws a general picture of the clustering proposal. Spatiotemporal data needs a separate covariance structure; we work with the covariance structure mentioned in Bernal-Rusiel et al., 2013[12]. We used WM data in this covariance structure in Section 5.1. We believe that this structure accurately captures the dependence between voxels, as well as temporal changes among time points to model spatiotemporal data in a non-functional way. The objective function that final- izes clusters’ partition is the posterior distribution π(ω|y), hence we need prior distribution on ω. The optimization of π(ω|y) is highly computationally challenging work, especially in spatiotempo- ral data. As stated in Booth et al., 2008[13] if there are n data points to cluster, the total number of all possible partitions is called the Bell number[116] and is denoted by B(n) = #Pn. B(n) increases rapidly with n. The table below gives an idea of the relationship between ’n’ and B(n). We have n B(n) 1 1 2 2 3 5 4 15 5 52 6 203 7 877 8 4140 9 21147 10 115975 40 1.6 × 1035 100 4.8 × 10115 120 around 316,000 voxels to cluster, which is computationally challenging. We also need a stochastic search algorithm to optimize the objective function. We are still working on building this algorithm. c(ω) k=1 We now focus on the prior distribution of ω. The posterior distribution π(ω|y) is calculated using π(ω) and π(θ|ω) with integrating out nuisance parameter θ from the likelihood. It should be noted that estimates of θ are cluster specific and should depend on ω. We explore the same prior used by Booth et al., 2008[13] for ω ∈ Pn, π(ω) = Γ(m)mc(ω) Γ(n + m) (5.6) Γ(nk), c(ω) = 1, .., n, ω ∈ Pn where nk = #(Ck) is the number of objects in cluster Ck, n =c(ω) k=1 nk is total number of objects in study, and m > 0 is a parameter. This distribution was first used by Crowley, 1997[27]. This prior is also known as "The Ewens-Pitman Prior"[22]. McCullagh and Yang, 2006[83] presented detailed mathematical derivation of this prior. Moreover when c(ω) → ∞ this prior is also called as Chinese Restaurant Process. Parameter ’m’ controls the number of clusters to be formed, and it increases proportionally. When m → 0, we can achieve a single cluster with all objects gathered into it. Similarly, for m → ∞, number of clusters reaches ’n’, which is the same as the number of objects. As mentioned in Booth et al., 2008[13], the expected number of clusters to be formed if ω ∼ π(ω) is- E [c(ω)] = m This prior has two other interesting properties: n−1 1 m + i i=0 • Exchangeability: If two partitions (∈ Pn) have same c(ω) and same n1, n2, .., nc(ω) then they have exact same probability under π(ω). • Consistency: Prior probability for a partition based on n objects (ω ∈ Pn) doesn’t depend ω∈S πn+1(ω) = on if (n + 1)th element enters into study, i.e. for ω∗ ∈ Pn and S ⊂ Pn+1, πn(ω∗) is satisfied. 121 Next we will define our assumed model and prior for π(θ|ω). We will calculate posterior of ω as- Þ π(ω|y) ∝ f(y|θ, ω)π(θ|ω)π(ω)dθ θ 5.2.2.2 Spatiotemporal Linear Mixed Effects model: In this part, we focus on the same linear mixed effect model as the one used in Section 5.1.2.3. Let us assume that yij is the ith voxel’s jth subject’s measurement observed over T time points, and for fixed ω, yij ∈ Ck = 1,2, .., nk the kth cluster. Then, yij = Xj βk + Z jui + i j, f or i = 1, .., nk andj = 1, .., M k Iq) ui ∼ Nq(0, λσ 2 k IT) i j ∼ NT(0, σ 2 (5.7) is jth subject’s fixed effect design matrix, ZT×q XT×p is jth subject’s time-variant random effect j matrix, βk is a p-dimensional coefficient vector for kth cluster, ui is the ith voxel’s random effect associated with jth patient, and i j is the measurement error. Once we aggregate a single voxel’s measurement for all M subjects, we obtain j (cid:16) (cid:17) yi = X βk + Zui + i, i = 1, .., nk cov(yi) = σ 2 k λZ ZT + IMT = σ 2 k Σ (5.8) Σ contains the temporal covariance matrix, and we can assume that Z is block-wise orthogonal, such that each patient has an independent random effect. Once we consider that all nk voxels belong (nk MT)×1 = (y T)T, we write covariance matrix as a product of to kth cluster, we get yk spatial and temporal covariance matrices T, .., ynk cov(yk) = σ 2 k Wk = σ k (Gk ⊗ Σ) = σ 2 2 k Gk11Σ Gk12Σ Gk21Σ Gk22Σ ... ... . . . Gk1nk Σ . . . Gk2nk Σ ... Gknk1Σ Gknk2Σ . . . Gknk nk Σ   122 (cid:16) (cid:17) from equation (5.8) is the temporal λZ ZT + IMT where ⊗ is the Kronecker product, and Σ = covariance matrix for all patients’ voxels present in the kth cluster. Gk is the nk × nk spatial covariance matrix that accounts for the correlation between voxels present in the kth cluster. The spatial structure mentioned below and on which we worked in previous sections was proposed by Bernal-Rusiel et al., 2013[12]. It is empirically useful: exp−ak d12−bk d2 12 1 −bk d2 1nk −bk d2 2nk . . . exp . . . exp ... . . . −ak d1nk −ak d2nk ... 1   Gk = exp−ak d21−bk d2 21 ... −ak dnk1−bk d2 nk1 exp 1 ... −ak dnk2−bk d2 nk2 exp where ak, bk ≥ 0 are unknown parameters and di j is the Euclidean distance between voxels i & j in the kth cluster. Hence, the joint distribution of the yk vector of WM measurements in kth cluster is as follows: yk ∼ N(X∗ k Wk), 2 βk, σ (5.9) where X∗ = (cid:174)1nk × X is (nk ∗ MT) ⊗ p matrix. For time being, we assume that Wk is known. Wk has three unknown parameters λ, ak, bk. (ak, bk) can be plugged in from our previous study, and we estimate λ from a data-driven method. Our primary focus is on parameters θk = (βk, σ k), and a convenient non-informative prior on (βk,log(σk)) is, k |ω) ∝ 1 π(βk, σ 2 2 2 σ k 5.2.2.3 Objective function derivation: Once we decide on the prior distributions, it is easy to derive posterior distributions. We determine the first posterior distribution for π(βk|σ & ω. Next, we k , yk, ω) conditional on σ 2 k 2 123 ascertain the marginal posterior distribution of π(σ 2 (cid:16) βk|σ 2 k , yk, ω W−1 k yk ˆβk = 2 X∗T k |yk, ω). k Vβk), where W−1 X∗T (cid:17) ∼ N( ˆβk, σ k X∗(cid:19)−1 (cid:18) k X∗(cid:19)−1 (cid:18) 2(cid:16)(nk ∗ MT) − p, s2 (cid:17) (cid:17)T W−1 X∗T (cid:16) = k yk − X∗ ˆβk 1 Vβk 2 Marginal posterior distribution of σ K is (cid:17) ∼ Inv − χ (cid:16) σ k |yk, ω 2 s2 k = (nk ∗ MT) − p (cid:16) where W−1 k yk − X∗ ˆβk (cid:17) (5.10) (5.11) (5.12) To find π(ω|y), we need to integrate out βk and σ k , ω)π(βk, σ 2 f(yk| βk, σ π(yk|ω) ∝ 2 k k |ω)dβk dσ 2 2 k so that we can find Þ Þ (cid:18)(nk ∗ MT) − p βk 2 σ k 2 ∝ Γ (cid:16) (cid:19)(cid:18)1 2 (cid:17)T (cid:16) (cid:17)(cid:19)−(nk∗MT)−p 2 yk − X∗ ˆβk W−1 k yk − X∗ ˆβk Finally, we product the above function over all clusters and multiply it with the prior distribution on cluster partitions π(ω). This provides us our long-desired objective function, π(ω|y) ∝ π(ω) c(ω) k=1 ∝ Γ(m)mc(ω) Γ(n + m) π(yk|ω) c(ω) k=1 Γ(nk)Γ((nk ∗ MT − p)/2) (cid:16) (cid:18)1 2 (cid:17)T (cid:16) W−1 k yk − X∗ ˆβk (cid:17)(cid:19)−(nk∗MT)−p 2 yk − X∗ ˆβk (5.13) The posterior function π(ω|y) is logically reasonable. The residual quadratic form calculates within the sum of squares in a cluster and prefers partitions with lower WSS. Furthermore, this posterior objective function prefers large homogeneous regions for large nk . It also measures lack of fit from the linear mixed effect model we have assumed. Optimizing the objective function is challenging work. 124 CHAPTER 6 FUTURE WORK The dissertation is focused on Bayesian variable selection and its application to brain image data. We extensively worked with Bayesian model based clustering in order to segment brain in similar regions. Optimization of the objective function would be our future works. As we have discussed earlier, the total number of possible combinations for 316 thousands voxels would be in billions. It’s implausible to optimize the function for such a large space. In order to utilize this objective function we decided to reduce our sample space. We will find a large number of partitions using K-means described in section 5.1.2.2 with different cluster numbers. Once we finalize these partitions the above derived objective function will work as criterion to pick the best partition among selected large number of partitions. We focused on the Metropolis-Hastings based on biased random walk algorithm described in Booth et al., 2008[13]. If we assume each partition as a node of an undirected graph and edges are created based on connection from one node to another. As authors described, from a particular partition if we move one element of a cluster and place in another cluster and obtain a new partition then these two partitions have an edge. Although following this logic we won’t be able to obtain symmetric candidate density for Metropolis-Hastings algorithm. Our future work would be to propose and alternative graph creation which will provide symmetric candidate proposal. There is also a potential problem with our reduced sample space approach. As we’ve mentioned, we will subset our space with large (not infinite) number of partitions. There is a possibility that two separate partitions won’t have any edges or they have no connection in terms of voxel distribution. We need more brainstorming in order to handle this situation. All these potential computational problems will be answered in our future work. 125 APPENDIX 126 APPENDIX APPENDIX A.1 Chapter 3: Tables for Classification performance using combination of ROIs Table A1: Combination two ROIs sensitivity 0.87 ±0.02 0.86 ±0.01 0.86 ±0.02 0.84 ±0.02 0.81 ±0.02 0.77 ±0.03 0.78 ±0.01 0.84 ±0.02 0.82 ±0.04 0.81 ±0.03 specificity 0.85 ±0.01 0.88 ±0.01 0.88 ±0.02 0.90 ±0.01 0.90 ±0.01 0.84 ±0.01 0.88 ±0.01 0.89 ±0.001 0.89 ±0.01 0.84 ±0.01 accuracy 0.86 ±0.01 0.87 ±0.01 0.87 ±0.001 0.88 ±0.01 0.87 ±0.01 0.81 ±0.01 0.84 ±0.01 0.87 ±0.01 0.86 ±0.01 0.83 ±0.01 AUC 0.92 ±0.02 0.93 ±0.02 0.94 ±0.02 0.93 ±0.02 0.92 ±0.03 0.89 ±0.03 0.90 ±0.02 0.92 ±0.03 0.92 ±0.02 0.91 ±0.03 H + WB H + FG H + EC (Mean ± SD) (Mean ± SD) (Mean ± SD) H + MTC (Mean ± SD) WB + EC (Mean ± SD) WB + FG (Mean ± SD) WB + MTC (Mean ± SD) EC + FG (Mean ± SD) EC + MTC (Mean ± SD) FG + MTC (Mean ± SD) Notes: Sensitivity is proportion of correct AD prediction, Specificity is proportion of correct CN prediction, AUC is area under ROC curve. The mean and standard deviation are based on 100 repeated results on test data sets. Probability threshold is 0.5. 127 Table A2: Combination three ROIs sensitivity 0.86±0.02 0.91±0.02 0.87±0.05 0.85±0.03 0.83±0.02 0.86±0.02 0.84±0.04 0.85±0.02 0.82±0.02 0.86±0.02 specificity 0.86±0.02 0.80±0.02 0.86±0.03 0.87±0.02 0.90±0.01 0.87±0.01 0.86±0.01 0.89±0.01 0.86±0.03 0.85±0.01 accuracy 0.86±0.01 0.84±0.01 0.87±0.02 0.86±0.001 0.87±0.01 0.86±0.01 0.85±0.02 0.88±0.01 0.84±0.02 0.85±0.01 AUC 0.91±0.05 0.91±0.10 0.93±0.07 0.93±0.03 0.93±0.03 0.93±0.02 0.91±0.03 0.94±0.02 0.91±0.03 0.92±0.03 H+WB+EC (Mean ± sd) H+WB+FG (Mean ± sd) H+WB+MTC (Mean ± sd) H+EC+FG (Mean ± sd) H+EC+MTC (Mean ± sd) H+FG+MTC (Mean ± sd) WB+EC+FG (Mean ± sd) WB+EC+MTC (Mean ± sd) WB+FG+MTC (Mean ± sd) EC+FG+MTC (Mean ± sd) Notes: Sensitivity is proportion of correct AD prediction, Specificity is proportion of correct CN prediction, AUC is area under ROC curve. The mean and standard deviation are based on 100 repeated results on test data sets. Probability threshold is 0.5. 128 Table A3: Combination four ROIs specificity 0.79±0.04 0.89±0.03 0.90±0.02 0.84±0.02 0.86±0.02 0.77±0.01 sensitivity 0.91±0.01 0.77±0.04 0.74±0.08 0.85±0.02 0.85±0.03 0.81±0.01 H+WB+FG+MTC H+WB+EC+MTC H+WB+EC+FG (Mean ± sd) (Mean ± sd) (Mean ± sd) (Mean ± sd) (Mean ± sd) (Mean ± sd) H+EC+FG+MTC WB+EC+FG+MTC H+WB+EC+FG+MTC* accuracy 0.84±0.02 0.84±0.02 0.84±0.03 0.84±0.02 0.86±0.01 0.79±0.01 AUC 0.92±0.06 0.89±0.10 0.85±0.14 0.91±0.05 0.92±0.03 0.87±0.02 Notes: Sensitivity is proportion of correct AD prediction, Specificity is proportion of correct CN prediction, AUC is area under ROC curve. The mean and standard deviation are based on 100 repeated results on test data sets. Probability threshold is 0.5. * Normalized volumes were used for this model run A.2 Chapter 4: Posterior consistency proofs Let us define B(rn) = supγ=γ(rn)Ch(G−1 rn ) and ¯B(rn) = supγ=γ(rn)Ch(Grn) where Grn = rn Iq). B(rn) is the largest eigenvalues of Grn and D(R) = 1+R.sup|h|≤R|a n (cid:31) 1 and assuming the below conditions hold which come from Jiang, 2 (cid:48)(h)|sup|h|≤R|ψ(h)|. 1 Iq, .., τ∗ diag(τ∗ Let n → (0,1] with n 2007[63] paper: 129 Conditions: 2 n n) ≺ n i) pn log(1/ 2 ii) pn log(pn) ≺ n 2 n iii) pn log(D( pn ¯B(rn)n λn iv) rn (cid:31) pn v) rn log( ¯B(rn)n) (cid:31) n 2 ) ≤ −4n vi) log( rn n pn pn n)) (cid:31) n 2 2 n n and ∆(rn) (cid:31) n 2 2 n Proof of Condition I: If ψ(u) = eu/(1 + eu), then w(u) = log[ψ(u)/(1 − ψ(u))] = u =⇒ w =⇒ |w (cid:48)(u) = 1 (cid:48)(u)| ≤ Cq Proof of Condition S: The proof starts with defining set and notations used in condition S. Let rn be a large integer > 0 and η is small > 0, then S(rn, η) =(cid:8)(γ, βγ)) : γ = γ(rn), βγ ∈ M(rn, n)(cid:9) (cid:40) M(rn, n) = 2 (b1, .., brn)T : bj ∈ βj ± n n rn , j = 1, ..,rn (cid:41) Here rn is the model size and γ(rn) = (1,2, ..,rn,0, ...) is an increasing sequence whose first rn components take value 1. Let, 1 ≺ rn ≺ min(pn, n/log(pn)) and∞ j=1 || βj||2 < ∞. (cid:35) (cid:34) 2 βγ ∈ βj ± n n rn πn |γ = γ(rn) (cid:34) (λ (2π)(q−1)/2 exp(cid:16)−λn n)(q/2) 2 (cid:113) ¯βT j ¯βj (cid:33)(cid:35) (cid:17)(cid:32) n 2 n rn ≥ rn j=1 130 where ¯βj is some intermediate value which achieves the minimum density over (βj ± n Then , 2 n rn )j∈γ(rn). rn (cid:113) ¯βT j asrn j=1 (cid:113) ¯βT j ¯βj ≤ limn→∞pn rn j=1 j λn ¯βj ≤ C1B(rn) j=1 2 ¯βj + n n rn (cid:113) ¯βT j=1 n)(q/2) (λ 2 (2π)(q−1)/2 ≥ exp(−C2rn − C3rn log( ¯B(rn)) (cid:35) (cid:32) (cid:34) is bounded. In addition we can show that, (cid:33) ) where ¯B(rn) = supγ=γ(rn)Ch(Grn). Therefore, 2 βγ ∈ βj ± n n rn πn |γ = γ(rn) ≥ exp −C2rn − C3rn log( ¯B(rn) − C1B(rn) − rn log( rn 2 n n ≥ exp(−cn n) 2 To prove the prior condition: Let ¯rn such that rn < ¯rn ≤ pn & ¯rn ≺ n/ln(pn). For our model we have placed π0,n ∼ Beta distribution which is equivalent way of proposing Bernoulli distribution on γ = γ(rn) where γ(rn) ∼ Bernoulli(π0,n) [115]. Now, lnπn = rnlnπ0,n + (pn − rn)ln(1 − π0,n) if rn ≈ pnλn then for π0,n = rn/pn small and 1 ≺ rn ≺ min(pn, n/lnpn) =⇒ lnπn ≥ −rnlnpn > −cn =⇒ πn[γ = γ(rn)] > exp(−cn 2 n f or large n n) 2 Satisfying condition (S). 131 Proof of Condition L: Let us assume D(R) = 1 + R.sup|h|≤R|a (cid:48)(h)|sup|h|≤R|ψ(h)| and there exists some Cn such that then, Next, 2 n  ) ≺ n ¯rnln( 1 ¯rnln(pn) ≺ n ¯rnlnD(¯rnCn) ≺ n 2 n 2 n 2 n πn(|γ| > ¯rn) = πn(|γ| = pn) = ( rn )pn pn =⇒ ln(πn(|γ| > ¯rn) = pnln( rn ) ≤ −cn pn =⇒ πn(|γ| > ¯rn) ≤ e−cn 2 n 2 n πn(|| βj||2 > t|γ) ∝ Þ ∞ t ≤ 1 λn (cid:114) βT j βj dβj −λn e e−λnt 2 If t = Cn = cn n λn and n n (cid:31) 1, then 2 1 λn Satisfying condition (L). e−λnt ≤ e−cn 2 n , as λn ≥ 1 132 BIBLIOGRAPHY 133 BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] Stanisław Adaszewski, Juergen Dukart, Ferath Kherif, Richard Frackowiak, and Bogdan Draganski. How early can we predict alzheimer’s disease using computational anatomy? Neurobiology of aging, 34(12):2815–2826, 2013. James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679, 1993. DR Anderson and K Burnham. Model selection and multi-model inference. Second. NY: Springer-Verlag, page 63, 2004. Sönke Arlt, Ralph Buchert, Lothar Spies, Martin Eichenlaub, Jan T Lehmbeck, and Holger Jahn. Association between fully automated mri-based volumetry of different brain regions and neuropsychological test performance in patients with amnestic mild cognitive impair- ment and alzheimer’s disease. European archives of psychiatry and clinical neuroscience, 263(4):335–344, 2013. Takeshi Asami, Sylvain Bouix, Thomas J Whitford, Martha E Shenton, Dean F Salisbury, and Robert W McCarley. Longitudinal loss of gray matter volume in patients with first-episode schizophrenia: Dartel automated analysis and roi validation. Neuroimage, 59(2):986–996, 2012. Jeffrey D Banfield and Adrian E Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, pages 803–821, 1993. J Barnes, RI Scahill, C Frost, JM Schott, MN Rossor, and NC Fox. Increased hippocam- pal atrophy rates in ad over 6 months using serial mr imaging. Neurobiology of aging, 29(8):1199–1203, 2008. Douglas M. Bates and Donald G. Watts. Nonlinear Regression Analysis and Its Applications. 1988. Sam Behseta, Robert E Kass, and Garrick L Wallstrom. Hierarchical models for assessing variability among functions. Biometrika, 92(2):419–434, 2005. [10] Andrew R Bender, Manuel C Völkle, and Naftali Raz. Differential aging of cerebral white matter in middle-aged and older adults: a seven-year follow-up. Neuroimage, 125:74–83, 2016. [11] [12] Jorge L Bernal-Rusiel, Douglas N Greve, Martin Reuter, Bruce Fischl, Mert R Sabuncu, Alzheimer’s Disease Neuroimaging Initiative, et al. Statistical analysis of longitudinal neuroimage data with linear mixed effects models. Neuroimage, 66:249–260, 2013. Jorge L Bernal-Rusiel, Martin Reuter, Douglas N Greve, Bruce Fischl, Mert R Sabuncu, Alzheimer’s Disease Neuroimaging Initiative, et al. Spatiotemporal linear mixed effects 134 [13] modeling for the mass-univariate analysis of longitudinal neuroimage data. Neuroimage, 81:358–370, 2013. James G Booth, George Casella, and James P Hobert. Clustering using objective func- tions and stochastic search. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):119–139, 2008. [14] F Dubois Bowman. Spatiotemporal models for region of interest analyses of functional neuroimaging data. Journal of the American Statistical Association, 102(478):442–453, 2007. [15] F DuBois Bowman, Brian Caffo, Susan Spear Bassett, and Clinton Kilts. A bayesian hierarchical framework for spatial modeling of fmri data. NeuroImage, 39(1):146–156, 2008. [16] F DuBois Bowman and Clinton Kilts. Modeling intra-subject correlation among repeated scans in positron emission tomography (pet) neuroimaging data. Human brain mapping, 20(2):59–70, 2003. [17] Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pages 73–80, 2009. [18] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. [19] Ramon Casanova, Benjamin Wagner, Christopher T Whitlow, Jeff D Williamson, Sally A Shumaker, Joseph A Maldjian, and Mark A Espeland. High dimensional classification of structural mri alzheimer’s disease data based on large scale regularization. Frontiers in neuroinformatics, 5:22, 2011. [20] George Casella. Empirical bayes gibbs sampling. Biostatistics, 2(4):485–500, 2001. [21] George Casella and Elias Moreno. Objective bayesian variable selection. Journal of the American Statistical Association, 101(473):157–167, 2006. [22] George Casella, Elías Moreno, F Javier Girón, et al. Cluster analysis, model selection, and prior distributions on models. Bayesian Analysis, 9(3):613–658, 2014. [23] Arindam Chatterjee and Soumendra Nath Lahiri. Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494):608–625, 2011. [24] Ray-Bing Chen, Chi Hsiang Chu, Shinsheng Yuan, and Ying Nian Wu. Bayesian sparse group selection. Journal of Computational and Graphical Statistics, 25(3):665–683, 2016. [25] G Chetelat, B Landeau, F Eustache, F Mezenge, F Viader, V De La Sayette, B Desgranges, and J-C Baron. Using voxel-based morphometry to map the structural changes associated with rapid conversion in mci: a longitudinal mri study. Neuroimage, 27(4):934–946, 2005. [26] Hee Min Choi, James P Hobert, et al. The polya-gamma gibbs sampler for bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, 7:2054–2064, 2013. 135 [27] Evelyn M. Crowley. Product partition models for normal means. Journal of the American Statistical Association, 92(437):192–198, 1997. [28] Christophe Delaloye, Guenael Moy, Fabienne De Bilbao, K Weber, S Baudois, Sven Haller, Aikaterini Xekardaki, Alessandra Canuto, U Giardini, K-O Lövblad, et al. Longitudinal analysis of cognitive performances and structural brain changes in late-life bipolar disorder. International journal of geriatric psychiatry, 26(12):1309–1318, 2011. [29] Gordana Derado, F DuBois Bowman, Lijun Zhang, and Alzheimer’s Disease Neuroimaging Initiative. Predicting brain activity using a bayesian spatial model. Statistical methods in medical research, 22(4):382–397, 2013. [30] Vasilis K Dertimanis, Minas D Spiridonakos, and Eleni N Chatzi. Data-driven uncer- tainty quantification of structural systems via b-spline expansion. Computers & Structures, 207:245–257, 2018. [31] [32] Ilaria Dimatteo, Christopher R. Genovese, and Robert E. Kass. Bayesian curve-fitting with free-knot splines. Biometrika, 88(4):1055–1071, 2001. Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [33] Yingying Fan, Gareth M James, Peter Radchenko, et al. Functional additive regression. The Annals of Statistics, 43(5):2296–2325, 2015. [34] Yong Fan, Nematollah Batmanghelich, Chris M Clark, Christos Davatzikos, Alzheimer’s Disease Neuroimaging Initiative, et al. Spatial patterns of brain atrophy in mci patients, identified via high-dimensional pattern classification, predict subsequent cognitive decline. Neuroimage, 39(4):1731–1743, 2008. [35] Anders M Fjell, Kristine B Walhovd, Christine Fennema-Notestine, Linda K McEvoy, Donald J Hagler, Dominic Holland, James B Brewer, and Anders M Dale. One-year brain atrophy evident in healthy aging. Journal of Neuroscience, 29(48):15223–15231, 2009. [36] Chris Fraley and Adrian E Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. The computer journal, 41(8):578–588, 1998. [37] Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American statistical Association, 97(458):611–631, 2002. [38] Chris Fraley and Adrian E Raftery. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of classification, 24(2):155–181, 2007. Jerome H. Friedman and Bernard W. Silverman. Flexible parsimonious smoothing and additive modeling. Technometrics, 31(1):3–21, 1989. [39] 136 [40] Thomas Frodl, Annette Schaub, Sandra Banac, Marketa Charypar, Markus Jäger, Petra Kümmler, Ronald Bottlender, Thomas Zetzsche, Christine Born, Gerda Leinsinger, et al. Reduced hippocampal volume correlates with executive dysfunctioning in major depression. Journal of Psychiatry and Neuroscience, 31(5):316, 2006. [41] Sylvia Frühwirth-Schnatter and Rudolf Frühwirth. Auxiliary mixture sampling with appli- cations to logistic models. Computational Statistics & Data Analysis, 51(7):3509–3528, 2007. [42] Alexander Genkin, David D Lewis, and David Madigan. Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3):291–304, 2007. [43] Edward I George and Robert E McCulloch. Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993. [44] Edward I George and Robert E McCulloch. Approaches for bayesian variable selection. Statistica sinica, pages 339–373, 1997. [45] Robert B Gramacy, Nicholas G Polson, et al. Simulation-based regularized logistic regres- sion. Bayesian Analysis, 7(3):567–590, 2012. [46] Mark S Handcock, Adrian E Raftery, and Jeremy M Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2):301–354, 2007. [47] Trevor J. Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning. 2009. [48] Nicholas A Heard, Christopher C Holmes, and David A Stephens. A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of bayesian hierarchical clustering of curves. Journal of the American Statistical Association, 101(473):18–29, 2006. [49] WJP Henneman, JD Sluimer, J Barnes, WM Van Der Flier, IC Sluimer, NC Fox, Ph Scheltens, H Vrenken, and F Barkhof. Hippocampal atrophy rates in alzheimer disease added value over whole brain volume measures. Neurology, 72(11):999–1007, 2009. [50] Chris Hinrichs, Vikas Singh, Lopamudra Mukherjee, Guofan Xu, Moo K Chung, Sterling C Johnson, Alzheimer’s Disease Neuroimaging Initiative, et al. Spatially augmented lpboosting for ad classification with evaluations on the adni dataset. Neuroimage, 48(1):138–149, 2009. [51] David B Hitchcock, George Casella, and James G Booth. Improved estimation of dissim- ilarities by presmoothing functional data. Journal of the American Statistical Association, 101(473):211–222, 2006. [52] James P Hobert and Charles J Geyer. Geometric ergodicity of gibbs and block gibbs samplers for a hierarchical random effects model. Journal of Multivariate Analysis, 67(2):414–430, 1998. 137 [53] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky. Bayesian model averaging: a tutorial. Statistical science, pages 382–401, 1999. [54] Chris C Holmes, Leonhard Held, et al. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian analysis, 1(1):145–168, 2006. [55] Xue Hua, Suh Lee, Igor Yanovsky, Alex D Leow, Yi-Yu Chou, April J Ho, Boris Gutman, Arthur W Toga, Clifford R Jack Jr, Matt A Bernstein, et al. Optimizing power to track brain degeneration in alzheimer’s disease and mild cognitive impairment with tensor-based morphometry: an adni study of 515 subjects. Neuroimage, 48(4):668–681, 2009. Jian Huang, Joel L. Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models. Annals of Statistics, 38(4):2282–2313, 2010. [56] [57] Hemant Ishwaran, J Sunil Rao, et al. Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005. [58] Clifford R Jack, Ronald C Petersen, Yue Cheng Xu, Peter C O’Brien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen. Predic- tion of ad with mri-based hippocampal volume in mild cognitive impairment. Neurology, 52(7):1397–1397, 1999. [59] Clifford R Jack, Ronald C Petersen, Yuecheng Xu, Peter C O’Brien, Glenn E Smith, Robert J Ivnik, Eric G Tangalos, and Emre Kokmen. Rate of medial temporal lobe atrophy in typical aging and alzheimer’s disease. Neurology, 51(4):993–999, 1998. [60] Clifford R Jack, Ronald Carl Petersen, Y Xu, PC O’brien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Eric George Tangalos, and E Kokmen. Rates of hippocampal atrophy correlate with change in clinical status in aging and ad. Neurology, 55(4):484–490, 2000. [61] Gareth M James. Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):411–432, 2002. [62] Wenxin Jiang. On the consistency of bayesian variable selection for high dimensional binary regression and classification. Neural computation, 18(11):2762–2776, 2006. [63] Wenxin Jiang et al. Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. The Annals of Statistics, 35(4):1487–1511, 2007. Iain M Johnstone, Bernard W Silverman, et al. Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649, 2004. [64] [65] Panos Kalnis, Nikos Mamoulis, and Spiridon Bakiras. On discovering moving clusters in spatio-temporal data. In International Symposium on Spatial and Temporal Databases, pages 364–381. Springer, 2005. [66] Yuwon Kim, Jinseog Kim, and Yongdai Kim. Blockwise sparse regression. Statistica Sinica, pages 375–390, 2006. 138 [67] Slava Kisilevich, Florian Mansmann, Mirco Nanni, and Salvatore Rinzivillo. Spatio- temporal clustering. In Data mining and knowledge discovery handbook, pages 855–874. Springer, 2009. [68] Keith Knight and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics, pages 1356–1378, 2000. [69] Minjung Kyung, Jeff Gill, Malay Ghosh, George Casella, et al. Penalized regression, standard [70] [71] errors, and bayesian lassos. Bayesian Analysis, 5(2):369–411, 2010. John W Lau and Peter J Green. Bayesian model-based clustering procedures. Journal of Computational and Graphical Statistics, 16(3):526–558, 2007. Jonathan C Lau, Jason P Lerch, John G Sled, R Mark Henkelman, Alan C Evans, and Barry J Bedell. Longitudinal neuroanatomical changes determined by deformation-based morphometry in a mouse model of alzheimer’s disease. Neuroimage, 42(1):19–27, 2008. [72] Sang Han Lee, Alvin H Bachman, Donghyeon Yu, Johan Lim, Babak A Ardekani, Alzheimer’s Disease Neuroimaging Initiative, et al. Predicting progression from mild cog- nitive impairment to alzheimer’s disease using longitudinal callosal atrophy. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 2:68–74, 2016. [73] Chenlei Leng, Minh-Ngoc Tran, and David Nott. Bayesian adaptive lasso. Annals of the Institute of Statistical Mathematics, 66(2):221–244, 2014. [74] Kelvin K Leung, Josephine Barnes, Gerard R Ridgway, Jonathan W Bartlett, Matthew J Clarkson, Kate Macdonald, Norbert Schuff, Nick C Fox, Sebastien Ourselin, Alzheimer’s Disease Neuroimaging Initiative, et al. Automated cross-sectional and longitudinal hip- pocampal volume measurement in mild cognitive impairment and alzheimer’s disease. Neu- roimage, 51(4):1345–1359, 2010. Jiahan Li, Zhong Wang, Runze Li, and Rongling Wu. Bayesian group lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies. The annals of applied statistics, 9(2):640, 2015. [75] [76] Qing Li, Nan Lin, et al. The bayesian elastic net. Bayesian analysis, 5(1):151–170, 2010. [77] Yingjie Li. High Dimensional Classification for Spatially Dependent Data with Application to Neuroimaging. Michigan State University, 2018. [78] Feng Liang, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger. Mixtures of g priors for bayesian variable selection. Journal of the American Statistical Association, 103(481):410–423, 2008. [79] Anastasia Lykou and Ioannis Ntzoufras. On bayesian lasso variable selection and the speci- fication of the shrinkage parameter. Statistics and Computing, 23(3):361–390, 2013. 139 [80] Benoît Magnin, Lilia Mesrob, Serge Kinkingnéhun, Mélanie Pélégrini-Issac, Olivier Col- liot, Marie Sarazin, Bruno Dubois, Stéphane Lehéricy, and Habib Benali. Support vector machine-based classification of alzheimer’s disease from whole-brain anatomical mri. Neu- roradiology, 51(2):73–83, 2009. [81] Atreyee Majumder. Variable Selection in High-Dimensional Setup: A Detailed Illustration Through Marketing and MRI Data. Michigan State University, 2017. [82] Gertraud Malsiner-Walli and Helga Wagner. Comparing spike and slab priors for bayesian variable selection. arXiv preprint arXiv:1812.07259, 2018. [83] Peter McCullagh and Jie Yang. Stochastic classification models. In International Congress of Mathematicians, volume 3, page 72, 2006. [84] CR McDonald, LK McEvoy, L Gharapetian, C Fennema-Notestine, DJ Hagler, D Holland, A Koyama, JB Brewer, AM Dale, Alzheimer’s Disease Neuroimaging Initiative, et al. Re- gional rates of neocortical atrophy from normal aging to early alzheimer disease. Neurology, 73(6):457–465, 2009. [85] Mathew W. McLean, Fabian Scheipl, Giles Hooker, Sonja Greven, and David Ruppert. Bayesian functional generalized additive models with sparsely observed covariates. arXiv preprint arXiv:1305.3585, 2013. [86] Mario Medvedovic, Ka Yee Yeung, and Roger Eugene Bumgarner. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics, 20(8):1222–1232, 2004. [87] Lukas Meier, Sara Van De Geer, and Peter Bühlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008. [88] Volodymyr Melnykov, Ranjan Maitra, et al. Finite mixture models and model-based cluster- ing. Statistics Surveys, 4:80–116, 2010. [89] Alan Miller. Subset selection in regression. Chapman and Hall/CRC, 2002. [90] Chandan Misra, Yong Fan, and Christos Davatzikos. Baseline and longitudinal patterns of brain atrophy in mci patients, and their use in prediction of short-term conversion to ad: results from adni. Neuroimage, 44(4):1415–1422, 2009. [91] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988. [92] Christian Montag and Martin Reuter. Internet addiction: Neuroscientific approaches and therapeutical interventions. 2015. [93] HANS-GEORG MÜLLER. Functional modelling and classification of longitudinal data. Scandinavian Journal of Statistics, 32(2):223–240, 2005. [94] Naveen Naidu Narisetty, Xuming He, et al. Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789–817, 2014. 140 [95] Wei Pan and Xiaotong Shen. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007. [96] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. [97] Nicholas G Polson, James G Scott, and Jesse Windle. Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349, 2013. [98] Nicholas G Polson, James G Scott, and Jesse Windle. The bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):713–733, 2014. [99] Chengxuan Qiu, Miia Kivipelto, and Eva von Strauss. Epidemiology of alzheimer’s dis- ease: occurrence, determinants, and strategies toward intervention. Dialogues in clinical neuroscience, 11(2):111, 2009. [100] Olivier Querbes, Florent Aubry, Jérémie Pariente, Jean-Albert Lotterie, Jean-François Dé- monet, Véronique Duret, Michèle Puel, Isabelle Berry, Jean-Claude Fort, Pierre Celsis, et al. Early diagnosis of alzheimer’s disease using cortical thickness: impact of cognitive reserve. Brain, 132(8):2036–2047, 2009. [101] Adrian E Raftery and Nema Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473):168–178, 2006. [102] Naftali Raz and Kristen M Kennedy. A systems approach to the aging brain: Neuroanatomic changes, their modifiers, and cognitive correlates. In Imaging the aging brain, pages 43–70. Oxford University Press, 2009. [103] Naftali Raz, Ulman Lindenberger, Karen M Rodrigue, Kristen M Kennedy, Denise Head, Adrienne Williamson, Cheryl Dahle, Denis Gerstorf, and James D Acker. Regional brain changes in aging healthy adults: general trends, individual differences and modifiers. Cere- bral cortex, 15(11):1676–1689, 2005. [104] Naftali Raz and Karen M Rodrigue. Differential aging of the brain: patterns, cognitive correlates and modifiers. Neuroscience & Biobehavioral Reviews, 30(6):730–748, 2006. [105] H Rusinek, Y Endo, S De Santi, D Frid, W-H Tsui, S Segal, A Convit, and MJ de Leon. Atrophy rate in medial temporal lobe during progression of alzheimer disease. Neurology, 63(12):2354–2359, 2004. [106] Fabian Scheipl, Ludwig Fahrmeir, and Thomas Kneib. Spike-and-slab priors for function selection in structured additive regression models. Journal of the American Statistical Association, 107(500):1518–1532, 2012. [107] Fabian Scheipl, Thomas Kneib, and Ludwig Fahrmeir. Penalized likelihood and bayesian function selection in regression models. AStA Advances in Statistical Analysis, 97(4):349– 385, 2013. 141 [108] Flávio Luiz Seixas, Bianca Zadrozny, and Débora Christina Muchaluat Saade. A bayesian network decision model for supporting the di- agnosis of dementia, alzheimer’s disease and mild cognitive impairment. Computers in biology and medicine, 51:140–158, 2014. Jerson Laks, Aura Conci, [109] Philip Shaw, Noor J Kabani, Jason P Lerch, Kristen Eckstrand, Rhoshel Lenroot, Nitin Gog- tay, Deanna Greenstein, Liv Clasen, Alan Evans, Judith L Rapoport, et al. Neurodevelopmen- tal trajectories of the human cerebral cortex. Journal of Neuroscience, 28(14):3586–3594, 2008. [110] Li Shen, Yuan Qi, Sungeun Kim, Kwangsik Nho, Jing Wan, Shannon L Risacher, Andrew J Saykin, et al. Sparse bayesian learning for identifying imaging biomarkers in ad prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 611–618. Springer, 2010. [111] Guiling Shi. Bayesian Variable Selection: Extensions of Nonlocal Priors. Michigan State University. Statistics, 2017. [112] Russell T Shinohara, Ciprian M Crainiceanu, Brian S Caffo, and Daniel S Reich. Longi- tudinal analysis of spatiotemporal processes: a case study of dynamic contrast-enhanced magnetic resonance imaging in multiple sclerosis. 2011. [113] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 22(2):231–245, 2013. [114] Martha Skup, Hongtu Zhu, and Heping Zhang. Multiscale adaptive marginal analysis of longitudinal neuroimaging data with time-varying covariates. Biometrics, 68(4):1083–1092, 2012. [115] Michael Smith and Robert Kohn. Nonparametric regression using bayesian variable selec- tion. Journal of Econometrics, 75(2):317–343, 1996. [116] Richard P. Stanley. Enumerative Combinatorics: Volume 1. 1986. [117] Adam J Suarez, Subhashis Ghosal, et al. Bayesian estimation of principal components for functional data. Bayesian Analysis, 12(2):311–333, 2017. [118] Tero Tapiola, Corina Pennanen, Mia Tapiola, Susanna Tervo, Miia Kivipelto, Tuomo Hän- ninen, Maija Pihlajamäki, Mikko P Laakso, Merja Hallikainen, Anne Hämäläinen, et al. Mri of hippocampus and entorhinal cortex in mild cognitive impairment: a follow-up study. Neurobiology of aging, 29(1):31–38, 2008. [119] Paul M Thompson, Kiralee M Hayashi, Greig I De Zubicaray, Andrew L Janke, Stephen E Rose, James Semple, Michael S Hong, David H Herman, David Gravano, David M Doddrell, et al. Mapping hippocampal and ventricular change in alzheimer disease. Neuroimage, 22(4):1754–1766, 2004. [120] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. 142 [121] Angelika Van Der Linde. A bayesian latent variable approach to functional principal compo- nents analysis with binary and count data. AStA Advances in Statistical Analysis, 93(3):307– 333, 2009. [122] Geert Verbeke and Geert Molenberghs. A model for longitudinal data. Linear mixed models for longitudinal data, pages 19–29, 2000. [123] Xuejing Wang, Bin Nan, Ji Zhu, and Robert Koeppe. Regularized 3d functional regression for brain image data via haar wavelets. The annals of applied statistics, 8(2):1045, 2014. [124] Jennifer L Whitwell, Scott A Przybelski, Stephen D Weigand, David S Knopman, Bradley F Boeve, Ronald C Petersen, and Clifford R Jack Jr. 3d maps from multiple mri illustrate changing atrophy patterns as subjects progress from mild cognitive impairment to alzheimer’s disease. Brain, 130(7):1777–1786, 2007. [125] Jennifer L Whitwell, Maria M Shiung, SA Przybelski, Stephen D Weigand, David S Knop- man, Bradley F Boeve, Ronald C Petersen, and CR Jack. Mri patterns of atrophy associated with progression to ad in amnestic mild cognitive impairment. Neurology, 70(7):512–520, 2008. [126] Jesse Windle, Nicholas G Polson, and James G Scott. Sampling polya-gamma random variates: alternate and approximate techniques. arXiv preprint arXiv:1405.0506, 2014. [127] Mark William Woolrich, Mark Jenkinson, J Michael Brady, and Stephen M Smith. Fully bayesian spatio-temporal modeling of fmri data. IEEE transactions on medical imaging, 23(2):213–231, 2004. [128] Xiaofan Xu, Malay Ghosh, et al. Bayesian variable selection and estimation for group lasso. Bayesian Analysis, 10(4):909–936, 2015. [129] Lan Xue. Consistent variable selection in additive models. 2009. [130] Wenlu Yang, Xinyun Chen, Hong Xie, and Xudong Huang. Ica-based automatic classification In Life System Modeling and Intelligent of magnetic resonance images from adni data. Computing, pages 340–347. Springer, 2010. [131] Ka Yee Yeung, Chris Fraley, Alejandro Murua, Adrian E. Raftery, and Walter L. Ruzzo. Model-based clustering and data transformations for gene expression data. Bioinformatics, 17(10):977–987, 2001. [132] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. [133] Daoqiang Zhang, Dinggang Shen, Alzheimer’s Disease Neuroimaging Initiative, et al. Pre- dicting future clinical changes of mci patients using longitudinal and multimodal biomarkers. PloS one, 7(3):e33182, 2012. 143 [134] Lin Zhang, Veerabhadran Baladandayuthapani, Bani K Mallick, Ganiraju C Manyam, Pa- tricia A Thompson, Melissa L Bondy, and Kim-Anh Do. Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer. Journal of the Royal Statistical Society: Series C (Applied Statistics), 63(4):595–620, 2014. [135] Xiaoxi Zhang, Tim Johnson, Rod Little, and Yue Cao. Longitudinal image analysis of tumor/brain change in contrast uptake induced by radiation. 2009. [136] Hongxiao Zhu, Marina Vannucci, and Dennis D Cox. A bayesian hierarchical model for classification with selection of functional predictors. Biometrics, 66(2):463–473, 2010. [137] Vadim Zipunnikov, Sonja Greven, Haochang Shou, Brian S. Caffo, Daniel S. Reich, and Ciprian M. Crainiceanu. Longitudinal high-dimensional principal components analysis with application to diffusion tensor imaging of multiple sclerosis. Ann. Appl. Stat., 8(4):2175– 2202, 12 2014. [138] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. 144