VARIABLE SELECTION IN HIGH-DIMENSIONAL SETUP: A DETAILED ILLUSTRATION THROUGH MARKETING AND MRI DATA By Atreyee Majumder A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics - Doctor of Philosophy 2017 ABSTRACT VARIABLE SELECTION IN HIGH-DIMENSIONAL SETUP: A DETAILED ILLUSTRATION THROUGH MARKETING AND MRI DATA By Atreyee Majumder In the times of big data and ever growing information, variable selection is an integral part of statistical analysis. With the advancement of technology, we are able to store and access large volumes of data, only part of which is required for inference. Variable selection is a statistical technique that helps us retain valuable information while discarding everything that is non-significant. To understand variable selection, we perform a comparative study of various popular frequentist variable selection techniques. This study analyses the difference of performance of models based on Ridge, LASSO and Elastic Net methods of penalized regression. The comparison of these methods is done for both continuous and binary outcome. We further emphasize the importance of tuning parameter selection in penalized regression models. This is done by comparing 6 different methods of tuning parameter selection for each penalized approach. The best performing method is then chosen to build statistical models for market research data of 4 varied countries. This exercise is an application of variable selection. Here, we showcase the applicability of such models in handling large information efficiently, for managerial decisions. We show how managers can leverage this technique for better resource allocation in their business decisions. Next, we build a model for variable selection in a Bayesian setup. This is motivated by the fact that the frequentist approaches have unstable inference. Here, we analyze Alzheimer’s Disease Neuroimaging Initiative (ADNI) with a Bayesian model. This is done by building a Bayesian hierarchical model with multivariate Laplace priors in spike and slab prior style. This model is able to select a group of related variables. The frequentist counterpart of this estimator, group lasso, is also discussed. We build a classification model that is able to select the significant brain regions in Alzheimer’s disease with 80% accuracy. Instead of using standard MAP thresholding, we use posterior median thresholding for variable selection. Furthermore, the consistency of this estimator is also proved. Lastly, we build a Bayesian structured model for variable selection based on magnetic resonance imaging (MRI) data. This model is an extension of the second method but takes into account bi-level selection and spatio-temporal correlation. Voxels in brain regions have spatial correlation and repeated measurements for each voxel which brings in temporal correlation. This model is applied on a simulated functional MRI (fMRI) type data and real data. The real data detects blood oxygenation level dependent (BOLD) activation. The data is large on the account of numerous voxels present in the brain. Our method, successfully, detects the activated brain regions in the presence of a stimuli. Thus, this thesis delves into various scenarios of variable selection with three different real data application studies. The focus is mainly on Bayesian variable selection and the use of hierarchical modeling with iterative sampling from posterior distribution in the group lasso setup. Our application of using group lasso structure to identify brain regions and voxels is an innovative approach in the context of present literature review. All of these methods have practical implication that can be used to solve relevant real world problems. To my beloved parents Subir Kumar Majumder and Kalyani Majumder for their unconditional love and support. iv ACKNOWLEDGMENTS I am extremely obliged to be a part of the Statistics and Probability Department at Michigan State University. In my tenure of five years, I have worked with some amazing professors who have been a building block in my academic advancement. I would like to thank my PhD. advisor, Dr. Tapabrata Maiti, for his immense support, encouragement and guidance in my research. I am grateful to him for believing in my potential and giving me the opportunity of working with him. I have been extremely fortunate to work with Dr. Alla Sikorskii for 4 years in her research projects with the College of Nursing. I would like to thank her for providing me this opportunity of learning and growth and also, for being a part of my research committee. My sincere gratitude goes to Dr. Roger Calantone for providing me with the market research dataset and collaborating on one of my projects. I would like to thank my research committee member, Dr. Gustavo de los Campos, for agreeing to be a part of my committee and sharing knowledge on Bayesian techniques. I also had the opportunity of working with Dr. David Zhu, Department of Radiology, MSU, for patiently describing details of fMRI data and helping me understand and work with brain data. I am thankful to the Department of Statistics and Probability and College of Nursing at Michigan State University for supporting me financially through my PhD. career. My sincere thanks to all professors in the department for helping me glide through these years of research. Special thanks to Dr. Ramamoorthy and Dr. Levental for preparing us for the PhD/ qualifier courses. I have worked closely with many of my peers and would like to thank them for engaging in useful converstaions that led to the development of some useful ideas in my research. v I would like to thank my family - my parents, Mr. Subir Kumar Majumder and Mrs. Kalyani Majumder, for their love and support through the tough times. They are my pillars of strength. Last but not the least, I feel immense gratitude towards my husband, Mr. Subha Tirtha Datta, for helping me pursue my dreams. This journey is rewarding because he is a part of it. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . 1.1.2 Penalized Regression Approaches . . . . . . . . . . . . . . 1.1.3 Bayesian Penalized Regression . . . . . . . . . . . . . . . . 1.1.4 Frequentist Group Lasso . . . . . . . . . . . . . . . . . . . 1.1.5 Bayesian Group lasso . . . . . . . . . . . . . . . . . . . . . 1.1.6 Bayesian Group Lasso with Logistic Regression . . . . . . 1.2 Applications of Variable Selection . . . . . . . . . . . . . . . . . . 1.2.1 Market Research Data: A Comparative Study . . . . . . . 1.2.2 Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data: Group Lasso in Logistic Regression . . . . . . . . . . . . . 1.2.3 Functional Magnetic Resonance Imaging (fMRI) Data: Bayesian Spatiotemporal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian . . . . . . 1 1 3 4 7 9 11 13 14 15 . . . . . . 20 Chapter 2 Gaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 28 28 32 41 42 50 50 51 A Bayesian Group Lasso Classification For ADNI Volumetrics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Bayesian Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Group Lasso with Logistic Regression . . . . . . . . . . . . . . . . Posterior Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification of Alzheimer’s Disease using ADNI MRI data . . . . . . . . . . 53 53 56 57 61 62 72 74 2.1 2.2 2.3 2.4 2.5 Better Tools for Strategic Global Decision Consumer Insights . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . The penalized regression models . . . . . . . . . . . . . . 2.2.1 Tuning parameter selection . . . . . . . . . . . . Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . An Empirical Variable Selection Exercise . . . . . . . . . 2.4.1 Characterizing Lifestyle Segments . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Theoretical Implications . . . . . . . . . . . . . . 2.5.2 Managerial Implications . . . . . . . . . . . . . . Makers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3 3.1 3.2 3.3 3.4 3.5 3.6 vii 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Bayesian Spatiotemporal Model for Detecting Voxel-level tivation in fMRI Data . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Bi-level Variable Selection . . . . . . . . . . . . . . . . . . . . Bayesian Spatiotemporal Model with Bi-level Selection . . . . . . . . . 4.3.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Bi-level Spatiotemporal Model for fMRI Data . . . . . . . . . . Computational Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Creating the Adjacency Matrix . . . . . . . . . . . . . . . . . . 4.4.2 Matrix Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Generating Simulated fMRI Data . . . . . . . . . . . . . . . . . 4.5.2 Analysis Result of Simulation Study . . . . . . . . . . . . . . . Single-Subject fMRI Data Analysis . . . . . . . . . . . . . . . . . . . . 4.6.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 fMRI Data Pre-processing and Analysis . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5 Ac. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 85 85 89 90 90 98 107 108 109 109 110 111 112 112 113 116 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 viii LIST OF TABLES Table 2.1: Penalized Regression Methods To Be Discussed . . . . . . . . . . . . . . . 29 Table 2.2: Mean squared prediction error and model size when p=50 (without multicollinearity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 2.3: Mean squared prediction error and model size when p=700 (without multicollinearity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 2.4: Mean squared prediction error and model size when p=50 (with multicollinearity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 2.5: Mean squared prediction error and model size when p=700 (with multicollinearity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 2.6: Negative log-likelihood and model size when p=50 (without multicollinearity) for binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 2.7: Negative log-likelihood and model size when p=700 (without multicollinearity) for binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 2.8: Negative log-likelihood and model size when p=50 (with multicollinearity) for binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 2.9: Negative log-likelihood and model size when p=700 (with multicollinearity) for binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Table 2.10: Results of Variable Selection - USA . . . . . . . . . . . . . . . . . . . . . 46 Table 2.11: Results of Variable Selection - Canada . . . . . . . . . . . . . . . . . . . . 47 Table 2.12: Results of Variable Selection - China . . . . . . . . . . . . . . . . . . . . . 48 Table 2.13: Results of Variable Selection - Brazil . . . . . . . . . . . . . . . . . . . . . 49 Table 3.1: Mean (Standard Error) True/False Positive Rate and Negative Log-Likelihood in 28 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Table 3.2: Demographics of patients in ADNI data used for analyses . . . . . . . . . 77 Table 3.3: Mean (Standard Error) of Parameter Estimates of Selected ROIs . . . . . 80 ix Table 4.1: Median accuracy (minimum and maximum) of correct classification and false positives over 10 simulated datasets . . . . . . . . . . . . . . . . . . 111 x LIST OF FIGURES Figure 1.1: Geometry of the ridge, LASSO and elastic net (α = 0.5) . . . . . . . . . 6 Figure 1.2: Shrinkage in four different priors . . . . . . . . . . . . . . . . . . . . . . 8 Figure 2.1: Distributions and descriptions of the six psychographic segments . . . . . 45 Figure 3.1: Some regions of interest selected by the Bayesian classification model . . 79 Figure 3.2: ROC curve for our classifier . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 4.1: Task stimuli based on three convolution functions . . . . . . . . . . . . . 86 Figure 4.2: Illustration of neighborhood of a cube . . . . . . . . . . . . . . . . . . . 108 Figure 4.3: Stimuli convolved with double-gamma density (red) and the corresponding boxcar function (green) . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 4.4: True and estimated binary map . . . . . . . . . . . . . . . . . . . . . . . 112 Figure 4.5: Scene minus object contrast activation . . . . . . . . . . . . . . . . . . . 115 Figure 4.6: Scene and object activation . . . . . . . . . . . . . . . . . . . . . . . . . 115 xi Chapter 1 Introduction 1.1 Variable Selection Variable selection is an extremely relevant area of statistical modeling, especially in modern times. Development of high performance computing machines have paved a way for storage of large volumes of data. Today, most fields of research are able to store data on a variety of aspects. For example, a financial firm or a medical researcher may include information about subjects that may not be relevant from an apparent view but might deem to be important if explored with statistical analysis. Researchers do not want to miss out on useful information, so a lot of data is stored only to selectively use them at a later stage. Although, the availability of large volumes of data is welcome, statisticians have to be careful to use a subset of it for valid statistical inference. In the real world, most data that is captured contains very little information for a specific outcome. In genetic data, for example, information on thousands of SNPs are available, but, only a few are able to completely characterize a disease. When building a statistical model, we want to develop a relationship that is able to explain an association between an outcome and some features with a very small error rate. In doing so, we do not wish to sacrifice the optimality of the model. One might think that inclusion of more information will reduce the error rate but, we also need to consider the complexity of this model. 1 Thus, building a model and selecting an optimal model have become interchangeable terms these days. Variable selection is not a new concept in statistical model building so, many procedures have existed for long. An ordinary regression equation is as follows: yi = α + β1 x1 + · · · + βp xp , i = 1, . . . , n (1.1) Here, there are p independent variables which are used to build a model for predicting y. Note that, p < n and all the p covariates may not be statistically significant for predicting y. We use stepwise selection procedures for selecting the significant covariates. In these procedures, we start with all variables (backward selection) or no variable at all (forward selection) and eliminate or add covariates based on their corresponding p-value. This is continued until no more covariates are eliminated or added in the model. From the description, it is evident that these methods are time cumbersome as we need to validate the significance of variables at each step until it converges. The step by step procedure may miss the optimal model and using p-values to add/drop variables may not be appropriate in cases, such as, when multicollinearity is present. Subset selection is another approach of variable selection where various candidate models are validated for optimality using a choice of metric. The problem here is that, with the increase in number of covariates, the number of candidate models increase exponentially. So, computationally this becomes a complicated problem. In the past few years, newer approaches use penalization of the likelihood (objective) function to obtain a sparse solution. The penalty is put on the parameter space so that the parameter estimates are encouraged to move to zero if they are not statistically significant. These methods are better understood in a high-dimensional setup where p > n. In such a case, it is clear that ordinary regression breaks down due to issues of invertibility of the low 2 rank design matrix. Penalization approaches deal with this issue as well as perform variable selection, making these methods the most sought after for variable selection in this era. 1.1.1 High-Dimensional Data High dimensional feature space arises when we have more number of feature variables than the number of observations. This scenario is commonplace in various fields of studies like social sciences, marketing research, genetic sequencing, machine learning, brain image analysis etc. For instance, in brain image data, we have tens of thousands of voxel level data and only a few responses. In genomics, hundreds of thousands of SNPs are potential covariates for a particular phenotype. In market research or social science data, we have more data on feature variables that are obtained from questionnaires, than the number of individuals who can be interviewed. All of these cases show that high-dimensional data occurs frequently. The problems that arise with high-dimensional setup should be tackled appropriately since its occurrence is unavoidable in many practical domains of research. In the scenario of p > n, we can no longer pursue ordinary regression due to the noninvertibility of X T X. In such a case, it is helpful to assume that our regression function lies in a low dimensional space (Fan and Lv, 2010). This can be applied by introducing the concept of sparsity. If we assume that many covariates have zero estimates then we can reduce our dimension of interest to a much smaller space. Assuming sparsity means we believe that many of the covariates do not have significant impact on the outcome. This assumption is practical in almost all the research areas that are affected by the curse of dimensionality. In genetic studies, there is information on numerous genes but only a few are actually associated with the occurrence of a disease. The assumption of sparsity is thus valid for simpler statistical analysis and also for researchers 3 who would be able to make more sense from a subset of associated genes scientifically. In brain magnetic resonance imaging, the scanner collects information of the entire brain region (containing thousands of voxels), but only a few of these voxels are activated on the presence of a stimuli. Thus, including information that is not activated will mislead biologists. Therefore, assumption of sparsity makes ground for better model building as well as improved interpretation of the analysis. Variable selection in high-dimensional setup is an important area of research. It is highly sensitive because it has real world applications. A proper statistical model based on appropriate information and consequently, its precise implications can answer a lot of questions in the medical, genetic, financial, economic, machine learning and social sciences domains. This research problem has a greater contribution to the betterment of everyday issues at large. 1.1.2 Penalized Regression Approaches Consider the vector of response Y and the matrix of covariates X. Also, let ω be the vector of parameters and be the random error associated with the model. Assume, ∼ N (0, σ 2 ). Let n be the sample size and p be the number of parameters in the model, then the classical regression model is: Y = Xω + The ordinary least squares method approach of estimating the parameters gives ω ˆ ols = (X T X)−1 X T Y . We know that ω ˆ ols is non- estimable when p > n or in the presence of collinearity. To overcome this issue we introduce the following penalized regression methods which minimizes the sum of squares of errors subject to some constraints. We assume that 4 the underlying feature (design) matrix is sparse with only a few predictive features. To elaborate this idea we look at the following three models that places a constraint on the parameters to induce sparsity. Ridge regression (Hoerl and Kennard, 1970) shrinks the estimates of the coefficients towards zero. In this approach we minimize p n (Yi − Xi ω)2 + λ i=1 ωj2 j=1 λ is the tuning parameter which controls the trade-off between bias and variance of the estimates. The LASSO (Tibshirani, 1996) puts an L1 −penalty on the parameters. In this approach we minimize p n (Yi − Xi ω)2 + λ i=1 |ωj | j=1 where λ is the tuning parameter. This penalized approach forces many ω to take 0 values and works consistently when p > n. The LASSO is an attractive tool due to simultaneous estimation and variable selection. But, it does not perform satisfactorily when multicollinearity occurs. To improve on these methods, Zou and Hastie (2005) proposed the elastic net which includes feature of both the Ridge and the LASSO estimators. Here, we put a convex combination of the ridge and LASSO penalties on the parameters. Thus, elastic net minimizes n (Yi − Xi ω)2 + λPα (ω) i=1 where λ is the tuning parameter and Pα (ω) = 5 p 1 2 j=1 2 (1 − α)ωj + α|ωj | . (Zou and Hastie (2005) called this penalty as the naive elastic net penalty and called a rescaled version elastic net penalty, but we drop this distinction here). Figure 1.1 (Source: http://scikit-learn.sourceforge.net/0.7/modules/sgd.html) shows that the edges of LASSO and elastic net creates more opportunities for some estimates to be zero. This feature guarantees sparsity of estimates. The convex edges of ridge and elastic net encourages grouping effect. Thus, ridge penalty is good when there is multicollinearity. It does not generate a sparse model, although, it does shrink the estimates towards 0. The LASSO on the other hand generates a sparse model but fails to do grouped selection. It generally selects one variable from a group and drops the others. Since, elastic net is a convex combination of the ridge and LASSO penalty it reflects properties of both. The vertices guarantee sparsity and the convexity encourages grouping effect. Figure 1.1: Geometry of the ridge, LASSO and elastic net (α = 0.5) Other penalized approaches include SCAD, adaptive LASSO, group lasso etc. All of these methods use some penalty parameter, also known as tuning parameter, to induce sparsity in the model. Regularization in group lasso is introduced on a group of related variables (dummy variables or basis splines). 6 1.1.3 Bayesian Penalized Regression Both frequentist and Bayesian approaches have been explored extensively for variable selection. Although, frequentist variable selection approaches are popular and useful, they provide unstable standard errors of the estimates. This makes it imperative for us to look for alternatives that are more statistically reliable. Bayesian approaches for variable selection are based on the assumption that there is a priori probability associated with each subset model and we select the model that has the highest posterior probability. The key notion in variable selection is shrinkage of parameter estimates towards zero. Bayesian methods offer natural shrinkage with a proper choice of prior. Shrinkage in Bayesian literature is in terms of the estimates being shrunk towards prior belief. A careful choice of prior on the regression parameters will encourage sparsity in the model. Tibshirani (1996) noted that the posterior mode of Laplace priors on parameters will give identical estimates as that of the lasso. This has motivated (e.g., Figueiredo 2003; Bae and Mallick 2004; Yuan and Lin 2005) to use i.i.d. Laplace priors on coefficients, β’s, to develop Bayesian lasso-like estimates. A normal prior is equivalent to Bayesian ridge regression where the estimates are obtained by maximizing the posterior. A spike and slab prior puts some weight on a flat density and the remaining on a spiked density thus making it a weighted mixture of two densities. The spike part encourages shrinkage and is especially helpful when number of predictors is larger than number of observations. A mixture of a point mass density and a normal density is special case of spike and slab; we will call this version spike and slab prior throughout this thesis. Figure 1.2 (de los Campos et. al., 2013) shows how the prior information in Bayesian analysis naturally encourages the posterior estimates to shrink towards prior belief. 7 Figure 1.2: Shrinkage in four different priors Park and Casella (2008) developed a fully Bayesian lasso by using conditional Laplace priors on β with a non-informative prior on the variance parameter to ensure unimodality. A good mixing property of normal densities with exponential density results in a Laplace prior. This property is used to formulate a hierarchical Gibb’s model so we get full conditionals of all the parameters involved. The solution path of this Bayesian posterior estimate is similar to that of the lasso estimate. In most high-dimensional setup, rarely does a model occur with very high frequency, so instead of looking at the maximum posterior probability, it is often feasible to use posterior means as the estimates of β. Posterior means do not directly give zero estimates, so we use posterior median thresholding since median is a natural thresholding estimator. 8 1.1.4 Frequentist Group Lasso Variable selection is a technique of selecting an optimal model in predictive modeling. In many regression problems, we are interested in selecting feature variables that are important in predicting the response variable. The feature variables can be individual numeric variables, various levels of a categorical variable or a number of basis functions of the original measured variables. Recently proposed methods like the LASSO, SCAD etc. can efficiently perform variable selection by selecting individual feature variables. In case of an ANOVA type model where there are multiple levels of a feature variable or for an additive model where each component is a linear combination of a number of basis functions, selecting the important variable amounts to selecting all levels of the variable. A very simple linear regression equation is of the form: Yn×1 = Xn×p βp×1 + n×1 (1.2) Here, X is the design matrix whose columns are the feature variables, β is the vector of coefficients, is the error vector where each i has a normal distribution with mean 0 and variance σ 2 and Y is the vector of observations. Each feature variable in equation (1.2) can be either categorical or continuous. ANOVA is a special case where all the input variables are categorical whereas an additive model is a special case of all continuous input variables. However, the input variables could be a mixture of both numeric and categorical variables in a regression problem given by equation (1.2). When we want to work with factor variables with G factors (groups) then we can modify our notations in equation (1.2) as follows: 9 G Yn×1 = Xg βg + (1.3) g=1 where n×1 ∼ Nn (0, σ 2 In ), βg is a coefficient vector of length mg , and Xg is an n × mg covariate (feature) matrix corresponding to the factor βg , g = 1, 2 . . . G. Let p be the total number of predictors, so p = G g=1 mg . To eliminate the intercept from equation (1.3), we center response variables and each input variable so that the observed mean is 0. The goal is to select important feature variables for accurate prediction. This amounts to selecting as well as estimating the parameter coefficients. Tibshirani (1996) proposed the LASSO method which is an attractive tool due to simultaneous estimation and variable selection. When the need for selecting a group of levels of a categorical variable or group of basis functions representing a numeric variable arise, these methods fail because they are designed to select individual feature variables and fail to select whole factors. Yuan and Lin (2006) proposed group lasso as an alternative to LASSO in terms of factor selection and also exhibit superior model selection performance. The group lasso penalty is a hybrid of the l1 and l2 − penalties and encourages selection at a group level. The group lasso estimate, for linear regression, minimizes G G Xg βg ||22 ||Y − g=1 ||βg ||2 +λ (1.4) g=1 where λ is the tuning parameter. Note that, in (1.4), when all groups have size 1 i.e. m1 = m2 = · · · = mG = 1, we have lasso. 10 1.1.5 Bayesian Group lasso The limiting distribution of the group lasso estimator is complicated (Knight and Fu, 2000; Chatterjee and Lahiri, 2011). Thus, this estimator fails to give meaningful standard errors of the estimates which affects the statistical significance of the covariates in the chosen model. To deal with this drawback of frequentist lasso type estimators, Bayesian formulations have been developed. The Bayesian MAP estimators provide reliable standard errors for the estimates. It is known that the lasso estimator for linear regression is equivalent to the posterior mode with independent Laplace priors on each regression coefficient. Park and Casella (2008) developed a fully hierarchical Bayesian setup for the lasso using a scale mixture prior on the regression parameters. This mixture prior results in a Laplace marginal distribution for β. This idea has been further extended to build similar fully Bayesian Hierarchical models for group lasso, fused lasso (Tibshirani et. al., 2005) and the elastic net (Zou and Hastie, 2005) by Kyung et. al. (2010). They employ a multivariate mg − dimensional Laplacian prior over each group of regression coefficients. π(βg ) ∝ exp − λ ||βg ||2 , σ (1.5) The classical group lasso is recovered as the MAP solution in log-space with σλ having the role of a fixed Lagrangian multiplier. For a full Bayesian treatment, however, we place hyperpriors on λ and σ which lead to integrations that are analytically impossible to solve. For finding closed form posterior distributions for all parameters we extend the hierarchical scale mixture model approach of lasso to grouped predictors. Thus, we express the prior as a scale mixture of multivariate normals over βg with Gamma hyperpriors over the 11 variance hyperparameter. Specifically, with βg |τg 2 , σ 2 ∼ind Nmg 0, τg 2 σ 2 Img ,τg 2 ∼ind Gamma mg + 1 λ2 , 2 2 (1.6) the marginal distribution of βg is of the form (1.5). This Bayesian formulation encourages shrinkage at the group level and provides comparable prediction performance with the group lasso. However, estimation of βg by its posterior mean or median does not produce exact 0 estimates; we need to bring in the concept of sparsity here. Thus, to introduce sparsity at group level, Xu and Ghosh (2015) assumed a multivariate zero inflated mixture prior or a spike and slab prior for each βg . For variable selection, we want the estimates to produce exact zeroes such that they are dropped from the model. Zero inflated mixture priors are such that the slab part draws values from a known distribution and the spike part is degenerate distribution selecting zero. Xu and Ghosh (2015), further showed that median thresholding is better than using posterior mean. The spike and slab prior keeps the scale mixture prior of normals and gamma intact thus providing full conditionals. This approach is thus computationally easy and gives exact zero estimates. Narisetty and He (2014) used shrinking and diffusing priors for variable selection in a hierarchical Bayesian setup. Zero inflated mixture priors, in recent years, have been extensively utilized in Bayesian variable selection setups. George and McCulloch (1997) used zero inflated normal mixture priors in the hierarchical formulation for variable selection in a linear regression model. Chen and Dunson (2003) used a spike and slab type prior for the random effects variances in a linear setup allowing probabilistic selection of random effects. 12 1.1.6 Bayesian Group Lasso with Logistic Regression So far we have talked about group lasso in a linear regression setup i.e. when the response variable has a Gaussian error. In many practical problems, we come across response values that cannot be fit into a linear model. For example, when the outcome is a binary categorical variable, count data or multi-level categorical variable then the Gaussian error assumption does not hold. In such cases we have to use generalized linear models (GLM) with various link functions. In many financial, insurance and medical data the outcome has two values thus making it a binary response variable. Since, the occurrence of binary outcome is very common in the real world we will focus on GLM with a logit link. Since, the outcome is binary we cannot model this data with (1.1) having normal errors. Meier et. al. (2008) developed the logistic group lasso in a frequentist setup. Before delving into its Bayesian counterpart, let us summarize the frequentist group lasso in logistic regression setup. Assume that we have independent and identically distributed observations (xi , yi ), i = 1, . . . , n, of a p−dimensional vector xi ∈ Rp of G predictors and a binary response variable yi ∈ {0, 1}. Each group has mg levels. We can write xi = (xTi1 , . . . , xTiG )T . Linear logistic regression models the conditional probability pβ (xi ) = Pβ (Y = 1|xi ) by log pβ (xi ) 1 − pβ (xi ) = ηβ (xi ) also known as the logit link with the link function ηβ (xi ) = 13 G T g=1 xig βg . The logistic group lasso estimator, βGL , is given by the minimizer of the convex function G Sλ (β) = −l(β) + λ ||β||2 g=1 where l(.) is the log-likelihood function i.e. n yi ηβ (xi ) − log[1 + exp{ηβ (xi )}] . l(β) = i=1 The tuning parameter λ ≥ 0 controls the amount of penalization. Motivated by Xu and Ghosh’s (2015) work, here we construct a Bayesian formulation for the logistic regression case. Here, our likelihood is Bernoulli probability mass function with a logit link. We abide by using a multivariate zero inflated mixture prior with point mass at zero and the continuous part as double exponential distribution. Since a double exponential prior on βg can be expressed as a scale mixture of normal and Gamma priors (as in (1.6)), we use priors very similar to the linear setup. 1.2 Applications of Variable Selection With the main focus of variable selection in our mind, we have developed some Bayesian hierarchical models for model selection by employing an efficient Gibbs’ sampler. There are numerous areas in which our models can be applied; we have chosen brain imaging data to illustrate the usability of our methods. Brain imaging techniques are widely used to capture brain activity that can be used to detect various diseases related to brain deformity. We use two types of brain image data to build Bayesian hierarchical models. Although, brain image data analysis is the primary focus of this thesis, we start our illustrations with a comparative 14 study based on market research data. We use market research questionnaire data to illustrate how there different variable selection methods differ in the frequentist approach. This exercise also emphasizes the need of meticulous selection of the tuning parameter and its effect on model building. We show how penalized approaches can be used to build predictive models that are relevant in a variety of areas. 1.2.1 Market Research Data: A Comparative Study International market segmentation inevitably stands out as fundamental and crucial in global marketing strategy matching the increase in international trade. Successful targeting of customers and positioning firms’ products requires insights of customer heterogeneity across borders, namely international segmentation. Moreover, international segmentation provides a strategic perspective on the balance between standardization and adaptation (Verhage, Dahringer, and Cundiff, 1989), where standardization has the economic advantages, and adaption enables firms to better satisfy specific customer needs. In general, international market segmentation addresses heterogeneity at three levels countries, regions, and consumers. The former two are usually defined by geography, while the last one has no aggregation level. Although geographic aggregation for international segmentation is widely accepted, it generally overlooks consumer-level heterogeneity within segments and, more importantly, ”it is neither theoretically motivated nor is the managerial relevance of the segmentation variables established” (Nachum, 1994; Steenkamp and Ter Hofstede 2002). Due to effective logistic capability, mass customization, and the death of distance due to advances in consumer communications, company offerings and consumers’ preferences are rarely constrained by geographic aggregation. Hence, disaggregate interna15 tional consumer segmentation offers a more effective way to examine and consider consumers’ needs. Variable inclusion in the international market segmentation domain requires a bifocal view of data employed, regardless the type of segmentation. In this study, we focus on model selection for lifestyle segmentation across countries. Additionally, we look at the very realistic situation where significant disparity exists between sample size and number of variables, often referred to as the high dimensionality problem. Variable selection is done by obtaining the maximum information required with optimal utilization of resources. Variable selection, when the number of predictors exceeds number of observations, has largely been overlooked in marketing research. In many situations, survey questions exceed the number of people interviewed. This leaves us with the problem of variable selection in a highdimensional setup. Conventional statistical methods fail in this case. We illustrate three statistical methods that is custom designed for such scenarios. Variable selection reduces dimensionality of the wide data and selects variables most significant for segmentation. In the problem of assessing the relation of covariates with a response variable we often face challenges such as multicollinearity and high dimensionality. Often, they occur simultaneously. The ordinary least squares method for regression fails to estimate the coefficients of a regression model in these cases due to the non-invertibility of the X T X matrix where X is the design matrix. Typical examples of such scenarios arise in social networking sites, product reviews and market portfolio data. Penalized regression methods are getting popularity for these scenarios, see Hastie et al. (2009) and Fan and Lv (2010) for overviews among a large amount of recent literature. Some of the penalized methods used to overcome these challenges are ridge (Hoerl and Kennard, 1970), LASSO (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), adaptive LASSO (Zou, 2006), SCAD penalty (Fan and Li, 2001) etc. 16 These methods can be used when the data has multicollinearity or high-dimensionality but they differ in terms of their variable selection efficiency or predictive power. For instance, the ridge penalty shrinks the coefficients toward zero but fails to perform variable selection. The LASSO, although helpful in variable selection, shrinks the coefficients to zero and may provide larger prediction errors. The elastic net, SCAD and adaptive LASSO address a few of these inconsistencies but none of these procedures are proved to be a generic best procedure. Although these methods are effective in dealing with the non-invertibility issue of the OLS but their performance is highly dependent on the selection of the tuning parameter. The tuning parameter puts a penalty on the parameters of the model, affecting model selection. Performance of a model is highly sensitive on the selection of tuning parameter and one should practice cautiously while dealing with penalized procedures. We compare 6 different types of tuning parameter selection and conclude that the best way of selection is based on information criterion. Both, regularization and tuning parameter selection techniques are illustrated through simulation and real data analysis. We look at 4 countries and build separate models for each of them to address the problem of heterogeneity of geographical distances. This comparative study is useful for making managerial decisions as managers will be able to optimize their resources in terms of advertisement and consumer targeting. 1.2.2 Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data: Bayesian Group Lasso in Logistic Regression Alzheimer’s disease (AD) is the sixth leading cause of death in the United States. It is a form of dementia in which patients suffer loss of memory where they fail to identify people or objects, have difficulty with speech and, in later stages, are unable to perform daily life 17 activities like getting up from the bed or brushing their teeth. Although, it is mainly a disease of old age affecting people who are 65 or older, early onset of the disease can occur in 40 or 50 year olds in upto 5 percent of cases. AD is the most common case of dementia amounting to 60 to 80 percent of all cases. It is a progressive disease where symptoms worsen over time. AD affected patients live an average of 4 to 20 years after the symptoms become noticeable. Medical scientists are yet to find a cure for Alzheimer’s but it is possible to slow down the worsening of dementia and improve lifestyles of both the affected people and their caregivers. Extensive studies are being conducted to find a treatment for AD, delay its onset or curb its advancement. More information and facts about Alzheimer’s disease can be found at www.alz.org. According to recent studies (Leifer, 2003), early detection of AD is extremely helpful as it can be treated with novel drugs to delay AD progression. Numerous methods have been developed for the analysis of ADNI data to identify the brain subregions that are disease related. These methods usually single out a region of interest (ROI) and perform a univariate analysis based on the chosen ROI (Luo and Nichols (2003), Grimmer et. al. (2009)). Univariate analysis of ROI’s neglect the effect of other significant ROI’s. These methods aim at analyzing each hypothesized significant ROI and then looking at multiple hypothesis where careful adjustment of multiple comparisons have to be looked at. To include all the ROI’s for analysis simultaneously a regression framework seems plausible such that the model itself selects the most significant regions. However, due to high number of candidate ROI’s the standard regression analysis is not possible. The good thing is, there is a medical belief that only a few ROI’s are informative for characterizing AD. Thus, a dimension reduction technique such as penalized regression can be developed. The ADNI MRI database has volume, area and thickness measurements of various brain regions. Since these measurements are a direct manifestation of brain region atrophies we should consider 18 them as a single variable with multiple levels. Thus the number of regression parameters (brain subregions with all levels) may exceed the number of patients being studied. There are number of competitive penalized regression techniques have been developed in recent years. LASSO, perhaps the most popular technique among all (Tibshirani, 1997). However the direct use of LASSO is not appropriate in presence of multiple levels of a covariate in feature selection models. We employ, instead, a group lasso technique to build a model since it places group penalty on the parameters of a variable (feature) which makes easy selection of the whole set of volumetric measurements for an ROI. We treat different measurements of the same subregion as different levels of a covariate in this regression setup. Thus, it is easy to visualize structured correlation in the matrix of covariates (subregions) establishing the motivation of using a group lasso like method. The acuteness of Alzheimer’s disease makes its early detection imperative which is why classification of a subject into healthy individuals or AD patients is of immense importance. We develop logistic regression in Bayesian setup for detection of Alzheimer’s disease for getting more reliable standard error estimates. We have obtained structural magnetic resonance imaging (MRI) data from ADNI. The structural data is the volume, surface area and thickness measurements of various brain regions like entorhinal cortex, putamen etc. on the left and right hemispheres. Since, each region has a group of measurements, we know that there is an inherent correlation among these covariates. We group these variables and the selection of a region amounts to the selection of all the corresponding measurement attributes. Thus, we can model this using a group lasso setup. We analyze baseline data where the outcome is that a subject is a healthy control or has AD. A hierarchical Bayesian group lasso model with binary outcome is a novel method. The Bayesian group lasso formulation is motivated by Xu and Ghosh (2015). We place 19 independent multivariate Laplace priors on the β’s and use the mixing property of normal and gamma densities to get full conditionals of all parameters involved. To ensure sufficient sparsity, we have used a spike and slab prior on β where the slab part is the normal mixing density of the multivariate Laplace prior. Instead of using maximum posterior probability technique, we employ a posterior median thresholding approach to simultaneously select and estimate the parameters. Our model is chosen by the posterior median estimates. Johnstone and Silverman (2004) showed that posterior median is a thresholding estimator under fairly general conditions and their results are generalized for multivariate spike and slab by Xu and Ghosh (2015). For the binary outcome, a logistic link is used. We have shown that the posterior median thresholding estimate is consistent. Our method gives an 80% accuracy in classifying AD from healthy controls. It has an AUC (area under curve) of 0.867 for classifying true from false. The proposed approach has also been validated with simulation studies which compares the frequentist group lasso with our Bayesian version. It is seen that the false positive rate is higher for frequentist group lasso, indicating the superiority of our method. 1.2.3 Functional Magnetic Resonance Imaging (fMRI) Data: Bayesian Spatiotemporal Inference fMRI data is extensively used nowadays to detect and understand brain activities. Understanding of brain activity helps in understanding how the brain regions are associated with a particular disease or activity. This can reveal important insights, thus, helping medical practitioners and scientists to develop cures for diseases or explain nervous aberrations. The fMRI scans reveal various degrees of brain activity when a person is exposed to some stimuli. 20 Brain activity is measured by detecting changes in neural activity associated with blood flow also known as blood oxygenation level dependent activation (BOLD). This is measured by contrasting deoxygenated hemoglobin (paramagnetic substance) to oxygenated hemoglobin (diamagnetic substance) in brain regions. The brain regions where these measurements are obtained are spatial brain volumes, known as voxels. An fMRI scan is really a picture in 3-dimension. The scans provide images of brain voxels slice-by-slice and each voxel is uniquely identified by a 3-D coordinate identifier in the 3-D space. Brain activation does not occur in all the voxels but in certain regions of the brain, affecting a few voxels only. Thus, we want to identify the voxels that are activated by some external stimuli. The fMRI scans are collected over a duration of time at certain intervals. Thus, a single scan will contain voxel level information of a subject collected at T timepoints. There are more than hundred thousand voxels in the brain and these are collected for T timepoints. So, our dataset contains number of voxel × T observation for a single subject. this is a huge dataset and our challenge is to develop a model that is able to detect the truly activated regions. This is, thus, a variable selection problem where detection of the voxels amounts to selecting variables from a large number of candidate voxels. fMRI scan produces a highly resolved brain imaging dataset. Software like Analysis of Functional NeuroImages (AFNI) can be used to view the datasets as images. It is obvious that there exists spatial correlation among the voxels and the repeated measurements at T timepoints induce a temporal correlation as well. It is a challenge to handle all of this data at once, so wee need to come up with some modeling techniques that will be able to easily model the data. Smith et. al. (2003), were the first to use spatial Bayesian variable selection for fMRI data. Smith and Fahrmeir (2007) developed a method that incorporated spatial dependence by the use of Ising prior. They have considered a full-brain approach as well as 21 slice-by-slice analysis. This approach does not take into account the temporal dependence. Musgrove, Hughes and Eberly (2016) have developed a fully spatiotemporal Bayesian model for fMRI data. Instead of dealing with all the voxels, they have introduced a divide and rule approach. In their method, they partition the brain into several smaller parts and perform variable selection of activated voxels. This method is useful but the partitioning of the brain seems arbitrary and it is not known if a different partitioning should lead to a different result. In a more recent work, Castruccio et. al. (2016) have used a multi layer approach for variable selection after considering spatio-temporal correlations. Their method is useful in handling data of the size 22 million. However, they have used region of interest information in their model to build the multi layer approach. Our data consists of voxel level data only and we introduce a novel spatio-temporal Bayesian variable selection technique by extending our group lasso idea into a bi-level selection type solution. To handle data of size 28 million, we have used a two step approach. In the first step, we perform a simple analysis without considering spatial correlation and select the ”significant” voxels based on p-values. We use our Bayesian bi-level selection technique on this reduced dataset. This work is a collaboration between the Department of Statistics and Probability and the Department of Radiology. We have obtained BOLD signal data of 64×64×36 voxels over 192 timepoints. The data was obtained at the Department of Radiology on a single subject. The subject was shown two stimuli, an object and a scenery, and their BOLD signal was measured using a scanner. There is a baseline trend in the measurements but the data that we use here has been rescaled to eliminate baseline trend. The stimuli onset times are convolved with the double-gamma density. The structure of our model is such that N voxels given a stimuli is a group. We have two groups of covariates and the covariates in each group share some correlation. So selecting a 22 stimuli means selecting the entire group (all voxels). However, all voxels are not activated for a given stimuli; only a few exhibit activation. So, we need to select the activated voxels within the selected stimuli. Thus, we need a second stage of covariate selection where our model should select individual voxels from the group of voxels of the selected stimuli. Our model incorporates bi-level selection in group lasso following Xu and Ghosh (2015). We place a temporal correlation on the error structure of the normal distribution of responses. We do not introduce a spatial structure through the errors, but employ it later through the prior of the voxel coefficients. Similar to the spike and slab prior on groups, we use a spike and slab prior for selecting the relevant voxels. We use an adjacency matrix for spatial dependence where we have 1 for a neighbor and 0 for non-neighbors. Note that, a 2-D space will have 4 a maximum of neighbors while a 3-D space will have 26 neighbors. We use an algorithm to build the adjacency matrix where absolute value 1 of at least one coordinates’ difference of two voxels mean neighbor and absolute value the coordinates difference of two voxels greater than 1 means non-neighbor. If difference of all three coordinates is zero then we are comparing a voxel with itself. An R package called NeuroSim is used to simulate the 2-D fMRI data. To build a close resemblance with actual fMRI data, it is important that we simulate data that is synthetic fMRI data. Our simulation results give a very high accuracy rate of selection. Our method is a very useful application of spatio-temporal data. It has practical relevance in brain imaging data where the volume of data is large. Our Bayesian model can effectively identify the truly activated brain regions from thousands of candidate voxels. This application is extremely relevant in the field of biology and neuroimaging where BOLD signals can be used to detect brain activation, thus, leading to breakthrough revelations about the association of neuronal activity and certain diseases/ disorders or human behavior. 23 Chapter 2 Better Tools for Strategic Global Decision Makers Gaining Consumer Insights 2.1 Introduction Marketing strategies are mostly based on consumer psychology and behavior. Since socioeconomic conditions and cultures vary significantly across international markets, use of culture based strategies to reach out to consumers seem to be an intriguing idea. By paying attention to these cultural insights, marketers can get ahead of the curve and offer messages that anticipate changing consumer attitudes rather than simply responding to the present demands of the market. Research in markets across culture can help us better understand behavioral affiliations of consumers and help marketers launch a better campaign for their products. It is obvious that cultures vary in terms of mood of the nation, language, personal values, religion, rituals, personal preferences, social behavior and infrastructural facilities like technology and transportation. These variations can explain substantially the variation in consumer choices for many categories across countries. It is thus important for firms to develop marketing strategies that use local behavioral trends rather than a global message 24 across cultures. Successfully targeting customers and positioning firms’ products requires subtle and deep insights into customers’ cultures and tastes across borders. For instance, Kumar and Pansari, (2016) examined the importance of cultural and economic aspects of a country on customer lifetime value (CLV), where the economy of a target country directly influences customer profitability, and national culture (individualism, uncertainty avoidance, masculinity, etc.) influences customer profitability through purchase frequency and contribution margin. In this vein, national culture is significantly associated with customer behaviors such as innovativeness (Steenkamp, Ter Hofstede, and Wedel, 1999), new product development activity (Nakata and Sivakumar, 1996), word of mouth (Money, Gilly and Graham, 1998), and financial decision making (Petersen, Kushwaha, and Kumar, 2015). Customer heterogeneity is a function of numerous factors. 2.1.1 Motivation In general, international market segmentation must adapt and expand to understand customer heterogeneity and address the issue at three levels -at the country-, region-, and consumer-level. The segments defined by countries and regions are based mainly on geography; accordingly, the culture corresponds to a specific geography. Although geographic aggregation for international segmentation is widely accepted and predominates in business practice, unfortunately it overlooks consumer-level heterogeneity within a geographic segment. More importantly, the aggregated segmentation approach does not theoretically motivate international marketing strategy decisions nor is it managerially relevant (Nachum, 1994; Steenkamp and Ter Hofstede, 2002). Because of the accelerating trend toward globalization, consumers’ information, knowledge, and needs are no longer limited by location, shared language, or presumed similar culture in this era of information explosion. 25 Lifestyle segmentation system collects information about leisure activities, topics of interests, media profiles, personal traits and values, and thus substantially enhances the accessibility and actionability of the segments via marketing mechanisms such as promotion and advertising. A shortcut for firms can be to extract the discerning characteristics of consumers in target markets from lifestyle segments, and design corresponding marketing strategies. Thus, variable selection from a great number of indicators in lifestyle surveys helps firms to identify target customers accurately and efficiently. Despite the fact that variable selection from lifestyle segments significantly reduces the amount of market research effort required to target customers, there are a number of challenges in making precise strategic inferences and decisions. First, lifestyle segmentation surveys usually include information as extensive and granular as possible. Hence, some information is relevant and some might not be. This happens quite often because firms tend to gather information exhaustively in the early stages of market entry. Second, although a maximum input of information should be appreciated, it is almost always accompanied by overlapping and redundant information. Thus, once consumer insights have been transformed into data, high dimensionality and multicollinearity are present in further analyses. A traditional approach is stepwise regression, with an automatic procedure to successively include variables in the model based purely on a t-value. However, as Olusegun, Dikko, and Gulumbe (2015) recognized, the drawback of stepwise regression for the selection of variables usually comes from omitting a suppression effect. Stepwise regression for variable selection may overlook a variable that is not correlated (or weakly correlated) with the dependent variable, but which is significantly correlated with other predictor variables. Consequently, the predictive power of the model is discounted; what is worse is that omitting such a variable may present the risk of rejecting a true hypothesis as false (Pandey and Elliott, 2010). 26 More computationally efficient, penalized likelihood approaches-Ridge, LASSO, and elastic net are used widely in variable selection procedures. Nevertheless, the selected variables and complexity of the model rely heavily on the choice of tuning, also called regularization, parameter- λ. Typically, cross-validation and information criterion are used for selecting the optimal tuning parameter. A topic of interest is to see which model selection criterion helps achieve the optimal model. In an exercise of variable selection, the goal is to optimize model selection by balancing minimization of the mean squared prediction error (MSPE) for Gaussian models, or the negative log-likelihood (NLL) for non-Gaussian models and maintaining the sparsity closest to the true model. In practice, this is usually managed by developing models with sequential values of tuning parameters and selecting the one that achieves the optimal criterion. Information criterion approach takes into account the model fit and complexity to select the optimal tuning parameter. A potential concern of using the penalized likelihood approach is that the choice of the tuning parameter can be arbitrary (Kim et al. 2012). Fan and Tang (2012) empirically demonstrated that neither the Akaike information criterion (AIC) nor the Bayesian information criterion (BIC) is adequate to identify the true model. The concept of criteria is to put a penalty on the degrees of freedom of the model. This penalty is 2 for AIC and log(n) for BIC. One may argue that in a high-dimensional setting the number of parameters grow with increasing sample size and that there is an effect of dimensionality p on the penalty of the information criterion. We use the generalized information criterion (GIC) (Fan and Tang, 2012) with penalty log(log(n)) log(p) as a third type of information criterion. 27 2.2 The penalized regression models Table 2.1 summarizes the 3 penalized likelihood methods we will compare for variable selection. It is a known fact that, ordinary regression fails in case of high-dimensionality and/or multicollinearity. In market survey data, information is collected in excess which brings along the risk of redundancy of information. Penalized or regularized likelihood methods penalize the parameter space. The penalty encourages shrinkage of estimates of the coefficients. These methods maximizes the likelihood but has a constraint attached to the function to be maximized. There is no closed form solution for this approach. A popular method is the gradient descent algorithm to find the estimates. Penalized regression simultaneously selects and estimates coefficients of the regression model. For a generalized linear model, our model setup is as follows: g(µ) = Xω + s Here, g(.) is a link function, which links the mean (µ) of the responses to a linear combination of the covariates. The penalized likelihood method is very similar to the Gaussian case. In this setup too, we are interested in maximizing the likelihood, but instead of minimizing the sum of the square of errors we minimize the negative log-likelihood. Thus, for penalized methods, we add the penalty term to the NLL, and our objective is to minimize this. 2.2.1 Tuning parameter selection The penalized approach selects a model of reduced dimension with a few covariates. Number of questions that are selected from the market survey questionnaire depends on the tuning 28 Table 2.1: Penalized Regression Methods To Be Discussed Method Minimize Ridge Regression (Hoerl and Kennard, 1970) n i=1 (Yi n i=1 (Yi LASSO (Tibshirani, 1996) Elastic Net ( Zou and Hastie, 2005) − Xi ω)2 + λ − Xi ω)2 + λ n i=1 (Yi Function p 2 j=1 ωj Shrinks estimates of the coefficients toward zero. p j=1 |ωj | This approach forces many ω to take 0 values. − Xi ω)2 + λPα (ω) Works in the presence of multicollinearity, unlike LASSO. parameter that sets the penalty on the parameters. Since, the number of questions included in the final model may play an enormous impact on a market survey financially, careful selection of the tuning parameter is of great importance. The tuning parameter reduces the degrees of freedom of the model. Thus, selecting an appropriate tuning parameter affects model selection. This directly affects the predictive performance of the selected model. A tuning parameter is generally denoted as λ; we will call it λ henceforth. A small choice of λ selects a larger model and a large choice of λ selects a smaller model. Choice of an optimal λ is important for selecting the optimal model. A popular method of λ selection is cross-validation. Here, the data is divided into approximately equal k-sized parts. k − 1 parts are used as training sets and kth part is used as the test dataset. A model with parameter λ is fit on the training set. The error in prediction P Ek (λ) = i ∈ kth part (yi − xi ω ˆ −k ) is computed using the test set based on ω ˆ −k . For a generalized linear model, we work with the value of NLL. In the case of a binary outcome, 29 our NLL is N LL(ω) = n i=1 yi log(1 + exp(−xT ω)) + (1 − yi ) log(1 + exp(xT ω)) . This is repeated k times using every sub-dataset as the test and the remaining as training datasets for the same lambda. The average cross-validation error is computed. This entire procedure is repeated for every candidate value of λ and the model with the smallest average crossvalidation error is selected. We know that AIC and BIC are measures of goodness of fit of a model in ordinary least squares; recently information criterion are used as a method of λ selection. The information criterion penalizes the degrees of freedom in a model thus dealing with a trade-off between goodness of fit and complexity of the model. More the number of parameters in a model better is the fit but this often leads to overfitting. The penalty term for AIC is 2 and BIC is log(n). Another kind of information criterion is the generalized information criterion (GIC) which has log(log(n)) log(p) as its penalty term. We often notice that the number of parameter grow with an increasing sample size and thus it seems reasonable to introduce the parameter size in penalty. The general formula for information criterion is: Information Criterion = -2 log-likelihood + penalty * degrees of freedom. The log-likelihood term attributes the goodness of fit and the degrees of freedom times penalty term attributes to the complexity of the model. A model that maximizes the likelihood is preferred but the second term brings a trade-off between model complexity. We choose that model which minimizes the information criterion. The motivation in this chapter is to find an optimal model for a market survey dataset where the number of questions(and sub-questions) outnumber the number of consumer who participated in the survey. The chosen model will minimize the prediction error. Selection 30 of the model depends on λ selection. We want to find a method that consistently selects a λ value for which an optimal model is achieved. In this paper, we compare 3 different information criteria and 5, 10 and 50 fold cross-validation methods to see which one of these chooses an optimal λ consistently. Statistical model building is a commonly used technique in market research. In marketing data, when we want to build a model where there are too many variables as opposed to too few observations (e.g. data from a survey questionnaire), these variable selection methods can be very useful. A stepwise regression type approach has its limitations (Harrell, 2013). We are looking for a statistically valid variable selection methodology. Application of a penalized regression method with careful choice of tuning parameter can be a powerful method in market research model building. The above mentioned variable selection method are not just valid for high-dimensional data. As we will see in the illustration, these methods are also valid when p < n. When we are trying to predict consumer behavior or segmentation this can be an efficient way of dealing with the problem. An optimal model able to predict consumer behavior or segmentation by simultaneously selecting and estimating relevant variables has multi-dimensional cost saving potential. A better model would mean better designing of market strategies in terms of a cost effective strategy. A viable model does away with noninformative variables thus saving on the cost of data collection. A good market strategy impact returns from the market which shows that the initial step of building an accurate model affects the long-term credibility of marketing strategies for a firm. 31 2.3 Simulation We test our proposed approach for tuning parameter selection of penalized methods by a thorough simulation study. To demonstrate its performance and validity, we simulate data that represents four scenarios - low vs high dimensionality, and with the presence/absence of multicollinearity. These four scenarios are illustrated for both a linear case and a binary response case. It is expected that the presence of non-informative variables in a model can disturb the selection of true informative variables and skew parameter estimates. This simulation exercise compares 3 different variable selection penalized approaches and 6 different tuning parameter selection methods. We add non-informative variables in a sequentially increasing order to the set of informative variables. To illustrate, we use several different values of p for variable selection. Consider a linear model Yi = Xi β + i , i = 1, . . . , N , where N indicates the number of observations, Xi is the design matrix for each group i and i is the Gaussian random error with zero mean and variance of σ 2 . The overall design matrix is generated from a multivariate normal distribution with zero mean and covariance matrix Σ with pairwise correlation Σkk = ρ|k−k | , ρ = 0.2. To simulate different scenarios when multicollinearity is present or absent, we generate two different feature matrices - i) an X which does not have correlated columns; and then ii) an X with a number of correlated columns. The number of parameters p is set to vary in order to illustrate if the methods are able to reliably select true factors and covariates. We examine each of the above scenarios with increasing p starting from 5 up to 1000. X is generated to have 500 rows and 1000 columns. Our parameter vector has p entries, with the first 5 entries being 1, 2, 4, 3, 3, and the rest are all 0s. The first 360 rows of X are used as the training dataset, and the remaining 140 rows are used as the test 32 dataset. We validate our model here since we are testing the optimality of our models using a model metric. The error variance used to simulate Y is 0.25. The same design has been used to generate Y using a logit link generalized regression structure. Here, X, β and p are the same as the previous case - the only difference being that we used these to generate a binary response using a logit link. For the linear setup, we find the prediction error on the test dataset and for the binary response, we obtain the NLL on the test dataset based on the estimates obtained from the training dataset, using 6 different criteria (AIC, BIC, GIC, CV-5, CV-10, and CV-50) for ridge, LASSO, and elastic net, respectively. Bootstrapping estimation is used with 100 draws, with various levels of p, the number of parameters, such that both low-dimensional and high-dimensional settings are considered. Our true model size is 5 and we demonstrate which methods are able to produce a model closest to the true model. Tables 2.2 and 2.3 exhibit the estimation results when we use an X that does not have correlated columns, namely the absence of multicollinearity. Table 2.2 displays the scenario of low dimensionality, where the number of parameters is smaller than the number of observations (p = 50). Ridge selects all the covariates and gives the highest MSPE, compared to LASSO and elastic net. This result reflects overfitting of ridge. The 3 cross-validations, CV5 CV-10, and CV-50, perform consistently by choosing a comparatively smaller λ. LASSO and elastic net are both able to perform variable selection. BIC and GIC both resulted in selecting a reasonable model, as the average model size is very close to the true model. AIC and CVs put a small penalty on the parameters, resulting in the selection of larger models. The standard errors of the model size are higher for AIC and CVs, indicating that they fail to consistently select a model. BIC and GIC, on the other hand, have smaller standard errors of model size. However, the MSPE is slightly smaller for AIC and CVs, mainly because AIC 33 and CVs result in including more variables. Thus, BIC and GIC outperform the other selection methods indicated by reasonable MSPE and close to true model size. Given that the differences of MSPE are not significant, it shows that the extra variables included by AIC and CVs do not significantly improve the predictive power of the model. Table 2.3 exhibits the scenario of high dimensionality, where the number of parameters is greater than the number of observations (p = 700). We drop ridge since it fails to select an optimal model. A similar pattern is uncovered in that, AIC and CVs select too many variables in the model. The MSPEs yielded by BIC and GIC are slightly better, as the number of selected variables is smaller. However, the MSPE resulting from BIC and GIC are quite comparable but GIC selects the model size that is closest to the true model. It further reflects the fact that BIC and GIC compromise a trade-off between an optimal model size and a lower MSPE. The AIC curve, for higher values of p, is a decreasing function of λ; a minimum point is not achieved by the AIC curve. To maintain consistency with other criteria, we have set a limit of λ values for AIC in this study. Thus, AIC is not a good option for λ selection. Similarly, LASSO, with GIC as a tuning parameter selection, outperforms in the scenario of high dimensionality that is free from multicollinearity. Next, we apply our proposed approach to the simulated scenarios when the first three columns of X are correlated, exhibiting multicollinearity. The first and third columns are set up with a moderate correlation [0.6] and the second and third columns are set up with a high correlation [0.9]. Similarly, we repeat the simulation for p = 5, 50, 100, 300, 500, 700, and 1000. We report model results when p = 50 and 700 in Tables 2.4 and 2.5, respectively. The active set in the tables indicates which of the true variables the model selects. In the scenario of high dimensionality, LASSO is not able to select the true featuring variables that are correlated in neither low nor high dimensional cases, failing to identify the true model. As 34 expected, elastic net is able to include all true featuring variables. Therefore, in the presence of multicollinearity, elastic net demonstrates a better penalized approach. The resulting MSPE is reasonably small. Comparable to the non-multicollinearity case, BIC and GIC give higher values of λ selection. In the context of high dimensionality, AIC is the weakest selection criterion, indicated by the highest MSPE and wrongly big model size. CVs exhibit smaller MSPE but, similar to AIC, select too many featuring variables. Correspondingly, BIC and GIC yield better results regarding both model selection and reasonably low MSPE. GIC is able to identify the model closest to the true model. To conclude, in the case of high dimensionality with multicollinearity, elastic net, with GIC as a tuning parameter selection, is considered the preferred penalized method. Table 2.2: Mean squared prediction error and model size when p=50 (without multicollinearity) Ridge LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 0.001021388 0.003562153 0.004184590 0.507462118 0.415566287 0.380492323 0.02781584 0.05170223 0.05460872 0.03329435 0.03121886 0.03008664 0.04052913 0.09110559 0.09683863 0.05252995 0.04963789 0.04773889 λ (0.0000000001) (0.0005126616) (0.0004470174) (0.4643633689) (0.3319358473) (0.1616254741) (0.008342591) (0.008185595) (0.008521452) (0.006880519) (0.006198419) (0.007338059) (0.012860055) (0.014255864) (0.013107877) (0.008847460) (0.009817753) (0.011280188) 35 Model Size 50 50 50 50 50 50 16.56 (6.92) 6.84 (1.47) 6.38 (1.23) 14.83 (6.15) 15.54 (5.84) 15.59 (6.04) 22.76 (7.67) 8.45 (2.20) 7.70 (1.74) 19.28 (5.90) 20.23 (6.32) 20.52 (6.76) MSPE 0.2893 (0.03679) 0.2893 (0.03679) 0.2893 (0.03679) 0.2897 (0.03682) 0.2894 (0.03684) 0.2894 (0.03683) 0.26235 (0.03299) 0.26776 (0.03358) 0.26902 (0.03344) 0.26165 (0.03324) 0.26165 (0.03324) 0.26197 (0.03336) 0.26663 (0.03384) 0.27658 (0.03486) 0.27898 (0.03481) 0.26555 (0.03335) 0.26555 (0.03354) 0.26586 (0.03346) Table 2.3: Mean squared prediction error and model size when p=700 (without multicollinearity) LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.02338005 (0.007871480) 0.07734480 (0.006230073) 0.08481446 (0.007959283) 0.05652823 (0.008080458) 0.05387672 (0.009212344) 0.05205381 (0.010043022) 0.04130926 (0.01049730) 0.15116164 (0.01347818) 0.16840461 (0.01598920) 0.09941351 (0.01619864) 0.09406769 (0.01637588) 0.08952375 (0.01987736) Model Size 138.50 (37.02) 6.96 (1.66) 5.65 (0.83) 26.09 (13.50) 30.16 (16.78) 32.67 (18.58) 157.06 (28.78) 9.23 (2.81) 6.76 (1.42) 41.63 (18.90) 47.45 (21.04) 53.99 (27.89) MSPE 0.31144 (0.03719) 0.29031 (0.0347) 0.29667 (0.03585) 0.28022 (0.03396) 0.2800 (0.034) 0.28058 (0.03434) 0.32671 (0.03858) 0.32075 (0.04022) 0.33463 (0.04147) 0.29747 (0.03698) 0.29666 (0.03747) 0.29788 (0.03839) Similarly, Tables 2.6-2.9 illustrate the comparison of LASSO and elastic net on binary outcomes both with and without multicollinearity. The conclusion we draw from these results is very similar to that of the linear case. Note that NLL is lower for AIC and all cross-validation cases. Variable selection is evidently better for BIC and GIC, and more importantly, GIC offers a model closest to the true model. For the binary outcome case, BIC and GIC pay the price of getting a slightly high NLL compared to other methods. Given that our main objective is variable selection, we consistently propose to use GIC over other methods as being optimal for variable selection. One thing to note in this illustration is that AIC performs better than the linear model setup. To summarize, LASSO and elastic net are both applicable variable selection methods for the scenario of high dimensionality that perform similarly in both a linear model and a binary outcome model. When there is no threat of multicollinearity, LASSO is considered the preferred choice as the penalized likelihood method because it selects a model closest to the true model and results in low MSPE. However, high dimensionality is often accompanied 36 Table 2.4: Mean squared prediction error and model size when p=50 (with multicollinearity) LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.03181344 (0.009161650) 0.05726272 (0.009371818) 0.05964167 (0.009658534) 0.03815249 (0.007085186) 0.03643849 (0.007736441) 0.03527344 (0.009265170) 0.05163878 (0.01170837) 0.10042863 (0.01673992) 0.10884811 (0.01832204) 0.06252737 (0.01147031) 0.05944682 (0.01124177) 0.05569424 (0.01259223) Model Size 12.42 (6.12) 4.24 (1.26) 4.00 (1.11) 10.48 (4.88) 10.94 (5.45) 11.18 (6.27) 16.99 (5.43) 7.14 (1.88) 6.38 (1.55) 14.81 (4.92) 15.60 (4.87) 16.16 (5.55) Active Set 3.5 (0.50) 3.44 (0.50) 3.42 (0.50) 3.53 (0.50) 3.56 (0.50) 3.58 (0.50) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) MSPE 0.26497 (0.03133) 0.2668 (0.03175) 0.26749 (0.03193) 0.26402 (0.03121) 0.26429 (0.03138) 0.26476 (0.03133) 0.26807 (0.03132) 0.27405 (0.03308) 0.27618 (0.03268) 0.2681 (0.03082) 0.26775 (0.03123) 0.2678 (0.03113) by multicollinearity in a variety of extents. Thus, elastic net is considered more generalizable, since it is able to select all true featuring variables when there is potential multicollinearity. With regard to selecting the tuning parameter, GIC is considered a better alternative when the goal is to select informative variables precisely and efficiently. The simulation study demonstrates that in an exercise of variable selection in high dimensionality that presents multicollinearity, elastic net, together with GIC approach, should be employed. We present a variable selection exercise with a global lifestyle segmentation in the next section. 37 Table 2.5: Mean squared prediction error and model size when p=700 (with multicollinearity) LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.02738359 0.02738359 0.07876276 (0.006756074) 0.08634825 (0.008313138) 0.06229525 (0.010551477) 0.05964375 (0.009620764) 0.05632187 (0.010686199) 0.04487323 (0.01612084) 0.15466768 (0.01366089) 0.17007910 (0.01428129) 0.11063973 (0.01643746) 0.10529391 (0.01720958) 0.10155184 (0.01918456) Model Size 117.55 (44.90) 4.61 (1.37) 3.62 (0.68) 18.16 (13.65) 19.6 (13.66) 23.73 (16.49) 147.18 (42.90) 7.37 (1.84) 5.83 (1.01) 28.80 (13.61) 33.29 (16.64) 36.55 (19.12) Active Set 3.53 (0.50) 3.49 (0.50) 3.44 (0.50) 3.32 (0.47) 3.32 (0.47) 3.32 (0.47) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) MSPE 0.29717 (0.03471) 0.27272 (0.03042) 0.27588 (0.03102) 0.26981 (0.03085) 0.26912 (0.03023) 0.26958 (0.03121) 0.31471 (0.03655) 0.29291 (0.03168) 0.2999 (0.03405) 0.28104 (0.03077) 0.28127 (0.03036) 0.28189 (0.03069) Table 2.6: Negative log-likelihood and model size when p=50 (without multicollinearity) for binary response LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.019 (0.001) 0.024 (0004) 0.025 (0.004) 0.01 (0.002) 0.009 (0.001) 0.01 (0001) 0.02 (0.003) 0.043 (0.012) 0.049 (0.012) 0.01 (0.002) 0.01 (0.001) 0.009 (0.001) 38 Model Size 12.77 (2.52) 7.5 (1.67) 6.96 (1.42) 22.05 (4.03) 23.21 (4.01) 23.27 (3.87) 23 (4.16 ) 9.99 (3.69) 8.34 (2.64) 34.37 (4.00) 35.02 (3.53) 35.24 (3.49) NLL 31.57 (2.70) 33.53 (3.01) 33.05 (2.98) 29.69 (3.68) 29.59 (3.87) 29.57 (3.86) 33.73 (2.68) 39.59 (3.60) 40.08 (3.46) 31.91 (3.72) 31.81 (3.80) 31.8 (3.79) Table 2.7: Negative log-likelihood and model size when p=700 (without multicollinearity) for binary response LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.03 (0.003) 0.044 (0.004) 0.05 (0.006) 0.015 (0.005) 0.015 (0.005) 0.015 (0.005) 0.07 (0.007) 0.099 (0.010) 0.113 (0.012) 0.029 (0.009) 0.026 (0.009) 0.025 (0.010) Model Size 23.67 (3.12) 6.65 (1.51) 5.31 (0.81) 68.86 (16.62) 69.3 (17.57) 68.81 (19.45) 21.66 (7.13) 7.66 (2.21) 5.55 (1.13) 104.48 (25.92) 113.19 (27.18) 116.21 (31.64) NLL 37.23 (2.43) 40.24 (2.67) 42.34 (2.93) 32.64 (3.62) 32.74 (3.67) 32.82 (3.67) 47.17 (2.27) 53.02 (2.39) 55.74 (2.78) 41.47 (3.01) 41.49 (3.15) 41.81 (3.43) Table 2.8: Negative log-likelihood and model size when p=50 (with multicollinearity) for binary response LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.019 (0.001) 0.022 (0.003) 0.023 (0.003) 0.009 (0.001) 0.009 (0.001) 0.009 (0.001) 0.02 (0.002) 0.042 (0.010) 0.045 (0.009) 0.009 (0.001) 0.009 (0.001) 0.009 (0.001) Model Size 16.47 (1.83) 4.88 (1.38) 4.53 (1.22) 17.4 (2.72) 17.41 (2.94) 17.26 (3.01) 20.57 (3.08) 9.52 (2.83) 8.65 (2.16) 32.46 (3.02) 32.46 (3.06) 32.5 (3.00) 39 Active Set 3.45 (0.5) 3.31 (0.46) 3.28 (0.45) 3.72 (0.45) 3.71 (0.46) 3.71 (0.46) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) NLL 18.45 (1.53) 19.44 (1.82) 19.83 (1.90) 14.21 (2.12) 14.19 (2.11) 14.21 (2.14) 21.91 (1.72) 28.53 (3.16) 29.48 (2.81) 17.67 (2.39) 17.67 (2.39) 17.66 (2.39) Table 2.9: Negative log-likelihood and model size when p=700 (with multicollinearity) for binary response LASSO Elastic Net AIC BIC GIC CV-5 CV-10 CV-50 AIC BIC GIC CV-5 CV-10 CV-50 λ 0.024 (0.004) 0.035 (0.004) 0.04 (0.004) 0.01 (0.002) 0.01 (0.003) 0.011 (0.003) 0.062 (0.007) 0.085 (0.009) 0.095 (0.009) 0.017 (0.006) 0.016 (0.006) 0.015 (0.007) Model Size 14.95 (6.21) 4.55 (1.67) 3.46 (0.69) 57.88 (9.60) 56.95 (10.06) 55.97 (10.09) 17.74 (6.28) 7.07 (1.91) 5.57 (0.77) 117.89 (21.31) 119.7 (20.85) 121.25 (23.50 ) 40 Active Set 3.29 (0.46) 3.14 (0.35) 3.11 (0.31) 3.49 (0.50) 3.49 (0.50) 3.48 (0.50) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) 5 (0) NLL 21.26 (1.99) 25.04 (2.02) 26.39 (1.78) 17.25 (2.57) 17.28 (2.58) 17.34 (2.57) 34.54 (1.87) 39.4 (2.03) 41.39 (2.06) 26.81 (3.08) 26.63 (3.05) 26.7 (3.18) 2.4 An Empirical Variable Selection Exercise We perform our proposed penalized likelihood approach on a variable selection exercise of global lifestyle segments. The objective is to select characterizing features that reflect consumer heterogeneity, using data collected by an international marketing research firm specializing in consumer insights. This report collects lifestyle measures of consumers worldwide. The original response variable is defined by six global lifestyle segments: fun-seekers, intimates, creatives, altruists, devouts, and strivers. Figure 2.1 shows the distributions of the six lifestyle segments across four countries, the US, Canada, Brazil, and China. The four selected countries are expected to vary in underlying constructs. In other words, individual items may share the same underlying structure but the membership of items regarding each construct may be different across the four countries. We obtain 300 observations from random draws for each country. As expected, these reveal that the distribution of lifestyle segments is similar between the US and Canada, but this distribution differs strongly from China and Brazil. In the US and Canada, many people are fun-seekers (28.7% US and 29.7% Canada), followed by intimates (18.00% US and 20.3% Canada), creatives (15.3% US and 21.0% Canada), and strivers (15.3% US and 14.7% Canada). The US (18.0%) has more devouts than Canada (9.3%) and altruists are rare in both the US (4.7%) and Canada (5.0%). Contrarily, strivers account for more than half (58.3%) of the China sample, followed by 13.3% of altruists, 7.7% of fun seekers, 7.0% of devouts and creatives, and 6.7% of intimates. The Brazil sample exhibits a more balanced distribution and has relatively the largest percentage of intimates (22.7%), followed by devouts (21.7%), altruists (19.7%), creatives (14.7%), fun-seekers (12.0%), and strivers (9.3%). The entire survey questionnaire includes 11 sections, each of which has a wide range in 41 the number of questions. We eliminated the section about evaluation of popular brands and brand involvement, because the specific brands are not available across the four countries. Following this logic, we deleted a number of items that do not apply to all four countries, but retained sections that are generalizable across countries, including mood of the nation, personal values, leisure and frequent activities, influential exposure to society, opinions about environmental issues, and demographics. We ended up having 416 explanatory variables, with 300 observations for each country sample. Thus, the data setup explicitly presents high dimensionality. 2.4.1 Characterizing Lifestyle Segments The primary interest is to identify the featuring variables for each lifestyle segments across the four countries. Thus, we first transform the original dependent variable into six binary dependent variables, representing each type of lifestyle. According to out simulation study, we employ elastic net as the penalized likelihood method and GIC as the criterion for tuning parameter selection. The overall model size of each lifestyle segment varies from 12-15 for the US, 9-15 for Canada, 5-14 for China, and 7-15 for Brazil. As expected, personal values are the strongest indicators of lifestyle segments among the 416 explanatory variables. The selected featuring variables characterizing lifestyle segments differ across the four countries, yet there are still degrees of mutual patterns. First, across the four countries, personal values such as pleasure are important to worldwide fun-seekers, while duty is not. Other than pleasure, US fun-seekers are into looking good; Canadian fun-seekers are into adventure and sex; Chinese fun-seekers are into romance; and Brazilian fun-seekers are into leisure, having fun, and adventure. In addition to duty, Canadian fun-seekers are not into spirituality, perseverance, and protecting the family; Chinese fun-seekers are not into 42 wealth; and Brazilian fun-seekers are not into spirituality. Second, the featuring variables characterizing intimates are more differential. Chinese and Brazilian intimates are fairly concentrated by one personal value, where protecting the family is important to Chinese intimates while sex is important to Brazilian intimates. It seems the concept of intimates is reflected in romantic relationships in Brazil (sex) and more broadly in China (family). Except for protecting the family, stable personal relationships are important to intimates in both the US and Canada. Interestingly, romance is important to US intimates, while it is an indicator for fun-seekers in China. US intimates don’t seem to care much about status or adventure, and they would not be fine financially if they stopped working. Canadian intimates don’t seem to care about status, beauty, or creativity. Third, the variables selected to characterize creatives are most differential across the four countries. US creatives advocate values such as open-mindedness, authenticity, creativity, self-reliance, and curiosity, whereas they consider wealth, pleasure, sex, tradition, and respecting ancestors not important. Except for openmindedness and curiosity, Canadian creatives consider learning very important and status and sex not important. In addition to personal values, Canadian creatives are likely to own a stereo. Chinese creatives do not care about wealth and are more likely to purchase travel insurance. Brazilian creatives are into creativity and freedom. Fourth, US altruists appreciate perseverance and helpfulness, and are not likely to own a scanner. Canadian altruists value duty, are relatively older, and don’t do fun things on a daily basis. Chinese altruists seem to be strongly affected by social indicators such as social stability, social responsibility, preserving the environment, and are more likely to make a sizable donation to a local or national organization. Brazilian altruists are characterized by the biggest number of personal values, including being in tune with nature, preserving the environment, perseverance, justice, social responsibility, helpfulness, and equality. Fifth, US devouts value 43 obedience, respecting ancestors, traditional gender roles, and faith, and they are interested in topics about personal finance. Similarly, Canada’s devouts value tradition, obedience, and traditional gender roles. Chinese devouts value tradition and modesty, but not power, and are very interested in gardening. Brazilian devouts value spirituality and faith but do not work just to earn a living. Finally, power, wealth, and status are strong values for strivers worldwide, except that power does not seem important to Brazilian strivers. Besides, US strivers spend more time with their spouses. Canadian strivers seem to be very self-centered, since they do not care about honesty, freedom, protecting the family, equality, or friendship. However, Canadian strivers are very interested in social issues. Chinese strivers seem to care less about stable personal relationship, friendship, and respective ancestors. Tables 2.10-2.13 summarizes the selected variable and parameter estimates. 44 Figure 2.1: Distributions and descriptions of the six psychographic segments 45 Table 2.10: Results of Variable Selection - USA Funseekers Excitement Intimates Adventure Creatives Wealth Spirituality If stopped working wld be fine financially Status Honesty Power Open-mindedness Authenticity Honesty Authenticity Pleasure Status Self- esteem Enjoying Life OpenMindedness Creativity Having fun Beauty Self-reliance Adventure Sex Leisure Protecting the family Looking good Curiosity Knowledge Stable personal relationships Romance Modesty Pleasure Looking good Duty Faith Excitement Spirituality Honesty Authenticity Public image Sex Tradition Respecting ancestors Wealth Altruists Topics very interested in: Electronic games Financial services/products own/use: Hospital cash plans Devouts Topics very interested in: Personal Finance Topics very interested in: Wellness, Fitness, Exercise Financial services/products own/use: Home equity loans Attitude toward new technology: It scares me Tech Items owned: Portable Music Player (such as Walkman, Discman) Own inkjet/laser printer or scanner: Scanner Excitement Respondent main income earner Leisure Spirituality Tradition Activities on Internet past 30 days: Download video files Power Obedience Respecting ancestors Wealth Status Preserving the ment Perseverance Traditional gender roles Honesty Faith Enduring love Modesty Friendship Enjoying life environ- Personal support Helpfulness Equality Having fun Topics very interested in: Electronic games Pleasure Enjoying Life 46 Friendship Having fun Sex Strivers Weekly hrs-With spouse Environment-Things would like cos. to do: Invest in research of technologies to help preserve the envi Tech Items owned: Video camera/camcorder Tech Items owned: Digital Home Projector Activities on Internet past 30 days: Watch video, broadcasts and events Table 2.11: Results of Variable Selection - Canada Funseekers Have fun now future care of itself Types of TV programs watch: Music video and popular music Intimates Three things most concerned about: Terrorism Weekly hrsGardening or yard work Spirituality Items recommended to other people: Parenting/family issues Status Honesty Authenticity Perseverance Protecting the family Pleasure Enjoying life Having fun Live for today Adventure Sex Looking good Duty Beauty Creativity Protecting the family Social responsibility Stable personal relationships Enduring love Adventure Duty Obedience Creatives Topics very interested in: Arts and culture EnvironmentHow feel re cos. that make effort: I don’t know which companies address the environment or how Tech Items owned: Stereo (hi-fi) Wealth Status Open-mindedness Authenticity Perseverance Self-reliance Curiosity Knowledge Wisdom Sex Looking good Learning Altruists Weekly hrs-Socializing Devouts Topics very interested in: Religion Strivers Topics very interested in: Automobiles/Driving fun Tech Items owned: Cellular/Mobile phone Topics very interested in: Social Issues Environment-Things would like cos. to do: Make and sell products that do not harm the environment Have fun now - future care of itself Age of Respondent Are you a parent? Excitement Tech Items owned: Portable Music Player (such as Walkman, Discman) Most import.: brand offers good value for money Spirituality Traditional gender roles Duty Obedience Tech Items owned: camera Power Wealth Status Preserving the ment Social tolerance Traditional gender roles Honesty Patriotism Freedom Weekly things hrs-Doing environ- Adventure Duty Respecting ancestors 47 Protecting the family Equality Romance Friendship Enjoying life Film Table 2.12: Results of Variable Selection - China Funseekers Wealth Protecting family Intimates Are you a parent? the Power Romance Creativity Duty Protecting family Learning Stable personal relationships Enduring love Pleasure Enjoying life the Creatives Activ. Freq.attend lecturesnot reg study Topics very interested in: Cultures around the world Financial services/products own/use: Life insurance or endowment products that provide a lump sum Financial services/products own/use: Travel insurance Financial services/products own/use: Auto/car loan Financial services/products own/use: Home equity loans Financial services/products own/use: Overdraft facility Financial services/products own/use: Mutual Funds or other investment product Financial services/products own/use: Financial planning/advisors M30-Activities on Internet past 30 days: Get information related to my hobbies/interests Wealth Altruists Weekly hrs-Learning new things Political/societal activities in past year: Made a sizable donation to a local or national organization Percentage of pre-tax monthly income save Devouts Three things most concerned about: Environmental pollution Topics very interested in: Gardening Power Strivers Activ. Freq.-exercise to keep fit Environment-Things would like cos. to do: Support recycling or clean-up of polluted sites Financial services/products own/use: Credit card(s) I spend a lot of time researching brands Wealth Types of TV programs watch: Programs for children Wealth Tradition Power Status Modesty Wealth Honesty Status Preserving ment the environ- Ambition Protecting the family Honesty Social responsibility Justice Social stability Stable personal relationships Friendship Respecting ancestors Material security Social stability Learning 48 Table 2.13: Results of Variable Selection - Brazil Funseekers Weekly hrsSocializing Intimates -Activ. Freq.-sitdown meal in a restaurant Creatives Topics very interested in: Religion Altruists Weekly hrs-g-Socializing Devouts Three things most concerned about: AIDS Activ. Freq.-take pictures or photos Ambition Business overnight trips in past 12 mos Health and fitness Topics very interested in: Environmental issues Activ. Freq.-play indoor or parlor games Topics very interested in: Science Honesty Being in tune with nature Items recommended other people: Music Preserving ment Items recommended to other people: Restaurants or places to eat Buy whatever Importance Topics very interested in: Music Being in tune with nature Perseverance Work just to earn living Wealth Excitement Creativity Justice Not person that takes risks Status Leisure Spirituality Authenticity Knowledge Wisdom Protecting family Romance . Need help making financial decisions Types of TV programs watch: Learning/Educational programs for adults/children Technology effect on relationships: Strengthens my bonds with people I care about Activities on Internet past 30 days: Get access to news or other up-to-the-minute information Creativity Self-reliance Freedom Social responsibility Helpfulness Equality Health and fitness Spirituality Perseverance Honesty Knowledge Social tolerance Self-reliance Personal support Traditional gender roles Friendship Pleasure Friendship Having fun Looking good Duty Tradition Duty Faith Protecting family Justice Pleasure Having fun Adventure Duty the Authenticity Friendship Sex Faith Simplicity the the 49 environ- to Strivers Three things most concerned about: Money enough to live right and pay the bills Political/societal activities in past year: Attended a political rally, speech or event Do not like to be in debt I want- 2.5 2.5.1 Discussion Theoretical Implications The primary aim of this study is to propose an approach for variable selections in settings of high dimensionality with grouping variables. Despite the increasing phenomenon of big data, excessive consumer information becomes challenging in extracting effective and efficient consumer insights in order to properly implement market segmentation, target customers, and position advantageously. Traditional approaches of variable selection in high dimensionality may yield unstable coefficient estimation and inflated standard errors (Drolet and Morrison 2001). The family of penalized likelihood methods is advocated, but the choice of tuning parameter is fairly sensitive, since it controls the trade-off between bias and variance in the resulting parameter estimates (Hastie et al., 2009; Fan and Lv, 2010). Certain specifications of the optimal tuning parameter are challenging to quantify in practice, because the resulting tuning parameters are valid only asymptotically, and rely heavily on unknown nuisance parameters in the true model (Fan and Tang, 2013). The challenges are even amplified for variable selection in a setting of high dimensionality and multicollinearity (strong grouping effects). The proposed generalized information criterion (GIC) enables penalized likelihood methods to be flexible in choosing desired models, and, more notably, it addresses both consistent and efficient estimation. The simulation study demonstrates that our proposed approach is able to offer the best penalized likelihood approaches and tuning parameter selection in high-dimensional settings. Elastic net is considered a cutting-edge procedure in market research studies dealing with excessive and overlapping information. With the assistance of GIC, our proposed procedure is able to identify the true model in the presence of multicollinearity. 50 From a theoretical perspective, this study enables international market segmentation with a procedure for selecting insightful factors and/or covariates from an enormous amount of information. The study should also be able to assist uninvestigated empirical exploratory studies, which ask for simultaneously testing the relevance and importance of a large set of potential covariates and/or factors. Second, the proposed technique is flexible with multicollinearity and construct non-equivalence. In other words, our approach is able to address heterogeneity both in consumer and latent structures. Although we practice a variable selection exercise by sub-sampling among the four countries, this approach is also applicable in testing interactions for contingent and contrasting effects. 2.5.2 Managerial Implications One of the most conspicuous applications of the proposed variable selection approach is to gain consumer insights into the presence of excessive and overlapping information, which inevitably occurs to a firm that intends to expand to a new business territory. Thanks to advanced technology, firms now are able to access extensive information and store a large repository of data. Although rich information enables managers to gain sufficient consumer insights, too many variables may camouflage the underlying structure and result in misleading interpretations for managerial implications. Consequently, inaccurate variable selection harms resource allocation and ultimately firms’ profitability in the long run. This concern is particularly severe when the contexts are dynamic and diversified, e.g., among different countries as in our empirical illustration, where the variables and their associated impacts might be elementally different across contexts. An example of contextual differentiation emerges from our findings. If a firm intends to promote a romantic product, say a high-end perfume, the communication strategy should be 51 quite different between the US and China markets. In the US market, the concept of romance is acknowledged more by intimates, who appreciate family and stable personal relationships. Young couples should be considered the target customers. Since they may be in a financially precarious position, a zero-interest monthly payment might be an attractive marketing tool. However, the concept of romance is more likely to be acknowledged by fun-seekers in China. Catchy marketing strategies such as tailoring scents, where consumers can select fragrance elements and combine them to customize their ”own scents” would be uniquely appealing to the China market. Another interesting finding is that the perception of sex is recognized differently between Canada and Brazil. Sex is identified as the strongest value to indicate intimates in Brazil but is considered one of important values of fun-seekers in Canada. The differential perceptions of sex or being sexy motivates different messages that a firm wants to convey. Canadian fun-seekers appreciate the value of sex in a way that emphasizes adventure, while Brazilian customers may blend the concept of sex with the context of intimate relationships. The famous Victoria’s Secret show is regarded as more compelling in the Canadian market than in the Brazilian market, because it is exciting and intriguing. It might be more appealing for customers in Brazil to deliver a mingled image of being sexy and intimate. Global lifestyle segments seem to be generalizable and comparable across countries, yet collating holistic lifestyle indicators may not be the most efficient and profitable way for marketing communications. Effective variable selection techniques help firms use the best characteristics to identify customers. Broadly, variable selection in international market entry almost always presents high dimensionality and multicollinearity. There is a compelling need for managers to employ tools that retain relevant and indicative information. This study provides an implementation-friendly model for selecting insightful global consumerlevel information. 52 Chapter 3 A Bayesian Group Lasso Classification For ADNI Volumetrics Data 3.1 Introduction Dedicated research is conducted with neuroimaging techniques for early diagnosis of Alzheimer’s Disease (AD). Alzheimer’s Disease Neuroimaging Initiative (ADNI) conducts multi-center case-control study of elderly people that was designed to find more sensitive and accurate methods to diagnose AD at earlier stages. AD is an older age disease in which patients develop deformities in brain structure. It is identified by loss of memory, speech inconsistensies or inability to perform daily tasks of survival. ADNI studies use brain-imaging techniques, such as positron emission tomography (PET) and magnetic resonance imaging (MRI). ADNI database has data from three phases (ADNI1, ADNI GO and ADNI 2). Historically, studies have shown that AD causes abnormal change to brain region volumes which causes shrinkage in the hippocampal volume or reduction in its thickness or enlargement of internal ventricles. Smith et. al. (2012) studied structural brain alterations before mild cognitive impairment (MCI). They had previously demonstrated that volume loss in bilateral anteromedial temporal lobe is present at baseline in longitudinally followed normal subjects who later developed MCI or AD. Arlt et. al. (2013) believed that fully au- 53 tomated MRI-based volumetric measurements may serve as a biomarker for the diagnosis in patients with MCI or dementia. They concluded that fully automated MRI-based volumetry allows detection of regional grey matter volume loss that correlates with neuropsychological performance in patients with amnestic MCI or mild AD. Our objective in this chapter is to predict dementia in patients based on the volume measurements obtained from the MRI ADNI data. There is evidence of brain atrophy with increasing age but the atrophies differ significantly from normal aging to AD patients. We use the differences of brain region atrophies to distinguish subjects with or without AD. The volumetric data has brain parcellated subregions of the entire brain for the left and right hemispheres. Volume, area and thickness measurements of brain subregion is a simple way of detecting atrophied brain regions, thus the motivation of combined use of these measurements. It is believed that all these regions are not associated with dementia but only a few. Identification of a few brain regions from the large number of regions makes appropriately a dimension reduction problem. In this article, we develop Bayesian group lasso type technique with spike and slab prior following Xu and Ghosh (2015) over other types of penalized regression because this approach presents many natural advantages. The biggest advantage is the Bayesian approach provides reliable estimates of uncertainty which can be used for statistical inference beyond feature selection. A thorough literature review has shown that the Bayesian group lasso with logistic regression model is largely overlooked. This article develops this novel method motivated by the ADNI data. Bayesian group lasso with spike and slab prior deals with feature selection (dimension reduction) in a binary outcome scenario and produces reliable estimates for regression coefficients. Unlike commonly used Bayesian variable selection methods, we propose median thresholding to make insignificant coefficients are exactly zero. Another major contribution of this paper is that we look at the brain image volumetric data at a minute 54 level, considering all available brain subregions mapped by FreeSurfer to include effects of all ROI’s rather than looking at individual ROI’s and identifying atrophied brain subregions by selecting a group of volumetric measurements of the corresponding selected ROI’s. Zhang et. al. (2011) performed classification with MRI data based on 93 manually labeled ROI’s. We use data of 116 automatically labeled ROI’s (each having 4 different measurements) by FreeSurfer and analyze them together to perform a dimension reduction analysis. Zhang et. al. (2011) used a composite of 3 different modalities of biomarkers. Unlike Zhang et. al. (2011), our method provides reliable parameter estimates which can be used to calculate the log of odds or relative risk of AD based on the selected subregions instead of just classifying subjects. Group lasso encourages selection of all levels of a significant subregion and spike and slab prior on the parameter coefficients ensure that a large number of subregions which have no impact on the disease are dropped from the model. So, the proposed method selects affected brain subregions automatically from a large pool of brain subregions. Furthermore, among the selected subregions only a few subregions serve as discriminative features in the model assessed by their statistical significance. So, we are able to narrow down the regions that should be studied by scientists to stop progress of the disease or improve the quality of life of the affected individuals. Finally, we provided theoretical foundation to our proposed methodology. This chapter is organized into 7 sections. In the next section we have reviewed the literature of group lasso and Bayesian group lasso and then in the following section we have elaborated on Bayesian group lasso in logistic regression setup. Section 3.4 shows the posterior consistency of our estimator, i.e. the model selected by the proposed method converges to the true model for sufficiently large n. In Section 3.5 we have conducted a simulation study to test the performance of the proposed method. Section 3.6 contains the 55 analysis on the ADNI dataset where we detail out our findings and our concluding remarks are in Section 3.7. 3.2 Group lasso Oftentimes, we are interested in selecting a group of variables that are predictive in a model. Popular high-dimensional penalized regression techniques used for variable selection are built to select a group of related variables. For example, we may have a group of dummy variables that represent a significant categorical variable. To select this variable, it is necessary that we select all the corresponding dummy variables. LASSO fails to correctly perform this selection since it can never select all the dummy variables of a categorical variable. Similar scenarios are when we have a number of basis functions representing a function or an ANOVA model with more than one level feature variable. Group lasso, proposed by Yuan and Lin (2006), addresses this issue. It puts a hybrid l1 and l2 − penalty on the parameter space, thus, encouraging selection or dropping of variables in groups. Note that, this is feasible since a multi-level variable’s parameter is represented as an mg - tuple vector, mg being the number of levels for group g. Our regression model is as follows: G Yn×1 = X g βg + g=1 56 (3.1) where for g = 1, 2 . . . G. n×1 mg ×1 βg n×mg Xg ∼ Nn (0, σ 2 In ) vector of coefficients design matrix If p is the total number of predictors, then p = G g=1 mg . We center both the response and input variables to eliminate intercept from 1.2. Our primary goal is to select important predictive features. The group lasso estimate, for linear regression, minimizes G G Xg βg ||22 ||Y − +λ g=1 ||βg ||2 (3.2) |βj | (3.3) g=1 where λ is the tuning parameter. (3.2) is an evident extension of the LASSO penalty: p n (Yi − Xi β)2 i=1 +λ j=1 where λ is the tuning parameter. Note that (3.3) is a special case of (3.2). 3.2.1 Bayesian Group lasso A major issue with lasso-type estimates is that it is difficult to give satisfactory standard errors since the limit distribution of the lasso estimator is very complicated (Knight and Fu, 2000; Chatterjee and Lahiri, 2011) but the Bayesian version of lasso overcomes this by 57 producing reliable standard errors. Tibshirani showed that the lasso estimator for linear regression is equivalent to the posterior mode with independent Laplace priors on each regression coefficient. Park and Casella (2008) developed a fully hierarchical Bayesian model using a scale mixture prior of normal distributions for lasso. This idea was further extended by Kyung et.al . (2010) to develop a general Bayesian formulation for a number of lasso variations, including the group lasso, the elastic net (Zou and Hastie, 2005) and the fused lasso (Tibshirani et.al ., 2005). Raman et.al . (2009) developed a fully Bayesian formulation of the group lasso to tackle the problem of poor variance estimates of regression coefficients. From a probabilistic perspective, the group lasso with Gaussian likelihood can be seen as a linear regression model with normal errors and a product of multivariate Laplace priors over the regression coefficients. Thus, p(y|X, β, σ 2 ) ∝ exp − ||y − Xβ||2 /(2σ 2 ) 1 ˆ t X t X(β − β) ˆ (β − β) 2σ 2 SSE .(σ 2 )−ν/2 exp − 2σ 2 ∝ (σ 2 )−p/2 exp − (3.4) ˆ t (y − X β) ˆ is the sum of squared errors where βˆ is the least squares solution, SSE = (y − X β) and ν = n − p. The last equation results from ”completing the squares” which is standard in Bayesian formulations. Assuming a generalized multivariate mg − dimensional Laplacian prior over each group of regression coefficients π(βg ) ∝ exp − 58 λ ||βg ||2 , σ (3.5) the classical group lasso is recovered as the MAP solution in log-space with σλ having the role of a fixed Lagrangian mutiplier. For a full Bayesian treatment, however, we place hyperpriors on λ and σ which lead to integrations that are analytically impossible to solve. Following the hierarchical scale mixture of lasso, we extend it to group lasso regression models. Lasso employs scale mixture by using normal and Gamma densities; here we form a mixture prior with multivariate normals and Gamma hyperprior. βg |τg 2 , σ 2 ∼ind Nmg 0, τg 2 σ 2 Img , τg 2 ∼ind Gamma m g + 1 λ2 , 2 2 (3.6) the marginal distribution of βg is of the form 3.5. This Bayesian formulation encourages shrinkage at the group laevel and provides comparable prediction performance with frequentist group lasso. However, this approach based on estimating βg by its posterior mean or median does not produce exact 0 estimates. Thus, to introduce sparsity at group level, we assume a multivariate zero inflated mixture prior or a spike and slab prior for each βg . Xu and Ghosh (2015), further showed the superiority of posterior median thresholding. The spike and slab type zero inflated prior keeps the scale mixture prior of normals and gamma so we get full conditionals making derivation of posterior distributions easier. This approach gives exact zero estimates and is easier to compute. In recent years, many studies have been conducted exploring the application of zero inflated mixture priors (see Yuan and Lin (2012), Lykou and Ntzoufras (2013), Zhang et.al . (2014)). Heavy tailed distributions, such as double exponential, are often used as the slab part. The slab part can be further segmented to a scale mixture of normal and gamma distributions as is done by Xu and Ghosh (2015). The following hierarchical Bayesian formulation with spike and slab prior for linear 59 regression (3.1) comparable to a group lasso type estimator is proposed by Xu and Ghosh (2015). Y|X, β, σ 2 ∼ Nn Xβ, σ 2 In , βg |σ 2 , τg2 ∼ind (1 − π0 )Nmg 0, τg 2 σ 2 Img + π0 δ0 (βg ), τg 2 ∼ind Gamma mg + 1 λ2 , 2 2 σ 2 ∼ Inverse Gamma (α, γ), , g = 1, 2, . . . , G, g = 1, 2, . . . , G, σ 2 > 0, π0 ∼ Beta (a, b). λ(k) = p+G G 2 g=1 Eλ(k−1) [τg |Y] where δ0 (βg ) denotes a point mass as 0 ∈ Rmg , βg = (βg 1 . . . βg mg )T . The posterior expectation of τg2 will be replaced by the sample average of τg2 generated in the Gibbs sampler based on λ(k−1) . The value of λ should be carefully tuned. A large value of λ will overshrink the estimates while a small value will lead to overfitting. Xu and Ghosh (2015) suggested a conjugate Gamma prior can be placed on λ2 . Using an empirical Bayes approach, λ is estimated from data using marginal maximum likelihood. Since marginal maximum likelihood of λ does not have a closed form, a Monte Carlo EM algorithm (Casella, 2001; Park and Casella, 2008) can be used to estimate λ. The kth EM update for λ is given in the above setup. 60 3.3 Bayesian Group Lasso with Logistic Regression Meier et.al . (2008) developed the logistic group lasso in a frequentist setup. Most practical situation have binary outcomes. In risk analysis we want to know if a person will default on a loan or not, in detection of diseases we want to classify subjects with or without a disease, etc. Thus, here we need to use a generalized linear model (GLM) to model our data. The use of GLM is not limited to binary data. There are numerous cases where the outcome is not binary but we need to use a GLM, such as, modelling income or count data. Since occurrence of binary outcome is very common in real life, we focus this chapter on logistic regression with a logit link. The concept here is to maximize the likelihood function (objective function) subject to a group lasso constraint on the parameters. Refer to Section 1.1.6 for a detailed overview of frequentist group lasso in logistic regression. Although group lasso for linear regression has been explored, a thorough literature review reveals that no work has been done on group lasso on logistic regression. Motivated by Xu and Ghosh’s (2015) work, here we construct a Bayesian formulation for the logistic regression case. Our likelihood is Bernoulli probability mass function with a logit link. We abide by using a multivariate zero inflated mixture prior with point mass at zero and the continuous part as double exponential distribution. Since a double exponential prior on βg can be 61 expressed as a scale mixture of normal and Gamma priors (as in (3.6)), we have.  yi |xi , β ∼ Bernoulli  e  xT i β xT i β  , i = 1, . . . , n, 1+e βg |τg2 ∼ind (1 − π0 )Nmg 0, τg 2 Img + π0 δ0 (βg ), mg + 1 λ2 , 2 2 τg 2 ∼ind Gamma , g = 1, 2, . . . , G, g = 1, 2, . . . , G, π0 ∼ Beta (a, b). The full posterior conditional distributions are as follows:  n p(β, τ 2 , π0 |Y, X) y xT i β  e  ∝ xT β 1+e i i=1 i 1 1−yi xT β 1+e i βgT βg mg − 2 (1 − π0 )(2πτg2 )− 2 e 2τg I[βg =0] G × + π0 δ0 (βg ) g=1 G × (λ2 ) λ2 τg2 mg +1 mg +1 −1 − 2 2 (τg ) 2 e 2 g=1 × π0 a−1 (1 − π0 )b−1 (3.7) We can simulate an efficient block Gibbs sampler to simulate from the posterior distribution above. Details of the block Gibbs sampler is given in the appendix. 3.4 Posterior Consistency Xu and Ghosh (2015) showed that the posterior median is an adaptive thresholding estimator for a linear regression setup. Theorem 1 in their paper gives a proof of this idea. We will extend this idea for logistic regression model numerically. 62 To prepare the ground for posterior consistency, we will rewrite our model using different notations just so it is alignment with the model setup of Jiang (2007). Our proof of consistency is in line with Jiang’s (2007) paper so similar notations will ease understanding of the proof. Let Dn = {y; X1 , . . . , XPn : y ∈ {0, 1}, Xi ∈ Rn , i = 1, 2, . . . , Pn } denote a dataset of n observations each consisting of Pn predictors where Pn can increase with increasing n. We want to model this data using logistic regression. Let ξn denote a chosen (subset) model, and |ξn | denote the model size of ξn . Note that, here ξn is the sum of all dummy variables(factor levels) of the groups that are chosen in the subset model. Let us call G∗ , the number of selected groups then G∗ ≤ ξn . A major difference of this and Jiang’s setup is multivariate βg , g = 1 . . . G. An interesting thing to note is that, if we express our setup in terms of the dummy variables then this layout is similar to what Jiang proposed. Thus, when the chosen model is ξn , we are really considering our chosen group size to be G∗ and the model size as G∗ g=1 mg = |ξn |. To make the proof here in line with Jiang’s (2007) paper, we express the chosen model in terms of the dummy variables rather than the groups. Clearly, this is an extension of Jiang’s model since we consider a grouped structure for β’s. Conditional on ξn , the regression coefficients βξn |τξn ∼ N (0, Vξn ) ∗T ) where Vξn is a |ξn |×|ξn | covariance matrix and a function of τξn . Here, βξn = (β1∗T , . . . , βG ∗ ∗ ) denote the vector of true regression coefficients and true variance and τξn = (τ1∗ , . . . , τG ∗ parameters respectively such that G∗ g=1 mg ∗ } ⊂ {X , . . . , X } = |ξn |. Let {X1∗ , . . . , X|ξ 1 Pn n| denote the predictors chosen in model ξn . 63   ∗2 τ1 Im1     Note that, Vξn =      ··· 0 0 τ2∗2 Im2 · · · 0 .. . .. . .. 0 0 ··· mg +1 λ2 2 , 2 , and τg 2 ∼ind Gamma 0 .. . . ∗2 I τG ∗ mG∗           g = 1, 2, . . . , G. Let model ξn have the prior P −|ξn | Π(ξn |π0 ) = π0 n (1 − π0 )|ξn | and π0 ∼ Beta(a, b) where a and b are pre- specified hyperparameters.Thus, Π(ξn ) = Π(βξn ) = Π(βξn , τ 2 ξn ) dτ 2 ξn G∗ = g=1 0 ∞ Π(βξn |τ 2 ξn )Π(τ 2 ξn ) dτg2 βgT βg ∞ e 2τg2 mg g=1 0 (2πτg2 ) 2 G∗ = Beta (a + Pn − |ξn |, |ξn | + b) Beta (a, b) − (λ2 ) 64 λ2 τg2 mg +1 mg +1 −1 − 2 2 (τg ) 2 e 2 dτg2 Substituting αg2 = 12 we have, τg G∗ Π(βξn ) = mg e−λ 2 (λ ) 2 g=1 βgT βg mg −1 (2π) 2 ∞  2βT β λ g g 2 1 − αg − 2 2λ2 αg2 λ2 2π(αg2 )3 0 e λ βgT βg 2  dαg2 The term in the integral is an Inverse Gaussian density i.e. αg2 ∼ Inverse Gaussian λ , λ2 T βg β g , thus the integrals integrate to 1 for all g = 1, . . . , G∗ . Therefore, G∗ Π(βξn ) = mg e−λ 2 (λ ) 2 βgT βg mg −1 (2π) 2 g=1 (3.8) Let ξn be the model obtained from the median thresholding posterior probability and ξn∗ be the true model. We want to show that the model ξn converges to the true model ξn∗ as the sample size n becomes sufficiently large. Define f ∗ as the true density under model ξn∗ and f as the density proposed under model ξn . Hellinger distance between f and f ∗ is defined as d(f, f ∗ ) = f− f ∗ νy (dy )νx (dx ). To investigate posterior convergence, we formulate the following theorem based on Theorem 4 in Jiang’s (2007) paper. We consider logistic regression in this paper with a density of the form p∗ (y|x) = exp {a(h∗ )y + b(h∗ ) + c(y)} ≡ f (y, h∗ ) where, h∗ = xT β ∗ is the linear parameter, a(h) and b(h) are continuously differentiable, and a(h) has non-zero derivative. The mean function µ∗ = E(y|x) = − b (h∗ ) a (h∗ ) ∗ h ≡ ψ(xT β ∗ ) = e h∗ . Thus ψ is the inverse of the 1+e logistic link function. Assume that limn→∞ G 1 βg∗T βg∗ < ∞. Let rn be the prior expecta- tion of model size and for simplicity ξ be the corresponding subset model for which |β| > 0, |ξ| 65 is the corresponding model size. Define, ∆(rn ) = inf ξ:|ξ|=rn ∗ j:j ∈ξ / |βj | < ∞, B(rn ) = ¯ n ) = sup ˜ sup ξ:|ξ|=rn ch1 (Vξ−1 )and B(r ξ:|ξ|=rn ch1 (Vξ ). Let, Bn = sup ξ:|ξ|≤Kn ch1 (Vξ ) where Kn is the maximal model size. Let, D(R) = 1 + R × sup |h|≤R |a (h)|. sup |h|≤R |ψ(h)| for any R > 0. Here, ch1 (Vξ ) and ch1 (Vξ−1 ) are the largest eigenvalues of Vξ and Vξ−1 respectively. Let n ∈ (0, 1] for each n and n n 1 and assume the following conditions hold: Assumption1 : A1. Kn log 1 2 n ≺ n 2n A2. Kn log (Pn ) ≺ n 2n A3. Kn log B˜ n 2 D Kn nλ n n ≺ n 2n A4. rn ≺ Pn A5. rn log B¯n (rn ) ≺ n 2n and ∆(rn ) ≺ n 2n A6. log rn Pn 4n 2 ≤− Pn n G∗ g=1 mg A7. mg is such that ≺ Pn , ∀g = 1 . . . G We will replace λ by λn since λ and τg2 , g = 1, . . . , G∗ are dependent on n. Also, λn is inversely proportional to the sum of all τg2 ’s. Theorem 1. Assume the prior setting on 3.8 is used and the Assumption 1 hold. Let P {.} denote the probability measure for the data Dn . Assume, G ≺ Pn , 1 ≤ λn ≤ B(rn ), |xj | ≤ 1 for all j and limn→∞ G 1 βg∗T βg∗ < ∞ where Pn is a nondecreasing sequence in n. Also, let Vξ be such that B˜n ≥ 4. Let n be a sequence such that n ∈ (0, 1] for each n and n 2n 1, . . . , G∗ . Then, we have, (i) For some co > 0, 66 1 and τg2 < ∞, g = 2 lim P π d(f, f ∗ ) ≤ n |Dn ≥ 1 − e−c0 n n = 1 n→∞ (ii) For some C1 > 0 and for all sufficiently large n, 2 2 P π d(f, f ∗ ) > n |Dn ≥ e−0.5c1 n n ≤ e−0.5c1 n n Proof. The proof follows by checking conditions N and O of Theorem 4 from Jiang’s paper (2007) since our prior falls under the category of general prior in their paper. If we can show that our setup satisfies these two conditions under the given assumptions then by Theorem 4 in Jiang (2007) we have the posterior consistency of our spike and slab prior in this logistic regression model. Statement of Condition (O): Let D(R) = 1 + R × sup |h|≤R |a (h)|. sup |h|≤R |ψ(h)| for any R > 0. There exist some Cn > 0 and some Kn satisfying 1 ≤ Kn ≤ Pn , such that Kn ln 1 2 n ≺ n 2n , Kn ln Pn ≺ n 2n , Kn ln D(Kn Cn ) ≺ n 2n . Furthermore, for all large enough n, the following two equations hold: 2 π(|ξ| > Kn ) ≤ e−4n n , 67 and for all ξ such that |ξ| ≤ Kn , for all j ∈ ξ, 2 π(|βgj | > Cn |ξ) ≤ e−4n n . Checking Condition (O) : We have, 1 ≤ Kn < Pn . rn Pn Pn π(|ξ| > Kn ) = π(|ξ| = Pn ) = log π(|ξ| = Pn ) = Kn log rn Pn ≤ −4n 2n (3.9) 2 (3.9) holds from assumption (A6). Thus, π(|ξ| > Kn ) ≤ e−4n n is checked. Now, we 2 need to show that π(|βgj | > Cn |ξ) ≤ e−4n n holds. π(|βgj | > x|ξ) ∝ e x ∞ ≤ −λn βgj e x =− e B˜n n 2 n λn and n 2n dβgj −λn βgj ∞ λn e−λn x = λn Choose, x = Cn = mg 2 j=1 βgj ∞ −λn 1. Then, 68 x dβgj ˜ 2 e−λn x e−Bn n n = λn λn 2 ˜ = e−Bn n n , since λn ≥ 1, 2 ≤ e−4n n This implies, that Condition (O) is checked. Note that, all the other conditions are also satisfied with the choice of Cn and assumptions of the theorem. 2 π(|βgj | > Cn |ξ) ≤ e−4n n Statement of Condition (N): Assume that a sequence of (nonempty) models ξn exists such that, as n increases, |βj∗ | ≺ 2n , j:j ∈ξ / n and for any sufficiently small η > 0, there exists Nη such that, for all n > Nη , we have π(ξ = ξn ) ≥ and 69 n 2 − 8n e , 2 π β∈ βj∗ n n η 2 ± n ξn ≥ e− 8 |ξn | j∈ξn Checking Condition (N) : Take the sequence of models ξn , such that, for each n, ξ = ξn reached its infimum in ∆(rn ) = inf ξ:|ξ|=rn ∗ j:j ∈ξ / |βj |. Then, η 2 π β∈ ξn : j∈ξn mg 2 (λn ) 2 − βT β e mg −1 g=1 (2π) 2 G∗ 2 n η ± rn |βj∗ | = ∆(rn ) ≺ n 2n . βj∗ ± rnn For the condition on prior, π β ∈ βj∗ j ∈ξ / n ξn ≥ j∈ξn η 2n rn for some intermediate value of β achieving the infimum of the density over η 2 βj∗ ± rnn . j∈ξn Note that, G∗ G∗ βT β λn g=1 ≤ ||βg ||B(rn ) g=1   βjT βj  B(rn ) = j∈ξn ≤ C1 B(rn ) for some constant C1 > 0, since we have assumed that λn ≤ C1 B(rn ) for all large enough n and j∈ξn βjT βj ≤ limn→∞ G∗ g=1 r η 2 βgT βg + nrn n is bounded. Also note that, 70 G∗ mg |ξn | (λ2n ) 2 (λ2n ) 2 = mg −1 g=1 (2π) 2 |ξn ∗ (2π) 2 − G2 ¯ ≥ e−C2 rn +C3 rn ln B(rn ) ¯ n ) for all large for some constant C2 > 0 and C3 > 0. This is due to the fact λ1 ≤ B(r n enough n. Therefore, π β∈ βj∗ ± η 2n ¯ n ) − C1 B(rn ) − rn ln ξn ≥ exp − C2 rn − C3 rn ln B(r rn j∈ξn rn η 2n 2 This will be greater in order than any e−cn n (c > 0), satisfying a requirement of Condition ¯ n ) and B(rn ) are all smaller than n 2 in order and so are (N), since rn , rn ln B(r n rn ln rn ≤ Kn ln Pn rn ln 1 2 n ≤ Kn ln 1 2 n Now, consider the condition on π(ξn ): π(ξn ) = ≈ Beta (a + Pn − |ξn |, |ξn | + b) Beta (a, b) |ξn | |ξn | Pn (3.10) (3.10) is derived by choosing a = |ξn | and b = 1 and using approximations for factorial 71 terms. The approximation also ignores O(1) terms. Notice that ξn is chosen such that |ξn | = rn so, log π(ξ = ξn ) ∼ rn log rn Pn ≥ −rn log Pn and rn log Pn So, π(ξ = ξn ) ≥ 3.5 n 2n . n 2 − 8n e . Thus, Condition (N) is checked. Simulation Before applying the group level Bayesian selection method on the brain image data we run a simulation study. The simulation study has the unknown parameters in control and tests the method on controlled inputs. We work on two different scenario where the first case is high-dimensional while the second case in large n small p scenario. • Example 1: The number of observations is 60 and there are 16 predictors each with 5 levels. Thus the number of parameters here is really 80. So, we are essentially looking at a small n large p problem here. The setup here is adapted from Example 2 of Xu and Ghosh’s (2015) paper. We define the j th predictor as Xgj = zg + zgj , where zg and zgj are independent standard normal variables and g = 1, . . . , 16, j = 1, . . . , 5. Thus, the predictors in a group are correlated but the predictors in different groups are independent. Assign true parameter 72 values as follows: β = ((7, 6, 3, 4, 5), 0, (4, 5, 6, 10, 7), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, (2, 3, 4, 5, 6)) where 0 is a 0 vector of length 5. Use the simulated X and β to generate 60 independent Bernoulli random values using the logit link. Here, 40 observations are used to train the model and the rest are used as a test dataset. • Example 2: The number of observations is 100 and there are 4 predictors each with 10 levels which makes p = 40. The design matrix is generated exactly as in Example 1. Let β = (0, 2, 0, 2) where 0 and 2 are vectors of length 10 with all elements 0 and 2 respectively. Use the simulated X and β to generate 100 independent Bernoulli random values using the logit link. 60 randomly selected rows were used as train dataset and the remaining as test data. Hyperparameters, for both cases, a and b were both set to 1.5. 20,000 Monte-Carlo iterations were implemented. 28 bootstrapped samples were used to average out bias in estimates. Table 3.1 summarizes the true and false positive rates and the negative log-likelihood of the two examples mentioned above. Both the methods are able to identify the true variables although the frequentist group lasso has a high false positive rate. This indicates that the group lasso tends to select more variables for an optimal tuning parameter. On the other hand, the model selected by median thresholding gives excellent result in terms of variable selection. It not only identifies the true positives correctly in all cases, it also estimates the 73 Table 3.1: Mean (Standard Error) True/False Positive Rate and Negative Log-Likelihood in 28 Simulations Example 1 Example2 TPR FPR Neg log-likelihood TPR FPR Neg log-likelihood Bayesian Spike and Frequentist Slab Group Lasso Group Lasso 1.00 (0) 1.00 (0) 0.00 (0) 0.78 (0.50) 2.65 (0.53) 6.52 (0.57) 1.00 (0) 1.00 (0) 0.00 (0) 0.32 (0.24) -2.35 (2.31) 5.45 (1.59) unimportant values to be 0. We see that the method proposed in this paper gives a smaller negative log-likelihood indicating a better model fit. Thus, we see that in a simulated dataset, the median thresholding method is able to classify variables very well as compared to the conventional group lasso method when we have variables that have a structured correlation. 3.6 Classification of Alzheimer’s Disease using ADNI MRI data The MRI data used in this section of the paper was obtained from ADNI database. The main objective of ADNI has been to test whether serial MRI, PET, other biological biomarkers, and clinical and neuropsychological assessment can be used to detect dementia or measure its progression. Both normal aging and AD patients have brain region atrophies but it is essential to identify the abnormalities that lead to dementia. Some studies are done to study the differences of brain atrophy in these two categories of subject (Double et.al ., 1996). Such studies have shown that there is a significant difference in the atrophies of normal aging and AD patients so we use this idea to classify the subjects. In this paper we delve into classification of Alzheimer’s disease (AD) patients from normally aging control 74 (CN) subjects at the baseline and estimation of parameters of selected volumetrics. The parameter estimates give us the log of odds of being AD at baseline for a subject with a given set of volumetric measurements. Thus, baseline volumetric values for AD and normal controls from ADNI dataset serve our purpose. ADNI data is collected from 2003 onwards by NIA, NIBIB, FDA and a few pharmaceutical companies as a public-private partnership. The ADNI project is a large project involving subjects across USA and Canada from more than 50 sites. This initiative was launched to develop new treatments and follow subjects through time to monitor the effectiveness of the treatments. For more information about ADNI, visit www .adni − info.org. The volumetric segmentation and cortical reconstruction of the brain is done with the help of freely available software FreeSurfer (http://surfer.nmr.mgh.harvard.edu/). An early version of the longitudinal image processing framework (Reuter et al., 2012) is used to process the sequential scans. This process does motion correction and averaging of multiple volumetric TI weighted images, removes non-brain segments, automates Talairach transformation, segments subcortical white matter (WM) and deep gray matter (GM) volumetric structures. It also automates topology correction and surface deformation of the brain. MRI data points from 1737 subjects with baseline diagnosed as Normal, MCI and AD were collected. For all subjects at each visit, structural MRI scans were acquired from 1.5T scanners for ADNI1 subjects and from 3T scanners for ADNIGO and ADNI2 subjects. MRI protocols were performed across a variety of scanners such as GE, Siemens, or Philips to ensure comparability. MRI volumes were computed using FreeSurfer by UCSF/SF VA Medical Center. ADNI1’s 1.5T data was run with FreeSurfer version 4.3 and ADNIGO and ADNI2’s 3T data was run with FreeSurfer version 5.1. For a detailed guide please refer to the UCSF FreeSurfer Methods documented by Hartig et.al .(2014 ). 75 Many studies have been done to identify the ROI’s associated with AD but using the entire brain segmentation to identify 4 different volumetric aspects of a region has not been explored. This technique includes all available subregion data in the model and identifies the subregions that are potentially associated with AD. Previous studies have isolated one or a few brain regions and used their volume measurements as a predictor in the prediction of AD or MCI from CN’s (Jack et.al ., 1999 and Jack et.al ., 1997). We want the model to automatically select the atrophied regions rather than subsetting a brain region before start of the analysis. Classification of AD from CN has been done using FDG-PET scan (Herholz et.al ., 2002) using comparative statistical methods like t-statistics but researcher have yet to explore variable selection techniques using the entire volumetric data. Wang et.al ., 2014 used Haar wavelets to identify ROI’s using voxel level data for dimension reduction. This method identifies ROI’s successfully but does not narrow down the brain hemisphere of the ROI’s. Since, our data is present for each region for the left and right hemispheres, we are able to identify the exact part of the ROI that is more significantly associated with AD. For the analysis, we use AD and CN patients to distinctively understand the difference of brain regions that cause a subject to be cognitively normal or progress to AD. We have used the longitudinal processing data for our analysis. Due to advancement of technology in the computing area, quantitative assessment of brain volumes, obtained through volumetric MRI, are being used extensively for studies involving Alzheimer’s disease. Volumetric measurements are mainly based on brain segmentation done at reliable MR centers. The demographic characteristics of the 421 subjects are given in Table 3.2. The age and sex distribution in our dataset shows that the data is not skewed with respect to these two variables. Also, the maximum and minimum age for AD is 55.1 and 90.9 respectively and 76 that of CN is 59.9 and 89.6 respectively. Thus, the effect of age in the outcome has been controlled for in the ADNI dataset. Table 3.2: Demographics of patients in ADNI data used for analyses Category Sex (male) Count (%) AD (n = 191) 100 (52.36%) CN (n = 230) 120 (52.17%) Age Mean (SD) 75.27 (7.46) 75.86 (5.01) We have used baseline data of 421 subjects of whom 191 have Alzheimer’s disease (AD) and 230 are cognitively normal subjects. There are 72 predictors (brain regions segmented with FreeSurfer) with 4 levels each namely, volume, area, thickness average and thickness standard deviation. 46 brain regions had volume data only. The regions marked ’Unknown’ and ’Undetermined’ were discarded beforehand because these regions were not identified in the MR scans. We used all the remaining 116 brain regions (single and four-leveled) as predictors for dementia. The analysis to identify the (few) significant regions from the entire brain region is performed. The model identifies a unique set of brain regions significant for the classification of a binary outcomes. Our objective is to be able to select an optimal model that identifies the important brain regions for identifying the two kinds of brain cognitive functionality. Early diagnosis of dementia is very important as it can help in prompt treatment of subjects thus delaying progression of AD, oversee treatment efficiency and reduce time and costs of clinical trials. The brain regions are segmentation of both gray matter (GM) and white matter (WM). Studies suggest that the gray matter is associated with cognitive disorders in elderly people. We keep both GM and WM to test the efficacy of our model i.e. if the model is efficient in selecting the correct brain regions. Variable selection selects the significant brain region 77 from a large collection of brain regions and then the model successfully classifies the subjects using the test dataset. We have a logistic model for the two outcomes of the response variable. Around 70% of the data is used to train the model. The prior placed on the coefficients is a spike and slab type prior that encourages zero estimates for predictors which are not significant. The model is selected using median thresholding method. We run 10,000 iterations of the MCMC chain of which the first 5000 are used as burn-ins. The usual convergence diagnostics are performed. 295 subjects were randomly selected from the 421 subjects to train the model. The median thresholding model selects 29 out of the 116 brain subregions. These regions correspond to ROIs, namely, right bankssts, right pallidum, pars oper cularis, left pars orbitals, right precuneus, putamen, right anterior cingulate cortex, superior frontal, entorhinal cortex, supramarginal gyrus, right transverse temporal, left hippocampus, left inferior lateral ventricle, middle temporal gyrus, inferior temporal gyrus, left precentral gyrus, right fusiform gyrus, left parahippocampal, paracentral, third ventricle (Feng et.al ., 2004) and right inferior parietal. The estimates of all these regions are, however, not statistically significant. Some regions have negligible amount of contribution in the model thus making the credible interval of the feature exclude the corresponding subregion. We only keep the subregions in our model which are statistically significant as given by the corresponding credible intervals. 78 Figure 3.1: Some regions of interest selected by the Bayesian classification model 79 Table 3.3: Mean (Standard Error) of Parameter Estimates of Selected ROIs ROI Volume Surface Area Cortical Cortical Thickness Thickness Average Standard Deviation Right Entorhinal Cortex 2.97 (0.03) 1.83 (0.02) 2.14 (0.01) 1.69 (0.03) Right Pallidum -0.27 (0.12) ∗ ∗ ∗ Right Pars Orbitals -0.33 (0.16) Right Precuneus 0.39 (0.05) -0.42 (0.07) 1.0 (0.09) Right Putamen -0.21 (0.07) ∗ ∗ ∗ Left Anterior Cingulate -0.46 (0.12) -0.12 (0.04) 0.15 (0.05) 0.09 (0.04) Right Trasnverse Temporal -0.54 (0.22) Left Entorhinal Cortex 0.22 (0.07) 0.21 (0.07) -0.41 (0.13) Left Precentral Gyrus 0.25 (0.08) 0.38 (0.12) 0.45 (0.15) 0.25 (0.09) Left Parahippocampal Gyrus 1.97 (0.11) 1.90 (0.11) 2.0 (0.11) 2.07 (0.11) Right Bankssts 0.06 (0.02) Left Middle Temporal Gyrus 1.93 (0.04) 1.57 (0.04) 1.91 (0.02) 1.63 (0.03) Left Hippocampus 1.50 (0.02) ∗ ∗ ∗ ∗ means these region measurements were not captured in data. − means that these regions were not statistically significant although the region was selected by median thresholding. 80 The statistically significant ROI’s are given in Table 3.3. Previous studies have established the association of these regions in AD. Volumes of right entorhinal cortex are severely diminished in AD patients. The other regions selected are also coherent with relevant literature (Juottonen et.al ., 1998 and Galton et.al ., 2001). The precentral gyrus controls motor skills, middle temporal gyrus regulates semantic memory processing; hippocampus, parahippocampal and entorhinal regulate memory and navigation. Putamen, pallidum, transverse temporal and bankssts are all known to be affected by AD (Clerx et.al ., 2013). Recent studies have separately analyzed all these regions and found atrophies in those areas. The proposed method identifies a subset of all atrophied regions and then selects fewer regions as discriminative features for classification. The functions of the selected regions also intuitively implicate the precision of the model. Zhang et.al . (2011) classified AD and CN using support vector machine (SVM) thus leading to non-interpretability of the associated coefficients. The method achieved fairly high accuracy of 80%. Cuingnate et.al . (2011) classified AD and CN based on ROI’s but they restricted their analysis to a few selected ROI’s namely the entorhinal thickness, supramarginal cortex thickness and hippocampal volume. Their sensitivity ranged from 69% to 70% whereas our method gives a sensitivity of 76%. The specificity in their study (90%) is, however, higher than ours (83%). These two studies are, however, not directly comparable except that they are both classification studies because the datasets used in these studies are different. The drawback of their method is that they pre-select a few ROI’s and perform classification, unlike our method. On the other hand, the proposed method is based on statistical foundations that account and measure uncertainties due to randomness in the data set. Our logistic model coded CN as 1 and AD as 0 so the parameter estimates should be 81 interpreted accordingly. Table 3.3 gives the mean parameter estimate and standard error (within parentheses) of the selected ROI’s. To diagnose the accuracy of our test we build an ROC curve. An ROC (Receiver operating characteristic) curve is a plot of the sensitivity vs. 1-specificity. Figure 3.2 shows we have an area under the curve(AUC) of 0.867. This means that if we randomly draw samples for classification then our classifier will accurately classify 86.7% of the time. This AUC tells us that our classifier is, indeed, a good one. Figure 3.2: ROC curve for our classifier 82 3.7 Discussion In this chapter we propose a Bayesian approach of variable selection with spike and slab prior to identify cognitively healthy controls from Alzheimer’s patients. This method uses whole brain parcellation data to classify dementia as well as interpret the association of each significant volumetric measure of a brain subregion. This technique captures the structured correlation in this type of data to retain all levels of the subregion that are disease related. The Bayesian approach guarantees better standard error estimates. Also, the median thresholding method for posterior model selection together with the use of spike and slab prior is a more efficient method than the frequentist group lasso method as shown in the simulation study. Liang et.al . (2013) developed a Bayesian subset selection method for generalized linear models which can select individual variables only. In their method, they place a prior on the model to perform subset selection unlike our approach of using a spike and slab prior which directly drops out variables in the many Monte Carlo iterations. Our median thresholding, as opposed to their MAP posterior probability, is able to choose the best model without comparing information criterion type quantities among several candidate models. Thus, the spike and slab prior median thresholding Bayesian group lasso has attractive properties of high dimensional variable selection and performs efficiently with structured correlated covariates. Most of the other dimension reduction techniques are unable to tackle correlated variables in variable selection. The significant regions selected by our model is identified from a large number of subregions thus accounting for the effect of the whole brain while performing dimension reduction. Wang et.al . (2014) in their paper have introduced a dimension reduction technique using HAAR wavelet based on voxel level data with ADNI PET data. Their method builds on 83 continuous outcomes and does not perform variable selection. Our approach builds the model with the MRI brain parcellated volumetric data which are a direct indicator of dementia. This Bayesian formulation not only tackles ANOVA type dummy variables but also deals with the high dimension problem. The simulation results show that this method is effective in both low and high dimension. AUC of 0.867 shows that our model is a good classifier. Most authors have only reported the accuracy of their model but a look at AUC helps us better understand the diagnostic ability of a classifier. The greatest advantage of this method is that it considers all the subregions while building the model and efficiently drops the ones that are not disease related and, can also easily interpret the risk of dementia from the parameter estimates. This is a novel contribution to classification for Alzheimer’s disease to the best of our knowledge. 84 Chapter 4 Bayesian Spatiotemporal Model for Detecting Voxel-level Activation in fMRI Data 4.1 Introduction With functional magnetic resonance imaging (fMRI), brain activation can be recorded. This is done by measuring blood oxygenation level dependent (BOLD) activation through MRI techniques. An external stimulus is administered to a subject and their BOLD MR signals are recorded at each voxel as a time series. This is useful to understand the functioning of human brain with respect to brain development, diseases related with brain functionality, neuronal activity in alcoholism etc. A single subject is administered to an external stimuli, like seeing an object or administering a physical stimuli like touching a hot surface and the MR scanner captures 3D images at every 2 to 3 secs of the entire brain. These 3D images are comprised of voxel level data on a lattice for multiple brain slices. There are approximately 60×60 voxels in a 2D lattice with approximately 30 of such brain slices. Thus, the number of voxels is of the order of 1 million. BOLD signal responses of 1 million voxels are recorded at around 190 timepoints. Thus, we have data of the size of 85 20 million for a single subject making the data structured highly complicated. The stimulus is a so-called ”boxcar” design where there are alternative periods of activation and rest. This stimulus is a 0-1 input and is transformed by a hemodynamic response function (HRF), Gitelman et.al . (2003) (See Figure 4.1, Welvaert et.al . (2011)). This transformation is done under the assumption that blood oxygenation is a delayed and proceeds continuously. Thus, the 0-1 boxcar stimuli is convolved with an HRF density, like double-gamma density. The transformed stimulus is the regressor in the model and the corresponding coefficient is called the ”amplitude” of activation. Variable selection is equivalent to retaining the voxels when the corresponding amplitude is non-zero. Figure 4.1: Task stimuli based on three convolution functions The time series nature of the images occur due to voxel level images being captured at multiple timepoints. Since voxels in our brain are interconnected in an arbitrary fashion, the 86 presence of spatial correlation cannot be ignored. Standard fMRI analysis ignores spatial dependence and performs correlation analysis or parametric regression analysis voxelwise. The objective of fMRI data analysis is to identify the activated voxels (brain regions) that respond to the stimuli. This, clearly, is a variable selection problem where the covariates are brain voxels. We build a Bayesian variable selection model that accounts for the spatiotemporal correlation inherent in fMRI data. Friston et.al . (1995) applied separate regression at each voxel to get t statistics for the activation amplitude. This popular method, however, does not account for spatial dependence. Recently, many Bayesian approaches have been developed using Gaussian MRF priors (Gossl, Auer, and Fahrmeir, 2001 and Fahrmeir and Gossl,2002) where MCMC algorithm is implemented to obtain posterior distributions. Penny, Trujillo-Barreto, and Friston (2005) used Gaussian spatial priors on regression coefficients and autoregressive coefficients of the noise process. Although, this procedure reduces computational time considerably, it replaces the true posterior with an approximate posterior. Smith and Fahrmeir (2007) introduced a spatial variable selection for fMRI data by placing an Ising prior on the activation amplitudes but they ignore the temporal correlations in the data. Recently, Musgrove et.al . (2016) developed a Bayesian variable selection method by parcellating the brain into several smaller units. Their method takes into account both spatial and temporal correlation. They assume a sparse areal generalized linear model ((Hughes and Haran, 2013) with spatial random effects. They use latent variables distributed as Bernoulli to indicate the presence or absence of a voxel. They perform variable selection based on the posterior probability of activation, thresholding the value at 0.8722. A major drawback of this method is that the accuracy is not validated for random parcellation of the dataset. We develop a Bayesian variable selection method by incorporating group lasso technique 87 into a bilevel selection method. Our regressors are voxels that are task-based (depends on stimuli). We are interested in knowing the stimuli that activates brain voxels as well as the voxels that are activated. So, a stimuli as a regressor comprises of N voxels. In this case, selecting a stimuli is equivalent to selecting a group of covariates; where the group of covariate is the chosen stimulus at each voxel. Now, we can approach this as a group lasso problem but we know that for a chosen stimulus, all voxels may not be activated. So, we want a second level of selection where we can select the activated voxels. Thus, first we select a stimulus such that the amplitudes are non-zero and then finally choose the voxels when a second step of selection drops some amplitudes of activation indicating those voxels are not activated. Motivated by Xu and Ghosh’s (2015) bilevel variable selection, we extend their method by including spatiotemporal correlation. The large size of data presents a host of issues in data handling. We will discuss thoroughly how we have built a Gibb’s sampler for efficient computation. To enable shrinkage both at group level and within a group, we use a spike and slab prior like before. To ensure sparsity we reparametrize the parameters. Park and Casella (2008) and Kyung et.al . (2010) have worked with two level hierarchical structure with mixture priors that has shrinkage effects but fail to produce sparse solutions since the posterior mean/ median is never exactly zero. Spike and slab ensures sparsity. Unlike selecting variables whose posterior probability is above a threshold, we incorporate posterior median thresholding for variable selection. The posterior median obtained from the posterior median of our MCMC samples tell us which amplitudes are active. 88 4.2 Bayesian Bi-level Variable Selection Before we introduce the spatiotemporal model, let us have a look at the bilevel selection method when we assume that there is independence of the responses. Since the BOLD signals are normally distributed, we are concerned with linear models here. Bilevel selection in frequentist approach was proposed by Simon et.al . (2012) where the objective function is regularized using a combination of l − 1 and l − 2 penalty as follows:  2 G  minβ  Y − g=1 G + λ1 ||β||1 + λ2 Xg βg  g=1 2 (4.1)  βg 2  Thus, a MAP solution with prior of βg in (1.2) λ λ π(βg ) ∝ exp − 12 ||βg ||1 − 22 ||βg ||2 , 2σ 2σ g = 1, . . . , G is equivalent to (4.1). The coef1 ficients are reparametrized to introduce the two types of sparsity. Let, βg = Ag2 bg , where 1 Ag2 = diag{τg1 , . . . , τgmg , g = 1, . . . , g; j = 1, . . . , mg . Here, bg has a spike and slab prior: ind bg ∼ (1 − πo )NN (0, Img ) + π0 δ0 (bg ), g = 1, . . . , g. 1 Thus, Ag2 controls the magnitude of elements of βg . Evidently, a non-zero τgj keeps βgj in model. So, we place a spike and slab prior on τgj s to ensure exact sparsity. ind τjg ∼ (1 − π1 )N + 0, s2 + π1 δ0 (τjg ), g = 1, . . . , G; j = 1, . . . , mg where N + 0, s2 is a normal distribution truncated at 0. We place some non-informative priors n other hyperparameters , thus, forming a hierarchical model. σe2 ∼ Inverse Gamma(α = 0.1, γ = 0.1) π0 ∼ Beta(a1 , a2 ), π1 ∼ Beta(c1 , c2 ) s2 ∼ Inverse Gamma(1, t). 89 t is updated with EM algorithm as was done in chapter 3. Note that, this setup is a modification of our algorithm in chapter 3 in a linear setup. We assume independence of our responses in this case. In the next section, we extend this idea to handle datasets that have spatiotemporal correlation. 4.3 Bayesian Spatiotemporal Model with Bi-level Selection 4.3.1 Model Formulation Consider the following model: yvt = xt1 βv1 + xt2 βv2 + · · · + xtp βvp + vt (4.2) v = 1, . . . , N t = 1, . . . , T Here, yvt is the BOLD signal contrast at timepoint t and voxel v. xtj , j = 1, . . . , p are the HRF transformed stimuli input and βvj are the activation amplitudes. We have N voxels 90 and T timepoints. Let t = 1, then   y N ×1 x11 β11 +x12 β12 +···+x1p β1p 11 y21 x11 β21 +x12 β22 +···+x1p β2p  . y1 =  ..  = .  . . x11 βN 1 +x12 βN 2 +···+x1p βN p yN 1 x  11 x . ⊗β  11 . 1 . x11 = x  12 x12 . ⊗β  + . 2 . x12    +   11 21 .  . . N1  x1p  x1p + · · · +  ..  ⊗ βp . x1p  +  11 21 .  . . N1 (4.3)   where βg = β1g  β2g   . , g  .  . βN g = 1, . . . , p. Thus, we can write equation (4.3) as: x11 0 ... 0 y1 = 0 x11 ... 0 ... 0 ... 0 ... ... ... x11  β11 N ×N x1p β   21 0 .   .  + ··· + ... . 0 βN 1   p Xg1 βg + 1 = g=1  where, Xg1 = x1g  0 ... 0 0 x1g ... 0 N ×N ... 0 ... 0  ,g ... ... ... x1g  = 1, . . . , p Thus, we have, yt = p t g=1 Xg βg + t , t = 1, . . . , T Stacking all of this together we have, y1 = y2 = p 1 g=1 Xg βg p 2 g=1 Xg βg + 1 + 2 ... 91 0 x1p ... 0 N ×N ... 0 ... 0  ... ... ... x1p    β1p  β2p   . +  .  . βN p  11 21 .  . . N1 ... p T g=1 Xg βg + T  N ×1  y1  y2 N ×1  = . ⇐⇒    . . yT N ×1 yT =       p 1  g=1 Xg βg p 2β  X g=1. g g  +   .  . p T g=1 Xg βg 1 .2 . . T   TASK 1  TASK p N ×N β11 x1p 0  β21  0 0   . +···+  ...  .  ... . x11 0 βN 1    0 ...   N ×N β1p 0  β2p   .  0   .   ...  .  x1p βN p  x11 0 ...   x1p ... ...    ...0 x...11 ... ... ...   0 0 ...   0 ...    ...................................................................  N T ×1  + ⇐⇒ Y =  .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..        β   1p  N ×N β11 N ×N  x xT p 0 ... 0  β2p   T 1 0 ... 0   β21   .  0 xT p ... 0    0 xT 1 ... 0   . +···+  .   ...  .   ... ... ... ... ... ... ...     0   . βN 1 0 ... xT 1 0 0 ... xT p . βN p  11 21 .  . . N1 N T ×1  ⇐⇒ Y = X11 β1  X 2β   1. 1   .   .  X1T β1 N T ×1  Xp1 βp  X 2β  p  + · · · +  p..   .  XpT βp  +  11 21 .  . . N1 = X 1 β1 + X 2 β2 + · · · + X p βp + = p g=1 Xg βg + Thus, we have p Y N T ×1 = X g βg + (4.4) g=1 In this setup, we have p tasks and N voxels. Measurements from each voxel is taken at T timepoints. We assume a first order auto-regressive correlation between the T timepoints for each voxel. ×1 Let, N = t  ρ1  0.  . . 0 0 ... ρ2 ... 0 0   N ×1  t−1 + et , t = 1, . . . , T. 0 ... ρN 92     σe2 IN 0 ... 0 0   0   0 σe2 IN ... 0   . , . . . . . . . . . . . . .  .  ∼ N     ..   .. .. .. .. .. .. .. .. .. .. .. .. ..   0 0 0 ... σe2 IN  p  1  y  g=1 Xg βg 1  p X 2 βg  y2 g   E  ..  =  g=1.  and variance-covariance .   . . yT p T g=1 Xg βg e  Here, Thus, 1 e2  .. . eT  matrix of Y is V . We want to find the structure of V. Assume that t is a stationary process i.e. the distribution of t does not depend on t. Then, it = ρi it−1 + eit , ∀i = 1 . . . N, t = 1 . . . T 0 = 0. Now, E( it 2 ) = E(ρi it−1 + eit )2 = ρ2i E( 2it−1 ) + E(eit 2 ) σ2 =⇒ E( it 2 ) = e 2 , ∀i = 1 . . . N 1−ρi Similarly, it = ρi it−1 + eit = ρ2i it−2 + ρi eit−1 + eit σ2 =⇒ E( it it−2 ) = E(ρ2i 2it−2 + ρi et−1 it−2 + it−2 eit ) = ρ2i e 2 , ∀i = 1 . . . N 1−ρ i |t−s| σe2 , ∀i 1−ρ2i In general, E( it is ) = ρi = 1...N 93 y Var-Cov = 1 y2 .  . . yT  y 11 y21  ..  . y  N1  y12   y22  . .  = Var-Cov y .  N. 2  .  . y  1T  y2T  .. . yN T           =V           cov(y11 ,y12 ) ... 0 cov(y11 ,y1T ) ... 0 . .   ...   . . 0 ... cov(yN 1 ,yN 2 ) 0 ... cov(yN 1 ,yN T )         cov(y12 ,y11 ) ... 0 V (y12 ) ... 0 cov(y12 ,y1T ) ... 0  . . .      ...  . . .  0 ... cov(y ,y 1 ) 0 ... V (y 2 ) 0 ... cov(y ,y T )  . . . . . . . . . . . . . . . . . . . . .N. .2. . N . . . . . . . . . . . . . . . . . . . . . . . . . .N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .N. .2. . N ......   V (y11 ) ... 0 .   . 0 ... V (yN 1 )          =  ............................... ............................... .  ............................... ............................... .  ............................... ............................... .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..  0 cov(y1T ,y12 ) ... 0  cov(y1T ,y11 ) ... . .    ... . .  0 ... cov(yN T ,yN 1 ) 0 ... cov(yN T ,yN 2 ) 94 ............................... ............................... ............................... . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .  V (y1T ) ... 0 .  . 0 ... V (yN T )          σe2  1−ρ2 ...   . 1  .   0 ...   ρ1 σe2 0  1−ρ2 ...  1  .  .  ρ σ2  0 ... N 2e 1−ρN  σe2 0  1−ρ2 ...   . 1  .   σe2 0 ... 1−ρ2N   ρT −1 σe2 0  1 2 ...  1−ρ  1 . ...  .   ρT −1 σe2  0 ... N 2 1−ρN  T −2 ρ σe2 0  1 2 ...  1−ρ  1 . ...  .   ρT −2 σe2  0 ... N 2 1−ρN 0               2    σ e   1−ρ2 N      2    ρ1 σe ... 0     1−ρ2     . 1     .       ρN σe2     0 ... = 1−ρ2   . . . . . . . . . . . . . . . .N  . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..  ...................... ...................... .  ...................... ...................... .  . . . . . . . . . . . . . . . . . . . . ..  . . . . . . . . . . . . . . . . . . . . .. .  −2 σ 2 ρT  ρT1 −1 σe2 e 1  ... 0 ... 0    2 2       1−ρ1 1−ρ1     . .    ...  . .     T −1 T −2   2 2 ρN σe   ρN σe    0 ... 0 ...  2 2 1−ρN 1−ρN                     ......................  ......................  ......................  ......................  ......................      σe2 0    1−ρ2 ...     . 1    .       σe2   0 ...  1−ρ2  N = σe2 V We will need the inverse of the variance-covariance matrix. To find the inverse let us redefine a few notations. Let Ej = ( 1 , . . . , j ) = ( 11 , . . . , N 1 , . . . , 1j , . . . , N j ), j = 1, . . . , T and 95  1   1−ρ2 ...  . 1  .   0 ...   j−1  ρ1 ... 0    1−ρ2    . 1      ...  .   j−1    ρN    0 ...  2 1−ρN   j−2    1  ρ 0  ... 0  1 2 ... 2     1−ρ1   1−ρ  . 1  .      .  ...  .   j−2   1   0 ... ρ    1−ρ2N 0 ... N 2  1−ρN .................. . ..................    ρ1 ... 0 0   1−ρ21  .  . 1   0 ... ρN 1−ρ2 1−ρ2N N       ρ  1    1−ρ2 ... 0   . 1   .   ρN   0 ... Aj = σe2  1−ρ2 N   ..................  .................. .................. .  .................. .................. .  .................. .................. .  .................. .................. .   j−2    j−1  ρ1 ρ1  ... 0     1−ρ2 ... 0    1−ρ2   . 1  1  .     ...  .  .   j−1 j−2     ρN ρN     0 ... 0 ... 1−ρ2 N 1−ρ2N ..................  ..................  ..................  ..................     1  ... 0   1−ρ21  .    .  1   0 ...  1−ρ2N Write the first order auto-regressive equations as: a0 it + ai it−1 = eit . (4.5) t = 1, . . . , T where, a0 = 1 and ai = −ρi . The distribution of ET : NT 1 − 1 E A−1 E dF (ET ) = (2π)− 2 |AT |− 2 e 2 T T T dET (4.6) And, the distribution of 1 , e2 , . . . , eT : dF ( 1 , e2 , . . . , eT ) = (2π) − N2T − 21 |A1 | 1 − 12 E1 A−1 1 E1 + σ 2 e e T e e t=2 t t d 1 de2 . . . deT (4.7) 96 Assume, T > 2. Considering (4.5) as a transformation from et to t , t = 2, . . . , T we have, − N2T dF (ET ) = (2π) 1 − 21 E1 A−1 1 E1 + σ 2 −1 N T −N e |A1 | 2 a0 e T t=2 N (a 2 i=1 0 it +ai it−1 ) dET (4.8) Comparing (4.6) and (4.8): 1 − 1 E A−1 E NT (2π)− 2 |AT |− 2 e 2 T T T − N2T = (2π) 1 − 21 E1 A−1 1 E1 + σ 2 −1 N T −N 2 e |A1 | a0 e T t=2 N (a 2 i=1 0 it +ai it−1 ) Thus, 2N T |A | and a2N T 0 |A1 | = a0 −1 1 ET A−1 T ET = E1 A1 E1 + σ 2 e Let, CT = A−1 1 0 0 0 T t=2 N i=1 (a0 it and ET BT ET = 12 σe T t=2 + ai it−1 )2 N i=1 (a0 it + ai it−1 )2 ∴ A−1 T = CT + BT So, ET BT ET = 12 σe T t=2 N 2 2 i=1 (a0 it + 2ai a0 it−1 it + a2i it−1 2 ) Now, BT is completely known. Thus, the inverse follows from the inversion of autocovariance matrix by Siddiqui, 1958. Let,  B11 B21 B12 ... B22 ... ....... . B1T B2T   ....... ........  . . . . . . . . . . . . . . . ........  BTN T ×N T =  B  (T −1)1 B(T −1)2 ... B(T −1)T BT 1 BT 2 ... BT T where Bij = N × N matrix correspond to N voxels, i, j = 1, . . . , T . Assuming, j ≥ i, 97 Bji = Bij =      0N ×N ,       i ≤ T − 1, i + 1 < j ≤ T     −ρ1    0 1    ...  2  σe . . . 0  1+ρ2 1 0 ... 0 ... −ρ2 ... ... . 0  . . . . , ... . .... i ≤ T − 1, i ≤ j ≤ i + 1 0 ... −ρN  0  0 1+ρ2 ... 0   Bii =   .. .. .. .. .. .. .. ..2 .. .. .. .. .. .. , 0 0 0 ∀i = 1, T ... 1+ρ2 N ρ21 0 ... 0  0 ρ2 ... 0  . . . 2. . . .    , B11 =  .. .. . ..  BT T = IN 0 0 ... ρ2 N  A11 A21  .......  ....... If, A−1 = T A (T −1)1 AT 1 A12 A22 ... ... A1T A2T  ....... . ........  ....... . ........   A ... A (T −1)2 AT 2 ... (T −1)T AT T where Aij = N × N submatrix, i, j = 1, . . . , T . We know all Aij s except for i and j which is less than or equal to 1, i.e. we do not know A11 . Since, AT is persymmetric, A−1 T is also persymmetric. Therefore, A11 = AT T . Also, A−1 T = CT + BT = 2N T |A |, we have, |A |−1 = a2N T T 0 |A1 | = a0 4.3.2 N (1−ρ2 ) i=1 i . σ 2N A−1 1 0 0 0 + BT Since, Bi-level Spatiotemporal Model for fMRI Data (4.2) is the regression setup where we have a single brain data. The brain is divided into N voxels. p tasks(stimuli) are administered to the subject and their reactions are recorded at T time points. βg , g = 1, . . . , p are the magnitude of the response to the corresponding stimuli. 98 We are interested in selecting a stimuli that significantly affects the response and also the voxel that is significantly associated with the corresponding task. Thus our regression setup has groups of N voxels for each stimuli. Response for this group of N voxels is present for T time points for each stimuli. Thus, (4.4) is of the group lasso form (Xu and Ghosh, 2015). Here, to select a stimuli and a voxel that are associated with the brain response accounts for bi-level selection. The first level of selection selects groups of voxel corresponding to a stimuli and then selects the voxels from among the selected stimuli. We can, thus, apply Bayesian sparse group lasso with bi-level selection. Our response is normally distributed so let us formulate the hierarchical Bayesian structure. Y |X, β, σe2 , ρ ∼ NN T (Xβ, V ) βgN ×1 |τg , σe2 ∼ N (0, σe2 Σg ), g = 1, . . . , p (4.9) 1 Let βg = Ag2 bg i.i.d. bg ∼ (1 − πo )NN (0, IN ) + π0 δ0 (bg ), τ 0 ... 0  1 Here, Ag2 =  1g 0 ... ... 0 Note that, τjg = 0 ... 0  ... ... ... ... ... τN g =⇒ βjg is dropped g = 1, . . . , p τ2g ... ... 0 out of the model even when bjg = 0. This means τg controls a within group level selection for a selected group βg . In our problem, we know that there is some correlation among adjacent voxels. Consider a spatial adjacency structure among the N spatial locations that can be represented through a known spatial weight matrix W = ((wij )), i, j = 1, . . . , N.wij is the weight corresponding to the presence or absence of adjacency between location i and j. For the prior selection of within group, we assume a spatial cross-sectional dependence is convoluted within the covariate structure of the model. To perform a group lasso variable 99 selection in the above model, we need to consider a proper prior for βg , g = 1, . . . , p that considers the spatial (voxel) relationships among the covariates. We assume a conditional auto-regressive prior for τ as: τjg |τig : i = j ∼ (1 − π1 )N + wj+ = wij N s2 i=1,i=j wj+ τig , wj+ + π1 δ0 (τjg ), g = 1, . . . , p N i=1 wij N + = folded normal towards positive side of the real line. Also, σe2 ∼ Inverse Gamma(α = 0.1, γ = 0.1) π0 ∼ Beta(a1 , a2 ) π1 ∼ Beta(c1 , c2 ) s2 ∼ Inverse Gamma(1, t) ρ∼ t(k) = 1 1 − ρ2 Uniform(−1, 1) 1 (4.10) E (k) 12 |Y t s We have an improper prior for ρ for the ease of posterior calculations. Thus, with the above model specification the joint posterior of b, τ 2 , σe2 , π0 , π1 , ρ conditional on observed data is: 100 p(b, τ 2 , σe2 , π0 , π1 , ρ|Y, X) = P (Y, b, τ 2 , σe2 , π0 , π1 , ρ|X) P (Y |X) ∝ P (Y |b, τ 2 , σe2 , π0 , π1 , ρ, X)P (b|π0 , τ 2 , σe2 , X)P (σe2 |π1 , X)P (π0 )P (π1 )P (ρ) T 1 1 p p 1 1 2 − 2 − 2 Y − g=1 Xg Ag bg V −1 Y − g=1 Xg Ag2 bg |V | = e NT (2π) 2 p 1 T N × (1 − π0 )(2π)− 2 e− 2 bg bg I[b =0] + π0 δ0 (bg ) b g=1 p N × g=1 j=1   (1 − π1 )2 − 2πs2 wj+ − 12 − wj+ τ − 2 jg e 2s γ × (σ 2 )−α−1 e σ2 a −1 × π0 1 c −1 × π11 (1 − π0 )a2 −1 (1 − π1 )c2 −1 − t × t(s2 )−2 e s2 1 × 2 N i=1 1 1 − ρ2i I[−1≤ρ ≤1] i 101 2 wij N i=j=1 wi+ τig I  [τjg >0]  + π1 δ1 (τjg ) We can simulate an efficient block Gibbs sampler to simulate from the posterior distribution above. Define the following notations: T , β T , . . . , β T ), β(g) = (β1T , . . . , βg−1 p g+1 X(g) = (X1 , . . . , Xg−1 , Xg+1 , . . . , Xp ), β(jg) = (β11 , β21 , . . . , βN 1 , . . . , β1g , . . . , βj−1g , βj+1g , . . . βN g, . . . , β1p . . . , βN p ), X(jg) = (x11 , . . . , xN 1 , . . . , x1g , . . . , xj−1g , xj+1g . . . , xN g , . . . , x1p , . . . , xN p ) where Xg is the design matrix corresponding to βg . Detailed computation of the posterior full conditional is given below. (1) We want to find the posterior distribution of bg . Let lg = P (bg = 0|rest) = P (bg = 0|Y, X, τg2 , π0 , π1 , ρ, σe2 ) = π0 A π0 A + (1 − π0 )B where rem ≡ X, τg2 , π0 , π1 , ρ, σe2 , 1 1 1 Y −X A 2 b −1 Y −X A 2 b − V 1 (g) (g) (g) (g) (g) (g) 2 A = |V |− 2 e T 1 p 1 2 V −1 Y − 1 bT b − 2 Y − g=1 Xg Ag bg − 1 g g e and B = |V |− 2 bg =0 e 2 N (2π) 2 π0 Thus we get, lg = T 1 µg µg π0 +(1−π0 )|Σg | 2 e 2 1 1 1 2 X T V −1 X A 2 and µ = Σ A 2 X T V −1 where, Σ−1 = I + A g g g g g g g g N g 1 p 2 bg X A g g g=1 dbg 1 2 b Y − X(g) A(g) (g) From the posterior full conditionals we have, bg |rest ∼ lg δ0 (bg ) + (1 − lg )NN (µg , Σg ). 102 (2) − 12 Y − P (τjg |rest) ∝ e  p N × g=1 j=1  (1 − π1 )2 − 21 Y − ∝e × (1 − π1 )2 − 12 Y − +e T 1 p 2 b X A V −1 Y − g=1 g g g 2πs2 wj+ − 12 − wj+ τ − 2 jg e 2s T 1 p −1 Y − 2 g=1 Xg Ag bg V 2πs2 wj+ − 12 − wj+ τ − 2 jg e 2s T 1 p −1 Y − 2 g=1 Xg Ag bg V 1 p 2 bg X A g g g=1 2 wij i=j wi+ τig I  [τjg >0]  + π1 δ1 (τjg ) 1 p 2 bg X A g g g=1 wij i=j wi+ τig 1 p 2 bg X A g g g=1 2 I[τ >0] jg π1 δ1 (τjg ) Thus, we see that, 2 ), τjg |rest ∝ qjg δ1 τjg + (1 − qjg )N + (ujg , vjg g = 1, . . . p; j = 1, . . . N where qjg = P (τjg = 0|rest) = P (τjg = 0|Y, X, b, π0 , π1 , ρ, σe2 ) = P (Y |τjg = 0, rem1)P (τjg = 0|rem1) P (Y |τjg = 0, rem1)P (τjg = 0|rem1) + τ =) P (Y |τjg = 0, rem1)P (τjg = 0|rem1) jg where rem1 = X, b, π0 , π1 , ρ, σe2 . Now, P (τjg = 0|rem1) = π0 103 T 1 p −1 Y − 2 g=1 Xg Ag bg V − 12 Y − P (Y |τjg = 0, rem1) = e τjg =) 1 p 2 bg X A g g g=1 and P (Y |τjg = 0, rem1)P (τjg = 0|rem1) e = 1 1 − 21 (Y −XA 2 b)T V −1 (Y −XA 2 b) τjg =) × (1 − π1 )2 2πs2 wj+ − 12 − wj+ τ − 2 jg e 2s 2 wij i=j wi+ τig I [τjg >0] dτjg After rigorous computation, we get, π1 qjg = π1 + 2(1 − π1 ) s2 wj+ 1 2 vjg e wj+ − 2 2s 2 µ2 wij jg i=j wi+ τjg + 2v 2 jg Φ −1 2 = wj+ + b2 xT V −1 x and vjg jg jg jg 2 s 2 µjg = vjg wj+ s2 wij i=j wj+ τig − Y − X(jg) β(jg) T V −1 xjg bjg (3) P (σe2 |rest) ∝ γ 1 − 1 (Y −Xβ)T V −1 (Y −Xβ) 2 −α−1 − 2 σ 2 2 |V | e (σe ) e e ∝ γ − 1 (Y −Xβ)T V −1 (Y −Xβ)− 2 − N2T −α−1 2σe2 2 σ e (σe ) e = (Y −Xβ)T V −1 (Y −Xβ) − 12 +γ N T 2 − 2 +α −1 σe (σe2 ) e 104 µjg vjg Therefore, σe2 |rest ∼ IG NT 2 + α, (Y −Xβ)T V −1 (Y −Xβ) 2  P (π0 |rest) ∝ +γ (4)  p N 1 T a −1 π0 1 (1 − π0 )a2 −1  (1 − pi0 )(2π)− 2 e− 2 bg bg I[bg =0] g=1 a −1 ∝ π0 1 + π0 δ0 (bg ) (#bg =0) (1 − π0 )a2 −1 (1 − π0 )[p−(#bg =0)] π0 Therefore, π0 |rest ∼ Beta(a1 + (#bg = 0), a2 + p − (#bg = 0)) (5) c −1 P (π1 |rest) ∝ π11 (1 − π1 )c2 −1 (1 − π1 ) (#τjg =0)] (#τjg =0) π1 Therefore, π1 |rest ∼ Beta(c1 + (#τjg = 0), c2 + (#τjg = 0)) (6) P (s2 |rest) ∝ = − t − 1 (#τ =0) − e (s2 )−2 e s2 (s2 ) 2 jg p g=1 p g=1 − 1 t+ (#τjg =0)+1 −1 s2 − 1 2 2 e (s ) Therefore, s2 |rest ∼ IG 1 2 (#τjg p g=1 = 0), t + (7) We have,  y11 −        (Y − Xβ) =        y21 − yN 1 − y12 − yN 2 − y1T − yN T −  p x β1g 1g g=1  p x1g β2g  g=1  ...  p x βN g  g=1 1g  p  x β1g 2g g=1  ...  p x2g βN g  g=1  ...  p xT g β1g  g=1  ... p x βN g g=1 T g  c 11 c21  c ...  N1  c12 =  ...  cN  c...2 1T ... cN T 105        N wj+ τ − jg j=1 2s2 N wj+ τ − jg j=1 2 N wj+ j=1 2 τjg − 2 wij i=j wj+ τij 2 wij τ ij i=j wj+ 2 wij τ i=j wj+ ij Then P (ρ1 , . . . , ρN |rest) ∝ (Y − Xβ)T V −1 (Y − Xβ) N i=1 1 I[−1≤ρ ≤1] i 1−ρ2i = ( c11 ... cN 1 c12 ... cN 2 ... c1T ... cN T )  −ρ1 . 0 1+ρ2 . 1 0 1 ... 0 . ... . 0 ... 1    −ρ ... 0  ... .  0. 1 ... −ρN  ×   ...    0  c  11 c21 ... c  N1  c12 × ...  cN  c...2 1T ... cN T    ×   Collecting ρi , N i=1  ... 0 ... . ... −ρN ... 0 . . ... 1+ρ2N 0 ... −ρ1 ... 0 . ... . 0 ... −ρN ... ... ... ... 0 ... −ρ1 ... 0 . ... . 0 ... −ρN 0      0    ...    1 ... 0  . . . 0 ... 1 1 I[−1≤ρ ≤1] i 1−ρ2 i i = 1, . . . , N terms, we get, −2ρi [ci1 ci2 + ci3 ci2 + ci4 ci3 + · · · + ciT ciT −1 ] + ρ2i c2i2 + c2i3 + · · · + c2iT −1 ∝ c2 + c2i3 + · · · + c2iT −1 ×  i2 2  ci1 ci2 +ci3 ci2 +ci4 ci3 +···+ciT ciT −1 c c +c c +c c +···+ciT ciT −1  ρ2 − 2ρi i1 i2 i3 i2 i4 i3 + i 2 2 2 2 2 2 ci2 +ci3 +···+ciT −1 ci2 +ci3 +···+ciT −1  2 2  − 12 c2 i2 +ci3 +···+ciT −1 ρi − 2σ e 1 ∴ ρi |rest ∝ 2 e σe ρi |rest ∼ N (µρi , vρ2 ) i = 1, . . . , N where i ci1 ci2 +ci3 ci2 +ci4 ci3 +···+ciT ciT −1 µρ i = 2 2 c2 i2 +ci3 +···+ciT −1 σ2 vρ2 = 2 2 e 2 i ci2 +ci3 +···+ciT −1 106 ci1 ci2 +ci3 ci2 +ci4 ci3 +···+ciT ciT −1 c2i2 +c2i3 +···+c2iT −1 2  4.4 Computational Algorithm Voxel level brain data for a single subject has a very complicated structure since the size of data is huge, We have to deal with a data size of more than 20 million. Our covariance matrix is of the order of N T × N T . Obtaining posterior distribution requires inversion of this matrix. It is almost impossible to deal with such large matrix inversions so we need to come up with efficient algorithms that are able to handle this data size. Computational efficiency in modern times is an inevitable step towards model building. Building a model is not sufficient if we cannot find a way of proper implementation within humanly controlled time periods. Apart from the covariance matrix, adjacency matrix and Ag is also of high order (i.e. N × N . Even storage of such matrices require a supercomputer. We have used inverse algorithms and matrix multiplication tricks to bypass the use of too many memory resources. Note that, the variance-covariance matrix in our setup is a block diagonal matrix for every N × N block. It has an autoregressive structure and as seen in the previous section, we do not need to invert the matrix using matrix algebra since we already know the elements of the inverted matrix. Thus, we build the variance-covariance matrix using the inversion of autocovariance matrix by Siddiqui, 1958 and just place all the elements in their true location. We have extensively used ”Matrix” package in R for the computation. This package saves sparse matrices with ”sparseMatrix” function by allocating it to a much smaller memory space. The inverse of variance-covariance matrix is sparse so we use this advantage while writing our code. 107 4.4.1 Creating the Adjacency Matrix We require an adjacency matrix for specifying the adjacency of voxels in the brain. This is used when we specify weights for spatial dependence through the variance of amplitudes. For a 2D lattice, a voxel has four neighbors, sharing boundary with each of its face. For a 3D cube, a voxel has 26 neighbors. Figure 4.2 (Source: http://www.math-only-math.com/commonsolid-figures.html) shows there are 8 vertices, 6 plane surfaces and 12 edges in a cube making te maximum number of neighboring voxels 26. Figure 4.2: Illustration of neighborhood of a cube For 2D lattice, we have two coordinates a and y. If two x-values or y-values have absolute difference of their x or y axis coordinate as 1 respectively then we call them neighbors. If the absolute difference is larger in either x or y then they are not neighbors. In 3 dimensional setup, we have 3 coordinates x, y and z. Here, we fix a coordinate and check for the absolute difference of the other two to be 1. Then we assign weight 1 meaning neighbors to the voxel with respect to the focal voxel. This algorithm is straightforward but the adjacency matrix is huge for large voxel size. Note that, adjacency matrix is sparse as most of the elements will be 0. 108 4.4.2 Matrix Tricks The largest order of matrix we need to invert in our model is N × N . Since this is a variance-covariance matrix of the posterior of βg , it is positive definite. We use Cholesky decomposition instead of traditional matrix inverse. For generating τg , g = 1, . . . , p we carefully perform matrix multiplication so that it saves computation time. For example, we need to evaluate uating Y − X(jg) β(jg) Y − X(jg) β(jg) T T V −1 xjg bjg for the posterior distribution of τg . Eval- means finding N T elements N times making the computation time N 2 T , so we calculate (Y − Xβ)T once which uses N T computation time. Then for N times we calculate X(jg) β(jg) and subtract from previously stored value of (Y − Xβ)T . This modification uses N T + N computation time. Using such matrix algebra tricks we are able to bring down the Gibb’s sampler run time considerably. 4.5 Simulation To understand how our proposed Bayesian hierarchical, sparsity-inducing model works, we need to perform a simulation study. In both the previous two simulation studies we have simulated the datasets by assigning true parameter values and generating a model from the concerned distribution. This works fine and was easy to build since our responses were independent. Now we have dependent responses. There are ways of generating dependent responses but what is unique in this chapter is that we want a simulated dataset that represents an fMRI dataset. This is not an easy task since a fMRI dataset has very specific patterns. Welvaert et.al .(2011 ) created an R package called ”neuRosim” that produces fMRI data replicating real fMRI data. We have used this package to generate our data in this section. 109 4.5.1 Generating Simulated fMRI Data We generate a stimuli with a totaltime of 390 seconds and repetition time of 2 seconds. Thus, there are 195 time series points in the BOLD signals. The stimuli is a boxcar function which is then convolved with a double-gamma density to make it continuous. Figure 4.3 plots the simulated stimuli. The stimuli onset times are at 60, 120, 210, 270 and 330 seconds. Figure 4.3: Stimuli convolved with double-gamma density (red) and the corresponding boxcar function (green) We generate 4 regions of activation on a 2D lattice at coordinates (10,42), (25,35), (10,10) and (40,31) with radii 3,4,3 and 3 respectively. A mixture noise is used with 0.4 and -0.25 as autocorrelation for simulating 2,500 voxels. A signal-to-noise ratio of 4.5 is used. 110 4.5.2 Analysis Result of Simulation Study Left-hand panel of Figure 4.4 shows us the activated regions in our simulated 2D lattice. Our aim is to analyze the simulated dataset of 50 × 50 voxels on a 2D lattice for 195 timepoints. Our Bayesian hierarchical bilevel selection method is able to detect the true activated regions with a very high accuracy. To assess the accuracy of our method we compare it to the results of Musgrove et.al . (2016) for the case where they have analyzed the entire brain and parcellated brain in their simulation study. The comparison is valid since we both the datasets simulated 2,500 voxels on 2D lattice with 195 timepoints. The stimuli have for both the studies have total duration and onset times to be same. Table 4.1 gives us the results. The median accuracy of our method is 99.6% which is better for both the methods of Musgrove et.al . (2016). The false positives are also better for our approach. Table 4.1: Median accuracy (minimum and maximum) of correct classification and false positives over 10 simulated datasets Method Bayesian hierarchical bilevel selection Musgrove et.al . (2016) Full dataset Musgrove et.al . (2016) Parcellated dataset Accuracy(%) 99.6 (98.9,100) 98.7 (96.0, 99.3) 99.4 (98.9, 99.7) False Positive 0.006 (0,0.008) 0.015 (0.007, 0.044) 0.007 (0.003, 0.011) Figure 4.4, right-hand side panel, shows us the selected voxels in white. Thus, it is evident from the results that our proposed method is an improved variable selection method. Musgrove et.al . (2016) have parcellated the brain to parallely run the analysis for significant time reduction. In our real data analysis section, we use a two stage procedure thus reducing our computation time significantly. 111 Figure 4.4: True and estimated binary map 4.6 4.6.1 Single-Subject fMRI Data Analysis Data Acquisition A healthy college student from Michigan State University volunteered to participate in this study and signed consent forms approved by the Michigan State University Institutional Review Board. The experiment was conducted on a 3T GE Signa HDx MR scanner (GE Healthcare, Waukesha, WI) with an 8-channel head coil. An fMRI dataset was collected on a visual stimulation condition with a scene-object fMRI paradigm. The parameters for the fMRI scan were: gradient-echo EPI, 36 contiguous 3-mm axial slices in an interleaved order, time of echo (TE) = 27.7 ms, time of repetition (TR) = 2500 ms, flip angle = 80◦ , field of view (FOV) = 22 cm, matrix size = 64 × 64, ramp sampling, and with the first four data points discarded. Each volume of images were acquired 192 times (8 min) while a subject was presented with 12 blocks of visual stimulation after an initial 10 s ”resting” period. In a predefined randomized order, scenery pictures were 112 presented in 6 blocks and objects pictures were presented in other 6 blocks. All pictures were unique. In each block, 10 pictures were presented continuously for 25 s (2.5 s for each picture), followed with a 15 s baseline condition (a white screen with a black fixation cross at the center). The subject needed to press his/her right index finger once when the screen was switched from the baseline to picture condition. Stimuli were displayed in color in full screen on a 1024×768 32-inch LCD monitor (Salvagione Design, Sausalito, CA) placed at the back of the magnet room. The LCD subtended 10.2◦ ×13.1◦ of visual angle. After the above functional data acquisition, high-resolution volumetric T1-weighted spoiled gradient-recalled (SPGR) images with cerebrospinal fluid suppressed were obtained to cover the whole brain with 120 1.5-mm sagittal slices, 8◦ flip angle and 24 cm FOV. These images were used to identify anatomical locations. 4.6.2 fMRI Data Pre-processing and Analysis All stimulus fMRI data pre-processing and analysis for each subject were conducted with AFNI software (Cox, 1996) as described in Henderson et al. (Henderson, Zhu et al. 2011). Essentially, slice-timing correction and rigid-body motion correction were carried. Spatial blurring with a full width half maximum of 4 mm was applied to reduce random noise. Multiple linear regressions (using the ”3dDeconvolve” routine in AFNI) were applied on a voxel-wise basis to find the magnitude change when each picture condition was presented, followed with general linear tests to find the statistical significances between stimulus conditions. We apply a two stage method- first we reduce the number of voxels based on p-value analysis for each voxel, ignoring spatial correlation and then apply our method in the second stage. The first step is a data processing step since many voxel data that are captured in the 113 MRI scan are outside the brain and the pvalue based analysis can easily identify the nonaffected voxels. We have 64 × 64 voxels and 36 slices of the brain. Each of these voxels were captured at 192 timepoints. This makes our data size of the order of 28 million. We have applied voxel by voxel regression analysis and retained voxels that had a p-value < 5 × 10−4 . This reduces voxel size to 6118. Now we have data size in the order of 1 million. In the second step, we apply our proposed Bayesian spike and slab bi-level selection to the reduced data. the subject was shown two stimuli, a scenery and an object. From the first step, we know that 6118 voxels represent the regions of interest in the brain that are activated in some way due to visual stimulation. Thus, we expect that the both the stimuli and the most activated voxels are selected by our model. Henderson et.al . (2007) conducted a similar study with indoor and outdoor sceneries and faces as stimuli. They conclude that there is activation in posterior parahippocampal cortex (pPHC), including parahippocampal place area(PPA) for sceneries. The faces stimuli activates fusiform gyrus and amygdala. Epstein et.al . (1999, 2003); Epstein and Kanwisher (1998) have shown that scenes activate PPA over faces, single objects and object arrays. Our results show the activation of PPA in the scene - object contrast. Figure 4.5 shows the activated areas in orange. The figure illustrates that scene preferentially activates PPA over object. This finding is in lines with Epstein et.al . (1999, 2003) and Henderson et.al . (2007). Blue regions show activation by object stimuli indicating activation fusiform gyrus and amygdala. The scene and object activation of brain regions can be looked at separately too (as in Figure 4.6). Figure 4.6 shows that both scene and object activate similar regions of interest in the brain that are traditionally related to visual activation. The coronal view specifically show that object preferentially activates the fusiform gyrus whereas scenes activate pPHC 114 Figure 4.5: Scene minus object contrast activation Figure 4.6: Scene and object activation 115 more than objects. From our results, we see that the activation behavior of objects is very similar to that of a faces stimuli (Henderson et.al . (2007)). Our model is able to select both the stimuli that activate the regions of interest associated with visual cortex. The model does a bi-level selection in selecting not just the stimuli but the activated voxels. The selection keeps the spatial clustering of the voxels intact by choosing clustered voxels in the visual cortex. Henderson et.al . (2007) performed their analysis without considering any spatial correlation. Although, our results are aligned with theirs, it is difficult to directly judge a better method since the stimuli used and objective of the studies are different. Our method has an advantage because we consider spatial information which they have completely ignored. 4.7 Discussion In this chapter we have extended our model from chapter 3 to introduce bi-level selection of covariates. The idea is to select a significant group and few significant levels within that group. This has further been convoluted with a spatiotemporal structure where we introduce temporal correlation in the likelihood and spatial correlation in the prior of β. This novel method of Bayesian hierarchical modeling is applied to a single subject fMRI data. Our goal here is to select the activated voxels in the presence of an external stimuli. Alongside the proposed variable selection method, we introduce BOLD activation in fMRI tests. We want to find how is the brain affected (activated) by a stimuli; this will lead medical professionals in understanding the neuronal activity in human brains which can further help treat various diseases or find root causes of certain human behavior. A thorough simulation study compares our results with those of Musgrove et.al . (2016). 116 Our method is able to give a lower false positive rate indicating that our method performs better. Our simulation was conducted in a 2D setup. We extend our method to 3D real data, acquired from Department of radiology, MSU. Our method correctly identifies the activation of the visual cortex in the presence of scene and object stimuli. It shows that scene preferentially selects PPA regions of interest while objects select the fusiform gyrus and amygdala. These results align with those of Epstein et.al . (1999, 2003). Thus, we are able to verify that our novel approach of variable selection has a very relevant application in fMRI data. The relevance of understanding fMRI data is immense and handling of huge data sizes is a challenge. We overcome this by introducing a two step procedure. the first step filters the data voxel by voxel and then incorporates the proposed method for variable selection. Castruccio et.al . (2016) have used a similar 3 step approach with region of interest(ROI) information added to voxel level data. Our data does not include ROI information but does a great job selecting the truly activated voxels. thus, we use a less complicated model than Castruccio et.al . (2016) but a more statistically relevant model than Henderson et.al . (2007) to identify BOLD activation in fMRI data. 117 Chapter 5 Future Work This dissertation presents three varied applications of statistical methodology in fields ranging from marketing research to brain image data. Our proposed methods are state of the art approaches for dealing with the analyzed data. Our main focus has been variable selection. When the number of parameters become larger than the number of observations then ordinary regression method fails. This has led to the development of numerous regularized modeling approaches. In our marketing research data application, we have used frequentist style variable selection methods, namely, LASSO and elastic net. For real data analysis, we convert 6 categories of the response in 6 different sets of binary variables - one vs others. Our comparative study has been performed on linear and binary outcome. A very interesting future work will be to build a model for the multi-category response variable. A multinomial logistic regression with LASSO and/or elastic net regularization can be used. Next, we have performed variable selection on binary response (Alzheimer’s disease or normal control) in a Bayesian group lasso setup. We propose a median thresholding posterior estimator of βs and use spike and slab type prior. A logit link used for this setup works well in terms of the true and false positive rates of prediction. A high AUC under the ROC curve also indicates that our method performs competitively. For further research, it will be interesting to explore the impact of a probit link for this model setup. The logit link helps us calculate 118 relative risks and odd ratios for the significant ROIs, however, for another interpretation of βs, we can explore the use of a probit link in our model. There is a scope of comparing models using two different link functions but essentially, the same model. Another potential variation in the model can be approximation of intractable posterior of β by a numerical method (e.g. Laplace approximation). This proposed method has a tremendous scope of future work, given the numerous trajectories of research. Our last application is based on a giant dataset of brain voxels in 3d image for hundreds of timepoints. We have used a spatiotemporal modeling to perform variable selection by extending our Bayesian group lasso approach to select a stimuli administered for monitoring brain activity as well brain voxels. The biggest challenge with this data is its size. We perform analysis by filtering the more active brain regions by a preliminary analysis to reduce data size. In the second step, we apply our Bayesian variable selection method. Further research should be continued to find out a method that is able to handle the entire dataset at once. This finding will be a great step ahead in today’s challenge of handling big data. Our model does a great job giving a low false positive rate and is shown to better than the method of Musgrove et.al. (2016). This piece of work builds an associative model to find out the most active voxels to further identify relation of neuronal dysfunction and a disease. We can extend this model to a predictive model where one may want to predict the state of a patient given the activity in the selected voxels. This requires building a model, inclusive of variable selection phase, on a train dataset and validating it using a test dataset. Our contribution through the novelty of application is tremendous and has carved a way for intriguing future research work. 119 APPENDIX 120 Appendix A. Gibbs Sampler The full posterior conditional distributions are as follows:  n p(β, τ 2 , π0 |Y, X) xT i β y  e  ∝ xT β 1+e i i=1 i 1 1−yi xT β 1+e i βgT βg mg − 2 (1 − π0 )(2πτg2 )− 2 e 2τg I[βg =0] G × + π0 δ0 (βg ) g=1 G × (λ2 ) λ2 τg2 mg +1 mg +1 −1 − 2 2 (τg ) 2 e 2 g=1 × π0 a−1 (1 − π0 )b−1 (.1) Let β(g) and X(g) be defined as follows: T , β T , . . . , β T ), β(g) = (β1T , . . . , βg−1 g+1 G X(g) = (X1 , . . . , Xg−1 , Xg+1 , . . . , XG ) where Xg is the design matrix corresponding to βg . Detailed computation of the posterior T xT β +···+xT iG βG . full conditional distributions is given below: (1) Note that, exi β = e i1 1 T T xT β xT β +···+xT iG .0+···+xiG βG = e i(g) (g) where xi(g) corresponds If, βg = 0 then, exi β = e i1 1 121 β(g) . p(βg = 0|rest) = p(βg = 0|Y, X, τg2 , π0 ) p(βg = 0, Y|X, τg2 , π0 ) = βg = p(βg , Y|X, τg2 , π0 )dβg Aπ0 Aπ0 + B(1 − π0 ) where π0 = p(βg = 0|τg2 , π0 ) n A = p(Y|βg = 0, X, τg2 , π0 ) = T i=1 B= βg =0 e yi xT β(g) i(g) x β 1 + e i(g) (g) p(Y|βg = 0, X, τg2 , π0 )dβg  =   βg =0 n i=1 e yi xT β(g) i(g) T x β 1 + e i(g) (g) βgT βg mg −  2τg2 2 − dβg  (2πτg ) 2 e  The integral in term B is complicated so we approximate it with Monte- Carlo approximation for each i = 1, . . . , n. Thus, we approximate the function yi xT β n e i(g) (g) i=1 xT β(g) 1+e i(g) by drawing i .i.d samples of βg from a multivariate normal distribution with mean 0 and variance τg2 Img . Let lg = p(βg = 0|rest). From the posterior full conditionals, we have that βg |rest ∼ (1 − lg )F + lg δ0 (βg ) 122 (.2) where F is some distribution whose form is unattainable. By collecting the terms of βg from (.1) we do not get a closed form for the distribution of βg |rest. But, we get (.2) and use the Metropolis algorithm to draw samples from βg |rest whenever βg = 0. The Metropolis algorithm: The Metropolis algorithm is a method of drawing samples from a posterior distribution when the posterior distribution does not have a closed form. Suppose we have a working collection of (θ(1) , . . . , θ(s) ) to which we would like to add a new value θ(s+1) . Let us consider adding a value θ∗ to the set that is in the vicinity of θ(s) . Let, r= Let θs+1 = p(θ(∗) |y) p(θ(s) |y)     θ∗ , with probability min(r,1)    θs , with probability 1-min(r,1) Here θ∗ is sampled from a symmetric distribution. Sample u ∼ Uniform (0, 1) and set θs+1 = θ∗ if u < r, otherwise set θs+1 = θs . We sample θ∗ from a normal distribution with a proposed mean and variance. Then the Metropolis ratio is: r= = = p(θ∗ |y) p(θ(s) |y) p(y|θ∗ )p(θ∗ ) p(y|θ(s) )p(θ(s) ) n ∗ ∗ i=1 dbinom(yi , θ )dnorm(θ , µ, τ ) n (s) (s) i=1 dbinom(yi , θ )dnorm(θ , µ, τ ) 123 Thus for our problem, mg n log r = log dbinom(yi , βg∗ ) − log (s) dbinom(yi , βg ) i=1 (s) log dnorm(βg∗ ) − log dnorm(βg ) + i=1 Using this algorithm we approximate F in (.2) where lg is also approximated by MonteCarlo method. Thus, (.2) generates βg |rest. (2) p(τg2 |rest) ∝ λ2 τg2 mg +1 mg +1 −1 − 2 2 (λ ) 2 (τg ) 2 e 2 × βgT βg mg − 2 (1 − π0 )(2πτg2 )− 2 e 2τg I[βg =0] If βg = 0: p(τg2 |rest) ∝ (λ2 ) λ2 τg2 mg +1 mg +1 2 (τg2 ) 2 −1 e− 2 π0 δ0 (βg ) ∝ (τg2 ) λ2 τg2 mg +1 −1 − 2 e 2 Letting τg2 = 12 we have, αg p(αg2 |rest) ∝ 2 − λ2 mg +1 (αg2 )− 2 −1 e 2αg 124 + π0 δ0 (βg ) If βg = 0: p(τg2 |rest) ∝ ∝ βgT βg λ2 τg2 mg +1 mg +1 mg − 2 (λ2 ) 2 (τg2 ) 2 −1 e− 2 (1 − π0 )(2πτg2 )− 2 e 2τg 1 1 −2 2 (τg ) 2 e βgT βg +λ2 τg2 τg2 Letting τg2 = 12 we have, αg p(αg2 |rest) ∝  T βg βg  − 2 αg − 2α 3 g − (αg2 ) 2 e λ βgT βg 2  Thus, αg2 |rest ∼     Inverse Gamma mg +1 λ2 2 , 2    Inverse Gaussian λ , λ2 ||βg ||2 if βg = 0 if βg = 0 (3) G p(π0 |rest) ∝ βgT βg mg − 2 (1 − π0 )(2πτg2 )− 2 e 2τg I[βg =0] g=1 Let t be the number of non-zero βg ’s . Define, Zg =     0 if βg = 0    1 if βg = 0 then G g=1 Zg =t 125 + π0 δ0 (βg ) × π0a−1 (1 − π0 )b−1 Thus, p(π0 |rest) ∝ (1 − π0 )t+b−1 π0G−t+a−1  G  G π0 |rest ∝ Beta  Zg − t + a, g=1 Z g + b g=1 (4) λ is the tuning parameter so a large value of λ will shrink the coefficients excessively and a small value of λ will result in a diffuse distribution. Thus, the value of λ should be carefully chosen. Aligning with Xu and Ghosh’s (2015) spirit, λ(k) = p+G G 2 g=1 Eλ(k−1) [τg |Y] , Here, the posterior expectation of τg2 is approximated with the sample average of τg2 generated in the Gibbs sampler based on λ(k−1) . 126 BIBLIOGRAPHY 127 BIBLIOGRAPHY [1] Akaike, H. (1973). “Maximum likelihood identification of Gaussian autoregressive moving average models.” Biometrika, 255-265. [2] Albaum, G., Wiley, J., Roster, C., and Smith, S. M. (2011). “Visiting item non-responses in internet survey data collection.” International Journal of Market Research, 53(5), 687-703. [3] Arlt, S., Buchert, R., Spies, L., Eichenlaub, M., Lehmbeck, J. T., and Jahn, H. (2013). Association between fully automated MRI-based volumetry of different brain regions and neuropsychological test performance in patients with amnestic mild cognitive impairment and Alzheimer’s disease. European archives of psychiatry and clinical neuroscience, 263(4), 335-344. [4] Bae, K., and Mallick, B. K. (2004). “Gene Selection Using a Two-Level Hierarchical Bayesian Model.” Bioinformatics, 20(18), 3423-3430. [5] Breiman, L. (1995). “Better subset regression using the nonnegative garrote.” Technometrics, 37(4), 373-384. [6] Cai, S., Huang, L., Zou,J., Jing,L., Zhai,B., Ji,G., von Deneen,K.M., Ren,J., Ren, A. and Alzheimer’s Disease Neuroimaging Initiative. (2015). “Changes in thalamic connectivity in the early and late stages of amnestic mild cognitive impairment: a resting-state functional magnetic resonance study from ADNI.” PloS one, 10(2), e0115573. [7] Casella, G. (2001). “Empirical bayes gibbs sampling.” Biostatistics, 2(4), 485-500. [8] Castruccio, S., Ombao, H., and Genton, M. G. (2016). “A multi-resolution spatiotemporal model for brain activation and connectivity in fMRI data.” arXiv preprint arXiv: 1602.02435. [9] Chatterjee, A., and Lahiri, S. N. (2011). “Bootstrapping lasso estimators.” Journal of the American Statistical Association, 106(494), 608-625. [10] Chen, Z., and Dunson, D. B. (2003). “Random effects selection in linear mixed models.” Biometrics, 59(4), 762-769. [11] Clerx, L., Jacobs, H. I. L., Burgmans, S., Gronenschild, E. H. B. M., Uylings, H. B. M., Echavarri, C., Visser,P. J., Verhey,F. RJ. and Aalten,P. (2013). “Sensitivity of different MRI-techniques to assess gray matter atrophy patterns in Alzheimer’s disease is regionspecific.” Current Alzheimer research, 10(9), 940-951. [12] Cox, R. W. (1996). “AFNI: software for analysis and visualization of functional magnetic resonance neuroimages.” Computers and Biomedical research, 29(3), 162-173. 128 [13] Craig, C. S., and Douglas. SP (2000). “Conducting international marketing research in the twenty-first century.” International Marketing Review, 18(1), 80-90. [14] Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehricy, S., Habert, M. O., Chupin,M., Benali,H., Colliot,O. and Alzheimer’s Disease Neuroimaging Initiative. (2011). “Automatic classification of patients with Alzheimer’s disease from structural MRI: a comparison of ten methods using the ADNI database.” Neuroimage, 56(2), 766-781. [15] de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., and Calus, M. P. (2013). “Whole-genome regression and prediction methods applied to plant and animal breeding.” Genetics, 193(2), 327-345. [16] D’Elia, A., and Piccolo, D. (2005). “A mixture model for preferences data analysis.” Computational Statistics and Data Analysis, 49(3), 917-934. [17] Didow, N. M., Perreault, W. D., and Williamson, N. C. (1983). “A cross-sectional optimal scaling analysis of the index of consumer sentiment.” Journal of Consumer Research, 10(3), 339-347. [18] Dolnicar, S., and Gruen, B. (2014). ”Including Don’t know answer options in brand image surveys improves data quality”. International Journal of Market Research, 56(1), 35-50. [19] Double, K.L., Halliday, G.M., Krill, J.J., Harasty, J.A., Cullen, K., Brooks, W.S., Creasey, H. and Broe, G.A. (1996). “Topography of brain atrophy during normal aging and Alzheimer’s disease.” Neurobiology of aging, 17(4), 513-521. [20] Drolet, A. L., and Morrison, D. G. (2001). ”Do we really need multiple-item measures in service research?” Journal of Service Research, 3(3), 196-204. [21] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). “Least angle regression.” The Annals of Statistics, 32(2), 407-499. [22] Epstein, R., and Kanwisher, N. (1998). “A cortical representation of the local visual environment.” Nature, 392(6676), 598-601. [23] Epstein, R., Harris, A., Stanley, D., and Kanwisher, N. (1999). “The parahippocampal place area: recognition, navigation, or encoding?” Neuron, 23(1), 115-125. [24] Epstein, R., Graham, K. S., and Downing, P. E. (2003). “Viewpoint-specific scene representations in human parahippocampal cortex” Neuron, 37(5), 865-876. [25] Fahrmeir, L., and Gossl, C. (2002). “Semiparametric Bayesian models for human brain mapping.” Statistical Modelling, 2(3), 235-249. 129 [26] Fan, J., and Li, R. (2001). ”Variable selection via nonconcave penalized likelihood and its oracle properties.” Journal of the American statistical Association, 96(456), 13481360. [27] Fan, J., and Lv, J. (2010). “A selective overview of variable selection in high dimensional feature space.” Statistica Sinica, 20(1), 101. [28] Fan, Y., and Tang, C. Y. (2013). ”Tuning parameter selection in high dimensional penalized likelihood.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 531-552. [29] Feng, R., Wang, H., Wang, J., Shrom, D., Zeng, X., and Tsien, J. Z. (2004). “Forebrain degeneration and ventricle enlargement caused by double knockout of Alzheimer’s presenilin-1 and presenilin-2.” Proceedings of the National Academy of Sciences of the United States of America, 101(21), 8162-8167. [30] Figueiredo, M. A. (2003). “Adaptive sparseness for supervised learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9), 1150-1159. [31] Friston, K., Ashburner, J., Frith, C. D., Poline, J. B., Heather, J. D., and Frackowiak, R. S. (1995). “Spatial registration and normalization of images.” Human brain mapping, 3(3), 165-189. [32] Galton, C. J., Patterson, K., Graham, K., Lambon-Ralph, M. A., Williams, G., Antoun, N., Sahakian,B. J. and Hodges, J. R. (2001). “Differing patterns of temporal atrophy in Alzheimer’s disease and semantic dementia.” Neurology, 57(2), 216-225. [33] George, E. I., and McCulloch, R. E. (1997). “Approaches for Bayesian variable selection.” Statistica sinica, 339-373. [34] Gitelman, D. R., Penny, W. D., Ashburner, J., and Friston, K. J. (2003). “Modeling regional and psychophysiologic interactions in fMRI: the importance of hemodynamic deconvolution.” Neuroimage, 19(1), 200-207. [35] Gossl, C., Auer, D. P., and Fahrmeir, L. (2001). “Bayesian spatiotemporal inference in functional magnetic resonance imaging.” Biometrics, 57(2), 554-562. [36] Grimmer, T., Riemenschneider, M., Forstl, H., Henriksen, G., Klunk, W.E., Mathis, C.A., Shiga, T., Wester, H.J., Kurz, A. and Drzezga, A (2009). “Beta amyloid in Alzheimer’s disease: increased deposition in brain is reflected in reduced concentration in cerebrospinal fluid.” Biological psychiatry, 65(11), 927-934. [37] Hartig, M., Truran-Sacrey, D., Raptentsetsang, S., Simonson, A., Mezher, A., Schuff, N., and Weiner, M. (2014). “UCSF FreeSurfer Methods.” 130 [38] Hastie, T., Tibshirani, R., and Friedman, J. (2009). ”The elements of statistical learning: data mining, inference and prediction.” The Mathematical Intelligencer, 27(2), 83-85. [39] Hawkins, D. I., and Coney, K. A. (1981). ”Uninformed response error in survey research.” Journal of Marketing Research, 370-374. [40] Hawkins, D. I., Coney, K. A., and Jackson Jr, D. W. (1988). ”The impact of monetary inducement on uninformed response error. ” Journal of the Academy of Marketing Science, 16(2), 30-35. [41] Henderson, J. M., D. C. Zhu and C. L. Larson (2011). “Functions of parahippocampal place area and retrosplenial cortex in real-world scene analysis: An fMRI study.” Visual Cognition, 19(7), 910-927. [42] Henderson, J. M., Larson, C. L., and Zhu, D. C. (2007). “Cortical activation to indoor versus outdoor scenes: an fMRI study.” Experimental Brain Research, 179(1), 75-84. [43] Herholz, K., Salmon, E., Perani, D., Baron, J. C., Holthoff, V., Frlich, L., Schnknecht, P., Ito, K., Mielke, R., Kalbe, E. and Zndorf, G. (2002). “Discrimination between Alzheimer dementia and controls by automated analysis of multicenter FDG PET.” Neuroimage, 17(1), 302-316. [44] Hoerl, A. E., and Kennard, R. W. (1970). ”Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics, 12(1), 55-67. [45] Hughes, J., and Haran, M. (2013). “Dimension reduction and alleviation of confounding for spatial generalized linear mixed models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 139-159. [46] Jack, C.R., Petersen, R.C., Xu, Y.C., Waring, S.C., O’Brien, P.C., Tangalos, E.G., Smith, G.E., Ivnik, R.J. and Kokmen, E. (1997). “Medial temporal atrophy on MRI in normal aging and very mild Alzheimer’s disease.” Neurology, 49(3), 786-794. [47] Jack, C.R., Petersen, R.C., Xu, Y.C., O’Brien, P.C., Smith, G.E., Ivnik, R.J., Boeve, B.F., Waring, S.C., Tangalos, E.G. and Kokmen, E. (1999) “Prediction of AD with MRIbased hippocampal volume in mild cognitive impairment.” Neurology, 52(7), 1397-1397. [48] Jiang, W. (2007). “Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities.” The Annals of Statistics, 35(4), 14871511. [49] Johnstone, I. M., and Silverman, B. W. (2004). “Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences.” Annals of Statistics, 1594-1649. 131 [50] Juottonen, K., Laakso, M. P., Insausti, R., Lehtovirta, M., Pitknen, A., Partanen, K.,and Soininen, H. (1998). “Volumes of the entorhinal and perirhinal cortices in Alzheimer’s disease.” Neurobiology of aging, 19(1), 15-22. [51] Kim, Y., Kwon, S., and Choi, H. (2012). ”Consistent model selection criteria on high dimensions.” Journal of Machine Learning Research, 13(Apr), 1037-1057. [52] Knight, K., and Fu, W. (2000). “Asymptotics for lasso-type estimators.” Annals of statistics, 1356-1378. [53] Kumar, V. (2000). ”International marketing research (pp. 225-226).” Upper Saddle River, NJ: Prentice Hall. [54] Kumar, V., and Pansari, A. (2016). ”Competitive advantage through engagement.” Journal of Marketing Research, 53(4), 497-514. [55] Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). “Penalized regression, standard errors, and Bayesian lassos.” Bayesian Analysis, 5(2), 369-411. [56] Leifer, B. P. (2003). “Early diagnosis of Alzheimer’s disease: clinical and economic benefits.” Journal of the American Geriatrics Society, 51(5s2), S281-S288. [57] Liang, F., Song, Q., and Yu, K. (2013). “Bayesian subset modeling for high-dimensional generalized linear models” Journal of the American Statistical Association, 108(502), 589-606. [58] Lpez, M.E., Brua, R., Aurtenetxe, S., Pineda-Pardo, J.., Marcos, A., Arrazola, J., Reinoso, A.I., Montejo, P., Bajo, R. and Maest, F. (2014). “Alpha-band hypersynchronization in progressive mild cognitive impairment: a magnetoencephalography study.” The Journal of Neuroscience, 34(44), 14551-14559. [59] Lykou, A., and Ntzoufras, I. (2013). “On Bayesian lasso variable selection and the specification of the shrinkage parameter.” Statistics and Computing, 23(3), 361-390. [60] Manisera, M., and Zuccolotto, P. (2014). ”Modeling ”don’t know” responses in rating scales.” Pattern Recognition Letters, 45, 226-234. [61] Manisera, M., and Zuccolotto, P. (2014). ”Modeling rating data with nonlinear CUB models.” Computational Statistics and Data Analysis, 78, 100-118. [62] Meier, L., Van De Geer, S., and Bhlmann, P. (2008). “The group lasso for logistic regression.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53-71. [63] Mitchell, T. J., and Beauchamp, J. J. (1988).“Bayesian variable selection in linear regression.” Journal of the American Statistical Association, 83(404), 1023-1032. 132 [64] Money, R. B., Gilly, M. C., and Graham, J. L. (1998). ”Explorations of national culture and word-of-mouth referral behavior in the purchase of industrial services in the United States and Japan.” The Journal of Marketing, 76-87. [65] Musgrove, D. R., Hughes, J., and Eberly, L. E. (2016). “Fast, fully Bayesian spatiotemporal inference for fMRI data.” Biostatistics, 17(2), 291-303. [66] Nachum, L. (1994). ”The choice of variables for segmentation of the international market.” International Marketing Review, 11(3), 54-67. [67] Nakata, C., and Sivakumar, K. (1996). ”National culture and new product development: An integrative review.” The Journal of Marketing, 61-72. [68] Narisetty, N. N., and He, X. (2014). “Bayesian variable selection with shrinking and diffusing priors.” The Annals of Statistics, 42(2), 789-817. [69] Olusegun, A. M., Dikko, H. G., and Gulumbe, S. U. (2015). ”Identifying the Limitation of Stepwise Selection for Variable Selection in Regression Analysis.” American Journal of Theoretical and Applied Statistics, 4(5), 414-419. [70] Pandey, S., and Elliott, W. (2010). ”Suppressor variables in social work research: Ways to identify in multiple regression models.” Journal of the Society for Social Work and Research, 1(1), 28-40. [71] Park, T., and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103(482), 681-686. [72] Penny, W. D., Trujillo-Barreto, N. J., and Friston, K. J. (2005). “Bayesian fMRI time series analysis with spatial priors.” NeuroImage, 24(2), 350-362. [73] Petersen, J. A., Kushwaha, T., and Kumar, V. (2015). ”Marketing communication strategies and consumer financial decision making: The role of national culture.” Journal of Marketing, 79(1), 44-63. [74] Piccolo, D. (2003). ”On the moments of a mixture of uniform and shifted binomial random variables.” Quaderni di Statistica, 5(1), 85-104. [75] Reuter, M., Schmansky, N. J., Rosas, H. D., and Fischl, B. (2012). “Within-subject template estimation for unbiased longitudinal image analysis.” Neuroimage, 61(4), 14021418. [76] Rokeach, M. (1973). Value Survey. [77] Rutz, O. J., Trusov, M., and Bucklin, R. E. (2011). ”Modeling indirect effects of paid search advertising: which keywords lead to more future visits?.” Marketing Science, 30(4), 646-665. 133 [78] Schuman, H., and Presser, S. (1980). ”Public opinion and public ignorance: The fine line between attitudes and nonattitudes.” American Journal of Sociology, 85(5), 1214-1225. [79] Schwartz, S. H. (1992). ”Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries.” Advances in Experimental Social Psychology, 25, 1-65. [80] Schwarz, G. (1978). ”Estimating the dimension of a model.” The Annals of Statistics, 6(2), 461-464. [81] Siddiqui, M. M. (1958). “On the inversion of the sample covariance matrix in a stationary autoregressive process.” The Annals of Mathematical Statistics, 29(2), 585-588. [82] Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). “A sparse-group lasso.” Journal of Computational and Graphical Statistics, 22(2), 231-245. [83] Smith, M., Ptz, B., Auer, D., and Fahrmeir, L. (2003). “Assessing brain activity through spatial Bayesian variable selection.” NeuroImage, 20(2), 802-815. [84] Smith, M., and Fahrmeir, L. (2007). “Spatial Bayesian variable selection with application to functional magnetic resonance imaging.” Journal of the American Statistical Association, 102(478), 417-431. [85] Smith, C. D., Andersen, A. H. and Gold, B. T. (2012). “Structural brain alterations before mild cognitive impairment in ADNI: validation of volume loss in a predefined antero-temporal region.” Journal of Alzheimer’s Disease, 31(s3), S49-S58. [86] Steenkamp, J. B. E., and Ter Hofstede, F. (2002). ”International market segmentation: issues and perspectives.” International Journal of Research in Marketing, 19(3), 185213. [87] Steenkamp, Jan-Benedict EM, Frenkel ter Hofstede, and Michel Wedel. (1999) ”A crossnational investigation into the individual and national cultural antecedents of consumer innovativeness.” The Journal of Marketing, 55-69. [88] Tadesse, M. G., Sha, N., and Vannucci, M. (2005).” Bayesian variable selection in clustering high-dimensional data.” Journal of the American Statistical Association, 100(470), 602-617. [89] Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267-288. [90] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). “Sparsity and smoothness via the fused lasso.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91-108. 134 [91] Verhoef, P. C., and Donkers, B. (2001). ”Predicting customer potential value an application in the insurance industry.” Decision Support Systems, 32(2), 189-199. [92] Wang, X., Nan, B., Zhu, J., and Koeppe, R. (2014). “Regularized 3D functional regression for brain image data via Haar wavelets.” The annals of applied statistics,8(2), 1045. [93] Welvaert, M., Durnez, J., Moerkerke, B., Verdoolaege, G., and Rosseel, Y. (2011). “neuRosim: An R package for generating fMRI data.” Journal of Statistical Software, 44(10), 1-18. [94] Xu, X., and Ghosh, M. (2015). “Bayesian variable selection and estimation for group lasso.” Bayesian Analysis, 10(4), 909-936. [95] Yuan, M., and Lin, Y. (2005). “Efficient empirical Bayes variable selection and estimation in linear models.” Journal of the American Statistical Association, 100(472), 1215-1225. [96] Yuan, M., and Lin, Y. (2006). “Model selection and estimation in regression with grouped variables.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67. [97] Zhang, Y., Li, R., and Tsai, C. L. (2010). ”Regularization parameter selections via generalized information criterion.” Journal of the American Statistical Association, 105(489), 312-323. [98] Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D., and Alzheimer’s Disease Neuroimaging Initiative. (2011). “Multimodal classification of Alzheimer’s disease and mild cognitive impairment.” Neuroimage, 55(3), 856-867. [99] Zhang, L., Baladandayuthapani, V., Mallick, B. K., Manyam, G. C., Thompson, P. A., Bondy, M. L., and Do, K. A. (2014). “Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer.” Journal of the Royal Statistical Society: Series C (Applied Statistics), 63(4), 595-620. [100] Zou, H., and Hastie, T. (2005). “Regularization and variable selection via the elastic net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. [101] Zou, H. (2006). ”The adaptive lasso and its oracle properties.” Journal of the American Statistical Association, 101(476), 1418-1429. 135