VARIABLE SELECTION FOR SPATIAL DATA AND ITS APPLICATION TO NEUROIMAGING By Abdhi Amitabha Sarkar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics - Doctor of Philosophy 2017 ABSTRACT VARIABLE SELECTION FOR SPATIAL DATA AND ITS APPLICATION TO NEUROIMAGING By Abdhi Amitabha Sarkar Ecological research, geological studies, image analysis are a few examples of high resolution spatial data where proximity describes the relationship between data points collected at various locations. Such dependencies play a vital role in modeling the data accurately to improve both its predictive capacity and parameter estimation. Rapid technological advancement has brought about an abundance of such information. To better understand this information, we are in need of feature selection techniques for spatially dependent data that can tease out relevant predictors associated with the response of interest. When the response variable at the various sites is in the form of discrete binary or count data we are faced with an added layer of complexity due to the inability of explicitly describing a joint parametric distribution. This dissertation explores the benefits of adopting a penalized quasi-likelihood approach to model a fixed number(p) or an expanding dimension(pn ) of predictor variables with regard to a discrete spatial response variable. In the past this approach has been extensively studied in longitudinal data analysis. Introducing random fields that exhibit certain ρ-mixing conditions we are able to provide some general theoretical results of the estimator obtained from the solving the penalized score equation. The oracle properties of the estimator as described by J. Fan & Li (2001) are provided, followed by an algorithm to successfully implement the method. Multiple simulation studies showcase the effectiveness of the method under covariance misspecification. We apply this technique to real data obtained from the Michigan Natural Features Inventory. Variable selection in neuroimaging has a unique formulation that leads to selection of activated regions of a brain in Task-based fMRI. As one of the most non-invasive formats of studying an active brain, Task-based fMRI provides a unique opportunity in neuroscience to study the dynamic aspects of brain function. Crude statistical techniques such as voxel-wise regression analysis have been used in the past with some success to identify active brain regions based on the blood-oxygen-level dependent (BOLD) signal of the image. Inspired by graphical covariate models proposed for genetic data we incorporate a similar idea and expand our understanding of penalized regression of weighted least squares with a separable space-time covariance model in this setup. Two penalty terms are introduced as a result; one for selection (LASSO) and another for smoothing (Ridge-type). We explore the interpretability of the proposed model as opposed to its Bayesian counterparts, its computational feasibility and various approaches to selecting an optimal tuning parameter in the case of a Single-subject study. The description of the model and its implementation are presented with discussions about theoretical implications. Extensive simulation studies and a real data example of a human brain subject to two visual stimuli are also given to provide evidence of the capability of this method. Copyright by ABDHI AMITABHA SARKAR 2017 To the memory of my family and friends who passed away during the course of my PhD. for reminding me to cherish every moment and always strive for something better. v ACKNOWLEDGMENTS I would like to take this opportunity to sincerely thank Dr. Chae Young Lim who introduced me to spatial statistics and her seamless understanding of the subject was truly inspiring. Her incredible dedication to her students is what pulled us through in our research. I truly appreciate her time to work with us without any hesitation. My chair advisor Professor Tapabrata Maiti took me in at his busiest time, having had so many students under his wing, just to see me succeed. I was introduced to a more modern approach to statistics and I could not be more grateful for all the opportunities given to me. My committee member Dr. Andrew Finley was incredibly generous in providing me with opportunities and resources during my studies. I would also like to thank Dr. Ping-shou Zhong for being a part of my committee and for our discussions. My sincerest gratitude to Steve Pierce at CSTAT for his extensive support and opportunities to collaborate on a variety of applications. This provided me with a unique insight into how statistical methods are viewed and interpreted. A very special mention to Andy Hufford for being incredibly patient with me and trying to improve resources for the graduate students. His advice was very accurate on many occasions and it was a real pleasure to work with him. My friends Pei, Sneha, Atreyee, Danielle, Guiling, Siddhartha and officemates Vojtech, Jimmy, Yingjie made East Lansing home and I cannot thank them enough. A special mention to my fianc´e Michael Frisby for his unwavering support during the entire process and pure faith in my abilities. Lastly I would like to thank my parents and sister for standing by my side always. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi KEY TO SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1 Introduction and Literature Review . . . . . . . . . . . . . . . . . . Chapter 2 Variable Selection for Discrete Spatial Data using a Penalized Quasi-likelihood Approach . . . . . . . . . . . . . . . . . . . . . . 2.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Penalized Quasi-likelihood Estimating Equations . . . . . . . . . . . . . . . . 2.3 Asymptotic Properties of the Penalized Quasi-Likelihood Estimator (p-fixed) 2.3.1 Selection Consistency and Oracle Property . . . . . . . . . . . . . . . 2.3.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.1 MM-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.2 Tuning Parameter Selection . . . . . . . . . . . . . . . . . . 2.3.2.3 Construction of Working Correlation . . . . . . . . . . . . . 2.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3.1 Synthetic Analysis: Lansing Woods . . . . . . . . . . . . . 2.3.4 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4.1 Crawford County Fire Disturbances . . . . . . . . . . . . . . 2.3.4.2 Lung Cancer Incidence of Counties in Iowa . . . . . . . . . . 2.4 Asymptotic Properties of the Penalized Quasi-Likelihood Estimator (p - expanding dimension) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Score Function Asymptotics . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.1 Regularity Conditions . . . . . . . . . . . . . . . . . . . . . 2.4.2 Penalized Score Function Asymptotics . . . . . . . . . . . . . . . . . 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Voxel Selection using Penalized Least Space-Time Covariance model . . . . 3.1 Functional Magnetic Resonance Imaging (fMRI) 3.2 Model Description . . . . . . . . . . . . . . . . 3.3 Estimation and Selection . . . . . . . . . . . . . 3.3.1 Coordinate Descent Algorithm . . . . . . 3.3.2 Algorithm implementation . . . . . . . . 3.3.2.1 Pathwise Coordinate Descent . 3.3.2.2 Active Set Convergence . . . . vii Squares for a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Separable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 13 13 14 17 22 24 24 27 28 28 36 38 39 41 44 44 46 54 57 59 78 78 81 82 85 87 87 88 3.4 3.5 3.6 3.3.2.3 Cholesky decomposition properties . . . 3.3.2.4 Tuning Parameter Selection . . . . . . . Simulation Study . . . . . . . . . . . . . . . . . . . . . . Real Data Analysis . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Empirical Bayes Estimate using Bayesian LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 . 89 . 91 . 104 . 111 . 113 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 viii LIST OF TABLES Table 2.1: Simulation results for spatial binary data generated using the power correlation model with ρ = 0.1. QL represents a quasi-likelihood approach without a penalty term, PQL.LASSO represents a penalized quasi-likelihood approach with the LASSO penalty and PQL.SCAD represents a penalized quasi-likelihood approach with the SCAD penalty. PQL.LASSO.Ind and PQL.SCAD.Ind uses the identity matrix as working correlation assuming independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Table 2.2: Simulation results for spatial binary data generated using the power correlation model with ρ = 0.3. QL represents a quasi-likelihood approach without a penalty term, PQL.LASSO represents a penalized quasi-likelihood approach with the LASSO penalty and PQL.SCAD represents a penalized quasi-likelihood approach with the SCAD penalty. PQL.LASSO.Ind and PQL.SCAD.Ind uses the identity matrix as working correlation assuming independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 2.3: Simulation results when a Mat´ern correlation model with θ = 0.7 and ν = 0.3 is used to generate spatial binary data. The results are compared to misspecification of Mat´ern parameters θ˜ = 0.8 and ν˜ = 0.4 and misspecification of structure exponential with ρ˜ = 0.1. . . . . . . . . . . . . . . . 33 Table 2.4: Simulation results when the covariates are correlated. Spatial binary data with a power correlation model (ρ = 0.3) on 20 × 20 grid are considered. A power correlation model was used to construct a working correlation matrix (˜ ρ). t is the level of dependence among covariates. . . . . . . . . . . . . . 34 Table 2.5: Simulation results when 10 covariates are correlated and 10 are independent. Spatial binary data with a power correlation model (ρ = 0.3) on 20 × 20 grid are considered. A power correlation model was used to construct a working correlation matrix (˜ ρ). t is the level of dependence among covariates. β0 = (1, −1.5, 0 . . . , 1, 0 . . . )T i.e. 2 non-zero coefficients are correlated and 1 is independent. . . . . . . . . . . . . . . . . . . . . . . . 34 Table 2.6: Simulation results when the power correlation model with ρ = 0.1 is used to generate spatial Poisson data. The power correlation model with ρ˜ is used for the working correlation matrix for estimation. . . . . . . . . . . 35 Table 2.7: Simulation results when the power correlation model with ρ = 0.3 is used to generate spatial Poisson data. The power correlation model with ρ˜ is used for the working correlation matrix for estimation. . . . . . . . . . . 36 ix Table 2.8: Lansing Woods: Estimates and 95% C.I. . . . . . . . . . . . . . . . . . . . 38 Table 2.9: Crawford County: Estimates and 95% C.I. . . . . . . . . . . . . . . . . . 41 Table 2.10: Iowa Lung Cancer Incidence Rate: Estimates and 95% C.I. . . . . . . . . 43 Table 2.11: Iowa Lung Cancer Incidence Rate: Estimates and 95% C.I. . . . . . . . . 43 Table 3.1: Choice of Lambda sequence parameters based on MSE . . . . . . . . . . 93 Table 3.2: Choice of Lambda sequence parameters based on MSE . . . . . . . . . . 96 Table 3.3: Choice of Lambda sequence parameters based on MSE . . . . . . . . . . 99 Table 3.4: Consolidated results for AR(1) simulation with regard to all 3 algorithms x 104 LIST OF FIGURES Figure 2.1: Lansing Woods Data on 24 × 24 grid showing the presence/absence of a specific tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 2.2: Crawford County Fire Disturbance Map . . . . . . . . . . . . . . . . . . 40 Figure 2.3: Lung Cancer Incidence Map of Iowa and bubble plot indicating countywise population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 3.1: A theoretical double-gamma HRF characterizing the BOLD signal of a voxel that may respond to the stimulus with some lag and undershoot over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 3.2: Simulated fMRI data with different grid sizes and time structures. (a) contains 5000 data points and (b) contains 11250 data points . . . . . . 92 Figure 3.3: AIC for sequences of λ1 &λ2 ∈ (0.1, 1000) . . . . . . . . . . . . . . . . . . 93 Figure 3.4: GIC for sequences of λ1 &λ2 ∈ (0.1, 1000) . . . . . . . . . . . . . . . . . . 93 Figure 3.5: MSE for sequences of λ1 &λ2 ∈ (0.1, 1000) . . . . . . . . . . . . . . . . . 94 Figure 3.6: True Positives (max 5 active) for sequences of λ1 &λ2 ∈ (0.1, 1000) . . . . 94 Figure 3.7: False Positives (max n-5) for sequences of λ1 &λ2 ∈ (0.1, 1000) . . . . . . 94 Figure 3.8: AIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 3.9: GIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 3.10: MSE (Mean squared error) for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . . . 96 Figure 3.11: True Positives (max 5 active) for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . 96 Figure 3.12: False Positives (max n-5) for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . . . 97 xi Figure 3.13: MSPE (Mean squared prediction error) for sequences of the index λ values given α ∈ (0, 1)using pathwise coordinate descent . . . . . . . . . . . . . 97 Figure 3.14: AIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 3.15: GIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 3.16: MSE (Mean squared error) for sequences of the index λ values given α ∈ (0, 1) using active sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 3.17: True Positives (max 5 active) for sequences of the index λ values given α ∈ (0, 1) using active sets . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 3.18: False Positives (max n-5) for sequences of the index λ values given α ∈ (0, 1) using active sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 3.19: MSPE (Mean squared prediction error) for sequences of the index λ values given α ∈ (0, 1) using active sets . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 3.20: 4-fold CV on time for design of the single subject fMRI study. Blue solid line indicates boxcar stimulus and dotted line convolved HRF . . . . . . 100 Figure 3.21: Simulation results of 4-fold cross validation on a 10 × 10 grid . . . . . . 101 Figure 3.22: AIC for model simulated using AR(1) Time correlation structure . . . . 102 Figure 3.23: GIC for model simulated using AR(1) Time correlation structure . . . . 102 Figure 3.24: MSE for model simulated using AR(1) Time correlation structure . . . . 103 Figure 3.25: MSPE for model simulated using AR(1) Time correlation structure . . . 103 Figure 3.26: True Positives (max 5) for model simulated using AR(1) Time correlation 103 Figure 3.27: False Positives (max n-5) for model simulated using AR(1) Time correlation104 Figure 3.28: Axial View λ = 0.0019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 3.29: Coronal View λ = 0.0019 . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 3.30: Sagittal View λ = 0.0019 . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 3.31: Axial View λ = 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xii Figure 3.32: Coronal View λ = 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 3.33: Sagittal View λ = 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 xiii KEY TO SYMBOLS 1. ≤ denotes the positive semidefinite ordering. i.e. we write A ≤ B if the matrix B − A is positive definite. 2. | · | denotes the cardinality of a set. 3. βK is the gradient of a vector K w.r.t β. 4. ◦ signifies the Schur or Hadamard product of two matrices. 5. ⊗ signifies the Kronecker product of two matrices. ... 6. µ ¨ denotes the double derivative of the function µ, similarly µ denotes the third derivative and so on. 7. := signifies assignment or is referred to as “denoted by”. 8. Tr A denotes the trace of a matrix A. 9. LHS stands for Left Hand Side of an equation. 10. ∼ denotes neighbors. i.e. u ∼ v implies u and v are adjacent voxels. 11. ·, · signifies inner product of two vectors. 12. sign(a) signifies −1 or +1 depending on whether a < 0 or a > 0 respectively. 13. I(·) denotes the indicator operator. 14. TP (True Positives) represents the average number of correctly detected nonzero coefficients. 15. FP (False Positives) represents the average number of incorrectly detected nonzero coefficients. 16. CP gives the average empirical Coverage Probability of the 95% confidence intervals. 17. CD stands for Coordinate Descent xiv Chapter 1 Introduction and Literature Review Spatial data innately incorporates dependencies based on proximity. Ignoring such dependencies in the analysis of data gives rise to inconsistent and inaccurate results as formally stated in Fitzmaurice (1995) and other sources. There exists a very intuitive notion of shortrange dependence that suggests, observations at locations closer to each other are more likely to be homogeneous than observations at locations that are farther apart. A variety of examples can be seen in climate and meteorological data, geological data on Earth and outer space, demographic social/economic data and image analysis. The focus of this dissertation lies in studying discrete spatial data with binary or count-like information at different sites. Several approaches to model these dependencies in the form of latent Gaussian processes are explained in P. J. Diggle et al. (1998) with some early beginnings seen in the seventies with extensive work done by Matheron (1970) in the development of kriging. A parallel rather statistical approach with auto-normal and auto-logistic models were considered using Markov Random Field by Besag (1974) with the use of the positivity condition (Hammersley & Clifford, 1971). A very comprehensive take on spatial data analysis can be seen in Cressie (1993). In chapter 6 of his book, Cressie discusses the initial pairwisedependence approaches that were employed for exponential distributions of the response (Poisson, Gamma, Binomial etc.) and chapter 7 elaborates that the non-Gaussian case 1 of the maximum pseudo-likelihood estimator introduced in Besag (1975) may not have a structure to obtain an optimum solution. P. Diggle et al. (2007) give us an overview of using the latent processes to interpolate surfaces using generalized linear geospatial models (GLGM). However lack of closed form expressions of full and marginal likelihoods leads to adopting Expectation Maximization (EM) algorithms or Bayesian methods. Bayesian techniques in the form of hierarchical mixed models have been prodigious in modeling parametric spatial models, (Banerjee et al., 2014) but heavily rely on maximum likelihood methods that suffer from model misspecification. A similar model for spatial binary data was proposed by Heagerty & Lele (1998) using a penalized composite likelihood in the form of probit models. The model specifically assumes that there exists a Gaussian spatial process with binary responses that are indicators of whether the spatial process was measured with error. Here certain score function asymptotic properties with regularity conditions formalized by Guyon (1995) are used. These regularity conditions are required in the study of GEEs in chapter 2 of this dissertation. In the context of spatial regression for discrete data not much has been explored in great detail. Wedderburn (1974) published seminal work on quasi-likelihood functions describing regression based techniques that required only parametric structures to be imposed on the first two moments of a random variable; a semi-parametric method instead a full distributional assumption in terms of a maximum likelihood (MLE) method. No closed form expressions were necessary for the objective Quasi-likelihood function. Further, conditions were cautiously provided whenever the observations are dependent, (McCullagh & Nelder, 1989, Chapter 9). Gotway & Stroup (1997) propose a quasi-likelihood method for estimation and prediction with generalizations to universal and indicator kriging but provide no theoretical justification. The use of generalized estimating equations (GEE) has been very well developed in 2 longitudinal data analysis also known as growth curve analysis. The methods established in Liang & Zeger (1986) do not directly translate into the spatial setting but are modified with the assumption of a latent process as in Zeger (1988); McShane et al. (1997). Lin & Clayton (2005) provided very strong theoretical justifications for binary spatial data and asymptotics under the increasing domain framework using GEE and the quasi-likelihood. A generalization of this method was given in Lin (2008) that extends the technique for using these methods for both binary and count data also allowing the underlying covariance structure to be more flexible. A working correlation setup was incorporated into the existing methodology by Lin (2010) to accommodate covariance-misspecification. Model selection is an incredibly practical problem especially in the initial exploratory stage of any data analysis. For estimating equations usual criterion for stepwise selection are inadequate and therefore a modified version of the Akaike information criterion was introduced by Pan (2001). Prior to which some other attempts were made by using Mallow’s Cp and AICc using Kullback-Leibler (Hurvich & Tsai, 1995). L. Wang & Qu (2009) used the quadratic inference function which generalizes the method of moments to longitudinal data and formed a novel BIC-type model selection procedure, as well as a test for checking the goodness of fit. This method does not require either the full-likelihood or the quasi-likelihood to have explicit closed form expressions. Modern penalized regularization techniques in machine learning and statistics with faster convergence algorithms have gained immense popularity and credibility over previous techniques, especially with the success of LASSO (Tibshirani, 1996). Essentially it has been formulated to specifically cater to model complexity in the form of high-dimensional data. The review article by Bickel et al. (2006) describes methods for independent samples that are commonly adopted (regularized least squares) and succeeded in addressing complex problems. Better penalty functions refer3 ring to those satisfying oracle properties like Smoothly Clipped Absolute Deviation (SCAD, J. Fan & Li (2001)) were soon established in the form of hard-thresholding by Donoho et al. (1994), Adaptive LASSO by Zou (2006), the Dantzig selector by Candes & Tao (2007) and Minimax Concave Penalty (MCP) by Zhang et al. (2010) among some potential others. Some penalized methods for dependent samples has been investigated and developed in a time series model by H. Wang et al. (2007). The method establishes a lasso tuning parameter for both the covariate regression coefficients and the autoregressive covariance parameter. A computationally efficient algorithm for GIS model selection of neighborhood extents and patterns on lattice data was explored by H.-C. Huang et al. (2010) in the form of spatial LASSO but lacked detailed theoretical backing. In general there appears to be a gap of regularization techniques applied to dependent data. J. Zhu et al. (2010) established a penalized maximum likelihood method under spatial adaptive lasso for data that has a conditional auto-regressive (CAR) model to simultaneously select covariates of relevance and a neighborhood structure. The beginnings of penalized quasi-likelihood methods are seen in Breslow & Clayton (1993) with the use of Laplacian methods originally formulated by Green (1987) for semiparametric regression models with an iterated weighted least squares algorithm for correlated response. These models too incorporate dependence in the form of a random effect. However the only known spatially correlated penalized quasi-likelihood methods were explored by Dean et al. (2004). No aymptotic properties are shown and the penalty is not in the context of regularization techniques. Additionally the spatial relationships discussed are in the form of adjacency matrices or a Laplacian approximation of the integral of the quasi-likelihood under a GLMM setup. In chapter 2 of this dissertation, GEE techniques in the similar context of longitudinal 4 data analysis is considered. Specifically we look at a model selection procedure for a large number of predictors associated with a discrete spatial response on a lattice. Here the withincluster covariance has a spatial structure satisfying certain ρ−mixing conditions and with no replicates at each location. This chapter considers two specific cases; the first where the number of predictors p is fixed and the second where the dimension suffixed by n denotes is expanding. i.e. pn → ∞ as n → ∞. The penalized quasi-likelihood score function properties are shown to posses properties in the current spatial setup that match Johnson et al. (2008) similar to results shown in Feng et al. (2016) with a working correlation that satisfies the ρ−mixing condition in Lin (2008). Extensive simulations under various working correlations with both count and binary data are conducted and showcased in this chapter. Portnoy (1988) addressed a very fundamental question regarding the number of parameters required for valid statistical analysis using asymptotic results. It was shown that for exponential families applying MLE methods for estimation, p2 /n should be very small. Thus formalizing methods where the number of parameters pn goes to infinity. In the case of expanding dimensions, we therefore require comparable selection techniques since no results have been established for this scenario with regard to GEE in the spatial context. Section 2.4 addresses this by using (Ortega & Rheinboldt, 1970, Theorem 6.3.4) given in L. Wang (2011) to prove the existence and consistency of the estimate obtained from solving the penalized quasi-likelihood estimations. Similar to L. Wang et al. (2012) we show selection consistency obtained form the estimating equations of the penalized score function. Specifically a theoretical explanation is provided that under this setup a difference in the rate at which the dimension expands varies significantly in comparison to the longitudinal setup or maximum likelihood setup (Portnoy, 1988). This has been briefly touched upon by Xie et al. (2003) under the setup n is bounded and m → ∞ where m denotes the cluster of time 5 points repeated for every individual. We require p4n .n−1 = o(1) as opposed to p2n n−1 = o(1) respectively as n → ∞. In section 2.3.2 we describe the model implementation and the use of Majorize-minimize(MM) algorithm by Hunter & Li (2005). Another important aspect of regularization techniques is finding an optimal tuning parameter. Cross validation techniques tend to select over-fitted models but is widely popular and successful in methods involving independent data. Since the concept of a complete replicate guides the method, this intuition is not apparent in the context of dependent data. Therefore optimal values are obtained based on a grid search over a range of values and root mean squared error is used as the optimality criterion. The purpose of this work is to lay down foundations and explore possibilities of solving ecological or geospatial applications involving high-dimensional data. The section 2.4.2 of chapter 2 establishes theoretical results with regard to variable selection for expanding dimensions of covariates. The oracle properties and selection consistency along with characterization of the solution of the penalized equations is provided in detail. The method in its current form can handle scenarios where the true underlying number of covariates (denoted by sn ) satisfies s4n n−1 = o(1) and pn = o(n) where pn is the total number of covariates. Two real data analyses have been performed in section 2.3.4. The first real data is obtained from the Michigan Natural Features Inventory affiliated to the State Department of Michigan and Michigan State University, to study fire disturbances in the region and obtain predictors that may have sustained such disturbances in the early 1800s. The second example is county-wise data of Lung cancer incidence in the state of Iowa obtained from SEER (Surveillance, Epidemiology and End Results program) database. Here we are interested in learning which factors are associated with high or lower rates of lung cancer incidence in particular areas of Iowa. Image analysis comprises the processes of restoration of an image from contamination, 6 extraction of vital information such as bar codes, facial recognition and interpretation of images. Data from these images are obtained in different forms. For example, remote sensing by satellites, digital or chemical photography using electronic photo-detectors, exposure to film respectively or through medical imaging such as recording transmission of X-rays, electron microscope imaging, ultrasound, magnetic resonance imaging (MRI), positron emission tomography (PET), diffusion tensor imaging (DTI) etc. With the advent of computer technology into the processing of images and pattern recognition, spatial statistics has played a key role. (Cressie, 1993, Section 7.4) provides details of spatial methodologies employed in the study of images. Neuroimaging has in the past few decades (since 1990) improved greatly in terms of data acquisition. The most noninvasive procedures of taking an image of the brain is MRI. As an application to variable selection for spatial data, the focus of chapter 3 is to identify activated regions of the brain responding to a stimulus based on a task performed. fMRI (functional magnetic resonance imaging) detects active brain parts by measuring changes in the blood oxygen level dependent (BOLD) signal. For an in-depth understanding of fMRI data, Lindquist (2008) describes the nature of the field from a statistician’s perspective. In general there are two major classes of experimental designs: block designs and event-related (Task-based designs). Another version of fMRI data is in the form of resting state fMRI famously introduced by Biswal et al. (1995) investigating connectivity in the brain when the human is at resting-state (performing no task). This version of the data has received mixed critiques but has seen success in studying diseased brains compared to healthy controls. Due to its unique structure in the context of variable selection, the focus of chapter 3 of this dissertation is to propose a selection technique for Task-based fMRI. Unlike the data we have seen so far, Task-based fMRI results in large amounts of noisy 7 data with a complicated spatio-temporal correlation structure. Additionally in the context of selection, the design matrix associated with the response has a handful of stimuli randomly assigned over time. The objective therein is to select only those brain regions that are responding to the stimulus provided over time. Thus there are co-efficients attached to each spatial unit of the image (known as a voxel ) and the data is collected over time at each voxel. Section 3.1 provides details of the experiment conducted and the nature of the data collected. It is important to note that the BOLD signal is regressed against a modeled hemodynamic response function (HRF) with the intuition that the coefficient associated with the voxel quantifies activation amplitude (the magnitude of correlation) between the stimulus and the voxel over time. In the early nineties voxel-wise regression analysis was extensively used due to it’s simplicity with respect to both the implementation and interpretation of the method. However a blatant draw-back was almost immediately recognized with the realization that an extremely large number of non-independent univariate comparisons were performed, thus sizeably inflating the Type-I error. Statistical Parametric Maps (SPMs) (Friston et al., 1994) was among the earliest most well established techniques referring to the probabilistic behavior of stationary Gaussian random fields (Adler, 1981) to construct a threshold using the Euler characteristic (Worsley et al., 1992). These techniques were studied for PET scans to calculate the approximate p-value in order to find the statistically significant regions of activation. Therefore the analysis was done in two steps. The first being a voxel-wise analysis of the time series data and the second step being the spatial adjustments of the p-values using random field theory (RFT). Bayesian methods had been very successfully implemented in image restoration by Geman & Geman (1984) where they judiciously exploited the equivalence of Gibbs distribution and the Markov Random Field (MRF) to construct a maximum 8 a posteriori MAP estimate of the degraded image. G¨ossl et al. (2001) introduced a hierarchical Bayesian method that incorporated modified intrinsic autoregressive priors as a kind of stochastic interpolation of neighbors for modeling spatial dependence (Gaussian MRFs) and random walks for modeling temporal dependence. The aim of their formulation was to be able to simultaneously model spatial and temporal dependencies directly in a single step. Some improvements of spatial adaptivity were made by extending the above technique to Laplacian and Cauchy-type priors by few others. A slightly different interpretation of the voxel selection was adopted by Smith & Fahrmeir (2007). Instead of priors on the amplitudes themselves they impose an Ising (Ising, 1925) prior on the unobservable indicator latent variables of the activation effects. Thus the selection is now based on a probability map of more likely activated areas of the brain. The Ising prior or the binary spatial MRF treats the lattice of voxels as a graph incorporating neighboring interactions to determine clusters of activated regions. This spatial Bayesian variable selection (SBVS) method empirically refined issues of over smoothing and edge effects but was not generalized to model in time dependencies due to the additional computation burden. Lee et al. (2014) integrated auto-regressive (AR) and moving average (MA) time series to the setup of Smith & Fahrmeir (2007) and rigorously applied it to a longitudinal study of Alzheimer’s disease (AD). For the block-design experiment they found the setup not conducive for moving averages but implemented an Empirical Bayes approach to estimate (ρ) parameter associated with AR(1). They also compared this with models AR(2), ARMA(1,1) and MA(1) but found AR(1) to provide the best results. In general for multiple subjects with each subject having over 100,000 voxels over a 20-30 mins study, the data is unimaginable massive. Furthermore with Bayesian MCMC methods applied to the data a healthy compromise between a suitable method and computational complexity must be 9 sought. Musgrove et al. (2016) identify this issue and propose a partitioning of the brain optimally and then fitting models to these partitions. They show through simulations that this kind of parcellation does not exhibit unnecessary edge-effects and boundary problems but no theoretical justification is provided. Although the probability maps based on the latent indicator process determines whether the effect of a voxel is active or inactive, it does not necessarily render a simplistic interpretation of the activation amplitude itself (co-efficient of interest). Regularization methods in bio-informatics has seen significant success with widely available data and greater computational flexibility. Infusing complex neighborhood structures that models correlated predictors using graph-structured regression covariates for variable selection in high-dimensional genomic data as seen in Li & Li (2010) is one such example. A generalization of this method is adopted by introducing sparse laplacian shrinkage (SLS, J. Huang et al. (2011)) where the LASSO penalty for sparsity is replaced by MCP for its oracle properties. In a similar vein, to study the HIV type I protease structure Xue et al. (2012) incorporated coupling coefficients using the Ising model to select the true underlying structure of interactions using the nonconcave SCAD penalty. These rather non-Bayesian methods for massive data incorporate the coordinate-descent algorithm that produces remarkable computational efficiency. Some key aspects of this algorithm are touched upon based on a technical report by Tseng (1988) in section 3.6 of this dissertation. Further, Grosenick et al. (2013) provide a comprehensive report of attempts made to use of-the-shelf machine learning methods to analyze fMRI data. For a whole brain analysis with correlated data they provide a handful of variants of the Graph-constrained Elastic Net (GraphNet). These methods are proposed and implemented on whole brains of multiple subjects based on event-designs with fewer than 10 time points. In chapter 3 of this dissertation we venture to propose a rather frequentist approach 10 using regularization methods discussed above by treating the 2D or 3D image as a graph and applying a method similar to that employed by Li & Li (2010) in a single subject study with fMRI block-design experiments. There exists a fundamental difference in the implementation of the method in a single subject fMRI study unlike Genomic data on multiple subjects due to the spatio-temporal setup. Replicates for each node on the graph i.e. each voxel is in the form of time-series and no other independent replicates exist. Thus the responses are directly associated with the graphical structure imposed in the smoothing penalty of Li & Li (2010). Details of the method are provided in section 3.3. Two penalty terms to a weighted least squares objective function are applied. The first penalty term is a LASSO continuous convex non-differentiable penalty for selection and the second is a smoothing penalty that uses an adjacency matrix which shares properties of a Laplacian matrix on a graph to obtain a convex penalty. The method is demonstrated through a variety of simulation studies that uses a known spatio-temporal formulation with some examples incorporating covariance tapering (Ex: Wendland) to infuse additional sparsity in a rather dense matrix formulation. This induction of sparsity improves the computationally feasibility of the method. Additionally some variation of the coordinate descent method are like path-wise coordinate descent, warm starts, active sets and full tuning parameter search grids in section 3.3.2. We also use the blockdesign to our advantage and investigate the performance of the method to find optimal tuning parameters using time Cross Validation explained in 3.3.2.4. One more attempt is made at efficiently locating the tuning parameter using the empirical bayes estimate obtained for the LASSO in the Bayesian LASSO approach introduced by Park & Casella (2008). Using Zou & Hastie (2005) we can reformulate the two penalty and convert it to a LASSO regularization method and then use the empirical bayes estimate of the LASSO tuning parameter. 11 The empirical investigation of this method suggests that there is a need to find an adequate criteria to find optimal tuning parameters. Furthermore the extent of spatio- temporal variance may affect the selection of these parameters. We would like to in the future like to explore the theoretical properties of the estimator obtained from this method. Lastly as a real data application we consider a single subject study where an individual’s brain is scanned through a series of visual stimuli. Specifically two stimuli, an object and a landscape image are shown to the individual randomly at regular intervals with some rest. The result show which portions of th brain appear to be activated by these stimuli. This study was conducted in Michigan State University, Department of Radiology by David Zhu. 12 Chapter 2 Variable Selection for Discrete Spatial Data using a Penalized Quasi-likelihood Approach 2.1 Model Setup Let us consider a discrete, binary or count random field denoted by a random variable Y = {Y (s1 ), Y (s2 ), ..., Y (sn )} as an n-dimensional vector associated with sampling sites s1 , ..., sn arranged on a spatial domain in Rd , where d ≥ 2. Let X denote the design matrix with n × p dimension. In the context of regression, the corresponding coefficient vector β = (β1 , .., βp )T denotes the magnitude and direction of the impact of a predictor on the response. The mean of the response vector E(Y ) = µ = (µ(s1 ), ..., µ(sn ))T is assumed to be related to the matrix X of p predictors through a link function g(·) such that g(µ) = Xβ. Each xij is finite and corresponds to the j th predictor at site si i.e. maxij |xij | < ∞. We further impose that this link function g(·) is smooth over µ such that the first and second order derivatives of g −1 (·) are finite.The variance function v(·) of Y is assumed to be smooth such that V ar(Y (si )|xi .) = v(g −1 ((xTi β)). Let Cov(Y ) = V (µ)σ 2 = V σ 2 where σ 2 is 13 independent of µ. By virtue of the data being collected on sites at a different spatial locations, a dependence structure in assumed for the variance of the response Y i.e. σ 2 V = Σ1/2 ΓΣ1/2 where Γ is the true underlying symmetric, isotropic and positive definite spatial covariance matrix and Σ = diag{v(g −1 (x1 β))), .., v(g −1 ((xn β)))} is a diagonal matrix with variance components. In order to alleviate distribution restrictions for correlated discrete data that does not have a likelihood function that can be expressed explicitly, we use a Quasi-likelihood approach, previously used by Lin & Clayton (2005). In this chapter we explore two cases in which the design matrix X presents itself. (Case 1) The design matrix X has a fixed number of predictors. i.e. p < n and is fixed. (Case 2) The design matrix X has a varying number of predictors that grows as the sample size increases. i.e. pn → ∞ as n → ∞ but pn < n. Below is a brief review of Quasi-likelihood functions and score functions as described in (McCullagh & Nelder, 1989, Chapter 9); with regard to the structure and attributes of both cases considered above. 2.2 Penalized Quasi-likelihood Estimating Equations The Quasi-likelihood Estimating equations for the parameter of interest β is obtained by differentiating an objective Quasi-likelihood function QL(β), resulting in solving the equations U(β) = 0 where the Quasi-score function is given by, U(β) = DT V −1 (Y − µ)/σ 2 (2.1) 14 ∂µ Dij = ∂β i , the partial derivative of the ith mean function with respect to j th regression j coefficient. The derivative of the score function with respect to β that yields the gradient, ∂ 2 µi ∂βk ∂βj are assumed to be finite and equal the covariance of U(β). Specifically, Cov(U(β)) = DT Σ−1/2 ΓΣ−1/2 D (2.2) E[ β U(β)] = −DT Σ−1/2 ΓΣ−1/2 D (2.3) For correlated observations, however there may be multiple roots that solves equation (2.1). As suggested by (McCullagh & Nelder, 1989, Section 9.3.2) to ensure that βˆ is unique and globally maximizes the quasi-likelihood, we require a V −1 such that β U(β) is symmetric in order for the integral of the score-function that describes the Quasi-likelihood to be path independent. This is satisfied by our choice of Γ explained in section 2.1. In order to obtain asymptotic properties of the score function we require some additional conditions on the dependence structure of the matrix Γ. Let ∧ denote a lattice subset of the locations. Lin (2008) defined the ρ-product mixing coefficients as a generalization of (Guyon, 1995, Page 112) as given below, ρk,l p (m) = sup[|Cov{ si ∈∧1 y(sj )}| : E(|y(s)|2 ) ≤ 1, y(si ), sj ∈∧2 | ∧1 | ≤ k, | ∧2 | ≤ l, dp (∧1 , ∧2 ) ≥ m] (2.4) where dp (∧1 , ∧2 ) = inf { s1 − s2 p : si ∈ ∧i } and F∧i is the σ algebra formed by the random variables corresponding to ∧i . Further it is assumed that sup[E(|Y |)2+δ : s ∈ ∧] < ∞ for some δ > 0. The mixing coefficient ρk,l p (m) is defined to control the extent 15 of correlation between random variables at the different sites. The central limit theorem associated with these conditions are described in section 2.3. The objective of this chapter is to construct robust statistical methodology that can simultaneously select variables and estimate the parameters of interest. Quasi-likelihood estimating equations have been proven to provide unbiased estimates of β co-efficients (Lin & Clayton, 2005). We impose a penalty to these score functions (Johnson et al., 2008), under the current model setup and study its properties. We therefore consider obtaining an estimate for β that minimizes the objective function, −QL(β) + npλ (|β|) (2.5) where pλ (|β|) is the penalty function with a tuning parameter λ. QL(β) is the quasilikelihood function which yields the score function in equation 2.1. Thus the new penalized score function is given by, Up (β) = U(β) − nqλ (|β|)sign(β) (2.6) where qλ (|β|) = (∂pλ (|β1 |)/∂|β1 |sign(β1 ), ..., ∂pλ (|βp |)/∂(|βp |sign(βp ))T . Therefore βˆ is the solution of Up (β) = 0. It is important to note that U(β) is a p × 1 or pn × 1 dimensional vector for Case 1 and Case 2 respectively. Theoretical justifications of asymptotic properties in these scenarios vary considerably. The chapter henceforth is split in two sections addressing both cases. 16 2.3 Asymptotic Properties of the Penalized Quasi-Likelihood Estimator (p-fixed) Lin (2008) developed a central limit theorem (CLT) for a random field on a lattice under Lp metrics in the increasing domain framework using the mixing conditions defined in 2.4. Under similar conditions for sampling sites {s1 , .., sn } ∈ R2 on an n = n1 × n2 regular grid, the consistency and asymptotic properties of the estimate obtained from solving the following system of equations are established. Up (β) = DT Σ−1/2 Γ−1 Σ−1/2 (Y − µ) − nqλ (|β|) (2.7) Notice that the portion of equation 2.7 associated with the penalty term is discontinuous and non-differentiable at |βj | = 0 for some j = {1, ..p}. Therefore we study the two portions of this equation separately. Let us begin by looking at the score function that resembles equation 2.1. Asymptotics for Score Function: Let Sn = s∈∧n {Y (s) − µ(s)} and σn2 = var(Sn ) where ∧n is a strictly increasing sub- sequence of lattice sets. With ρk,l;p (m) as defined in 2.4, Lemma 2.3.1. Lin (2008) If a random field satisfies the conditions (i) ρ1,2;p (m) = O(m−d− ) for some > 0 (ii) ρk,l;p (m) = O(m−d ) for k + l ≤ 4 then Sn /σn converges in distribution to N (0, 1) and n → ∞. 17 This generalization of the CLT for weakly dependent stationary processes does not have any structural restrictions on the covariance other than (i) and (ii). It can be verified that both exponential and spherical covariances satisfy the conditions needed for CLT. Liang & Zeger (1986) have successfully adopted robust non-parametric covariance matrix estimates to replace the unknown correlation matrix in the form of a working correlation matrix and still obtained efficient unbiased estimates of the mean parameters. Therefore as long as the true underlying correlation structure as well as the misspecified correlation structure substituted in its place satisfy the ρ mixing conditions (i) and (ii) CLT holds. Thus during the implementation of the method not only can the parameters of a working correlation be misspecified, we can additionally misspecify its structure as well. Consequently, we replace ˆ that satisfies (i) and (ii). Henceforth, we investigate the true Γ with a working correlation Γ properties of the score function with the working correlation matrix denoted by Uw (β) as shown below, ˆ −1 Σ−1/2 (Y − µ) Uw (β) = DT Σ−1/2 Γ (2.8) Since Y (s) follows a central limit theorem based on the above Lemma, if all elements of ˆ −1 Σ−1/2 are finite, then asymptotic normality for the score function 2.8 can be DT Σ−1/2 Γ established using the following assumption. ˆ such that n−1 DT Σ−1/2 ΓΣ ˆ −1/2 D → I0 (β) (A1) There exists a working correlation matrix Γ as n → ∞ where I0 (β) is bounded and positive definite. Based on the model setup in section 2.1, maxi |µ(si )| and maxi |µ(si )−1 | are finite and ∃ a constant M0 > 0 such that maxij |Dij | ≤ M0 . Also Lin (2008) showed that for a correlation matrix Γ where each element is denoted by γij = co ||i − j||−d− which satisfies 18 the mixing conditions in 2.3.1, we obtain that each element of the inverse matrix Γ−1 is also ˆ −1 ΓΓ ˆ −1 Σ−1/2 D ≤ bounded. Hence a constant M1 > 0 may be found such that DT Σ−1/2 Γ n j=1 γij M1 J T Σ1/2 ΓΣ1/2 J where Jn×p is a matrix of 1s. Since = O(1) we have ˆ −1 ΓΓ ˆ −1 Σ−1/2 D = O(n) DT Σ−1/2 Γ (2.9) Thus we require the following assumption to establish asymptotic properties of the score function in the context of replacing the true underlying correlation with a working correlation satisfying (i),(ii) in lemma 2.3.1 and (A1). ˆ such that (A2) There exists a working correlation matrix Γ 1 T −1/2 Γ ˆ −1 ΓΓ ˆ −1 Σ−1/2 D nD Σ → I1 (β) as n → ∞ where I1 (β) is bounded positive definite matrix. Refer to (Lin, 2010, Appendix Proof of Theorem 1) for details. Thus an adequate working correlation maybe found to satisfy condition 2.9. Let us consider an estimate βˆ which is the solution to Uw (β) = 0. Lemma 2.3.2. (Lin, 2008, Theorem 2) Under current model setup assumptions (A1) and a working correlation satisfying (i),(ii) and equation 2.9, we get, n−1/2 Uw (β) → N (0, I0 (β)). Let us now consider the asymptotic properties of the gradient matrix that is denoted by β Uw (β). Specifically, β Uw (β) ˆ −1/2 ), ..., DT ◦ ( β1 (Σ−1/2 ΓΣ = ( β1 DT , .., βp D −1/2 ΓΣ ˆ −1/2 ))(Y βp (Σ T ) ◦ Σ−1/2 ΓΣ ˆ −1/2 (Y − µ) + ˆ −1/2 D. Due to − µ) − DT Σ−1/2 ΓΣ the finiteness and smoothness assumptions of the link function g(·), it can be seen that ( β1 DT , .., βp D T ) ◦ Σ−1/2 ΓΣ ˆ −1/2 ˆ −1/2 ), ..., and DT ◦ ( β1 (Σ−1/2 ΓΣ 19 −1/2 ΓΣ ˆ −1/2 ) βp (Σ are both bounded vectors. A similar justification as given in (Lin, 2008, Theorem 3) with bounded vectors and the asymptotic normality of Y (s) we have, a multivariate normal random variable Z ∗ such that n−1/2 {( β1 DT , .., ˆ −1/2 ), ..., DT ◦( β1 (Σ−1/2 ΓΣ n−1 β βp D T ) ◦ Σ−1/2 ΓΣ ˆ −1/2 (Y − µ) + −1/2 ΓΣ ˆ −1/2 )(Y −µ)} is equivalent to Z ∗ +op (1).Therefore, βp (Σ ˆ −1/2 D Uw (β) = n−1 Op (n1/2 ) − n−1 DT Σ−1/2 ΓΣ (2.10) Hence by assumption (A1) we have that, n−1 β Uw (β) → −I0 (β) in probability as n → ∞. The Inverse function theorem and an open ball argument can now be used for this score ˆ to show that the solution βˆ of the system of equations Uw (β) = 0 function 2.8 with a Γ is consistent with regard to the underlying true β0 asymptotically (Lin, 2008, Theorem 3 Proof). We can therefore represent these results in a compact form that is exactly an assumption used in (Johnson et al., 2008, Condition C.1) for most commonly used estimating equations. The combined results of asymptotic normality of the score function, the positive definite limit in probability for the gradient function and the consistency of the βˆ estimate result in the following corollary. Corollary 2.3.2.1. There exists a positive definite matrix A, such that for a given constant M > 0, sup |n−1/2 Uw (β) − n−1/2 Uw (β0 ) − n1/2 A(β − β0 )| = op (1) |β−β0 |≤Mn−1/2 So far consistency results have been established for estimates of β that solves 2.8. However situations rise where we are interested in selecting only those relevant predictors for a relatively large number of available predictors. Incorporating a penalty term like LASSO 20 (Tibshirani, 1996) or SCAD (J. Fan & Li, 2001) would allow us to simultaneously select the variables and estimate them. There is a natural sparsity assumption that builds into this formulation of the penalized score function. We proceed by investigating properties of the penalty term additionally attached to 2.1. Asymptotics for Penalty Term: Consider the true β0 = (β01 , ..., β0p )T for a fixed p. Without loss of generality, suppose β0j = 0 for j ≤ s and β0j = 0 for j > s, where s denotes the true number of non-zero βs. We now consider solving the following system of equations and investigate the relationship p p ˆ = 0 where Uw (β) is given by, of between β0 and the solution βˆ obtained from solving Uw (β) p ˆ −1 Σ−1/2 (Y − µ) − nqλ (|β|) Uw (β) = DT Σ−1/2 Γ (2.11) The penalty function, specifically its derivative must posses certain properties that basically ensures a consistent solution with regard to selection and estimation. Therefore we consider assumptions identical to (Johnson et al., 2008, Condition C.2). (A3) For all nonzero θ, limn→∞ n1/2 qλn (|θ|) = 0 and limn→∞ q λn (|θ|) = 0 ∂ q (|θ|) where q λn (|θ|) := ∂θ λn (A4) For any M > 0, limn→∞ inf |θ| s the penalty term diverges. These properties allows us to consistently distinguish significantly large coefficients. Further, (A4) also implies that for any consistent solutions of equation 2.11 there must satisfy zero estimates i.e. βˆj = 0. An alternative and equivalent version of 21 consistency can be established by showing 2.22 described in section 2.4 for the case p → ∞ as n → ∞. Finally both these conditions ensure the oracle property of the penalty term as defined by J. Fan & Li (2001). Some examples of penalty terms that satisfy (A4) and (A4) are SCAD, Adaptive LASSO (Zou, 2006), hard-thresholding (Donoho et al., 1994) and MCP (Zhang et al., 2010). Examples of those penalty terms that do not satisfy these conditions are LASSO and Elastic-Net (EN,(Zou & Hastie, 2005)). 2.3.1 Selection Consistency and Oracle Property Singularity at the origin: The derivative of the penalty terms under consideration used in equation 2.11 for a given βi = 0 is discontinuous. The sign function that results from modulus-type and indicator functions fluctuates at the origin rendering it difficult to obtain an exact zero-crossing solution. Therefore as suggested by (Johnson et al., 2008, Theorem 1a) a solution βˆ is zero-crossing if, p p lim sup n−1 Uwj (βˆ + ej )(Uwj (βˆ − ej ) ≤ 0 (2.12) →0+ where ej is the j th canonical unit vector. Additionally, an approximate zero crossing estimator is defined by, p p lim sup lim sup n−1 Uwj (βˆ + ej )(Uwj (βˆ − ej ) ≤ 0 n→∞ (2.13) →0+ Consequently, the zero-crossing is an exact solution to the penalized score function whenp ever Uw is continuous. Using this definition we show that the system of equations maybe 22 solved to obtain a √ n-consistent estimator with oracle properties. Lemma 2.3.3. Existence: Under assumptions (i),(ii) in lemma 2.3.1, (A1)-(A4) there exists a √ p n consistent approximate zero-crossing of Uw (β). With an approximate zero-crossing solution we proceed to check if the estimator posses selection consistency of the true zeroes in the true sparse β co-efficient vector. Based on the sparsity of β as defined in section 2.3, we assume that there are a fixed number of true non-zero βs s << p. Then we show the following. Lemma 2.3.4. Selection Consistency: For any √ n consistent zero-crossing solution of p Uw (β) under assumptions (A3) and (A4), lim P (βˆj = 0, j > s) = 1 n→∞ The current estimator thus exhibits selection consistency indicating potential for quality variable selection with a working correlation. The oracle property introduced by J. Fan & Li (2001) is established for non-concave penalty functions such as SCAD that satisfies assumption (A3) and (A4). The oracle property means that the penalized estimator is asymptotically equivalent to the oracle estimator that is the ideal estimator obtained only using the true non-zero (signal) variables without subjecting it to regularization. Theorem 2.3.5. Oracle Property: Under assumptions (i),(ii) and (A1)-(A4), for s << p, we denote βˆ1 = (βˆ1 , .., βˆs )T as the estimates of the true non-zero βs denoted by β01 = (β01 , .., β0s )T . Then, √ n(B11 + I011 ){βˆ1 − β01 + (B11 + I011 )−1 bn } →d N (0, V ) 23 where B11 = diag{−qλ (|β01 |)sign(β01 }, I011 is an s × s sub-matrix of I0 , n bn = −(qλn (|β01 |)sign(β01 ), ..., qλn (|β0s |)sign(β0s ))T and V is the s × s sub-matrix of I0 (β0 )−1 I1 (β0 )I0 (β0−1 ). Limiting conditions of assumptions (A1) and (A2) result in matrices I0 and I1 . The proofs of 2.3.3, 2.3.4 and 2.3.5 are provided in the Appendix of chapter 2. The theoretical results of consistency and oracle properties of the estimator provides strong evidence of the performance of the proposed statistical methodology. We have addressed the discontinuity of the score function in this section by defining approximate zero-crossing. In general numerical methods like Newton-Raphson algorithm are used to obtain the solution for the simultaneous equations using a simple update rule. However, the penalized score function in 2.11 is also non-differentiable and a modification is therefore needed to successfully implement the method. The following section provides an update rule using the MM-algorithm (Hunter & Li, 2005) and a way to incorporate a data-driven working correlation into the rule to be used in practice. 2.3.2 Model Implementation 2.3.2.1 MM-Algorithm Maximum likelihood estimation (MLE) have dominated the field of Statistics for a long time with a very natural theoretical intuition and efficiency. Thus they have also been very well formulated and studied. In the context of MLE, regularization methods have been extensively applied to create variable selection methods. It is known that the Quasi-likelihood functions coincide with maximum likelihood for data having distributions in the exponential family. When a penalty term is added to this setup it too faces discontinuity and non-differentiable 24 issues as mentioned in section 2.3.1. J. Fan & Li (2001) introduced a unified framework for the optimization using local quadratic approximations to be able to use a Newton-Raphson like update rule. This approximation for βj with an initial starting value βj0 is given by, [pλ (|βj |)] = qλ (|βj |)sign(βj ) ≈ qλ (|βj0 |) |βj0 | βj i.e. for β0j = 0 and βj ≈ βj0 we have, pλ (|βj |) ≈ pλ (|βj0 |) + qλ (|βj0 |) 2|βj0 | 2 {βj2 − βj0 } (2.14) Since 2.14 is undefined at βj0 = 0, the associated βs were set to 0 and the approximation is used only for the non-zero βs. Those βs that are set to 0 remain so for all of the iterations. Alternatively, Hunter & Li (2005) introduced the Majorize-Minimize (MM)- Algorithm in the context of MLE that both generalized the EM algorithm and circumvented the issue of discontinuity while generalizing the initial local quadratic approximation previously discussed as a special case of the MM-algorithm. By incorporating a perturbation to 2.14 smoothening out the denominator the function does not majorize the original piece-wise differentiable pλ (·) but a perturbed version of it. Then they proposed that asymptotically this perturbation does not indeed affect the optimization process using a Newton-Raphson like update rule. Similar to method implemented for MLE, the objective is to minimize the objective function 2.5. A small perturbation in the form of small > 0 renders the local quadratic approximation differentiable. Therefore we construct a modified penalized score function 25 2.11 in the following way, p ˆ −1 Σ−1/2 (Y − µ) − nqλ, (|β|) Uw, (β) = DT Σ−1/2 Γ (2.15) |β | |β | where qλ, (|β|) = (qλ (|β1 |+)sign(β1 ) +|β1 | , ..., qλ (|βp |+)sign(βp ) +|βp | )T and p 1 qλ (|βj |+) = limx→|β |+ pλ (x), j = 1, .., p. Similar to the solution for MLE provided by j (Hunter & Li, 2005, Section 3.3) we can find a solution βˆ of 2.15 that requires pλ (·) to be piece-wise differentiable, non-decreasing, concave on (0, ∞), continuous at 0 and pλ (0+) < ∞. Thus in order to obtain the solution one must iteratively solve, p β k+1 = β k + [H(β k ) + nE(β k )]−1 Uw, (β k ) (2.16) ˆ −1 Σ−1/2 (β k )D(β k ) and where H(β k ) = D(β k )T Σ−1/2 (β k )Γ E(β k ) = diag{ A choice for k |+) k |+) qλ (|β1, qλ (|βp, , ..., k | k | } +|β1, +|βp, was proposed by Hunter & Li (2005) ensuring that it satisfies, p p |Uw, − Uw | < τ /2 where τ is a very small predetermined tolerance (Ex: τ = 10−5 ). Therefore this results in, = τ min{|βˆj0 | : βˆj0 = 0} 2npλ (0+) (2.17) Several start values of βˆ0 must be employed however to ensure that the algorithm converges to a global maximum, otherwise the algorithm may oscillate near local maximas. Convergence is obtained when βˆk+1 − βˆk < δ for some δ > 0. It is important to note that the smoothening using the perturbation and update-rules do not drive down the βs to be 26 exactly 0, thus whenever convergence is attained, if |Uw (βˆjk )| > τ , then βˆjk is set to 0. SCAD penalty Smoothly Clipped Absolute Deviation (SCAD, J. Fan & Li (2001)) satisfies all assumptions associated with equation 2.15, (A3)-(A4) and has been implemented in our current efforts to showcase the methodology. The explicit form of the derivative of the penalty is given by, qλ (β) = λ I(β ≤ λ) + (aλ − β)+ I(β > λ) (a − 1)λ (2.18) A recommended value of a = 3.7 provided by (J. Fan & Li, 2001, Section 2.1) is used for the simulations in section 2.3.3. 2.3.2.2 Tuning Parameter Selection In regularization methods selection of the optimal tuning parameter is extremely crucial for efficiency of the method. Cross validation (CV) techniques have proven to be fruitful in doing so. The premise of this method lies in the independence assumption of data. Spatial data disallows for such independence and therefore CV techniques seem unreliable in this paradigm. The most natural cross-validation technique that is usually used in kriging would be a jackknife-like n-fold CV estimate. However for a sequence of lambda estimates it appears to be computationally highly inefficient. Alternative methods of using likelihood based criteria like AIC and BIC are not possible. We may have been able to calculate the modified AIC for gee by Pan (2001), however a closed form expression for the quasi-likelihood needs to be explicit is yet another barrier in using the method.Therefore we resort to a grid search of the tuning parameters over plausible values 27 and find the tuning parameter that satisfies an optimality criteria like minimizing prediction error. Therefore we implement the MM-algorithm with multiple starts on a sequence of tuning parameters and select the βˆ with the minimum mean squared error. 2.3.2.3 Construction of Working Correlation ˆ is required. Since the unTo implement the update rule 2.16, the working correlation Γ derlying true spatial covariance function is unknown, we use the Pearson residuals at each iteration (k). The start value βˆ(0) is obtained using GLM under independence. Then at each iteration of the update rule the pearson residuals is calculated and given by, (k) rˆi = (y(si ) − µi (βˆ(k) ))/ Σii (µ(βˆ(k) )) The over-dispersion parameter σ 2 is then estimated using σˆ2 (k) = (k) n ri )2 /(n − p(k) ). i=1 (ˆ After which the residuals are used to fit a variogram model denoted by γ(h) at lag h to ˆ is then constructed by using parametric models that satisfy (i) and (ii) of lemma 2.3.1. Γ ˆ the estimated γˆ (h) as Γ(h) = 1 − γˆ (h). Fit using least squares may increase computation time significantly. Certain instabilities creep when each iteration may result in a slightly different variogram model shape. Fixing the model structure in the beginning may instill some stability in the method. Further details maybe found using this method in Feng et al. (2016) where it is employed specifically to spatial binary data. 2.3.3 Simulation In order to investigate the effectiveness of the proposed method we consider two scenarios of responses; binary and count spatial response. Let us first consider the binary case. Three different unit-distance grids : 15 × 15, 20 × 20 28 and 25×25 are considered i.e. n = 225, 400, 625 respectively. The number of predictors asso0 ciated with the response is 20 with 3 non-zero coefficients i.e.β20×1 = (1, −1.5, 1, 0, . . . , 0)T . The corresponding design matrix X20×n are generated from a Uniform(0,1) distribution. The mean of the spatial binary distribution is associated with the predictors through a logit π i = xT β 0 for i = 1, . . . , n. Two spatial correlation link i.e. E(y(si )) = πi such that log 1−π i i d functions are considered the power correlation ρ ij where dij = si − sj and the Mat´ern covariance function with parameters θ and ν. In the tables below ρ˜ denotes the working correlation value substituted. The binary data is generated using the archived R package mvtBinaryEP based on the paper by Emrich & Piedmonte (1991) that describes generating correlated binary data using tetrachoric correlation. A renewed version of the model generating function may be found in the R package MultiOrd. For each setup 100 data sets were 1 simulated. The estimated mean squared error (MSE) is equal to 100 100 j=1 βˆj − β0 2 . We compare results obtained from using SCAD, LASSO as the penalty function and since p < n we used the score function obtained from the original QL method in equation 2.1. Both tables 2.1 and 2.2 indicate that the penalized quasi-likelihood approach provides considerably better estimates compared to the quasi-likelihood approach. The choice of working correlation parameter does not have a very significant impact on estimation efficiency. Further there is a tendency for the method to over-select when there is a stronger spatial dependence. Compared to results using the SCAD penalty, the LASSO penalty provides a less sparse result yielding a more complex model with many more predictors selected but not significantly improving the MSE. Consequently, even though TP results with the LASSO penalty is slightly better, FP results is also larger than that for the case with the SCAD penalty. A noticeable improvement is observed when a misspecified parameter working correlation is incorporated in comparison to employing the method under the independence 29 Table 2.1: Simulation results for spatial binary data generated using the power correlation model with ρ = 0.1. QL represents a quasi-likelihood approach without a penalty term, PQL.LASSO represents a penalized quasi-likelihood approach with the LASSO penalty and PQL.SCAD represents a penalized quasi-likelihood approach with the SCAD penalty. PQL.LASSO.Ind and PQL.SCAD.Ind uses the identity matrix as working correlation assuming independence. Grid Size MSE TP 15 × 15 0.10 2.70 PQL.LASSO.Ind 20 × 20 0.08 2.82 25 × 25 0.07 2.99 15 × 15 0.12 2.27 PQL.SCAD.Ind 20 × 20 0.11 2.60 25 × 25 0.14 2.94 ρ˜ = 0.1 Grid Size MSE TP 15 × 15 0.30 2.94 QL 20 × 20 0.15 2.99 25 × 25 0.08 3.00 15 × 15 0.11 2.66 PQL.LASSO 20 × 20 0.08 2.79 25 × 25 0.07 2.99 15 × 15 0.12 2.23 PQL.SCAD 20 × 20 0.11 2.62 25 × 25 0.14 2.94 ρ˜ = 0.3 Grid Size MSE TP 15 × 15 0.33 2.95 QL 20 × 20 0.15 2.99 25 × 25 0.09 3.00 15 × 15 0.10 2.63 PQL.LASSO 20 × 20 0.08 2.80 25 × 25 0.06 2.99 15 × 15 0.12 2.35 PQL.SCAD 20 × 20 0.11 2.70 25 × 25 0.14 2.92 FP 7.31 4.98 3.97 3.44 2.37 0.70 FP 14.42 13.61 12.14 7.16 4.46 3.68 3.17 2.26 0.57 FP 14.65 13.53 12.76 6.22 4.94 3.54 3.11 1.95 0.78 CP 0.85 0.92 0.95 0.72 0.88 0.96 CP − − − 0.87 0.93 0.94 0.69 0.87 0.96 CP − − − 0.86 0.87 0.93 0.78 0.88 0.93 assumption i.e. using the identity matrix as the working correlation. Finally both MSE and selection results significantly improve as the sample size increases. The proposed method with assumptions (i) and (ii) of the ρ mixing conditions in lemma 2.3.1 allow for the misspecification of parameters and the covariance structure as long as 30 Table 2.2: Simulation results for spatial binary data generated using the power correlation model with ρ = 0.3. QL represents a quasi-likelihood approach without a penalty term, PQL.LASSO represents a penalized quasi-likelihood approach with the LASSO penalty and PQL.SCAD represents a penalized quasi-likelihood approach with the SCAD penalty. PQL.LASSO.Ind and PQL.SCAD.Ind uses the identity matrix as working correlation assuming independence. Grid Size MSE TP 15 × 15 0.10 2.58 PQL.LASSO.Ind 20 × 20 0.08 2.84 25 × 25 0.06 2.99 15 × 15 0.11 2.26 PQL.SCAD.Ind 20 × 20 0.10 2.67 25 × 25 0.15 2.89 ρ˜ = 0.1 Grid Size MSE TP 15 × 15 0.28 2.95 QL 20 × 20 0.15 3.00 25 × 25 0.10 3.00 15 × 15 0.10 2.61 PQL.LASSO 20 × 20 0.07 2.90 25 × 25 0.07 2.97 15 × 15 0.11 2.16 PQL.SCAD 20 × 20 0.10 2.74 25 × 25 0.15 2.90 ρ˜ = 0.3 Grid Size MSE TP 15 × 15 0.25 2.98 QL 20 × 20 0.13 2.98 25 × 25 0.08 3.00 15 × 15 0.09 2.67 PQL.LASSO 20 × 20 0.07 2.88 25 × 25 0.06 2.99 15 × 15 0.12 2.28 PQL.SCAD 20 × 20 0.10 2.72 25 × 25 0.14 2.93 31 FP 6.59 4.13 4.13 3.04 2.12 0.72 FP 14.72 13.74 12.85 6.73 4.42 3.74 2.86 2.44 0.55 FP 14.32 13.39 12.44 5.94 4.34 3.64 3.11 1.94 0.63 CP 0.91 0.96 0.96 0.87 0.90 0.92 CP − − − 0.83 0.92 0.84 0.78 0.90 0.89 CP − − − 0.78 0.85 0.93 0.71 0.81 0.88 the assumptions are satisfied. Therefore in what follows, (see table 2.3) we generate spatial binary data using the Mat´ern correlation model, √ √ ( 2νθ)ν Kν ( 2νθ) C(θ, ν) = 2ν−1 Γ(ν) (2.19) where Kν (·) is the modified Bessel function of second kind. The true parameters θ0 = 0.7 and ν0 = 0.3. The working correlation is substituted with the true underlying spatial correlation, one with misspecified parameters θ˜ = 0.8 and ν˜ = 0.4 and a misspecified exponential correlation structure with ρ˜ = 0.1. The results in table 2.3 show that the performance of the method under misspecification is comparable to results obtained under the true correlation structure. The LASSO penalty still tends to over-select thus worsening the FP rate. A significant improvement in selection is observed as the same size increases in all scenarios i.e. TP rate increases and FP rate decreases as n increases. We further explore whether multicollinearity or correlated predictors affect the performance of the proposed method. In the setup of table 2.4 we consider once again spatial binary data that is generated with the exponential covariance with ρ = 0.3. There is a marked breakdown of the method for significantly high correlated variables as high as 0.9. An investigation was made as well to consider the setup which involves a mixture of correlated and independent variables with the true non-zero coefficients shared among both sets of variables. Results in table 2.5 indicate that there is slight improvement in identifying the variables thus confirming our notion that like most methods, very high multicollinearity naturally produces identifiable issues and thus may result in poorer performance. For the second case, we consider investigating this method for count data. Similar to the binary setup we use three different unit-distance grids : 15 × 15, 20 × 20 and 25 × 25 32 Table 2.3: Simulation results when a Mat´ern correlation model with θ = 0.7 and ν = 0.3 is used to generate spatial binary data. The results are compared to misspecification of Mat´ern parameters θ˜ = 0.8 and ν˜ = 0.4 and misspecification of structure exponential with ρ˜ = 0.1. θ˜ = 0.7, ν˜ = 0.3 QL PQL.LASSO PQL.SCAD θ˜ = 0.8, ν˜ = 0.4 QL PQL.LASSO PQL.SCAD ρ˜ = 0.1 QL PQL.LASSO PQL.SCAD Grid Size MSE 15 × 15 0.30 20 × 20 0.13 25 × 25 0.08 15 × 15 0.10 20 × 20 0.07 25 × 25 0.06 15 × 15 0.18 20 × 20 0.15 25 × 25 0.15 Grid Size MSE 15 × 15 0.28 20 × 20 0.14 25 × 25 0.08 15 × 15 0.10 20 × 20 0.07 25 × 25 0.06 15 × 15 0.18 20 × 20 0.15 25 × 25 0.14 Grid Size MSE 15 × 15 0.31 20 × 20 0.14 25 × 25 0.09 15 × 15 0.10 20 × 20 0.07 25 × 25 0.06 15 × 15 0.17 20 × 20 0.15 25 × 25 0.15 TP 2.93 2.98 3.00 2.78 2.94 2.99 2.59 2.71 2.94 TP 2.97 2.98 3.00 2.83 2.94 2.99 2.51 2.81 2.92 TP 2.98 2.99 3.00 2.79 2.95 2.98 2.45 2.85 2.86 FP 14.19 13.32 12.42 7.58 5.01 3.80 4.21 1.38 0.56 FP 14.24 13.24 12.46 7.46 5.10 3.88 3.91 1.47 0.67 FP 14.29 13.63 12.43 7.54 5.39 4.01 2.48 1.24 0.71 CP − − − 0.84 0.91 0.94 0.76 0.77 0.93 CP − − − 0.85 0.90 0.95 0.73 0.91 0.91 CP − − − 0.84 0.93 0.93 0.70 0.86 0.89 are considered i.e. n = 225, 400, 625 respectively. The number of predictors associated 0 with the response is 20 with 3 non-zero coefficients i.e.β20×1 = (1, −1.5, 1, 0, . . . , 0)T . The corresponding design matrix X20×n are generated from a Uniform(0,1) distribution. The 33 Table 2.4: Simulation results when the covariates are correlated. Spatial binary data with a power correlation model (ρ = 0.3) on 20 × 20 grid are considered. A power correlation model was used to construct a working correlation matrix (˜ ρ). t is the level of dependence among covariates. LASSO t 1 1 3 3 ρ˜ MSE TP 0.1 0.14 2.55 0.3 0.16 2.36 0.1 0.19 1.72 0.3 0.58 1.60 FP 6.19 5.52 6.75 7.68 SCAD t 1 1 3 3 ρ˜ MSE TP 0.1 0.17 2.24 0.3 0.18 1.98 0.1 0.20 1.38 0.3 0.20 1.28 FP 3.93 3.49 4.71 4.31 Table 2.5: Simulation results when 10 covariates are correlated and 10 are independent. Spatial binary data with a power correlation model (ρ = 0.3) on 20 × 20 grid are considered. A power correlation model was used to construct a working correlation matrix (˜ ρ). t is the level of dependence among covariates. β0 = (1, −1.5, 0 . . . , 1, 0 . . . )T i.e. 2 non-zero coefficients are correlated and 1 is independent. LASSO t 1 1 3 3 ρ˜ MSE TP FP 0.1 0.11 2.62 5.61 0.3 0.12 2.62 5.59 0.1 0.15 2.29 4.9 0.3 0.20 2.32 5.74 SCAD t 1 1 3 3 ρ˜ MSE TP 0.1 0.18 2.28 0.3 0.19 2.41 0.1 0.25 2.13 0.3 0.45 2.34 FP 2.68 4.58 4.63 7.89 mean of the spatial Poisson distribution is associated with the predictors through a log link i.e. E(y(si )) = λi such that log(λi ) = xTi β 0 for i = 1, . . . , n. The spatial correlation d function considered once again is the power correlation ρ ij where dij = si − sj . In study areas such as crime statistics, disease mapping spatial count data plays a very keen role. However without the existence of a known joint distribution ensuring that the moment 34 Table 2.6: Simulation results when the power correlation model with ρ = 0.1 is used to generate spatial Poisson data. The power correlation model with ρ˜ is used for the working correlation matrix for estimation. ρ˜ = 0.1 QL PQL.LASSO PQL.SCAD ρ˜ = 0.3 QL PQL.LASSO PQL.SCAD Grid Size 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 Grid Size 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 MSE 0.24 0.15 0.07 0.19 0.12 0.04 0.24 0.13 0.07 MSE 0.25 0.19 0.07 0.21 0.16 0.04 0.24 0.16 0.06 TP 2.83 3.00 3.00 2.74 3.00 3.00 2.95 3.00 3.00 TP 2.76 3.00 3.00 2.54 3.00 3.00 2.87 3.00 3.00 FP 14.10 9.96 10.74 11.49 7.22 6.91 14.31 9.84 10.81 FP 13.71 11.22 10.84 11.50 9.11 7.65 14.52 9.96 10.61 conditions can hold as those assumed in the proposed method may be challenging. The review paper by Inouye et al. (2017) covers known areas of simulating correlated count data with a particular Poisson like marginal distribution using copula methods, mixed models and Gaussian random fields etc. However these methods may be able to approximate the intended model there is a fair amount of deviation that may not be easily accounted for. Therefore we resort to generating spatial Poisson data from a Poisson-Normal model with a conditional autoregressive (CAR) correlation structure for a Normal error with mean Xβ. Tables 2.6 and 2.7 provide results but are not nearly as satisfactory. There is an unusual amount of selection with very high FP values. Unfortunately we attribute the failure of 35 Table 2.7: Simulation results when the power correlation model with ρ = 0.3 is used to generate spatial Poisson data. The power correlation model with ρ˜ is used for the working correlation matrix for estimation. ρ˜ = 0.1 QL PQL.LASSO PQL.SCAD ρ˜ = 0.3 QL PQL.LASSO PQL.SCAD Grid Size 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 Grid Size 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 15 × 15 20 × 20 25 × 25 MSE 0.24 0.11 0.07 0.19 0.09 0.05 0.25 0.10 0.08 MSE 0.25 0.13 0.06 0.20 0.11 0.03 0.25 0.11 0.07 TP 2.99 3.00 3.00 2.92 3.00 3.00 2.98 3.00 3.00 TP 2.99 3.00 3.00 2.88 3.00 3.00 2.99 3.00 3.00 FP 14.38 9.69 11.49 12.16 7.84 10.74 14.62 10.35 12.14 FP 13.69 10.68 11.21 11.28 7.49 7.35 14.54 9.33 11.42 the method to the inability of generating data that satisfy the assumptions of the method, especially with respect to correlation satisfying the ρ mixing conditions. 2.3.3.1 Synthetic Analysis: Lansing Woods To illustrate how this method may perform on real data, we use the well known Lansing woods data, originally used as application by (Lin & Clayton, 2005, Section 3). The data obtained from (Fingleton, 1986, Page 49) consists of 576 sites located on 24 X 24 grid. At each site there is a recording of whether there may be an oak, hickory or maple tree nearby in that gridded block, see Fig 2.1. Lin & Clayton (2005) tried to specifically study whether the presence of a hickory tree inhibits the presence of a maple tree. They specify that using 36 a conditional logistic model would be viable technique for the data set and if the spatial dependence were to be ignored then the method would asymptotically result in a chi-square test for independence (Agresti & Kateri, 2011). The estimates obtained from the using the quasi-likelihood approach for two predictors; presence of oak and hickory, with the response presence of maple lie within the the liberal estimates using a deflated chi-square test by Fingleton (1986) and the conditional binary logistic. Figure 2.1: Lansing Woods Data on 24 × 24 grid showing the presence/absence of a specific tree Maple Hickory Oak For our purposes of implementing a variable selection technique, we considered a set of spurious covariates that include 18 arbitrary covariates generated from 9 standard normal and 9 standard uniform distributions and the two interesting indicator variables for oak and hickory. By employing a SCAD penalty and using the power correlation to construct a working correlation as described in section 2.3.2.3 we successfully discarded all artificially added variables. The estimates obtained are shown in Table 2.8.These estimates are similar to the results obtained from Lin & Clayton (2005) where the estimate for the co-efficient of the presence of Hickory (-0.26) with standard deviation (0.001), thus indicating that the method efficiently selects and estimates the parameters of interest. 37 Table 2.8: Lansing Woods: Estimates and 95% C.I. Variable Name Estimate Presence of Hickory -0.24 Presence of Oak 0.29 2.3.4 0.025% -0.25 0.28 0.975% -0.23 0.30 Real Data Analysis In this section we discuss two examples showcasing the technique discussed above. The first example is binary response data describing the presence or absence of fire disturbances in Crawford County in Michigan, U.S. The second is an example of count data that relates to county-wise lung cancer incidence in Iowa, U.S. Results of these two studies are published in Feng et al. (2016). The notion of distances between locations in the datasets described are computed in two alternative ways. I Euclidean distance between the latitude and longitude coordinates in degrees. II Geodesic distant between the latitude and longitude coordinates in kms with consideration to the curvature of the Earth’s surface. Coordinate transformation for Geodesic distance - Bearing Measurement To convert the distances in kms with respect to the Earth’s curvature, we begin by using an average radius from the center of the Earth i.e. R = 6378.137 kms. Let us consider two distinct points w1 = (κ1 , ψ1 ) and w2 = (κ2 , ψ2 ), then we define the differences between coordinates of w1 and w2 which are converted to radians to be, ∆Lat = (κ2 − κ1 ) ∗ π/180 and ∆Lon = (ψ2 − ψ1 ) ∗ π/180. Then we calculate the following, a = sin( ∆Lat ∆Lat κ ∗π κ ∗π ∆Lon ∆Lon ) ∗ sin( ) + cos( 1 ) ∗ cos( 2 ) ∗ sin( ) ∗ sin( ) 2 2 180 180 2 2 38 κ ∗π κ ∗π κ ∗π κ ∗π b = cos( 1 ) ∗ sin( 2 ) − sin( 1 ) ∗ cos( 2 ) ∗ cos(∆Lon) 180 180 180 180 a c = 2 ∗ arctan( ) 1−a κ ∗π d = sin(∆Lon) ∗ sin( 2 ) 180 We then use a, b, c, d to calculate the length denoted by r = R ∗ c and its corresponding angle, θ = arctan( db ). In this example consider w1 to be the reference point, then distance r and angle θ for every other point may be obtained relative to this reference. These points are then projected into a Cartesian coordinate system such that the euclidean distance of two points in the new coordinate system is the kilometer distance between them. 2.3.4.1 Crawford County Fire Disturbances Within the spatial framework considered so far we base the real data example on a regularly shaped political border constructed for Crawford county belonging to the high plains region of the southern peninsula of Michigan. The response variable Y (s) is the presence or absence of fire disturbances in a 1.2 mile resolution with a total of 1484 locations. This data dates back to the early 1800’s using satellite information and land use surveys conducted by the Michigan Natural Features Inventory associated with Michigan State University and the State Department. The covariates or explanatory variables include several vegetation types, soil types, drainage indices and multiple topological indices at various spatial resolutions. Some preliminary interpolation techniques have been used to standardize the scale of resolution for each of these variables. To avoid issues of what is better known as separation in statistics, 39 Figure 2.2: Crawford County Fire Disturbance Map we only considered those covariates that covered over 25 percent of the spatial area under investigation. Specifically, since we are looking at binary responses (presence/absence) we want to make sure that multiple combinations of the explanatory exist for both outcomes, i.e. there is sufficient variability among the covariates. We also excluded certain highly correlated vegetation types, soil types and indices (ex: correlation =0.9). After screening through these variables we had 11 covariates under consideration; three vegetation types (Jack Pine-Red Pine, mixed Coniferous swamp, Pine Barrens), one soil cover type (glacial out-wash sand and gravel and post-glacial alluvium), topographical indices (TPI) at one mile and 240 meters radius, Schaetzl’s drainage index from Schaetzl (1986), water capacity of soils in each 1 mile polygon, sand cover, elevation and slope of the region. The distances between points is measured based on the geographic coordinate system (latitude and longitude) for rectangular polygons of 1.2 square miles covering the Crawford county uniformly. After applying the variable selection method for binary responses on the presence/absence responses, imposing an power correlation structure and obtaining a working correlation as 40 explained in section 2.3.2.3. Two relevant variables were selected using the SCAD penalty with distances in terms of I in section 2.3.4. Table 2.9 provides the estimates and the 95% confidence intervals using the asymptotic normality obtained from the oracle property. Table 2.9: Crawford County: Estimates and 95% C.I. Variable Name Estimate 0.025% 0.975% Proportion of Pine Barrens coverage 0.0204 0.0199 0.0220 TPI 1-mile -0.0293 -0.0318 -0.0267 Diagnostically the consulting geologist provided us evidence that the vegetation type Pine Barrens is known to be associated with presence of Fire as observed in the State of New Jersey by Givnish (1981). When compared to a regular generalized linear model for independent data it may be observed that the selected variables were indeed very statistically significant. Crawford county fire disturbances (in kms) Applying the transformation II in section 2.3.4 to the coordinates of Crawford county and then performing the variable selection method with SCAD penalty as described in Section 2.3.1, exactly one variable was selected. The proportion of Pine Barrens at each location had a coefficient estimate of 0.0275 with 95% CI (0.0248, 0.0302). 2.3.4.2 Lung Cancer Incidence of Counties in Iowa In order to showcase the method with respect to spatial count data, we provide results obtained from using this selection technique to find relevant socio-economic predictors that relate to lung cancer incidence in the state of Iowa. The county-wise rate of lung cancer incidence represents a time normalized aggregate between the years 2008 and 2012. This data is combined from sources like the Iowa State Cancer Registry (SEER data, National Cancer Institute Ries et al. (2006)), State Data Center of Iowa (Census Data) and Iowa 41 Public Health Tracking to obtain a complete spectrum of socio-economic covariates related to each of the 99 counties in Iowa. Due to the natural regular shape of the administrative county boundaries, we used to the Geographical Coordinate System to locate the center of the counties. The distances between sites are related to these central locations. The response variable Y (si ) is the rate of incidence of lung cancer in the ith county. Figure 2.3: Lung Cancer Incidence Map of Iowa and bubble plot indicating county-wise population The model applied to the count data was a Poisson distribution i.e. Y (si ) ∼ P oisson(µi ) with µi = Ei exi β where Ei is the estimated population at risk for the ith county and is calculated by considering the age-adjusted risk at both the county level and population level. Details and formulae can be found in Dass et al. (2012). Among the 33 original variables that were collected, a screening process was implemented to discard repetitive and highly correlated variables. From the 19 variables that were deemed suitable for the analysis, we applied the proposed method and used a power correlation structure based on the empirical variogram with the iterative working correlation estimation algorithm as described in section 2.3.2.3. This resulted in the selection of three variables. Additional diagnostics using the Lung Cancer data indicated that under the independence 42 Table 2.10: Iowa Lung Cancer Incidence Rate: Estimates and 95% C.I. Variable Name Estimate 0.025% 0.975% Percentage population below 18 years of age -0.0806 -0.0873 -0.0739 Percentage population with education less than 9th grade 0.0652 0.0583 0.0722 Percentage population that moved within the same county -0.0675 -0.0743 -0.0607 assumption if a generalized linear model approach is considered only the selected variables are significant. If the analysis is only performed on the selected variables the likelihood ratio test determined that the two models one with all variables present and the second with only the selected variables were statistically similar. Iowa state County-Wise Lung Cancer Data (in kms) Applying the geodesic transformation as shown in II in section 2.3.4 and then performing the variable selection method proposed with a SCAD penalty and a corresponding log link function for the corresponding Poisson rate data, the same three variables were selected and their estimates are shown in the table 2.11. Table 2.11: Iowa Lung Cancer Incidence Rate: Estimates and 95% C.I. Variable Name Estimate 0.025% Percentage population below 18 years of age -0.0925 -0.0928 Percentage population with education less than 9th grade 0.09404 0.0937 Percentage population that moved within the same county -0.0839 -0.0842 0.975% -0.0922 0.09438 -0.0835 In the following section, we build on the existing setup to try and solve issues of selection whenever the predictor space pn → ∞ as n → ∞ and create a premise to solve highdimensional problems in the discrete spatial paradigm. 43 2.4 Asymptotic Properties of the Penalized Quasi-Likelihood Estimator (p - expanding dimension) 2.4.1 Score Function Asymptotics Until now we have demonstrated the need for variable selection techniques for applications using discrete spatial data with fixed p number of covariates. A natural extension of this method is the case where the number of predictors in the design matrix X increases with the size of the data. Let us denote the dimension of the new design matrix as pn × n. These predictors are then regressed on to a response Y (s) on different sites {s1 , ..., sn } ∈ Rd as discussed earlier in section 2.1. In this portion we will study the asymptotics of an estimator βˆn obtained from solving generalized estimating equations where pn → ∞ as n → ∞. The parameters of interest βn are suffixed with an n to distinguish the solution from the fixed p case. We continue to assume that the working correlation used in the score function and the true underlying spatial covariance satisfy the conditions on ρ−mixing. It is also important to note that no asymptotics for the score function within the spatial architecture has been established, to the best of our knowledge. Therefore we begin by first studying the properties of the score function for pn → ∞. We characterize the score function in the following way, 1 Un (βn ) = √ X T Σ1/2 (βn )Γ−1 Σ−1/2 (βn )(Y − µ(βn )) n 44 (2.20) Unlike the original score function seen in equation 2.1, we consider only those smooth link ∂µ functions g(·) such that Dij = ∂β i = XΣ. It is known that responses Y (s) from a marginal j canonical exponential family share this property. The corresponding score function in which ˆ with properties similar to those described in section 2.3 is denoted a working correlation Γ ˆ n (βn ). by U Further in L. Wang et al. (2012) a decomposition is shown of the derivative of 2.20 in the supplementary material with derivative of matrix product properties specified in Pan (2002). Lemma 2.4.1. The gradient of the score function can be decomposed as, ∂Unk (βn ) = Hnk (βn ) + Enk (βn ) + Gnk (βn ) ∂βnT := − β (2.21) Unk where, Unk (βn ) = eTk Un (βn ), ek is a pn dimensional basis vector with the k th element equal to 1 and 1 Hnk (βn ) = − √ eTk X T Σ1/2 (βn )Γ−1 Σ1/2 (βn )X n 1 Enk (βn ) = − √ eTk XΣ1/2 (βn )Γ−1 Σ−3/2 (βn )C(βn )D(βn )X 2 n 1 Gnk (βn ) = √ eTk XΣ1/2 (βn )D(βn )J(βn )X 2 n for D(βn ) = diag(¨ µ(X1T βn ), .., µ ¨(XnT βn )), C(βn ) = diag(Y1 − µ1 , .., Yn − µn ) and J(βn ) = diag(Γ−1 Σ−1/2 (βn )(Y − µ(βn ))). ˆ n (βn ) = 0 such that In order to show that there exists a sequence of roots βˆn of U 45 βˆn − βn = Op ( pn /n) it is sufficient to show that ∀ > 0 there exists a constant ∆ > 0 such that for n sufficiently large, P( sup √ ˆ n (βn ) < 0) ≥ 1 − (βn − βn0 )T U (2.22) βn −βn0 =∆ pn /n This result follows from results similar to GEE in longitudinal models for clustered binary data in (L. Wang et al., 2012, Section 3). Before we formally verify 2.22 we require certain regularity conditions to hold. The absence of the summation for the score function illustrates the case of a single subject study with the cluster size going to infinity and the cluster exhibits spatial autocorrelation, in the context of longitudinal studies. 2.4.1.1 Regularity Conditions Below are some regularity conditions that appear most commonly in longitudinal data analysis literature using score functions and GEE’s in L. Wang (2011), Xie et al. (2003) and Balan et al. (2005). √ (A5) supi,j |Xij | = O( pn ) p (A6) βn parameters belong to a compact subset B ⊆ Rnn , and the true unknown parameter belongs to the interior of B and there exists a positive constant such that 0 < b1 < g(βn0 ). (A7) There exists two positive constants c1 and c2 for some α ∈ [0, 1] such that, c1 ≤ λmin ( n1α X T X) ≤ λmax ( n1α X T X) ≤ c2 . (A8) Let Sn = {βn : βn − βn0 ≤ ∆ pn /n}, then µ(Xβ ˙ n ) is uniformly bounded above ... and below by positive constants and µ ¨(Xβn ) and µ (Xβn ) are uniformly bounded by 46 a finite positive constant M on Sn . ˆ −1 − Γ−1 = Op ( pn /n) where both Γ ˆ and Γ satisfy ρ− mixing conditions and are (A9) Γ positive definite. (A10) Let r = Σ−1/2 (Y − µ) and E(rT r)/n < M ∗ ∀n > N . Assumptions (A5) and (A7) are used in the study of asymptotics of M-estimators with large p in Portnoy (1985) and imposes the condition that X T X is positive definite. Assumption (A6) ensures the minimum eigenvalue of Σ with diagonal entries that are a function of g have a positive lower bound i.e. 0 ≤ b3 ≤ λmin (Σ(βn0 )). Assumption (A8) is identical to (A6) in L. Wang et al. (2012) is common in GEE literature and holds true for the log and logit link functions. Condition (A9) was shown to be true in longitudinal data with the robust non-parametric working correlation in (L. Wang, 2011, Example 2). In this case we require a similar assumption with the structure of the covariances satisfying (i) and (ii) in lemma 2.3.1. Remark 1. It is also necessary to highlight that the asymptotics presented below are valid under the increasing domain of the spatial random field. Bachoc & Furrer (2016) under this increasing domain framework establish a lower bound on the minimum eigen value of the covariance matrix that is away from zero. This is a vital consequence that is necessary for the asymptotic results that follow. (Lin, 2008, Section 2) shows algebraically that the maximum eigen value of the inverse correlation matrix can be bounded above. Assumption (A10) is necessary to control the sum of variance of all pairwise residuals of a naturally unbounded random variable. In order to show the consistency of the estimator obtained from solving 2.20, below we note some common properties of matrix theory that are essential. For some, C > 0 47 (Note 1) Consider a symmetric positive definite matrix A with dimension n × n, then λmax (A) ≤ C · n. 1 ≥ −λ −1 (Note 2) Whenever A is invertible, − C·n min (A ) The inclusion of the working correlation and how it relates to the original score function with regard to the expanding dimension of pn requires investigation. Therefore based on the assumptions above we have the lemma below. Lemma 2.4.2. Under assumptions (A5)-(A10) and p2n n−1 = o(1) we have ˆ n (βn0 ) − Un (βn0 ) = Op (nα/2 √pn ) U Similar results need to be established for terms based on the decomposition in lemma 2.4.1 with respect to the working correlation and therefore we have the following lemma. Lemma 2.4.3. Assume the conditions (A5)-(A10). If n−1 p2n = o(1) then ∀∆ > 0, for bn ∈ Rpn , we show sup √ βn −βn0 ≤∆ sup |bTn [ β Un (βn∗ ) − pn /n |bn =1 ∗ ˆ β Un (βn )]bn | √ = Op (nα pn ) Similar to the identities established in (L. Wang, 2011, Section 3) we require to show the following to establish relationships between terms in the decomposition 2.4.1 where βn∗ lies between βn and βn0 . Lemma 2.4.4. Assume the conditions (A5)-(A10). If n−1 p2n = o(1) then ∀∆ > 0, we show sup √ βn −βn0 ≤∆ ˆ n (β ∗ ) − H ˆ n (βn0 )](βn − βn0 )| = op (p2 nα−3/2 ) |(βn − βn0 )T [H n n pn /n 48 Lemma 2.4.5. Assume the conditions (A5)-(A10). If n−1 p4n = o(1) then ∀∆ > 0, we show sup √ βn −βn0 ≤∆ ˆ ∗ )](βn − βn0 ) = op (pn ) ˆ n (β ∗ ) − H(β (βn − βn0 )T [ β U n n pn /n Proof of the lemmas above can be found in the Appendix section of chapter 2. Based on all the notes and conditions considered above we therefore have the following theorem that provides both the existence and consistency of the estimator obtained from solving the system of equations in 2.20. Theorem 2.4.6. Existence and Consistency ˆ n (βn ) = 0 has a root βˆn Assume (A5)-(A10), for α ∈ (0.5, 1] and let p4n n−1 = o(1), then U such that βˆn − βn0 = Op ( pn /n). Proof. On the set Sn = {βn : βn − βn0 = ∆ pn /n}, we would like to show 2.22, therefore we begin by examining the Taylor expansion, (βn − βn0 )T Un (βn ) = (βn − βn0 )T Un (βn0 ) − (βn − βn0 )T Un (βn∗ )(βn − βn0 ) := In1 + In2 ˆ n (βn0 )] ˆ n (βn0 ) + (βn − βn0 )T [Un (βn0 ) − U where, In1 = (βn − βn0 )T U := In11 + In12 From lemma 2.4.2 and n−1 p2n = o(1) we have, ˆ n (βn0 ) − Un (βn0 ) = ∆ |In12 | ≤ βn − βn0 ) U ∆op (pn ) 49 √ pn /nOp (nα/2 pn ) = ∆op (pn n(α−1)/2 ) ≤ The last inequality holds true due to α ∈ [0, 1]. For In11 consider, ˆ n (βn0 2 ) = 1 (Y − µ)T Σ−1/2 Γ ˆ −1 Σ1/2 XX T Σ1/2 Γ−1 Σ−1/2 (Y − µ)) E( U n ˆ −2 )λmax (Σ) 1 E(rT r) ≤ λmax (XX T )λmax (Γ n 1 ˆ −2 )λmax (Σ) E(rT r) ≤ Tr(X T X)λmax (Γ n ≤ Cpn · M ∗ n = C npn =⇒ |In11 | ≤ βn −βn0 ˆ n (βn0 ) ≤ k∆ pn /n√npn for some k > 0, then |In1 | ≤ ∆pn . U Now consider In2 rewritten with working correlation, In2 = −(βn − βn0 )T β ˆ ∗ )(βn − βn0 ) − (βn − βn0 )T [ Un (β ∗ ) − U(β β n n ∗ ˆ β Un (βn )](βn − βn0 ) := In21 + In22 where, In21 = −(βn − βn0 )T β ˆ n (β ∗ )(βn − βn0 ) U n ˆ n (βn0 )](βn − βn0 ) − (βn − βn0 )T [H ˆ n (β ∗ ) − H ˆ n (βn0 )](βn − βn0 ) = −(βn − βn0 )T [H n ˆ n (β ∗ ) − H ˆ n (βn ∗)](βn − βn0 ) − (βn − βn0 )T [ β U n := In211 + In212 + In213 Evaluating each term in the following way we find, In211 = −(βn − βn0 )T [Hn (βn0 )](βn − βn0 ) 1 ≤ − √ λmin (Γ−1 )λmin (X T X) βn − βn0 2 λmin (Σ) n 50 ≤ −˜ c∆2 nα pn n3/2 Thus |In211 | = o(1) for α ∈ (0, 1) applying assumptions (A7), (A6) and (Note 2). Further we see from lemma 2.4.4 that |In212 | = op (p2n nα−3/2 ) and from lemma 2.4.5 that 3/2 |In213 | = op (pn ) and In22 = op (pn nα−1 ) For a choice of α = (0.5, 1) we can therefore see (βn − βn0 )T Un (βn ) on the set Sn is asymptotically dominated in probability by In2 which is negative for a large enough ∆. In order to explore the distributional behavior of the score function equation 2.20, we build up on conditions considered in the fixed p case. Therefore consider the following, Lemma 2.4.7. If p4n n−1 = o(1) and assumptions (i),(ii), (A3)-(A10) then for a = 1 1 nα/2 ˆ n (βn0 ) := aT U 1 n(1+α)/2 ˆ −1 Σ−1/2 (Y − µ) →d N (0, aT I2 (βn0 )a) aT X T Σ1/2 Γ as n → ∞ where I2 (βn0 ) is a bounded positive definite matrix. Sketch of Proof: Since Y (s) has a CLT from Lin (2008) as shown in section 2.3, with the assumptions of ρ mixing conditions we can show that the linear combination of the score function is indeed asymptotically normally distributed if the variance of the score function has a positive definite limit. V ar( ˆ n (βn0 )) = 1 aT X T Σ1/2 Γ ˆ −1 ΓΓ ˆ −1 Σ1/2 Xa aT U α+1 α/2 n n 1 51 ˆ na := aT V (2.23) We can further show that, 1 nα+1 ˆ −1 ΓΓ ˆ −1 Σ1/2 Xa ≤ 1 aT λmax (X T X)Σ1/2 Γ ˆ −1 ΓΓ ˆ −1 Σ1/2 a aT X T Σ1/2 Γ nα+1 1 ˆ −1 ΓΓ ˆ −1 Σ1/2 a ≤ K∗ aT Σ1/2 Γ n due to assumption (A7). Condition (A1) in section 2.3 provides a positive definite limit in the special case of the identity link function. Consequently, we have, 1 ˆ −1 ΓΓ ˆ −1 Σ1/2 a → aT I2 (βn0 )a as n → ∞ K∗ aT Σ1/2 Γ n (2.24) where I2 (βn0 ) is a positive definite matrix. Thus we are able to establish the asymptotic distribution of score function 2.20. As a consequence we also have, −1/2 ˆ 1 aT V ˆn Corollary 2.4.7.1. α/2 n Un (βn0 ) ∼ N (0, 1) as n → ∞ for all a ∈ Rpn and a = 1. We would now like to establish asymptotic normality of the GEE estimator βˆn obtained from solving 2.20. Following along with results similar to (L. Wang, 2011, Theorem 3.8) we can show that, Theorem 2.4.8. Asymptotic Normality Under assumptions (i),(ii) (A1),(A5)-(A10), for α = 1, n−1 p4n = o(1) and a = 1, as n → ∞ we have, 1 nα/2 −1/2 ˆ ˆn aT V Hn (βn0 )(βˆn − βn0 ) →d N (0, 1) Proof. Consider the following, 1 nα/2 ˆ −1/2 U ˆ n (βn0 ) = aT V n 1 nα/2 ˆ −1/2 Un (βn0 ) + aT V n 1 nα/2 52 ˆ −1/2 [U ˆ n (βn0 ) − Un (βn0 )] aT V n (e.1) = 1 nα/2 ˆ −1/2 aT V n β 1 Un (βn∗ )(βˆn − βn0 ) + nα/2 ˆ −1/2 [U ˆ n (βn0 ) − Un (βn0 )] aT V n 1 T ˆ −1/2 ˆ 1 ˆ −1/2 [ a Vn Hn (βn0 )(βˆn − βn0 ) + α/2 aT V n nα/2 n 1 ˆ −1/2 [U ˆ n (βn0 ) − Un (βn0 )] + α/2 aT V n n = ∗ β Un (βn ) ˆ n (βn0 )](βˆn − βn0 ) −H The second inequality is due to the Taylor expansion of Un (βn0 ) where Un (βˆn ) = 0. Since the LHS of equation e.1 is asymptotically normal from lemma 2.4.7, it suffices to show, sup √ βn −βn0 ≤∆ | pn /n 1 nα/2 ˆ −1/2 [ aT V n ˆ n (βn0 )](βˆn − βn0 )| = op (1) −H (T.1) ˆ −1/2 [U ˆ n (βn0 ) − Un (βn0 )] = op (1) aT V n (T.2) β Un (βn ) 1 nα/2 The proof for statements T.1 and T.2 have been redirected to the Appendix of chapter 2. The theoretical properties of the score function under the expanding dimension indeed seem to preserve properties of consistency and asymptotic normality, however only under the condition that the number of predictors pn expands very slowly with respect to n i.e. p4n .n−1 = o(1). In reality, data in exploratory studies often have a large number of covariates and the goal is to extract only interesting covariates that are significantly correlated to the response. In the context of expanding variates, even though the initial number of parameters may not be very large, if one begins to include interactions of variables with enough degrees of freedom allocated, the number of covariates tends to increase with the sample size wherein only a fraction of the considered covariates may have an important effect. Thus a penalized approach where only sn number of true non-zero coefficients associated with the predictors can be successfully identified with a natural sparsity assumption may be a more effective method. 53 2.4.2 Penalized Score Function Asymptotics The method described in the rest of this section emulates established results in longitudinal data analysis by L. Wang, Zhou, & Qu, 2012 but under the current setup. We will proceed by using the above formulations specifically for α = 1. The new set of penalized estimating equations that has a normalizing factor similar to what is seen for consistency results of expanding dimensions of parameters of a response belonging to an exponential family by Strawderman & Tsiatis (1996), is under consideration and is given by, ˆ pn (βn ) = 1 X T Σ1/2 Γ ˆ −1 Σ−1/2 (Y − µ(βn )) − qλ (|βn |)sign(βn ) U n n (2.25) where qλn is the derivative of the penalty function described for equation 2.6 that satisfies assumptions (A3) and (A4). The estimator βˆn is obtained by solving the equations in 2.25. ˆ pn (βn ) is discontinuous at 0 for βnj = 0 for any j. Thus we Once again we can see that U ˆ pn (βn ) = o(ωn ) where ωn → 0 as n → ∞. Due to the consider an approximate solution U discontinuity, once again the MM-algorithm may be implemented as shown in section 2.3.2. T , β T )T where without loss of generality we can Let us denote the true βn0 = (βn10 n20 assume that the first sn coefficients are associated with βn10 ; the truly non-zero coefficients and the remaining pn − sn predictors associated with βn20 are the truly zero coefficients. Based on the results obtained in 2.4.1 we require sn = o(n1/4 ). Therefore we include one more assumption with regard to the penalty function similar to (A7) in L. Wang et al. (2012), (A11) min1≤j≤sn |βn0j |/λn → ∞, λn → 0, 3/2 sn log(n) √ λn n nλ2n log(n)2 s2 log(n) → ∞, s4n n−1 → 0, nλ √n = o(1) =⇒ n p s3 log 2 (n) p s3 log 4 (n) = o(1), n n2 2 = o(1) and n n2 4 = o(1) as n → ∞. n λn n λn A stronger modification of the design matrix is needed to be made in the form of the following 54 assumption, (A12) Xij are uniformly bounded where i = 1, . . . , n j = 1, . . . , pn . In order to establish the selection consistency of the score function 2.25, we would like to as in (L. Wang et al., 2012, Theorem 1) show the following where the associated penalty function is the SCAD penalty (J. Fan & Li, 2001) with details provided in section 2.3.3, Theorem 2.4.9. Under assumptions (i),(ii) in lemma 2.3.1 and (A1)-(A12) there exists an approximate solution βˆn that solves equation 2.25 such that, ˆ p (βˆn )| = 0, P |U nj ˆ p (βˆn )| ≤ λn , P |U nj log(n) j = 1, . . . , sn →1 (2.26) j = sn + 1, . . . , pn →1 (2.27) as n → ∞. Properties 2.26 and 2.27 characterize the solution of the estimating equations in 2.25. This technique is an alternative approach to the zero-crossing solution defined by Johnson et al. (2008). The oracle properties now easily follow from results obtained in the un-penalized version of the score function 2.28 and can be easily derived from lemma 2.4.7 with a minor adjustment to the normalizing factor, Corollary 2.4.9.1. Under assumptions (i), (ii), (A4)-(A12) and α = 1 we have, −1/2 ˆ ˆn P (βˆn2 = 0) → 1 and ∀a ∈ Rpn such that a = 1, aT V Hn1 (βn0 )(βˆn − βn0 ) →d N (0, 1) as n → ∞. Now that the oracle properties of the estimator are established we need to deduce the order in which the number of covariates pn is increasing. It is rather realistic to suppose 55 a fixed dimension for the underlying true model i.e. sn is fixed. (L. Wang et al., 2012, Remark 3) in provides the order of the λn and pn . Similarly using conditions imposed in assumption (A11) we have the following remark. Remark 2. Based on conditions assumed in (A11), the above results are applicable to pn = o(nη ) where 0 < η < 23 restricted to λn = O(n−ν ) where 0 < ν < 2−η 4 satisfying all impositions in (A11). Therefore similar to results obtained in the longitudinal case we are indeed able to utilize the proposed method for pn = o(n). However in order to truly address statistical problems of high-dimensional spatial regression with dimensions of exponential order with respect to sample size, we may require a stronger theoretical justification. 56 2.5 Discussion This chapter establishes a variable selection method under the framework of discrete spatial observations. It provides foundational scope for high dimensional variable selection and bridges a gap in statistical methodology that has not been previously explored. Using the GEE formulation, well studied in longitudinal data analysis we have provided asymptotic theory of the estimators that solve penalized quasi-likelihood score equation 2.11 with rigor. We begin by considering the fixed p < n scenario. Under a slightly strong assumption of ρ mixing co-efficients (i) and (ii) we are able to obtain selection consistency and oracle properties of the estimator. Simulation results in the binary response case appear efficient and perform well under covariance misspecification. To justify the technique, we investigate performance of the method under various scenarios. The synthetic example provides evidence of the method with results being validated by the real data analysis in Lin & Clayton (2005). In general during the implementation of the proposed method and the MM-algorithm, an improvement can be made with regard to tuning parameter selection. Without the availability of a likelihood expression, regular criteria like AIC and BIC cannot be used. Based on evidence provided by J. Fan & Li (2004) and Johnson et al. (2008) for selection in semi-parametric we adopt a variation of the generalized cross-validation (GCV) statistic but on a sequential grid of tuning parameters. More efficient methods may be explored to cater to spatial dependence. With regard to simulations related to spatial count response, there needs to be a more meticulous exploration of successfully generating multivariate Poisson models. The review article Inouye et al. (2017) provides well established methods for pair-wise dependence. Certain copula based techniques can be used to explore the method as one suggested by 57 Yahav & Shmueli (2012) who were successful in approximately obtaining correlated Poisson with a small bias but was restrictive in regard to the amount of correlation that can be imposed. The method for generation of binary data too suffered from being unable to incorporate a higher level of dependence. We then proceed to establish results for expanding dimensions with a rate i.e. (p4n .n−1 = o(1)) which is much slower than the usual rate for longitudinal data i.e. (p2n .n−1 = o(1)) established by L. Wang (2011). Both the consistency and asymptotic normality of the estimator are established under certain regularity condition 2.4.1.1. What is most noticeable here is that a normalizing factor as seen in Strawderman & Tsiatis (1996) was required to achieve these results. In the spatial paradigm no such results exist unlike the ones that exist and have been used in the fixed p-case. Focusing on the goal of selection we then use a penalized approach similar to that of L. Wang et al. (2012) and obtain consistency and oracle properties of the estimator. In its current form the theory allows for pn = o(n) but we eventually seek to look at modifications of the method whenever the dimensions increase exponentially. In order to fully study and investigate these methods we seek to perform a series of simulation studies in the future. The theoretical results suggest that these methods may be successful in the pn = n case but may require additional theoretical backing for a high dimensional setup. Therefore having formulated these basic ideas there is tremendous scope for future work of our current model setup. 58 APPENDIX 59 APPENDIX Fixed Dimension P- Proofs Lemma: 2.3.3 √ Proof. Since βˆ obtained from solving 2.11 is n consistent i.e. βˆ = β0 + Op (n−1/2 ), for all non-zero β s j = 1, .., s under assumption (A3) and lemma 2.3.2 we get, p p lim sup n−1/2 Uwj (βˆ ± ej ) = lim sup n−1/2 Uwj (βˆ ± ej ) − →0+ →0+ √ p nqλn (|βˆ ± ej |) → 0 as n → ∞. Under assumption (A4) for j > s, √ nqλn ( ej ) dominates in equation 2.11 with opposing signs for β s that are necessarily 0. Thus there exists a βˆ an approximate zero-crossing and √ p n-consistent solution of Uw (β). Lemma: 2.3.4 Proof. (Johnson et al., 2008, Appendix), (Feng et al., 2016, Appendix) Consider the sets in probability space Cj = {βˆ = 0}, j = s + 1, .., p. We need to show that √ for any > 0 there exists an n > N such that P (Cj ) < . Since βˆ is n consistent, there exists an M for a sufficiently large n such that P (Cj ) = P (βˆj = 0, |βˆj | < M n−1/2 ) + P (βˆj = 0, |βˆj | ≥ M n−1/2 ) ≤ P (βˆj = 0, |βˆj | < M n−1/2 ) + 2 60 Consider the Taylor expansion of non-penalized score function and equation 2.8 we have, n−1/2 Uw (β) = n−1/2 Uw (β0 ) + n−1/2 β ˜ βˆ − β0 ) + o(1) Uw (β)( = n−1/2 Uw (β0 ) + n1/2 (−I0 (β) + op (1))(βˆ − β0 ) + o(1) = n−1/2 Uw (β0 ) − n1/2 I0 (β)(βˆ − β0 ) + op (1) Thus on the set {βˆj = 0, |βˆj | < M n−1/2 } for the penalized score function we get, {n−1/2 Uwj (β0 ) − n1/2 I0j (β)(βˆ − β0 ) + op (1) − n1/2 qλn (|βˆj |)sign(βˆj )}2 = op (1) where I0j is the j th row of I0 . This implies that for a large n , there exists a M ∗ such that, √ P (βˆj = 0, |βˆj | < M n−1/2 , nqλn (|βˆj |) > M ∗ ) < /2 √ Assumption (A4) implies that if βˆj = 0 and |βˆj | < M n−1/2 then for large n, nqλn (|βˆj |) > √ M ∗ . Thus P (βˆj = 0, |βˆj | < M n−1/2 , nqλn (|βˆj |) > M ∗ ) = P (βˆj = 0, |βˆj | < M n−1/2 ) and P (Cj ) < . Theorem: 2.3.5 (Johnson et al., 2008, Appendix) Proof. The corollary 2.3.2.1 for the true non-zero βs can be rewritten as, n−1/2 Uw1 (β0 ) + n1/2 I01 (β)(βˆ1 − β01 ) − n1/2 qλn (|βˆ1 |)sign(βˆ1 ) = op (1) where Uw1 is an s × 1 dimensional vector associated with the signal βs. Using Tay- lor expansion of n1/2 qλn (|βˆ1 |)sign(βˆ1 ) we have, n−1/2 Uw1 (β0 ) + n1/2 I01 (β)(βˆ1 − β01 ) − n1/2 qλn (|β01 |)sign(β01 ) − n1/2 qλ (|β01 |)sign(β01 )(βˆ1 − β01 ) = op (1) n Using lemma 2.3.2 for the first s elements of the score function and re-arrangng the terms 61 such that B11 = −diag{qλ (|β01 |)sign(β01 )} is an s × s diagonal matrix and n bn = −qλn (|β01 |)sign(β01 ) is an s × 1 dimensional vector, we get the desired result. Expanding Dimension P-Proofs Lemma: 2.4.2 Proof. ˆ n (βn0 ) − Un (βn0 ) = √1 X T Σ1/2 Γ ˆ −1 Σ−1/2 (Y − µ) − √1 X T Σ1/2 Γ−1 Σ−1/2 (Y − µ) U n n 1 ˆ −1 − Γ−1 )Σ−1/2 (Y − µ) = √ X T Σ1/2 (Γ n 1 ˆ −1 − Γ−1 )r = √ X T Σ1/2 (Γ n Consider, ˆ n (βn0 ) 2 ≤ 1 Γ ˆ −1 − Γ−1 2 λmax (X T X)λmax (Σ)Er2 E Un (βn0 ) − U n pn ≤ Knα+1 · n From (Note 1) for λmax (Σ) and λmax (X T X), assumptions (A9) and (A10) and due to Chebyshev’s inequality, for some K > 0 it is proved. Lemma: 2.4.3 Proof. In order to ensure that lemma 2.4.3 is true, it is sufficient to show the following, sup √ βn −βn0 ≤∆ (L.1.) pn /n |bn =1 sup √ βn −βn0 ≤∆ ˆ ∗ )]bn | = Op (nα √pn ) sup |bTn [H(βn∗ ) − H(β n ˆ ∗ )]bn | = Op (n(α−1)/2 √pn ) sup |bTn [E(βn∗ ) − E(β n pn /n |bn =1 62 (L.2.) sup √ βn −βn0 ≤∆ ˆ ∗ )]bn | = Op (1) sup |bTn [G(βn∗ ) − G(β n (L.3.) pn /n |bn =1 Discordant to the above statements L.1.and L.2., condition L.3. is obtained under n−1 p2n = o(1). Let’s begin with (L.1.) ˆ ∗ )]bn | = |bT [ √1 X T Σ1/2 Γ−1 Σ1/2 X + √1 X T Σ1/2 Γ ˆ −1 Σ1/2 X]bn | |bTn [H(βn∗ ) − H(β n n n n 1 ˆ −1 ]Σ1/2 X]bn | = √ |bTn [X T Σ1/2 [Γ−1 − Γ n 1 ˆ −1 ≤ B √ bn 2 λmax (Σ)λmax (X T X) Γ−1 − Γ n 1 ≤ B √ n.nα · n pn /n The third inequality is due to Cauchy-Shwartz inequality and inequalities due to the largest eigenvalues. The last inequality is due to assumption (A7),(A9) and (Note 1) for some constant B > 0. Now consider L.2., 1 ˆ ∗ )]bn | = √ ˆ −1 Σ−3/2 CD(β ∗ )X]bn | |bTn [X T Σ1/2 ΓΣ−3/2 CD(βn∗ )X − X T Σ1/2 Γ |bTn [E(βn∗ ) − E(β n n 2 n 1 ˆ −1 ]Σ−3/2 CDX]bn | = √ |bTn [X T Σ1/2 [Γ−1 − Γ 2 n 1 ˆ −1 ≤ K √ λmax (D)λmax (X T X) bn 2 λmax (Σ−1 ) Γ − Γ n ≤ K nα · √ pn /n λmax (Σ−1 ) = 1/λmin Σ ≤ k due to assumption (A6) where k ≥ 0. Inequalities are due to reasons similar to the proof of L.1.. We require assumption (A8) for a finite upper bound 63 for λmax (D). In order to prove L.3. consider, E 1 ˆ ∗ )]Xbn 2 √ bTn X T Σ1/2 D[J(βn∗ )−J(β n 2 n 1 ˆ −1 )]2 E(r2 ) K X T X 2 λmax (Σ−1 )[λmax (Γ−1 − Γ 4n pn pn E(r2 ) ≤ K∗ 4n n ≤ ≤ K p2n /n ˆ −1 ) = Op ( pn /n) since the covariance matrices are Note that (A9) implies λmax (Γ−1 − Γ symmetric. (L. Wang, 2011, Remark 2 pg 396). Further we require assumptions (A10) for the last inequality and (A6) for the last but one inequality. From the assumption p2n n−1 = o(1) we have L.3.. Lemma 2.4.3 is proved. Lemma: 2.4.4 Proof. Consider, ˆ n (β ∗ ) − H ˆ n (βn0 )](βn − βn0 )| |In212 | = |βn − βn0 )T [H n 1 ˆ −1 Σ1/2 (β ∗ )X − X T Σ1/2 (β ∗ )Γ ˆ −1 Σ1/2 (βn0 )X = √ |(βn − βn0 )T [X T Σ1/2 (βn∗ )Γ n n n ˆ −1 Σ1/2 (βn0 )X − X T Σ1/2 (βn0 )Γ ˆ −1 Σ1/2 (βn0 )X](βn − βn0 )| + X T Σ1/2 (βn∗ )Γ 1 ˆ −1 Σ1/2 (β ∗ ) − Σ1/2 (β ∗ )Γ ˆ −1 Σ1/2 (βn0 )]X(βn − βn0 )| ≤ √ |(βn − βn0 )T X T [Σ1/2 (βn∗ )Γ n n n 1 ˆ −1 Σ1/2 (βn0 ) − Σ1/2 (βn0 )Γ ˆ −1 Σ1/2 (βn0 )]X(βn − βn0 )| + √ |(βn − βn0 )T X T [Σ1/2 (βn∗ )Γ n := In2121 + In2122 64 We further evaluate the two portions separately as follows similar to (L. Wang, 2011, Supplemenatry Material Lemma 3.5), 1 ˆ −1 [Σ1/2 (β ∗ ) − Σ1/2 (βn0 )]X(βn − βn0 )| In2121 = √ |(βn − βn0 )T X T Σ1/2 (βn∗ )Γ n n 1 ˆ −1/2 ≤ √ (βn − βn0 )T X T Σ1/2 Γ n ˆ −1/2 [Σ1/2 (β ∗ ) − Σ1/2 (βn0 )]X(βn − βn0 Γ n by Cauchy-Schwartz inequality. Now consider the two parts of the inequality, ˆ −1/2 2 = (βn − βn0 )T X T Σ1/2 Γ ˆ −1 Σ1/2 (β ∗ )X(βn − βn0 ) (βn − βn0 )T X T Σ1/2 Γ n ˆ −1 )λmax (Σ(β ∗ )) X(βn − βn0 2 ≤ λmax (Γ n ˆ −1/2 [Σ1/2 (βn∗ ) − Σ1/2 (βn0 )]X(βn − βn0 Γ 1/2 1/2 ˆ −1 )λmax (Σ1/2 (βn∗ ) − Σ1/2 (βn0 )] X(βn − βn0 ≤ λmax (Γ Hence, 1 ∗ ∗ T 2 ˆ −1 )λ1/2 |In2121 | ≤ √ λmax (Γ max (Σ(βn )) max |(Σij (βn ) − Σij (βn0 ))|λmax (X X) βn − βn0 n ij 1 √ ˜ α pn ∆2 ≤√ n max Xij βn − βn0 Kn n n ij ≤ √ pn pn pn α 2 n ∆ .K n n = Op (p2n nα−3/2 ) Using Taylor expansion we see that maxij |(Σij (βn∗ ) − Σij (βn0 ))| ≤ maxij Xij βn − ˜ > 0 and βn0 . (Note 1) and assumption (A7) are used in the above inequality for some K ˆ −1 ) = 1/λmin (Γ) ˆ which is bounded due to remark 1 in section 2.4.1.1. Similarly λmax (Γ 65 In2122 = Op (p2n nα−3/2 ) Lemma: 2.4.5 ˆ ∗ )](βn − βn0 ) ˆ n (β ∗ ) − H(β Proof. Let us consider, In213 = −(βn − βn0 )T [ β U n n Since, ∗ ˆ ∗ ˆ β Un (βn ) − H(βn ) ˆ ∗ )| ˆ ∗ ) − G(β ≤ |E(β n n √ In the same spirit as (L. Wang, 2011, Lemma 3.4, 3.5) we can show that if p2n / n = o(1) =⇒ p4n n−1 = o(1) for some bn ∈ Rpn then, sup √ βn −βn0 ≤∆ (a) ˆ ∗ )bn | = Op (√npn ) sup |bTn G(β n (b) pn /n bn =1 sup √ βn −βn0 ≤∆ ˆ ∗ )bn | = Op (√npn ) sup |bTn E(β n pn /n bn =1 ˆ are Let us begin with proof of (a). It is important to note that the results for Γ and Γ identical and interchangeable in what follows. Therefore without loss of generality, consider the telescopic sum, 1 E(βn∗ ) = √ X T Σ1/2 (βn∗ )Γ−1 Σ−3/2 (βn∗ )C(βn∗ )D(βn∗ )X 2 n 1 = √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )D(βn0 )X 2 n 1 − √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )D(βn0 )X 2 n 1 + √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )D(βn∗ )X 2 n 1 − √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )D(βn∗ )X 2 n 1 + √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn∗ )D(βn∗ )X 2 n 1 − √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn∗ )D(βn∗ )X 2 n 66 1 + √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn∗ )C(βn∗ )D(βn∗ )X n 1 − √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn∗ )C(βn∗ )D(βn∗ )X 2 n 1 + √ X T Σ1/2 (βn∗ )Γ−1 Σ−3/2 (βn∗ )C(βn∗ )D(βn∗ )X 2 n 1 =⇒ E(βn∗ ) = √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )D(βn0 )X 2 n 1 + √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )[D(βn∗ ) − D(βn0 ]X 2 n 1 + √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )[C(βn∗ ) − C(βn0 )]D(βn∗ )X 2 n 1 + √ X T Σ1/2 (βn0 )Γ−1 [Σ−3/2 (βn∗ ) − Σ−3/2 (βn0 )]C(βn∗ )D(βn∗ )X 2 n 1 + √ X T [Σ1/2 (βn∗ ) − Σ1/2 (βn0 )]Γ−1 Σ−3/2 (βn∗ )C(βn∗ )D(βn∗ )X 2 n 5 El (βn∗ ) = E1 (βn0 ) + (E*) l=2 Now let us examine, E E1 (βn0 ) 2 1 = 4n n n Dj1 Dj2 E(rj1 · rj2 ) · Tr{X T Σ1/2 (βn0 )Γ−1 Σ−1 j1 =1 j2 =1 ej1 eTj1 XX T ej2 eTj2 Σ−1 Γ−1 Σ1/2 (βn0 )X} where Dj is the j th diagonal entry and ej denotes a unit vector whose j th entry is 1 and all others 0. By Assumption (A10) we have the right side of the above inequality which is given by n n E E1 (βn0 ) 2 ≤ C j1 =1 j2 =1 1 T |e XX T ej2 eTj2 Σ−1 Γ−1 Σ1/2 XX T Σ1/2 Γ−1 Σ−1 ej1 | 4n j1 67 n n ≤C j1 =1 j2 =1 1 T e X 4n j1 ej2 Σ−1 Γ−1 Σ1/2 X X T ej2 X T Σ1/2 Γ−1 Σ−1 ej1 √ 1/2 1/2 Since, ej2 Σ−1 Γ−1 Σ1/2 X ≤ C ∗ Tr(XX T )1/2 λmax (Γ−1 )λmax (Σ−1/2 ) ≤ C˜ pn due to remark 1 and assumption (A6) we get, 1 E E1 (βn0 ) 2 ≤ C˜ pn n n n 1 ˜ 2 /n X T ej2 ≤ C˜ pn max |Xij |2 = Cp n n ij eTj1 X j1 =1 j2 =1 due to assumption (A5). =⇒ sup bn =1 |bTn E1 (βn0 )bn | = Op (1) if p2n /n = o(1) as n → ∞. Similarly for 1 E2 (βn∗ ) = √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )C(βn0 )[D(βn∗ ) − D(βn0 )]X 2 n we consider the following, 1 E E2 (βn∗ ) 2 = 4n n n [D(βn∗ ) − D(βn0 )]j1 [D(βn∗ ) − D(βn0 )]j2 E(rj1 rj2 )· j1=1 j2=1 Tr[X T Σ1/2 (βn0 )Γ−1 Σ−1 ej1 eTj1 XX T ej2 eTj2 Σ−1 Γ−1 Σ1/2 (βn0 )X] ≤ Knp2n By assumption (A8) we have that derivatives of the mean function are uniformly bounded above. Using assumption (A10) and all other trace results are identical to that obtained for √ E1 (βn0 ) we have, sup bn =1 |bTn E2 (βn0 )bn | = Op ( npn ). Further looking at the terms of equation E*, 1 E3 (βn∗ ) = √ X T Σ1/2 (βn0 )Γ−1 Σ−3/2 (βn0 )[C(βn∗ ) − C(βn0 )]D(βn∗ )X n 68 where, [C(βn∗ ) − C(βn0 )] = diag(µ(βn0 ) − µ(βn∗ )) := Pn Then considering, 1 E( E3 (βn∗ ) 2 ) = 4n n n Dj1 (βn∗ )Dj2 (βn∗ ) Tr{X T Pn Σ−3/2 Γ−1 Σ1/2 ej1 eTj1 XX T j1=1 j2=1 ej2 eTj2 Γ−1 Σ−3/2 Pn X} Using Taylor expansion and for some constant k˜ > 0 it can be shown that, Tr(P2n ) = µ(βn0 ) − µ(βn∗ ) 2 ≤ k˜ X 2 βn∗ − βn0 2 Thus if p2n n−1 = o(1) then, 1 ˜ E( E3 (βn∗ ) 2 ) ≤ K max Xij 2 λmax (Γ−2 )λmax (Σ−2 ) Tr(X T X)k˜ X 2 βn∗ − βn0 2 n ij ≤ K˜ p4n /n2 ∆2 = o(1) Lastly consider, |bTn E5 (βn0 )bn | 1 =| √ 2 n n Dj rj bTn X[Σ1/2 (βn∗ ) − Σ1/2 (βn0 )]Σ−1 Γ−1 ej eTj Xbn | j=1 n 1 ≤ C∗ √ n 1 ≤ C∗ √ n |bTn X[Σ1/2 (βn∗ ) − Σ1/2 (βn0 )]Σ−1 Γ−1 ej | · |ej Xbn | j=1 n Xbn 2 λmax (Γ−1 )λmax (Σ−1 ) max |Σ1/2 (βn∗ ) − Σ1/2 (βn0 )| j j=1 1 ≤ C ∗ n √ max Xij βn∗ − βn0 Tr(X T X) n ij √ pn 1 √ ∗ ≤ √ nC pn √ pn n n = C ∗ p2n 69 Similar to E5 we can show that E4 = O(p2n ). This completes proof of (a) The proof of (b) has similar arguments that results in using telescopic sums and investigating the properties of each its terms as discussed above. It can therefore be verified that (b) is ˆ ∗ ) − H(β ˆ ∗ )](βn − βn0 )| = ∆2 op (pn ) under the assumption true. Thus |(βn − βn0 )T [ β U(β n n n−1/2 p2n = o(1) as n → ∞. Theorem: 2.4.8 (L. Wang, 2011, Theorem Proof 3.8) Proof. Here we provide the proof of statements T.1 and T.2 In order to utilize lemmas 2.4.3, 2.4.4 and 2.4.5 we re-express T.1 as follows sup √ βn −βn0 ≤∆ | pn /n 1 −1/2 nα/2 ˆn aT V ≤ ˆ n (βn0 )](βˆn − βn0 )| [ β Un (βn ) − H | sup √ βn −βn0 ≤∆ pn /n 1 −1/2 nα/2 1 ˆn aT V [ β Un (βn ) − ˆ ˆ − βn0 )| β Un (βn )](βn −1/2 ˆn aT V ˆ n (βn ) − H ˆ n (βn )](βˆn − βn0 )| [ βU α/2 n βn −βn0 ≤∆ pn /n 1 T ˆ −1/2 ˆ ˆ n (βn0 )](βˆn − βn0 )| + sup √ | a Vn [Hn (βn ) − H α/2 n + | sup √ βn −βn0 ≤∆ pn /n := in1 + in2 + in3 Consider, in1 ≤ sup √ βn −βn0 ≤∆ ≤ [ pn /n sup √ βn −βn0 ≤∆ 1 T ˆ −1/2 a Vn ( nα max(|λmin ( β Un (βn ) − β Un (βn ) − pn /n 1 ˆ −1/2 )Op ( λmax (V n nα/2 pn /n) 70 2 ˆ −1/2 1/2 ˆ a] β Un (βn )) Vn ˆ β Un (βn ))|, |λmax ( βˆn − βn0 β Un (βn ) − ˆ β Un (βn ))| −1/2 ˆn Since 2.23 has a positive definite limit implies as in assumption (A3) in section 2.3 that λmax (V ) has an upper bound. According to lemma 2.4.3 we have, ≤ sup √ βn −βn0 ≤∆ 1 pn /n nα/2 Op (nα/2 )Op ( pn /n) Therefore α ∈ (0, 1) and p2n n−1 = o(1) we get in1 = o(1). Now consider the next term in3 , in3 ≤ sup √ βn −βn0 ≤∆ ≤ sup √ βn −βn0 ≤∆ ( pn /n 1 T ˆ −1/2 ˆ ˆ n (βn0 )]2 V ˆ −1/2 a)1/2 βˆn − βn0 a Vn [Hn (βn ) − H n nα ˆ n (βn ) − H ˆ n (βn0 ))|, |λmax (H ˆ n (βn ) − H ˆ n (βn0 )| max(|λmin (H pn /n 1 ˆ −1/2 )Op ( pn /n) λmax (V n nα/2 ≤ sup √ βn −βn0 ≤∆ op (pn nα−1/2 ) pn /n Here for p4n .n−1 = o(1), 1 nα/2 Op ( pn /n)     α = 0.5 =⇒ in2 = o(pn3/2 n−3/4 ) = o(1)    −1/2 ) = o(1) α = 1 =⇒ in2 = o(p3/2 n n Let us finally consider, in2 ≤ sup √ βn −βn0 ≤∆ ≤ sup √ βn −βn0 ≤∆ ≤ 1 √ nα/2 ( pn /n 1 T ˆ −1/2 a Vn [ nα 1 nα/2 pn /n β ˆ β Un (βn ) ˆ n (βn )]2 V ˆ −1/2 a)1/2 βˆn − βn0 −H n ˆ n (βn ) − H ˆ n (βn ) λmax (V ˆ −1/2 )Op ( U n pn /n) npn Op ( pn /n) Therefore only with α = 1 we have in2 = o(1). To complete the proof we look at T.2 for a = 1 71 and a ∈ Rpn we have, 1 nα/2 ˆ −1/2 [U ˆ n (βn0 )−Un (βn0 )] aT V n 2 1 T ˆ −1/2 ˆ ˆ n (βn0 ) − Un (βn0 )]T V ˆ −1/2 a a Vn [Un (βn0 ) − Un (βn0 )][U n nα 1 ˆ −1 ) U ˆ n (βn0 ) − Un (βn0 ) 2 ≤ α λmax (V n n = ≤ Kp2n /nα Based on the relation 2.24 for a bounded positive definite matrix I2 (βn0 ) we can find an upper bound for λmax (V−1 n ) independent of n. Therefore in2 = o(1) for any α ∈ (0.5, 1) under the assumption p4n n−1 = o(1). Thus we obtain the asymptotic normality of the estimate βˆn . Theorem: 2.4.9 (L. Wang et al., 2012, Properties 1,2) ˆ nj (βˆn ) = 0 from 2.4.1. Since the penalty Proof. To prove equation 2.26 we already have U function considered now is a SCAD penalty as shown in 2.18, it is sufficient to show that P (|βˆnj | ≥ aλn , j = 1, . . . , sn ) → 1 since it implies that the penalty portion of equation 2.25 too tends to 0 in probability. It is simple to see the relation, min1≤j≤sn |βˆnj | ≥ min1≤j≤sn |βn0j | − max1≤j≤sn |βn0j − βˆnj | ≥ min1≤j≤sn |βn0j | − βn10 − βˆn10 . Thus we have, P ( min |βˆnj | > aλn ) ≥ P ( min |βn0j | − βn10 − βˆn10 ≥ aλn ) 1≤j≤sn 1≤j≤sn = P ( βn10 − βˆn10 ≤ min |βn0j | − aλn ) → 1 1≤j≤sn 72 Since βn10 − βˆn10 = op ( sn /n) and (A11) imposes min1≤j≤sn |βn0j |/λn → ∞ as n → ∞. So we have 2.26. In order to prove 2.27, we notice that for βˆnk , k = sn + 1, . . . , pn the derivative of the penalty in 2.18 is indeed 0. Thus we proceed by considering only the first term of equation 2.25 and denote it by ˆ n (βˆn ) := 1 X T Σ1/2 Γ ˆ −1 Σ−1/2 (Y − µ(βˆn )) W n (2.28) th ˆ (β) ˆ ˆ ˆ = eT W and W nk k n (βn ) where ek has the k element as 1. Therefore 2.27 is equivalent to P max sn +1≤k≤pn ˆ (βˆn )| ≤ |W nk λn log(n) →1 which can be ensured if, P max sn +1≤k≤pn P ˆ (βˆn ) − W (βˆn )| > |W nk nk λn 2log(n) →0 (2.29) |Wnk (βˆn )| > λn 2log(n) →0 (2.30) max sn +1≤k≤pn To check 2.29 let us consider, P max sn +1≤k≤pn ≤P max ˆ −1 − Γ−1 ]Σ−1/2 (Y − µ(βˆn ))| > n−1 |eTk X T Σ1/2 [Γ sn +1≤k≤pn n−1 eTk X T Σ1/2 ˆ −1 − Γ−1 Γ Σ−1/2 (Y − µ(βˆn )) > λn > √ 2 sn log(n) √ √ n−1/2 E( ) sn log(n) sn log(n) √ √ ≤C =O = o(1) λn n λn n ≤ P n−1 73 λn 2log(n) λn 2log(n) = Σ−1/2 (Y − µ(βˆn )) is an n × 1 dimensional vector. It is easy to verify that √ maxsn +1≤k≤pn eTk X T Σ1/2 = O( n) and therefore we get the third inequality from aswhere sumption (A9) and Markov’s inequality. The last inequality comes from (A10) and Jensen’s inequality in the following way, 1 ∃M > 0, such that √ E n ≤ 1 E n 2 1/2 ≤M Thus using assumption (A11) we have 2.26. In order to prove 2.27 let us begin by considering the Taylor expansion, Wnk (βˆn ) = Wnk (βn0 ) + ∗ ˆ β,k Wnk (βn )(βn ∂ 2 Wnk (βn∗ ) ˆ T ˆ (βn − βn0 ) − βn0 ) + (βn − βn0 ) ∂βn ∂βnT (2.31) ∗) ∂ 2 Wnk (βn where βn∗ is between βˆn and βn0 . Let Dk (βn ) := denote the pn × pn matrix of T ∂βn ∂βn double derivatives and let Dk1 (βn ) denote the top left sn × sn sub matrix. Based on the fact that βˆn − βn0 = (βˆn1 − βn10 )T , 0T )T we can rewrite 2.31 as follows, Wnk (βˆn ) = Wnk (βn0 ) + ∗ ˆ β,k1 Wnk (βn )(βn1 − βn10 ) + (βˆn1 − βn10 )T Dk1 (βn∗ )(βˆn1 − βn10 ) (2.32) and similar to the earlier argument we can express the following, P max sn +1≤k≤pn ≤P +P |Wnk (βˆn )| > λn 2log(n) λn ) +P 6log(n) max |Wnk (βn0 | > max (βˆn1 − βn10 )T Dk1 (βn∗ )(βˆn1 − βn10 ) > sn +1≤k≤pn sn +1≤k≤pn max sn +1≤k≤pn 74 | β,k1 λn 6log(n) (βn∗ )(βˆn1 − βn10 )| > λn 6log(n) := i1 + i2 + i3 We are required to show that each of i1 , i2 , i3 = o(1). From corollary 2.4.7.1 and equation 2.24 that has a positive definite limit for the variance of the score function for a sufficiently large n we have, P √ λn n ˆ −1/2 1 T ˆ −1/2 T 1/2 ˆ −1 V max | √ ek Vn X Σ Γ (Y − µ(βn0 ))| > sn +1≤k≤pn 6log(n) nk n 1 nλ2n ≤ exp − K 2 log(n)2 2 /2 The inequality is obtained from a normal density (ex: P (|X| > t) ≤ e−t for X ∼ N (0, 1)). ˆ nk . Therefore if Due to 2.24 we can find a positive lower bound on the minimum eigen value of V nλ2n log(n)2 → ∞ as n → ∞ from assumption (A11) we get i1 = o(1) Let us consider, i2 = P =P max | β,k1 (βn∗ )(βˆn1 − βn10 )| > λn 6log(n) max | β,k1 (βn∗ )(βˆn1 − βn10 )| > λn , βˆn1 − βn10 ≤ ∆ 6log(n) sn +1≤k≤pn sn +1≤k≤pn + P ( βˆn1 − βn10 > ∆ sn /n sn /n) From the consistency of βˆn1 in Theorem for consistency (previous section) and the decomposition of 2.4.1 with the new normalizing factor we have, i2 ≤ P ≤P +P max sn +1≤k≤pn max sn +1≤k≤pn max sn +1≤k≤pn √ λn n > √ 6 sn log(n) √ λn n Hnk1 (βn∗ ) > √ 12 sn log(n) ∗ β,k1 (βn ) ∗ β,k1 (βn ) − Hnk1 (βn∗ ) + o(1) √ λn n > √ 12 sn log(n) + o(1) := i21 + i22 + o(1) where Hnk1 = (Hnk1 , . . . , Hnksn )T denotes the subvector Hnk with the first sn elements with a 75 1 n normalizng factor of √1 n instead of as in the original lemma. Now we can easily see from 2.4.5 the norm of i22 is o(sn ) and therefore with (A11) we get i22 = o(1). We can re-express i21 as follows, i21 = P +P max sn +1≤k≤pn max sn +1≤k≤pn √ λn n − Hnk1 (βn10 ) > √ 12 sn log(n) √ λn n Hnk1 (βn10 ) > √ 12 sn log(n) Hnk1 (βn∗ ) √ The first term is o(1) if λn n 3/2 sn log(n) → ∞ as n → ∞ from lemma 2.4.4 after modifying the normalizing factor and using assumption (A11). The second term is similar to the proof in supplementary material of Wang. To begin with we have, P max sn +1≤k≤pn ≤P √ λn n Hnk1 (βn10 ) > √ 12 sn log(n) max sn +1≤k≤pn | Hnk1 (βn10 2 ≤P max Hnk1 (βn10 ) sn +1≤k≤pn − E Hnk1 (βn10 ) 2 | + max sn +1≤k≤pn 2 >C E Hnk1 (βn10 ) 2 nλ2n sn log 2 (n) >C nλ2n ) sn log 2 (n) It can be shown that |Hnkj (βn0 )| is uniformly bounded using assumptions (A12),(A9) and (A8). Further, maxsn +1≤k≤pn E Hnk1 (βn0 ) 2 = maxsn +1≤k≤pn E( sn 2 j=1 Hnkj (βn0 )) ≤ Csn . Then from assumption (A11) we have the second term is o(1). Consider the following for a sufficiently large n, P max sn +1≤k≤pn pn ≤ | Hnk1 (βn10 ) P | Hnk1 (βn10 ) k=sn +1 pn ≤C k=sn +1 E( − E Hnk1 (βn10 ) 2 | ≥ Cnλ2n 2sn log 2 (n) − E Hnk1 (βn10 ) 2 | ≥ Cnλ2n 2sn log 2 (n) 2 2 sn 2 j=1 [Hnkj (βn0 ) 2 (β ))]2 s2 log 4 (n) − E(Hnkj n0 n n2 λ4n = O(pn s3n log 4 (n)/(n2 λ4n )) The last inequality is from markov’s inequality and assumption (A11). Therefore i2 = o(1). To complete the proof we now focus on the last term of 2.32 and need to show i3 = o(1). Similar 76 to the inequalities obtained above we get, i3 ≤ P ≤P max |(βˆn1 − βn10 )T Dk1 (βn∗ )(βˆn1 − βn10 )| > λn 6log(n) max |(βˆn1 − βn10 )T Dk1 (βn∗ )(βˆn1 − βn10 )| > λn , βˆn1 − βn10 ≤ ∆ 6log(n) sn +1≤k≤pn sn +1≤k≤pn + P ( βˆn1 − βn10 > ∆ pn ≤ P k=sn +1 pn ≤K k=sn +1 Tr(Dk1 ) > sn /n sn /n) nλn sn log(n) + o(1) E[Tr(D2k1 (βn∗ ))]s2n log 2 (n) + o(1) n2 λ2n It can be shown that E[Tr(D2k1 (βn∗ ))] = E similar to L. Wang et al. (2012), if pn s3n log 2 (n) n2 λ2n sn ∂ 2 Wnk (βn∗ ) 2 j=1 ∂βnj 2 ≤ Cs2n . Therefore with conditions → 0 as n → ∞ as in assumption (A11) we have i3 = o(1). 77 Chapter 3 Voxel Selection using Penalized Least Squares for a Separable Space-Time Covariance model 3.1 Functional Magnetic Resonance Imaging (fMRI) Magnetic Resonance imaging (MRI) uses strong a magnetic field and pulsating radio waves to capture the structure of the brain by measuring magnetic properties of certain molecules (ex: water by exciting hydrogen nuclei) that has varying densities in different parts of the brain like white matter, gray matter, brain stem, tumors, blood vessels etc. Different pulse sequences of the radio waves can be constructed to study different tissue properties, thus making the MRI an extremely flexible and powerful clinical tool. It is also among the most non-invasive procedures unlike computed axial tomography (CAT) scan that uses radiation and PET scans which requires the subject to be injected with a radioactive label. MRI as an imaging modality is meant to take a static image of the brain and has been successful in identifying structural anomalies related to certain diseases. Functional MRI (fMRI) on the other hand is used to detect brain function and connectivity. 78 Primarily interested in understanding neural activity in the brain, fMRI provides an indirect insight of neural function by measuring a BOLD (blood-oxygen-level-dependent) signal which is the vascular response to high neural activity. The BOLD signal is the ratio of oxygenated to de-oxygenated hemoglobin in the blood. The fMRI scanner is therefore tuned to detect the BOLD signal changes with high spatial resolution to detect significantly active parts of the brain responding to an external stimulus. Experiments are conducted either using block designs or event-related designs where a particular subject undergoes a series of active and resting states either performing a task or responding to an environmental stimulus. During this process 3D images are taken as quickly as possible so as to detect local increases in blood oxygenation at active areas of the brain in real time. These studies are tremendously interesting to neuroscientists and psychologists alike to study brain function and behavior related aspects of the human brain. Detailed explanations of these processes and further references maybe found in the book by Ashby (2011) that also introduces regular statistical techniques that are used in studying fMRI data. Hemodynamic Response Function: The proof-of-concept experiments indicate that through a process called hemodynamic response blood releases oxygen to active neurons at a greater rate than inactive neurons to meet the metabolic demands created through neural activity. Let us denote a box-car {0,1} design si,t to be the ith stimulus provided to a subject over time t. For an active voxel responding to the stimulus it is observed that although the BOLD signal increases with neural activation it is also sluggish. It is further hypothesized that once the neural activity stops, an oxygen debt occurs dropping the BOLD signal below baseline levels. Consequently it is necessary for the design matrix to undergo a transformation which would be a convolution of 79 the box-car sequence s and a hypothesized hemodynamic response function (HRF,h(λ, t)). t xi,t = 0 h(t − τ )si,τ dτ (Ashby, 2011, Chapter 3) discusses in detail multiple ways of estimating the HRF such as deconvolution and linear approximations. The parameters of h(·) are usually estimated based on a pre-processing step for the hypothesized HRF. The most common HRFs used are Poisson, Gamma or canonical (double-gamma; difference of two gamma). Below is an example of a hypothesized canonical HRF convolved with a boxcar stimulus. Figure 3.1: A theoretical double-gamma HRF characterizing the BOLD signal of a voxel that may respond to the stimulus with some lag and undershoot over time Thus we are interested in locating those regions that are significantly active and responding to stimuli based on how close the observed BOLD response is to the convolution of this hypothesized behavior. 80 3.2 Model Description Single Subject Model: Measurements obtained from an fMRI experiment detect brain activity associated with changes in blood flow when subjected to a stimulus repeatedly. The BOLD effect is the basis of fMRI. Here we propose a selection technique for a single subject study to detect those active parts βs (amplitudes of activation) that significantly relate to changes in blood flow incorporating both spatial and temporal dependencies and not requiring additional hypothesis testing steps. Let yv,t be denote a measurement obtained from a Task-based fMRI experiment at voxel v = 1, ..., N and time t = 1, .., T. Consider the following additive model, yv,t = ztT αv + xTt βv + v,t (3.1) where Z = {z1 , ..., zT } and Zαv is a baseline stimulus independent trend, βv = {βv,1 , ..., βv,p } denotes the amplitude of the activation associated with all p stimuli on voxel v of the fMRI experiment. xt = {xt,1 , .., xt,p }T is the row-vector of the T × p design matrix X that consists of the convolution of a boxcar stimulus and a hypothesized HRF and lastly the error term v,t for time t = 1, .., T and voxel v = 1, .., n accounts for the spatio-temporal dependence in the responses. For simplicity and feasibility, we use a separable model to provide a structure for the spatial and temporal dependence of the responses at each voxel over time. i.e. var( v,t ) = σ 2 , cov( v,t , v ,t ) = ν(|v − v |)ρ(|t − t |) where ν(·) is a stationary isotropic spatial correlation model and ρ(·) is a stationary correlation function modeling dependence over time for t = t 81 and v = v . In vector form let, Y = {y1 , .., yn } where yv = {yv,1 , ..., yv,T } for each voxel v = 1, .., n which is a stacked vector of dimension nT×1. The amplitude co-efficients β is a stacked pn×1 dimensional vector over each stimulus. Let Γ and R be correlation matrices corresponding to ν(·) and ρ(·) respectively such that cov(Y ) = σ 2 Q−1 where Q−1 = Γ ⊗ R of dimension ¯ = I ⊗ X where I is an n × n identity nT × nT. The design matrix must be rewritten as X matrix. The equation in vector form is therefore given by, ¯ + Y = Zα + Xβ (3.2) where ∼ NnT (0, σ 2 Q−1 ). 3.3 Estimation and Selection Without loss of generality, let us assume that the response BOLD signal is adjusted to the stimulus independent baseline i.e. Zα = 0. In the usual case of a Gaussian error in model 3.2, the parameters of interest β are estimated by minimizing the MLE or least squares objective function as shown below, βˆ = argminβ S(β) ¯ T Q(Y − Xβ) ¯ S(β) := (Y − Xβ) (3.3) To correctly identify activated voxels, we use a regularization approach to the problem by introducing certain penalty terms with the interest in simultaneously selecting and estimating the amplitude (β) associated with voxels responding significantly to the stimulus. For 82 simplicity, let us consider the scenario where only p = 1 stimulus is used in a single-subject study. The estimated coefficients are therefore obtained by minimizing, 1 ¯ T Q(Y − Xβ) ¯ + λ1 S(λ1 , λ2 , β) := (Y − Xβ) 2 n |βv | + λ2 β T (I − M W M )β (3.4) v=1 where W is the adjacency matrix of dimension n × n based on the sampling locations such that wij = 1 if a voxel is a neighbor and 0 otherwise. λ1 > 0 and λ2 > 0 are tuning parameters associated with selection (LASSO, l1 penalty) and smoothing penalty terms respectively. M is a diagonal matrix with Mii = √w1 where wi+ = i+ n j=1 wij . The matrix (I − M W M ) is specifically,      1 − w1 , if i = j and wi+ = 0   i+    (I − M W M )ij = − √ 1 wi+ wj+ , if i ∼ j         0, otherwise This representation is in the form of a Laplacian matrix on a Graph and thus making sure that (I − M W M ) is a positive semi-definite with 0 as the smallest eigenvalue and 2 as the largest eigenvalue as stated in (Li & Li, 2010, Section 2.1). Further this smoothing penalty allows us to express the penalty term in the form of sums of differences of βs in the following way, β T (I − M W M )β = i∼j βj 2 βi −√ √ wi+ wj+ (3.5) Commonly seen in undirected Graph-based semi-supervised learning as in X. Zhu (2011), this formulation penalizes those amplitudes on neighboring nodes that differ widely and 83 have a smaller penalty for clusters that are less different. Thus incorporating a natural area activation concept. Remark 3. The adjacency matrix considered as W in 3.6 provides weights to immediate neighbors or a pre-defined sized clusters. Some examples of adjacency matrices that can be constructed and implemented may be found in (J. Huang et al., 2011, Section 3) with regard to euclidean distances, dissimilarity measures, power adjacency, cluster analysis etc. For known covariance structure and parameters, we can further simplify 3.4 by using a cholesky decomposition for Q (= inverse of Cov(Y )) i.e. Q = LLT where L is a lower¯ Then the objective function can be triangular matrix. Define Y ∗ = LT Y and X ∗ = LT X. rewritten as, 1 S(β, λ, α) = 2 n T ∗ − x∗ β )2 + λP (β) (yvt α t v (3.6) v=1 t=1 where, 1 1 Pα (β) = (1 − α) β T (I − M W M )β + α β 1 = (1 − α) 2 2 i∼j n βj 2 βi −√ +α |βv | √ wi+ wj+ v=1 is the convex combination of the two penalty functions and α = λ1 /(λ1 + 2λ2 ) and λ = (λ1 + 2λ2 )/2n. Remark 4. It is important to note that this formulation is based on a single p = 1 stimulus. If additional stimuli are added, it affects the formulation with respect to the tuning parameters involved. We are at liberty to either restrict the objective function to two distinct parameters associated with the type of penalty or free them to have 2p tuning parameters in total, with no interactions associated in the penalty between stimuli or allow to incorporate 84 that as well. This liberal representation is currently not addressed in this dissertation. 3.3.1 Coordinate Descent Algorithm The coordinate descent algorithm explicated by Friedman et al. (2007) is implemented to solve 3.6. For effective calculation of the derivatives of the objective function, we reconstruct 3.6 as follows, S(β, λ, α) = S1 (β) + S2 (β) (3.7) where for a particular voxel u, S1 (β) = 1 2 n T ∗ − x∗ β )2 + (yvt t v v=u,v=1 t=1 ∂S1 =− ∂βu T n x∗t yt∗ t=1 + βu 1 2 T ∗ − x∗ β )2 (yut t u (3.8) t=1 x∗ 2t (3.9) t=1 and n |βv | + λ(1 − α) S2 (β) =λα v=u,v=1 +λα|βu | + λ(1 − α) 1 2 βz βv 2 −√ √ wz+ wv+ z∼v;z,v=u 1 βu βv 2 −√ √ 2 u∼v wu+ wv+ ∂S2 βv = λα − λ(1 − α) + λ(1 − α)βu √ √ ∂βu wu+ wv+ v∼u 85 (3.10) (3.11) Combining the derivatives obtained in 3.9 and 3.11, ∂S =− ∂βu T x∗t yt∗ + λ(1 − α) √ v∼u t=1 βv √ wu+ wv+ T + λα + βu ( x∗ 2t + λ(1 − α)) t=1 (3.12) We therefore obtain the coordinate wise update form for βu devised in (Donoho et al., 1994, Section 2.2), βˆu ← S( T ∗ ∗ t=1 xt yut + λ(1 − α) T ∗2 t=1 x t βv v∼u √wu+ √wv+ , λα) + λ(1 − α) . (3.13) The soft thresholding function S(z, γ) is defined as,      z−γ      S(z, γ) =: sign(z)(|z| − γ)+ = z + γ         0 if z > 0 and γ < |z| if z < 0 and γ < |z| otherwise. Previous updates of βv are used to estimate the current coordinate βu and this process continues until convergence is attained. In the special case of I − M W M = I or more simply the Elastic Net (EN) in 3.6, Friedman et al. (2010) constructed coordinate descent algorithms for generalized linear models. The major difference between the Soft thresholding update created so far is that they do not include residual terms associated with yv∗ − y˜v∗ (partial residuals) associated with the remaining voxels due to the structure of the data. Therefore in order to yield spatial influences of the amplitudes regardless of the already infused spatiotemporal covariance, the smoothing penalty structure plays a key role. Another version of the update rule can be written in terms of the original tuning param86 eters λ1 > 0 and λ2 > 0 in the following way, βˆu ← 3.3.2 S( T ∗ ∗ t=1 xt yut λ1 βv v∼u √wu+ √wv+ , n ) . T ∗ 2 + 2λ2 x t=1 t n 2λ + n2 (3.14) Algorithm implementation Ordinarily one can evaluate the above update rule for each fixed tuning parameter λ and α in 3.13 or λ1 and λ2 for 3.14 successively at each coordinate and continue until convergence is attained. 3.3.2.1 Pathwise Coordinate Descent Friedman et al. (2007) describe the pathwise co-ordinate descent algorithm as a procedure to be applied in a way that varies the tuning parameter along a path. The solutions are computed on a decreasing sequence of values for λ given some α ∈ (0, 1), starting at the smallest value λmax such that all of the co-efficients βˆ = 0. The scheme also exploits the notion of warm starts, leading to a more stable algorithm. Warm starts is referred to as the process of providing the smaller λ in the iterative process the start values equal to the converged estimates obtained in the previous λ considered. 1 x∗ , y ∗ ≤ λα. Following Friedman et It can be shown that βˆj = 0, j = 1, . . . , n if nT j j 1 x∗ , y ∗ |. The strategy is to select a minimum value al. (2007) we select λmax = max | nT j j λmin = λmax for a typical = 0.0001, and construct a sequence of K values from λmin to λmax in the logarithm scale. K may be any value, depending on computational capabilities, a common value used is K = 100 87 3.3.2.2 Active Set Convergence A slightly observation driven approach is considered in the active set convergence that may speed up the above algorithm significantly. Once the coordinate descent path is established for each combination of λ and α, usually an iterative procedure is adopted until convergence of the βs is attained. In active set convergence, after the first iteration we obtain an active set. Then the proceeding iterations are computed only for those non-zero βs the following iterations. Continuing with the next combination of λ the same procedure is employed. If the active set remains unchanged, the remaining process is considered only over the first liberal active set for λmax , significantly reducing the number of iterations at coordinates that are originally considered to be zero. Unfortunately not a very strong theoretical basis has been provided about the rate of convergence of the method but has been observed to be successful in the LASSO, EN and group LASSO setup (Meier et al., 2008). 3.3.2.3 Cholesky decomposition properties In the current model implementation we assume that the spatio-temporal separable covariance is known. The separability is rather restrictive supposition but we hope to exploit its matrix properties in order to make the method feasible. It is in general important to understand the scope of this dataset. For fMRI data Lindquist (2008) explains that the number of slices acquired depends on how quickly the image maybe taken when the brain is excited. In approximately 2 seconds, standard scanners can image the whole brain volume of 64×64×30 = 122880 voxels. In the block design setup the number of time points considered depends on the length of the experiment. Unlike event-related designs that have a signif- 88 icantly small number of time points (Ex: 4 to 7) but in block designs they may last upto 30 minutes or more with over 200 time points. In this setup the full covariance matrix for a single subject could exceed the dimension 24000000 × 24000000. Most high performance computers will be unable to store such a massive matrix in memory especially for a dense matrix with both spatial and temporal dependence. However if this dependence is separable i.e. of the form cov(Y ) = Q−1 = Γ ⊗ R, in order to obtain the Cholesky decomposition of Q we can use the following properties for positive definite matrices using Kronecker products, 1. (Γ ⊗ R)−1 = Γ−1 ⊗ R−1 2. (Γ ⊗ R)−1 = LΓ−1 LT −1 ⊗ LR−1 LT −1 = (LΓ−1 ⊗ LR−1 )(LΓ−1 ⊗ LR−1 )T R Γ 3. Γ−1 = (LΓ LTΓ )−1 = (LTΓ )−1 (LΓ )−1 where L is a lower triangular matrix. These properties enable us to perform matrix operations on smaller matrices based on space and time and eventually combine them and vectorize right away to obtain the weighted least squares without saving the large matrix to memory. The computational complexity of the space matrix and time matrix is dramatically smaller and using row wise Kronecker products and the properties above and multiplying it to the response and design matrix right away directly yields the weighted least squares without the need of allocating a nT × nT matrix. Details of these properties maybe reviewed in Schacke (2004). 3.3.2.4 Tuning Parameter Selection To consistently identify the true model it is crucial to choose an optimal tuning parameter that ensures the true model is selected every time. Cross validation as in Chapter 2 of this 89 dissertation is not easily implementable due to the dependence present in a single subject; both with regard to space and time. Therefore we seek to locate the optimal value of the tuning parameters based on a criteria over a grid. Our first candidate criterion is the usual AIC based on the log-likelihood. Additionally based on the method developed by Y. Fan & Tang (2013) we adopt the GIC (generalized information criterion) technique. Under the generalized linear model(GLM) for any independent response variable of the exponential family they find a sequence an in such a way that the minimum GIC value corresponds to tuning parameters that are consistently close to the true model. However the fMRI data structure is fundamentally different from a regular linear model since the selection is not on the predictors but on the locations with activated voxels. In the discussion section 3.6 we address the need to modify the GIC by finding a sequence an such that an optimal criterion may be obtained for the current setup. As a candidate criteria we investigate whether the GIC for a regular GLM can locate the optimal tuning parameters. Another attempt is made to use cross validation on block designs by assuming that no carry-over effects are sustained by the single subject while doing an activity repeatedly. We produce the illusion of independent replicates of block designs by splitting the different blocks over time into multiple folds. This drastically reduced the dimensions of correlation matrices used. Permutation matrices and properties of Cholesky decomposition and Kronecker products too support this formulation. However there is a computational trade-off that is involved depending on how many folds are constructed based on the blocks. The optimality of the solution may further depend on these folds. An example of its simulation results are provide in section 3.4. The Bayesian LASSO introduced by Park & Casella (2008) provides an empirical Bayes estimate of the tuning parameter under Bayesian hierarchical setup exploiting a property of 90 scalar mixture normals. Based on the LASSO formulation of the EN by Zou & Hastie (2005) another attempt was made at finding optimal solutions. Certain issues automatically rose in the conducting the simulation study and are addressed in the discussion section 3.6. 3.4 Simulation Study Below are a series of simulations conducted to examine the method proposed. Critical issues of tuning parameter selection arise with regard to criteria required to reach optimal solutions. Several possibilities are presented for a variety of implementations. General Setup: Let us consider a 10 × 10, 15 × 15 and 20 × 20 regular grid of 100/225/400 locations each a single unit away. For a single boxcar stimulus (i.e. p=1), we construct a hemodynamic response function using the Gamma density for 50 time points. Thus the design matrix is of dimension 5000 × 1, 11250 × 1 and 20000 × 1 respectively. The spatial covariance function imposed on the locations is given by ν(|v − u|) = exp(−0.7 · |v − u|) and the corresponding matrix produced is Γ. Further we introduce some additional sparsity by using the tapering method by Wendland (1995) that leads to a positive definite matrix. The parameters for the Wendland function used are θ = 15 and k = 2. The temporal structure R is given two structures a tri-diagonal matrix with ρ(|0|) = 1 for ρ(|t − t |) = 0.5 if |t − t | = 1 or 0 otherwise and an AR(1) with ρ˜ = 0.5. A location is selected in the grid and its closest neighbors are given an amplitude βv = 6 while all other amplitudes are assigned a 0 as the simulations studies shown in Musgrove et al. (2016). No additional baseline trend is considered. The data is generated from a normal distribution with mean ¯ and Q−1 = Γ ⊗ R. 100 such datasets are replicated for each setup. Xβ In figures 3.2 an example of the simulated data under two different time structures for 91 various grid sizes and 50 time points is shown. The red blocks indicate where the signal lies. We can clearly see that albeit the high signal, the data is still very noisy. Figure 3.2: Simulated fMRI data with different grid sizes and time structures. (a) contains 5000 data points and (b) contains 11250 data points (a) 10 × 10 grid with Tri-Diag Time Correlation (b) 15 × 15 grid with AR(1) Time Correlation To begin with, three different algorithms are used to look for optimal tuning parameters under the assumption that the underlying spatio-temporal model is known. 1. Sequential grid for λ1 and λ2 : In the general setup considered 3.4, we first consider exploring the efficiency of the selection method over an entire grid for values of the tuning parameters λ1 and λ2 as indicated in equation 3.14. A range from 0.1 to 1000 is considered over 15 instances in the range. Below are image plots indicating the AIC and GIC obtained from the various values for all 3 grid types considered, averaged over 100 datasets. In Figures 92 3.3, 3.4 and 3.5 it is distinctly visible that an obvious optimal combination for the tuning parameters is not apparent. Figure 3.3: AIC for sequences of λ1 &λ2 ∈ (0.1, 1000) (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.4: GIC for sequences of λ1 &λ2 ∈ (0.1, 1000) (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Table 3.1: Choice of Lambda sequence parameters based on MSE Grid Lambda 1 Lambda 2 TP FP MSE 10 × 10 142.94 71.52 5.00 31.46 45.57 15 × 15 500.05 142.94 5.00 29.52 45.13 20 × 20 1000.00 214.36 5.00 32.53 47.90 However based on the mean squared error (MSE), since the true parameters of interest βv are known, we obtain results in Table 3.1 for each of the grids. 93 Figure 3.5: MSE for sequences of λ1 &λ2 ∈ (0.1, 1000) (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.6: True Positives (max 5 active) for sequences of λ1 &λ2 ∈ (0.1, 1000) (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.7: False Positives (max n-5) for sequences of λ1 &λ2 ∈ (0.1, 1000) (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 2. Pathwise Coordinate Descent For a fixed value of α a decreasing sequence of 20 1 | < X, ¯ y > | and λmin = values of λ is considered, where λmax = max nα 94 · λmax . Here = 0.0001. Therefore on a sequence of α ∈ (0, 1) varying over 10 values the selection was run repeatedly over 100 datasets. Image plots for AIC, GIC and MSE are provided in Figure 3.8, 3.9 and 3.10. In this algorithm the mean squared prediction error (MSPE) (See Fig 3.13) seemed more intuitive with regard to the tuning parameters. λmax values for every dataset but did not exceed 5. Figure 3.8: AIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.9: GIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 95 (c) 20 × 20 Figure 3.10: MSE (Mean squared error) for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.11: True Positives (max 5 active) for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Table 3.2: Choice of Lambda sequence parameters based on MSE Grid 10 × 10 15 × 15 20 × 20 Lambda index 4.00 3.00 3.00 Alpha TP 0.55 5.00 0.77 5.00 0.66 5.00 FP 24.33 20.27 31.10 MSE 44.89 46.85 49.66 Interestingly the mean squared prediction error for the path-wise coordinate descent algorithm tends to prefer the LASSO solution at α 96 1. See Fig 3.13 Figure 3.12: False Positives (max n-5) for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.13: MSPE (Mean squared prediction error) for sequences of the index λ values given α ∈ (0, 1)using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 3. Active Set Convergence In active set convergence, we consider an approach similar to the path wise coordinate descent except for the consecutive sequences of λ we provide the solution of the previous iteration as a warm start. Then we proceed to do a coordinate descent only those voxels/coordinates that have not already been dropped. This results in a relatively speedy algorithm and leads to rather sparse solutions. 97 Figure 3.14: AIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.15: GIC for sequences of the index λ values given α ∈ (0, 1) using pathwise coordinate descent (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.16: MSE (Mean squared error) for sequences of the index λ values given α ∈ (0, 1) using active sets (a) 10 × 10 (b) 15 × 15 98 (c) 20 × 20 Figure 3.17: True Positives (max 5 active) for sequences of the index λ values given α ∈ (0, 1) using active sets (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Figure 3.18: False Positives (max n-5) for sequences of the index λ values given α ∈ (0, 1) using active sets (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Table 3.3: Choice of Lambda sequence parameters based on MSE Grid 10 × 10 15 × 15 20 × 20 Lambda index 5.00 4.00 4.00 Alpha TP 0.55 5.00 0.66 5.00 0.66 5.00 99 FP 23.51 26.31 41.89 MSE 44.91 0.45 0.48 Figure 3.19: MSPE (Mean squared prediction error) for sequences of the index λ values given α ∈ (0, 1) using active sets (a) 10 × 10 (b) 15 × 15 (c) 20 × 20 Cross Validation Regularization methods use cross validation to obtain optimal tuning parameters in big data settings with independent replicates. As explained in section 3.3.2.4 we try to exploit nature of the experimental design to produce an illusion of replicates. Assuming there are no carryover effects in these fMRI block-design experiments with respect to time let us consider the convolved double- gamma (canonical) HRF with a box-car stimulus shown in Figure 3.20. Figure 3.20: 4-fold CV on time for design of the single subject fMRI study. Blue solid line indicates boxcar stimulus and dotted line convolved HRF 100 Each cycle of the experiment is considered as a fold. In the Figure 3.20 without loss of generality we can place the first as the test set and the rest as the training set and continue until each fold has been the test set once. We average different the criteria and MSPE to find optimal solutions. This is done on a single dataset, 10 × 10 grid. However in this analysis it is intuitively clear that the criteria considered (like AIC and GIC favor LASSO (see Figure 3.21) or the modified ”Laplacian” Ridge or smoothing penalty based on MSPE and MSE which is expected when the objective is to reduce prediction error in the hope to reduce bias. The MSE and MSPE prefer the smoothing penalty as they want the least amount of bias Figure 3.21: Simulation results of 4-fold cross validation on a 10 × 10 grid AIC GIC MSE MSPE and AIC, GIC prefer picking the LASSO penalty as the criteria is proportionate to number of voxels in the model. Thus there is no indication that CV may yield optimal solutions. 101 Simulation study investigating AR(1) Time-Structure Similar to the general setup provided in the simulation studies above, we wanted to investigate whether the method can be implemented successfully whenever a more commonly used and better suitable time correlation structure is used. In the current setup we look at a single pseudo image grid 15 × 15. The spatial covariance and the wendlend taper function is still used to simulate the data. For the time structure we generate data with a voxel vise auto-regressive AR (1) model where parameter ρ = 0.5. The total experiment is assumed to have 50 time points with the same HRF used for the design matrix. All three algorithms are considered and compared with each other in the image plots 3.22- 3.27. Figure 3.22: AIC for model simulated using AR(1) Time correlation structure (a)Sequence CD (b) Pathwise CD (c) Active Set CD Figure 3.23: GIC for model simulated using AR(1) Time correlation structure (a)Sequence CD (b) Pathwise CD 102 (c) Active Set CD Figure 3.24: MSE for model simulated using AR(1) Time correlation structure (a)Sequence CD (b) Pathwise CD (c) Active Set CD Figure 3.25: MSPE for model simulated using AR(1) Time correlation structure (a)Sequence CD (b) Pathwise CD (c) Active Set CD Figure 3.26: True Positives (max 5) for model simulated using AR(1) Time correlation (a)Sequence CD (b) Pathwise CD (c) Active Set CD The use of an AR(1) model instead of a tridiagonal time structure matrix provides very 103 Figure 3.27: False Positives (max n-5) for model simulated using AR(1) Time correlation (a)Sequence CD (b) Pathwise CD (c) Active Set CD Table 3.4: Consolidated results for AR(1) simulation with regard to all 3 algorithms Lambda1 Lambda2 TP FP MSE Sequence CD 500.05 71.52 4.99 21.54 52.16 Lambda index Alpha TP FP MSE MSPE Pathwise CD 3.00 0.77 4.99 17.39 53.61 113.74 Active Set CD 4.00 0.77 4.99 23.08 52.04 113.01 similar results. A noticeable increase in the computational time is experienced due to increase in the density of the matrix. A steady 10% False positive rate is observed with an almost 100% true positive selection. 3.5 Real Data Analysis Data Acquisition A healthy college student from Michigan State University volunteered to participate in a study on a visual stimulation condition with a scene-object fMRI paradigm. Signed consent forms approved by the Michigan State University Institutional Review Board (IRB) were obtained from the individual. The experiment was conducted on a 3T GE Signa HDx MR scanner (GE Healthcare, Waukesha, WI) with an 8-channel head coil. The parameters for the fMRI scan that were collected were gradient-echo EPI, 36 contiguous 3-mm axial slices 104 in an interleaved order, time of echo (TE) = 27.7 ms, time of repetition (TR) = 2500 ms, flip angle = 80◦ , field of view (FOV) = 22 cm, matrix size = 64 × 64, ramp sampling, and with the first four data points discarded. Each volume of images were acquired 192 times (8 min) while a subject was presented with 12 blocks of visual stimulation after an initial 10 seconds “resting” period. In a predetermined randomized order, scenery pictures were presented in 6 blocks and objects pictures were presented in the other 6 blocks. All pictures were unique. In each block, 10 pictures were presented continuously for 25 seconds (2.5 s for each picture), followed with a 15 second baseline condition where the subject is defaulted to a white screen with a black fixation cross at the center. The subject needed to press his/her right index finger once when the screen was switched from the baseline to picture condition. Stimuli were displayed in color on a full screen 1024 × 768 32-inch LCD monitor (Salvagione Design, Sausalito, CA) placed at the back of the magnet room. The LCD was subtended at a visual angle of 10.2◦ × 13.1◦ . After the above functional data acquisition, high-resolution volumetric T1-weighted spoiled gradient-recalled (SPGR) images with cerebrospinal fluid suppressed were obtained to cover the whole brain with 120 1.5-mm sagittal slices, 8◦ flip angle and 24 cm FOV. These images were used to identify anatomical locations. fMRI Data Pre-processing All stimulus fMRI data pre-processing were conducted with AFNI software (Cox, 1996) as described in Henderson et al. (2011). The data was detrended for artifacts and the stimulus independent baseline specifically, slice-timing correction and rigid-body motion correction were carried out. Spatial blurring with a full width half maximum of 4 mm was applied to reduce random noise. Multiple linear regressions (using the “3dDeconvolve” routine in AFNI) were applied on a voxel-wise basis to find the magnitude change when each picture condition was presented, followed by general linear tests that determine statistical significances between 105 stimulus conditions. Once this was completed a mask was created that discarded all of the potentially irrelevant voxels of the brain with regard to the stimuli. This significantly reduced the size of the data from 64 × 64 × 31 × 192 ≈ 25 million data points to 6118 voxels with the original 192 time points ≈ 1 million. The current model proposed assumes the underlying spatio-temporal covariance is known. In real data studies this is always untrue and the covariance needs to be estimated. This need is addressed in the discussion section 3.6 but in order to showcase the method we proceed by eye-fitting a model on a spatio-temporal variogram. The R package spacetime is used to fit a separable exponential models for both space and time. The model that reduced the MSE was eventually considered. Data Analysis Once all of the preprocessing was done and an underlying spatio-temporal model was considered, using the Kronecker product and cholesky decomposition properties in section 3.3.2.3 we used a linear algorithm to avoid saving the Kronecker product to memory. Using this vectorized formulation we could directly calculate the transformed y ∗ and x∗ as explained in section 3.3. As explained above, the design consists of two stimuli scene and object provided at random. Based on remark 4 we use the most conservative approach and restrict the tuning parameters to be identical. It is important to note that since only an instantaneous image was recorded for both stimuli they are required to be studied simultaneously. Using the path-wise coordinate descent method in 3.3.2.1 we simplistically vary α ∈ (0, 1) over 10 points and obtain the sequence of λ between λmin = .λmax and λmax for = 0.0001 by selecting λmax = max{| 1 ∗ 1 ∗ xobject , yj∗ |, | x , y ∗ |} j nα nα scenej j 106 Based on all the simulation studies done above and the computational feasibility of the path-wise coordinate descent method, we provide an application of the method to a singlesubject study with two visual stimuli (p=2). The results in the simulation studies above provide a range of α ∈ (0.5, 0.7) whenever the criteria for optimality is the MSE for true known coefficients. Although in actuality, the coefficients are never known the selection based on this criteria provides optimal solutions that where it takes advantage of both penalties. Thus we crudely consider an α = 0.6 from the studies above and we look at a sequence of decreasing λs. Below are images of the 36 slices of the axial view and 64 slices of the coronal and sagittal view showing the selected voxels based on 3 choices of λ based on both stimuli (object and scene). Alternatively fixing α and varying the λ based on the stimuli has no novel effect as the intensity of both scene and object on the subject is identical. Therefore we proceed in the way described above. The results appear to coincide with the study conducted by Henderson et al. (2011). Most significantly we see activity in the visual cortex and frontal lobe of the brain. Further the method is currently unable to distinguish clearly activated regions responding to the two stimuli differently, so in the Figures below we see a fair amount of overlap. 107 Figure 3.28: Axial View λ = 0.0019 Object Stimulus Scene Stimulus Figure 3.29: Coronal View λ = 0.0019 Object Stimulus Scene Stimulus 108 Figure 3.30: Sagittal View λ = 0.0019 Object Stimulus Scene Stimulus Figure 3.31: Axial View λ = 0.19 Object Stimulus Scene Stimulus 109 Figure 3.32: Coronal View λ = 0.19 Object Stimulus Scene Stimulus Figure 3.33: Sagittal View λ = 0.19 Object Stimulus Scene Stimulus 110 3.6 Discussion In chapter 3 of the dissertation we empirically investigate a method proposed for selection of activated brain areas in a single subject fMRI study. As an application to variable selection and the study of regularization methods on densely dependent data, this is an area of study that requires keen exploration. Image analysis is structurally different from most data under statistical investigation. Medical imaging has vastly improved in the last decade and resources are being pooled to organize images taken as a regular diagnostic tool and improve our understanding of human anatomy. With fMRI data in general, the massive acquisition is perhaps why statisticians refrain from applying bold methodologies and submit to crude simplistic methods. The formulation of coordinate descent like algorithms have indubitably eased the computational burden and it may be an opportunity to explore regularized methods in neuroimaging. Under the setup of a known spatio-temporal covariance and properties of the Cholesky decomposition the method in chapter 3, sets up a variable selection routine with two penalty terms. The vectorized usually 4D structure of the data, i.e. 3D space and time series creates a matrix far too massive to computationally consider the Cholesky of order O((nT)3 ). But the separability assumption allows us to simplify the burden considerably by O(n3 ) + O(T 3 ). This computation need not be repeated unless Cross Validation as described in section 3.3.2.4 as permutation matrices are not square and therefore these operations need to be repeated based on the number of folds. A careful consideration must be made about however with regard to the implementation of coordinate descent methods for separable penalties as stated in Friedman et al. (2007). Although not addressed in J. Huang et al. (2011) and Li & Li (2010) if we look at the 111 the derivative of the object function 3.12 we see that the combination of penalties does not possess separability. Based on Tseng (1988) techniques that might require groups of coordinates to be updated together as in the case of the fused LASSO may yield better and more robust results, unless it can be shown otherwise. A major aspect of the method that is currently overlooked is the estimation of this underlying spatio-temporal covariance and the separability assumption. In general there exists well developed methods for covariance estimation in the spatio-temporal paradigm using likelihood based techniques. There is scope in the future to therefore formalize a methodology. To address the latter issue of separability, we must first consider the feasibility in general. Separable assumptions simplify methods making them feasible (Ex: George & Aban (2015)) and reduce the number of parameters to be estimated dramatically. Although in the presence of non-separable relations among voxels misspecification may lead to inaccurate results, we could use ideas such as separable approximations for non-separable spatio-temporal models (Genton, 2007). This aspect is not within the scope of this dissertation and has significant future scope. Tuning parameter selection is by far the most puzzling aspect of this setup. The observation that “ For α ∈ (0, 1) the method tends to be dominated by LASSO but may be an improvement over the simple lasso as it reduces any degeneracy and erratic behaviors caused by high correlation in the design” is made in the paper by Friedman et al. (2010) about EN. We can definitely sense the LASSO domination. The coordinate descent solution for EN using Cross Validation is given by Kooij et al. (2007) in her thesis and is widely popular. Therefore in general due to these tendencies the free grid search maybe a suboptimal method and the path-wise coordinate descent method is preferable. An interesting approach other than sequential methods or cross validation to obtain 112 optimal tuning parameters is estimating it. Park & Casella (2008) were able to do so in the Bayesian paradigm for the LASSO problem. Below is a conjecture following this notion but some obvious limitations exist. 3.6.1 Empirical Bayes Estimate using Bayesian LASSO The proposed method is a generalized version that yields a special case in the form of the Elastic Net (EN). Similar to (Zou & Hastie, 2005, Section 2.2) let us consider reformulating the two penalty weighted least squares in 3.4 to a single tuning parameter LASSO objective function. Therefore given set (Y, X) and λ1 , λ2 we define the reformulated (Y , X) by     XnT×n Y    Y (nT+n)×1 =   , X (nT+n)×n = √ 1   and Q = LLT 1+λ2 √ 0 λ2 (I − M W M )1/2   √ L 0  where L =   with L described in 3.3 such that for β ∗ = 1 + λ2 β and γ = 0 I √ λ1 / 1 + λ2 the object function is denoted by S∗ and given by, S∗ (γ, β ∗ ) := (Y − Xβ ∗ )T LLT (Y − Xβ ∗ ) + γ|β ∗ | Thus βˆ∗ = argminβ ∗ S∗ (γ, β ∗ ) subsequently βˆ = √ 1 1+λ2 (3.15) βˆ∗ . Tibshirani (1996) interpreted the LASSO solution as the posterior mode estimates when regression estimates have an independent and identical Laplace priors. Park & Casella (2008) exploits the fact that the Laplace distribution can be represented as a scalar mixture of normals to obtain a conditionally inverse-Gaussian distribution for 1/τj2 with parameters µ = γ 2 σ2 βj2 and γ = γ 2 where β|σ 2 , τ1 , . . . , τn2 ∼ Nn (0, σ 2 Dt ) and the variances τ 2 have a Laplace prior with hyper-parameter γ. Based on this hierarchical set up an empirical bayes 113 estimate was obtained of the form, γ (k) = 2n (3.16) n 2 j=1 Eγ (k−1) [τj |y] In the Bayesian paradigm a Gibbs sampler is used to evaluate the expectation in 3.16. Our primary objective is to find a suitable method that can be implemented to evaluate this tuning parameter γ in every iteration of the coordinate descent algorithm. We proceed by ˆ 2 |y) = using the relation, E(τ j βj2 γ2 for known σ 2 = 1 and γ = 0. A simulation study was attempted but an immediate issue was uncovered in the process. Unlike the Bayesian hierarchical setup that allows for the parameters of interest β’s to have a normal prior, it is not possible to obtain 0 estimates. However with the soft-thresholding used the more number of 0 estimates leads to a higher γ (k) estimate which may further penalize the estimates in future iterations. So there is a tendency for the LASSO estimate to take the value infinity for all 0 β estimates before the estimates actually converge. Thus further investigation is necessary to find out whether such estimates lead to the selection of optimal tuning parameters. In conclusion, the coordinate descent method for regularization has a very distinct computational benefit. However these strategies in machine learning and statistics are yet to be explored when it comes to dense dependent data. With a specific application in fMRI studies we are able to highlight a need for such methodologies and seek to theoretically justify the properties of the estimator obtained in the regularization method proposed in this chapter. 114 BIBLIOGRAPHY 115 BIBLIOGRAPHY Adler, R. J. (1981). The geometry of random fields. John Wiley&Sons, Chichester . Agresti, A., & Kateri, M. (2011). Categorical data analysis. Springer. Ashby, F. G. (2011). Statistical analysis of fmri data. MIT press. Bachoc, F., & Furrer, R. (2016). On the smallest eigenvalues of covariance matrices of multivariate spatial processes. Stat. Balan, R. M., Schiopu-Kratina, I., et al. (2005). Asymptotic results with generalized estimating equations for longitudinal data. The Annals of Statistics, 33 (2), 522–541. Banerjee, S., Carlin, B. P., & Gelfand, A. E. (2014). Hierarchical modeling and analysis for spatial data. Crc Press. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 192–236. Besag, J. (1975). Statistical analysis of non-lattice data. The statistician, 179–195. Bickel, P. J., Li, B., Tsybakov, A. B., van de Geer, S. A., Yu, B., Vald´es, T., . . . van der Vaart, A. (2006). Regularization in statistics. Test, 15 (2), 271–344. Retrieved from http://dx.doi.org/10.1007/BF02607055 doi: 10.1007/BF02607055 Biswal, B., Zerrin Yetkin, F., Haughton, V. M., & Hyde, J. S. (1995). Functional connectivity in the motor cortex of resting human brain using echo-planar mri. Magnetic resonance in medicine, 34 (4), 537–541. Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American statistical Association, 88 (421), 9–25. Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2313–2351. Cox, R. W. (1996). Afni: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical research, 29 (3), 162–173. Cressie, N. A. (1993). Statistics for spatial data. Dass, S. C., Lim, C. Y., & Maiti, T. (2012). Default bayesian analysis for multivariate generalized car models. Statistica Sinica, 231–248. Dean, C., Ugarte, M., & Militino, A. (2004). Penalized quasi-likelihood with spatially correlated data. Computational Statistics and Data Analysis. Retrieved from http:// www.sciencedirect.com/science/article/pii/S0167947302003249 116 Diggle, P., Ribeiro, P., & Geostatistics, M.-b. (2007). Springer series in statistics. Springer. Diggle, P. J., Tawn, J., & Moyeed, R. (1998). Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47 (3), 299–350. Donoho, D. L., Johnstone, I. M., et al. (1994). Ideal denoising in an orthonormal basis chosen from a library of bases. Comptes rendus de l’Acad´emie des sciences. S´erie I, Math´ematique, 319 (12), 1317–1322. Emrich, L. J., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate binary variates. The American Statistician, 45 (4), 302–304. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96 (456), 1348–1360. Fan, J., & Li, R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association, 99 (467), 710–723. Fan, Y., & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75 (3), 531–552. Feng, W., Sarkar, A., Lim, C. Y., & Maiti, T. (2016). Variable selection for binary spatial regression: Penalized quasi-likelihood approach. Biometrics, 72 (4), 1164–1172. Fingleton, B. (1986). Analyzing cross-classified data with inherent spatial dependence. Geographical Analysis, 18 (1), 48–61. Fitzmaurice, G. M. (1995). A caveat concerning independence estimating equations with multivariate binary data. Biometrics, 309–317. Friedman, J., Hastie, T., H¨ofling, H., Tibshirani, R., et al. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1 (2), 302–332. Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33 (1), 1. Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J.-P., Frith, C. D., & Frackowiak, R. S. (1994). Statistical parametric maps in functional imaging: a general linear approach. Human brain mapping, 2 (4), 189–210. Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence(6), 721–741. Genton, M. G. (2007). Separable approximations of space-time covariance matrices. Environmetrics, 18 (7), 681–695. 117 George, B., & Aban, I. (2015). Selecting a separable parametric spatiotemporal covariance structure for longitudinal imaging data. Statistics in medicine, 34 (1), 145–161. Givnish, T. J. (1981). Serotiny, geography, and fire in the pine barrens of new jersey. Evolution, 101–123. G¨ossl, C., Auer, D. P., & Fahrmeir, L. (2001). Bayesian spatiotemporal inference in functional magnetic resonance imaging. Biometrics, 57 (2), 554–562. Gotway, C. A., & Stroup, W. W. (1997). A generalized linear model approach to spatial data analysis and prediction. Journal of Agricultural, Biological, and Environmental Statistics, 157–178. Green, P. J. (1987). Penalized likelihood for general semi-parametric regression models. International Statistical Review/Revue Internationale de Statistique, 245–259. Grosenick, L., Klingenberg, B., Katovich, K., Knutson, B., & Taylor, J. E. (2013). Interpretable whole-brain prediction analysis with graphnet. NeuroImage, 72 , 304–321. Guyon, X. (1995). Random fields on a network: modeling, statistics, and applications. Springer Science & Business Media. Hammersley, J. M., & Clifford, P. (1971). Markov fields on finite graphs and lattices. Heagerty, P. J., & Lele, S. R. (1998). A composite likelihood approach to binary spatial data. Journal of the American Statistical Association, 93 (443), 1099–1111. Henderson, J. M., Zhu, D. C., & Larson, C. L. (2011). Functions of parahippocampal place area and retrosplenial cortex in real-world scene analysis: An fmri study. Visual Cognition, 19 (7), 910–927. Huang, H.-C., Hsu, N.-J., Theobald, D. M., & Breidt, F. J. (2010). Spatial lasso with applications to gis model selection. Journal of Computational and Graphical Statistics, 19 (4), 963–983. Huang, J., Ma, S., Li, H., & Zhang, C.-H. (2011). The sparse laplacian shrinkage estimator for high-dimensional regression. Annals of statistics, 39 (4), 2021. Hunter, D. R., & Li, R. (2005). Variable selection using mm algorithms. Annals of statistics, 33 (4), 1617. Hurvich, C. M., & Tsai, C.-L. (1995). Model selection for extended quasi-likelihood models in small samples. Biometrics, 1077–1084. Inouye, D. I., Yang, E., Allen, G. I., & Ravikumar, P. (2017). A review of multivariate distributions for count data derived from the poisson distribution. Wiley Interdisciplinary Reviews: Computational Statistics, 9 (3). Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Zeitschrift f¨ ur Physik A Hadrons and Nuclei , 31 (1), 253–258. 118 Johnson, B. A., Lin, D., & Zeng, D. (2008). Penalized estimating functions and variable selection in semiparametric regression models. Journal of the American Statistical Association, 103 (482), 672–680. Kooij, A. J., et al. (2007). Prediction accuracy and stability of regression with optimal scaling transformations. Child & Family Studies and Data Theory (AGP-D), Department of Education and Child Studies, Faculty of Social and Behavioural Sciences, Leiden University. Lee, K.-J., Jones, G. L., Caffo, B. S., & Bassett, S. S. (2014). Spatial bayesian variable selection models on functional magnetic resonance imaging time-series data. Bayesian Analysis (Online), 9 (3), 699. Li, C., & Li, H. (2010). Variable selection and regression analysis for graph-structured covariates with an application to genomics. The annals of applied statistics, 4 (3), 1498. Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 13–22. Lin, P.-S. (2008). Estimating equations for spatially correlated data in multi-dimensional space. Biometrika, 847–858. Lin, P.-S. (2010). A working estimating equation for spatial count data. Journal of Statistical Planning and Inference, 140 (9), 2470–2477. Lin, P.-S., & Clayton, M. K. (2005, 04). Analysis of binary spatial data by quasi-likelihood estimating equations. Ann. Statist., 33 (2), 542–555. Retrieved from http://dx.doi.org/ 10.1214/009053605000000057 doi: 10.1214/009053605000000057 Lindquist, M. A. (2008). The statistical analysis of fmri data. Statistical Science, 439–464. Matheron, G. (1970). Random functions and their application in geology. In D. F. Merriam (Ed.), Geostatistics: A colloquium (pp. 79–87). Boston, MA: Springer US. Retrieved from http://dx.doi.org/10.1007/978-1-4615-7103-2 7 doi: 10.1007/978-1-4615-7103-2 7 McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (Vol. 37). CRC Press. McShane, L. M., Albert, P. S., & Palmatier, M. A. (1997). A latent process regression model for spatially correlated count data. Biometrics, 698–706. Meier, L., Van De Geer, S., & B¨ uhlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (1), 53–71. Musgrove, D. R., Hughes, J., & Eberly, L. E. (2016). Fast, fully bayesian spatiotemporal inference for fmri data. Biostatistics, 17 (2), 291–303. Ortega, J., & Rheinboldt, W. (1970). Iterative solution of nonlinear equations in several variables (academic, new york, 1970). Pan, W. (2001). Akaike’s information criterion in generalized estimating equations. Biometrics, 57 (1), 120–125. 119 Pan, W. (2002). Goodness-of-fit tests for gee with correlated binary data. Scandinavian Journal of Statistics, 29 (1), 101–110. Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103 (482), 681–686. Portnoy, S. (1985). Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, 1403–1417. Portnoy, S. (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. The Annals of Statistics, 16 (1), 356–366. Ries, L. A., Harkins, D., Krapcho, M., Mariotto, A., Miller, B. A., Feuer, E. J., . . . others (2006). Seer cancer statistics review, 1975-2003. Schacke, K. (2004). On the kronecker product. Master’s thesis, University of Waterloo. Schaetzl, R. J. (1986). Soilscape analysis of contrasting glacial terrains in wisconsin. Annals of the Association of American Geographers, 76 (3), 414–425. Smith, M., & Fahrmeir, L. (2007). Spatial bayesian variable selection with application to functional magnetic resonance imaging. Journal of the American Statistical Association, 102 (478), 417–431. Strawderman, R. L., & Tsiatis, A. A. (1996). On consistency in parameter spaces of expanding dimension: an application of the inverse function theorem. Statistica Sinica, 917–923. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288. Tseng, P. (1988). Technical report lids-p, 1840. Massachusetts Institute of Technology, Laboratory for Information and Decision Systems. Wang, H., Li, G., & Tsai, C.-L. (2007). Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 (1), 63–78. Wang, L. (2011). Gee analysis of clustered binary data with diverging number of covariates. The Annals of Statistics, 39 (1), 389–417. Wang, L., & Qu, A. (2009). Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71 (1), 177–190. Wang, L., Zhou, J., & Qu, A. (2012). Penalized generalized estimating equations for highdimensional longitudinal data analysis. Biometrics, 68 (2), 353–360. Wedderburn, R. W. (1974). Quasi-likelihood functions, generalized linear models, and the gaussnewton method. Biometrika, 61 (3), 439–447. 120 Wendland, H. (1995). Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree. Advances in computational Mathematics, 4 (1), 389–396. Worsley, K. J., Evans, A. C., Marrett, S., & Neelin, P. (1992). A three-dimensional statistical analysis for cbf activation studies in human brain. Journal of Cerebral Blood Flow & Metabolism, 12 (6), 900–918. Xie, M., Yang, Y., et al. (2003). Asymptotics for generalized estimating equations with large cluster sizes. The Annals of Statistics, 31 (1), 310–347. Xue, L., Zou, H., & Cai, T. (2012). Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics, 1403–1429. Yahav, I., & Shmueli, G. (2012). On generating multivariate poisson data in management science applications. Applied Stochastic Models in Business and Industry, 28 (1), 91–102. Zeger, S. L. (1988). A regression model for time series of counts. Biometrika, 75 (4), 621–629. Zhang, C.-H., et al. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38 (2), 894–942. Zhu, J., Huang, H.-C., & Reyes, P. E. (2010). On selection of spatial linear models for lattice data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72 (3), 389–402. Zhu, X. (2011). Semi-supervised learning. In Encyclopedia of machine learning (pp. 892– 897). Springer. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101 (476), 1418–1429. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2), 301–320. 121