ADVANCED CLASSIFICATION METHODS FOR LARGE SPATIAL-TEMPORAL DATA: APPLICATIONS TO NEUROIMAGING By Rejaul Karim A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics - Doctor of Philosophy 2019 ABSTRACT ADVANCED CLASSIFICATION METHODS FOR LARGE SPATIAL-TEMPORAL DATA: APPLICATIONS TO NEUROIMAGING By Rejaul Karim Spatial data are characterized by dependency between the data indexed by a fixed point in space and its ”neighbors”. Exploiting such dependencies leads to improvement in estima- tion and inference. Due to large abundance of such data in nature,previous methodologies are being extended to incorporate such proximal information.For example in a latent model for generating data is spatially dependent,one would like to investigate how such depen- dencies affect the variable selection performances. This work is centerd around a penalized estimating equation approach to model of an expanding dimension(pn) of predictor variables where responses are generated from Poisson model driven by latent Gaussian model (Log Gaussian Cox process). In the past this approach has been extensively studied in longitudi- nal data analysis. Gaussian random fields that exhibit Conditional autoregressive structure (CAR) we provide some theoretical results of the estimator obtained from the penalized estimating equation. The oracle properties of the estimator as described by Fan & Li (2001) are provided. Pattern detection in imaging data has lead to a rise of classification methods that are effective in separating objects and structures in an image. This provides a major impetus in the context of medical imaging. Magnetic resonance images (MRI) collected in four dimensions (3D space and time), maybe used to predict different disease phases of a particular patient. Linear discriminant analysis(LDA) is a classical tool used for dimension reduction as well as classification. However, in the context of high-dimensional data where feature volume is significantly larger than sample size, the within-class covariance of the LDA tool is singular, yielding the classification rule unsuitable. Sparse discriminant methods have therefore been proposed to implement LDA in a high dimensional setup. These methods do not incorporate dependencies in the feature covariance structure when data acquired is spatially and temporally correlated. This article proposes a regularized high dimensional LDA resolution for spatio-temporal imaging data. Theoretically we ensure that the method proposed can achieve consistent parameter estimation, feature selection, at an asymptotically optimal misclassification rate. Extensive simulation study shows a significant improvement in classification performance under spatial-temporal dependence. This method is applied to longitudinal structural MRI data obtained from the ADNI initiative. LDA classification rule are restrictive since this paradigm is based on strong assump- tion that binary class data generating process is Normally distributed with same covariance function. In contrast, support vector machine is considered much more robust classifier due to its distribution free approach. The tensor counterpart of SVM also known as support tensor machine is widely popular in analysis of MRI image which is a tensor in its original format. This tensor structure preserves the neighboring spatial information which is lost after vectorization. In this work, we apply memory efficient random projection to tensor as a dimension reduction method which preserves distance with high probability. Near optimal classification consistency is shown along with few simulation study. To the memory of my family members who have motivated and supported me. iv ACKNOWLEDGMENTS I would like to sincerely thank my chair advisor Professor Tapabrata(Taps) Maiti for every- thing. Few words does not do justice to his contribution in my career. His patience and support resuscitated my academic career from time to time. I have utmost gratitude to Professor Chae Young Lim who spent lot of effort in proof checking and guided me through numerous expert suggestions. Her smart ideas helped me steer through research bottlenecks. I will be indebted to my committee member Professor Yimin Xiao who introduced to spatial statistics. His mathematical intuition still amazes me. I am grateful to generosity of my Professor Arun Ross for his unconditional acceptance to be part of my committee. He inspires me through his interesting application approaches which revolutionized biometrics. A very special mention to my senior and collaborator Abdhi Sarkar who helped me with superb coding skills especially regarding MRI data preprocessing and analysis. I would be unable to complete this dissertation without her. I am also thankful to my colleague Peide Li(Peter) who responded immediately to distress calls. The third work of my thesis is largely based on extension of his work. I am grateful to Yingjie Li who has permitted to use her material in my second work. The third work of my thesis is an extension for her thesis. Lastly I would like to thank my parents and my sister for the moral support. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . KEY TO SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 1.4.1 Transformation approach by Yasui & Lele (1997) Chapter 1 Analysis of Spatial Count Data: Penalized Estimating Equation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1.1 High dimensional curse . . . . . . . . . . . . . . . . . . . . . In search for transformation function . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Integral equation approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1.1 Charlier Polynomial . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Properties of marginal model 1.6 Penalized quasi likelihood on Induced model . . . . . . . . . . . . . . . . . . 1.7 GEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Short range dependence . . . . . . . . . . . . . . . . . . . . 1.8 Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smooth Clipped Absolute Deviation Penalty . . . . . . . . . . . . . . 1.9 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.0.1 1.8.1 Chapter 2 High Dimensional Sparse-LDA for spatio-temporal Data . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Review of classical Linear discriminant Analysis (LDA) . . . . . . . . . . . . 2.3 Spatio-temporal LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Spatio-temporal LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spatio-temporal covariance . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.4.2 Irregular lattice points in space and time . . . . . . . . . . . . . . . . 2.4.3 Well separateness in space and time domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Asymptotic optimal misclassification rate . . . . . . . . . . . . . . . . 2.6 Penalized Linear Discriminant Analysis (pLDA) . . . . . . . . . . . . . . . . 2.6.1 Across sample independence . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 REML estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Regularity Conditions vi ix xi 1 1 1 3 4 5 5 6 6 7 7 8 8 10 11 11 12 13 13 14 15 42 42 44 47 48 49 51 51 52 54 58 60 61 2.7.1 Asymptotic properties of one-step estimates 2.6.3 Validation of REML assumptions . . . . . . . . . . . . . . . . . . . . 2.6.3.1 Tapered REML . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Regularity conditions for penalty . . . . . . . . . . . . . . . . . . . . 2.7 Algorithm and methodology to obtain optimal solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Covariance Tapering . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 One way Tapering vs Two way Tapering . . . . . . . . . . . . . . . . 2.8.3 Tapering range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Misclassification Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 MRI Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Simulation Studies 2.11.1 Exponential Space Time Covariance (weak correlations) . . . . . . . . 2.11.1.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11.2 Exponential Space Time Covariance (strong correlations) . . . . . . . 2.11.2.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11.3 Matern Space Covariance and exponential time (separable) . . . . . . 2.11.3.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11.4 Non separable space and time Gneiting covariance (separable) . . . . 2.11.4.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 64 65 72 73 74 74 75 75 76 76 77 79 79 80 80 81 81 82 82 83 Chapter 3 Random Projection for Tensor data . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Kronecker factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Literature reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Weak dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Choice of d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Memory efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2.1 Tensor type data . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Variance reduction through averaging . . . . . . . . . . . . . . . . . . 3.5 Simulation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Future scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 3.7.1 Limitation for Gaussian assumptions . . . . . . . . . . . . . . . . . . 92 92 93 94 95 96 97 97 97 97 98 98 98 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.8.1 Mathematical Background for Tensor . . . . . . . . . . . . . . . . . . 100 3.9 Kernelized Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . 101 3.9.1 Framework of the Classification Problem . . . . . . . . . . . . . . . . 101 Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.9.2 3.9.3 STM with random projection . . . . . . . . . . . . . . . . . . . . . . 103 3.9.4 Solving the STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.9.5 Estimation with Complete Tensor Data . . . . . . . . . . . . . . . . . 104 3.10 Statistical Property of STM . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.8 Preliminaries vii APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 viii LIST OF TABLES Table 1.1: Bias of absolute value,Sample Standard Deviation,Empirical Coverage prob- ability of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 2.1: Exponential separable space and time covariance estimation when ∆ is the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mean vector 14 79 Table 2.2: Estimation and selection of ∆ using the estimated covariance . . . . . . . 79 Table 2.3: Estimation and selection of ∆ under independence . . . . . . . . . . . . . 79 Table 2.4: Exponential separable space and time covariance estimation when ∆small is the mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Table 2.5: Estimation and selection of ∆small using the estimated covariance . . . . 80 Table 2.6: Estimation and selection of ∆small under independence . . . . . . . . . . 80 Table 2.7: Exponential separable space and time covariance estimation when ∆ is the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mean vector 80 Table 2.8: Estimation and selection of ∆ using the estimated covariance . . . . . . . 80 Table 2.9: Estimation and selection of ∆ under independence . . . . . . . . . . . . . 80 Table 2.10: Exponential separable space and time covariance estimation when ∆small is the mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Table 2.11: Estimation and selection of ∆small using the estimated covariance . . . . 81 Table 2.12: Estimation and selection of ∆small under independence . . . . . . . . . . 81 Table 2.13: Matern covariance with separable exponential time when ∆ is the mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Table 2.14: Estimation and selection of ∆ using the estimated covariance . . . . . . . 81 Table 2.15: Estimation and selection of ∆ under independence . . . . . . . . . . . . . 82 Table 2.16: Gneiting covariance with non-separable when ∆ is the mean vector . . . . 82 ix Table 2.17: Estimation and Selection of ∆ using the estimated covariance . . . . . . . 82 Table 3.1: Average of total deviation of ratios of pairwise distance between projected . . . . . . . . . . . . . . . . . . . . and actual data from 1 . (Variability) 98 x KEY TO SYMBOLS 1. ≤ denotes the positive semidefinite ordering. i.e. we write A ≤ B if the matrix B − A is positive definite. 2. | · | denotes the cardinality of a set. 3. (cid:53)βK is the gradient of a vector K w.r.t β. 4. ◦ signifies the Schur or Hadamard product of two matrices. 5. ⊗ signifies the Kronecker product of two matrices. 6. ¨µ denotes the double derivative of the function µ, similarly ... µ denotes the third derivative and so on. 7. := signifies assignment or is referred to as “denoted by”. 8. Tr A denotes the trace of a matrix A. 9. LHS stands for Left Hand Side of an equation. 10. ∼ denotes neighbors. i.e. u ∼ v implies u and v are adjacent voxels. 11. (cid:104)·,·(cid:105) signifies inner product of two vectors. 12. sign(a) signifies −1 or +1 depending on whether a < 0 or a > 0 respectively. 13. I(·) denotes the indicator operator. 14. TP (True Positives) represents the average number of correctly detected nonzero coeffi- cients. 15. FP (False Positives) represents the average number of incorrectly detected nonzero coef- ficients. 16. CP gives the average empirical Coverage Probability of the 95% confidence intervals. 17. OSE stands for One step Estimation xi Chapter 1 Analysis of Spatial Count Data: Penalized Estimating Equation Approach 1.1 Introduction Geographical factors play a significant role in epidemiology. Poisson regression is popularly used for the analysis of disease rates, plant growth etc which assumes that the rates in nearby regions are independent and the variance of response is equal to the mean.Yasui & Lele (1997) Hierarchical models have been proposed to utilize spatial locations and neighbors as analysis of disease rates. Using hierarchal model, marginal likelihood becomes intractable, so this issue can be solved using estimating equation. 1.2 Literature review Mardia and Marshall(stationary setup CAR low dimensional estimation) have produced con- sistent estimators in Gaussian CAR model using likelihood method.However since likelihood 1 is not tractable for Poisson log Normal Distribution. Mean field method cannot be used here it assumes the distribution of latent variable to independent. Following Yasui & Lele (1997) method, we need to find exact form of function Y ∗ which is unbiased estimator for log Y to prove theoretical result. In order to achieve one way is to solve the first order Fredholm equation. However this equation is not solvable since log Y do not have finite expectation when Y ∼ Poisson(λ) Liang and Zeger (1986) used the idea of estimating equa- tion for longitudinal data.Wang L(2011) extended this for diverging number of covariates. Wang L, Zhou J, Qu A.(2011) used penalized estimating equation for high dimensional in- ference. ClS type estimates and MM estimates and estimating equation approach is used. The central limit theorem for this statistics can be achieved by result using Peligard of rho mixing . Correlation upper bound is obtained for banded precision matrix. For general CAR model, problem remains open. Some progress can be made using partitioned matrix inverse Σ−1 ii = (Σii)−1 − Σ−i,i(Σ−(i,i))−1Σi,−i LIN and CLAYTON (2005) extended the use estimating equation for spatial binary data referencing to the work of Zeger(1988) for time series of count data. These works shows that estimating equation approach can be used for spatially dependent data under appropriate mixing conditions. These paper estimated the regression parameter from quasi likelihood score function. Asymptotic covariance of such estimator depends on unknown nuisance parameter like scale and correlation parameters (σ2, γ) .Lele(1991),used Jacknifing tools for reduced bias estimators of nuisance parame- ters. HEAGERTY and LUMLEY (2000) proposed non parametric estimation of Covariance matrix using sub sampling windows for time series in general lattice data. Prentice(1988) provided a consistent estimate of scale and correlation parameter using second quasi likeli- hood score function. Here the precision matrix in quasi score requires knowledge of third and fourth moments which replaced by ”working” precision matrix. This paper generalizes the 2 idea of joint optimal estimation Godambe and Thompson (1989) (βn, σ2, γ) in independent when skewness and kurtosis of distribution is known from before. 1.3 Model For county i ∈ {1, 2, .., n} Yi= Observed disease cases of county i. Ei= Expected disease cases of county i.Ei s are known. Ψi= Logarithm of ratio of disease rates to some reference rates county i . Yi|Ψi ind∼ P oisson(eΨiEi) ψn×1 ∼N(cid:16)Xn×pnβpn×1, σ2V n(γ) (cid:17) V n(γ) = (I − γMnWn)−1Mn γ ∈ (−1, 1) ; σ2 ∈ (0,∞) Define W n×n as adjacency matrix  1 if i and j are neighbours otherwise j∈N (i) Wij = Wi+. Let M n×n = Diag ( 1 Wi+ Wij = 0 (cid:80) )ii 3 Here matrix W n can be interpreted as the adjacency matrix of graph with vertex set as indices of random variable namely={1,2,...n}. W r i,j n represents the number of paths of length r from vertex i to j. W n is irreducible if it is adjacency matrix of connected graph.Mathematically, W r i,j n > 0 for some r and all i, j ∈ {1, 2, .., n}. Suppose W n is reducible then W n can be represented into block diagonal matrix of irreducible matrix. The block diagonal structure of W n represents the isolated connected components of corresponding graph. Suppose  W = 1W 0 2W . . . 0 c−1W cW  here matrix jW is irreducible adjacency matrix of graph with vertex set indexed by j So the corresponding V 1,n = (I − γMnWn)−1Mn is also block diagonal matrix. Without loss of generality we prove all results assuming W n is irreducible since these results can be extended to block diagonal W n similarly. 1.4 Estimation The objective is to estimate β,σ2 and γ using observations {Y }i. Most straightforward method would me Maximum Likelihood Method (MLE) method but MLE is untractable so Yasui & Lele (1997) have used transformation and then MLE. 4 1.4.1 Transformation approach by Yasui & Lele (1997) Define Y ∗ = f1(Y ) such that EY ∗|Ψ = Ψ + b∗(ψ) Y ∗∗ = f2(Y ) such that EY ∗∗|Ψ = Ψ2 + b∗∗(ψ) With b∗(y) and b∗∗(y) satisfying some regularizing condition. q(λn) is SCAD penalty function with parameter λn , the following sets of MLE equation are solved for estimation (cid:0)βn (cid:1) = Un 1 n (cid:124) X(cid:124)(cid:2)σ2V3,n(γ)(cid:3)−1(Y ∗ − Xβ) (cid:125) (cid:123)(cid:122) Sn −q(λn)(βn) F1,n(γ) = (Y ∗ − Xβ) (cid:124) M nW n(I − γM nW n)(Y ∗ − Xβ) = 0 F2,n(σ2) = (Y ∗ − Xβ) (cid:124) (M−1 n − γW n)(Y ∗ − Xβ) − nσ2 = 0 1.4.1.1 High dimensional curse These equations are derived assuming asymptotic unbiasedness of Y ∗ for large value of Ψ and plugin method of moment estimates i.e Y ∗ in place of Ψ in the estimating equations.BuT when pn is more than n, the above method breaks down due to accumulation of bias of order pn/n. Shown in appendix. 5 1.5 In search for transformation function Here we considered two approaches to obtain a function which can serve as unbiased estimate of Xβ. Using these two method we could only approximate the target function. But high dimension such approximation fail miserably. 1.5.1 Integral equation approach We want to solve for function Y ∗ = f (Y ) such that EY ∗|Ψ = Ψ or equivalently ∞(cid:88) i=0 e−λ λi i! f (i) = log λ for all λ ∈ (0,∞) This kind of equation is classified as Fredholm Integral equation of first kind with form Kf = g where K is integral or expectation operator of random variable Poisson λ with kernel K(λ, i) = λi i! . g is known (data function) here g(λ) = log λ and f (solution function) is to be solved for. A natural heuristic is to think K as matrix with rows, dependent on λ and columns dependent upon i ∈ {0, 1, ..,∞} where f is an unknown vector and g is constant depending on λ Assuming all regularity conditions for Poisson kernel. Suppose it has SVD(singular value decomposition) the equation can be solved. But the data function g(λ) = log(λ) /∈ L1(L = Lebesgue, R,B). Hence log(λ) is not complete with respect to orthonormal basis. Therefore it does not have a series expansion with respect to any orthonormal basis. 6 1.5.1.1 Charlier Polynomial Charlier polynomials {Cm}∞ m=0 are a family of orthogonal polynomials with respect to Poisson weights. ∞(cid:88) i=0 λi i! Cm (λ, i) Cn (λ, i) = λ−neλn!δmn λ > 0 Suppose log(λ) = ∞(cid:88) n=0 anλ−nn! for some constants an Then function f (y) could be analytically approximated. However log(λ) does not have Taylor series expansion on (0, +∞). The solution f cannot be found using Charlier Polynomial. However many numerical approximations are possible using techniques like quadratures etc. 1.5.2 Properties of marginal model Finally we came across the work of which gives basic ideas about marginal distribution of this hierarchal model also known as log poisson Model. We derive first few moments needed in order to solve for this estimating equation. The induced model is PΛd=1(Xβ, σ2V 1,n(γ)) multivariate Poisson-log normal distribution with moments provided in the appendix 7 1.6 Penalized quasi likelihood on Induced model Basawa have discussed the likelihood estimation of int erst parameter under mixture model where the nuisance parameter has prior. He showed that under exponential family set up the conditional likelihood estimation is asymptotically efficient as the mixture distribution set up. However our problem does not fit this paradigm since the prior is on the random variable has prior on it. Extending the idea of conditional least squares,it can be shown that there are two cases for estima- tion.In case 1 we can estimate all parameters jointly β, γ, σ2 by either by differentiating same pseudo likelihood with respect to different parameters. hi =(cid:0)Y i − E(Y i|F−i)(cid:1)2 then multiply appropriate 1 to n but Y i In case 2 we can use two different pseudo likelihoods i.e hi =(cid:0)Y i − E(Y i|F−i)(cid:1)2 and (cid:18)(cid:0)Y i − E(Y i|F−i)(cid:1)2 − V ar(Yi|F−i) (cid:19)2 weights for optimality.Here F−i = (S)(Y 1, ..., Y i−1, Y i+1, ...Y n) is Sigma field of all samples from h(cid:48) i = our setup the expression of conditional expectation E(Y i|F−i) is intractable We work for simplified case: estimate all parameters through hi = (Y i−µi(βn, σ2, γ)2 the multiply then multiply appropriate weights for optimality. In appropriate weights for optimality. 1.7 GEE Since we have few moments of marginal distribution we can formulate the estimating equation (cid:124) 1 (cid:124) 2  be the coefficient matrix for β Hence XT i  X X ··· (cid:124) n X Let X = is the ith row of X µi(βn, σ2, γ) = E(Y i) = exp (−X (cid:124) i βn + σ2V 1,n(γ)ii/2) = θn,iRii 8 Vn(σ2, γ, βn) = Diag(µi) + Diag(µi)[R(cid:12)2(σ2, γ) − J n]Diag(µi) exp (−X (cid:124) i βn) = θn,i exp(σ2V 1,n(γ)ii/2) = Rii (cid:19) (cid:18) R(cid:12)2(σ2, γ) − J n = exp(σ2V 1,n i,j) − 1 i,j (cid:124) R(cid:12)2 denotes Hadamard product of matrix R (cid:12) R and J n = 1n1 n The equation are derived from stationarity of KKT conditions. The following score functions arise when we treat β, γ, σ2 as mean parameters under the constraints • sparse solution inducing penalty on β • σ2 ≥ 0 • |γ| ≤ 1 1 = − 1 n (cid:124) (cid:0)βn σ2, γ(cid:1) = 1 n (cid:124) Un (cid:18) ∂µ ∂β . . . , Xnµn XT Diag(cid:0)µi X1µ1, = − 1 n (cid:124) (cid:124)(cid:2)Vn(σ2, γ, βn)(cid:3)−1 (cid:123)(cid:122) (cid:19)(cid:2)Vn(σ2, γ, βn)(cid:3)−1 (cid:1)(cid:2)Vn(σ2, γ, βn)(cid:3)−1 (cid:123)(cid:122) Sn Sn (cid:123)(cid:122) Sn (cid:18) (cid:19) (cid:125) Y − µ(βn, σ2, γ) (cid:18) (cid:19) (cid:125) Y − µ(βn, σ2, γ) (cid:18) (cid:19) (cid:125) Y − µ(βn, σ2, γ) +q(λn)(|βn|) (cid:12) sgn(βn) +q(λn)(|βn|) (cid:12) sgn(βn) +q(λn)(|βn|) (cid:12) sgn(βn) = 0 under λn ≥ 0 q(λn) is derivative of SCAD penalty function wrt β with penalty parameter λn 9 F(cid:48) 3,n(γ, βn, σ2) = (cid:19) (cid:18) (cid:124)(cid:2)Vn(σ2, γ, βn)(cid:3)−1 (cid:125) (cid:123)(cid:122) (cid:124) Y − µ(βn, σ2, γ) (cid:19) (cid:18) 1,n(γ))(cid:2)Vn(σ2, γ, βn)(cid:3)−1 1,nW nV −1 Y − µ(βn, σ2, γ) ∂µ ∂γ F3,n 1 n Diag(V −1 + v sgn(γ) + v sgn(γ) = 0 2 3 = (cid:124) µ σ2 2n under v ≥ 0 = (cid:124) µ 1 2n under u ≥ 0 F(cid:48) 2,n(σ2, γ, βn) = (cid:18) (cid:19) (cid:124)(cid:2)Vn(σ2, γ, βn)(cid:3)−1 (cid:125) (cid:123)(cid:122) Y − µ(βn, σ2, γ) (cid:18) (cid:19) Diag(V 1,n(γ))(cid:2)Vn(σ2, γ, βn)(cid:3)−1 Y − µ(βn, σ2, γ) ∂µ ∂σ2 (cid:124) F2,n 1 n − u − u = 0 1.7.0.1 Short range dependence We need short range dependence property which gurantees variable selection consistency σ2, γ, βn) i.e variance covariance matrix of Y defined above has finite row (or colomn) sum bounded from above and below. Here V 1,n i,j = f (V n i,j) where f : R → R is hadamard matrix function Lemma 1.7.1. (cid:107)Vn(cid:107)∞ ≥ sup i (µi + µ2 Lemma 1.7.2. (cid:107)Vn(cid:107)∞ ≤ sup Since (cid:107)Vn(cid:107)∞ < ∞ for all n it implies (cid:80)∞ σ2V 1,n i,i µi+µ2 i 1 i σ2 wi+(1+|γ|) ) (cid:0) exp(σ2V 1,n i,i)−1(cid:1) i 1 wi+(1−|γ|) j=1 COV (Yi, Yj) < ∞ which represents short range dependence. 10 1.8 Penalty To proof consistency penalized score function needs Taylor expansion. So unbiased penalty function is desired by Heyde constrained equation 1.8.1 Smooth Clipped Absolute Deviation Penalty Non concave penalty with oracle properties by Fan and Li (2001) (cid:48) λ(|θ|) = λ[I(|θ| < λ) + q (aλ − |θ|)+ (a − 1)λ I(|θ| ≥ λ)] a > 2 for computation a = 3.7 Alternative penalty can be MCP. (cid:48) λ(|θ|) = λ(1 − |θ| aλ q )+ a > 0 value for a are found using cross validation Here Sn( ˆβn) is non penalized score version . Then we add penalty term and prove asymptotic up crossing ie Un = op(an) where an → 0 is sufficient since Un may be discontinuous due to penalty. 11 1.9 Assumptions 1. Ei = 1 for all i 2. True parameter β (cid:124) n(1×sn) = (β (cid:124) n(1×pn) , 0(1×sn−pn)) 3. True parameter σ2 is bounded and |γ|leq1 4. True parameter βn lies in interior of comapact subset B ∈ Rpn 5. X is bounded element wise. (cid:1) ≤ λmax (cid:0) 1 n (cid:80)n i=1 (cid:1) < ∞ X(cid:124) i Xi X(cid:124) i Xi i=1 (cid:80)n (cid:0) 1 n 6. 0 < λmin 7. min1≤j≤sn βn0j/λn → ∞ If pn = o(nα) then λn = O(n−η) where 0 < α < 4 3 and 0 < η < (2− α) 8. maxi wi+,n = k < ∞ 9. (cid:80)sn i=1 |βi,n| < ∞ 10. min 1≤i≤pn βn,0(i)/λ → ∞ 11. p3 n n = o(1) 12. λn → 0 13. p2 nlogn4 = o(nλ2 n) 12 1.10 Consistency Theorem 1.10.1. There exist approximate GEE solution of nth step be ˆβ (cid:124) n,n = (ˆβ (cid:124) n1,n, ˆβ (cid:124) n2,n) P(|Unj ˆβ)| = 0, j = 1, 2, .....pn) → 1 j = pn + 1, pn + 2, .....sn) → 1 P(|Unj ˆβ)| ≤ λn logn , P(ˆβn2 = 0) → 1 Remark 1. The approximation of the solution is conveyed through second item which says that the value of elements from pn + 1 to sn of score equation at ˆβn,n are not exactly 0 but less than λn/logn so by choosing λn carefully we can attain good approximation 1.11 CLT Define Ξn = ∂µ(cid:124) ∂βn V −1 n ∂µ ∂βn Define sandwich variance estimator by (cid:19) −1 ˆΞ −1(cid:18) ∂µ(cid:124) ˆHn = ˆΞ n (Y − ˆµ(βn)(Y − ˆµ(βn) ˆV −1 (cid:124) ˆV −1 n ∂µ ∂βn ∂βn ∀αn ∈ Rpn(cid:107)αn(cid:107) = 1 αn ˆHn(βn,0)− 1 2 ∂µ(cid:124) ∂βn n (Y − ˆµ(βn,0)(ˆβn − βn,0) −→ N(0, 1) ˆV −1 13 1.12 Simulation results We simulate using X = I and maximal number of neighbors k = 10 . We record their std error and set p = n1.5 and s = n0.8 and σ2=1 σ2 0.1 10 100 2 σ 0.1 10 100 n = 400, sn = 30pn=500 γ = 0.05 γ = 0.75 γ = 0.9 (0.28,0.81,0.85) (0.24, 0.74,0.84) (0.21,0.70,0.87) (0.27 ,0.73,0.76) (0.21,0.73,0.85) (0.22,0.68, 0.86) (0.38,0.85,.60) (0.35,0.82, 0.72) (0.32,0.83,0.85) n = 1000, sn = 30pn = 2000 γ = 0.05 γ = 0.75 γ = 0.9 (0.22,0.49,0.91) (0.21,0.27,0.91) (0.21,0.28,0.93) (0.25,0.57, 0.89) (0.23,0.59, 0.91) (34.90 ,0.91) (0.31,0.61,0.83) (0.23,0.61,0.84) (0.30,0.58,0.84) Table 1.1: Bias of absolute value,Sample Standard Deviation,Empirical Coverage probability of β 14 APPENDIX 15 APPENDIX Variance Components V 3,n(βn, σ2 n, γn) = V(Y ∗) (cid:125) (cid:123)(cid:122) Y |Ψ(Y ∗) E = V Ψ (cid:124) V 1,n + E (cid:124) (cid:125) ΨV Y |Ψ(Y ∗) (cid:123)(cid:122) V 2,n V 1,n = σ2(I − γMnWn)−1Mn (cid:20)(cid:18) 1 Ei V 2,n = Diag exp (−X iβ + V 1,n(γ)ii/2) (cid:19) (cid:18) (cid:48)2∗ (Ψi)exp(Ψi) + b2∗(Ψi) b (cid:19)(cid:21) + E Ψ (CRLB is attained for exponential family) (cid:18) ∂(Ψ + bias(Ψ)) (cid:19)2 ∂(exp(EΨ)) /I(cid:0)exp(ψ)(cid:1) V Y |Ψ(Y ∗) = = exp(−Ψ/E)[1 + b∗(Ψ)]2 Moment Generating function of Normal Distribution Ψ(exp(−Ψi/Ei)) = exp (−X E (cid:124) i β + σ2V 1,n(γ)ii/2)/Ei Assuming negligible bias and derivative of bias of order ( 1 n ) (cid:0)Ψ − EY |Ψ(Y ∗ i )(cid:1)(cid:124)(cid:0)Ψ − EY |Ψ(Y ∗ i )(cid:1) (V 2,n)ii = E ΨV Y |Ψ(Y ∗ ΨV Y |Ψ(Y ∗ = E i ) + E Ψ i ) + b2∗(Ψi) 16 We will approximate to get (cid:18) 1 Ei V 2,n = Diag exp (−X iβ + σ2V 1,n(γ)ii/2) (cid:19) Inequalities using function Y ∗ Proof. We assume eΨ = λ onwards Since log(x + 0.5) is a concave function. By Jensen’s inequality E log(x + 0.5) ≤ log(E x + 0.5) = log(λ + 0.5) and using Y |λ ind∼ P oisson(λ) so it follows Stein Chen identity Stein Chen identity, E(Y g(Y )) = λE(g(Y + 1) when E|Y g(Y )| and E|g(Y + 1)| exists. Hence Taylor approximation is valid and central moments can be derived using Stirling’s number. From Log Sovolev inequality in Poisson Measure for any function f : R → (0,∞) (cid:20) f (Y )log(cid:0)f (Y )(cid:1) − E[f (Y )] log(cid:0) E[f (Y )](cid:1)(cid:21) Eλ ≤ λ E (cid:18)(cid:0) f (Y + 1) − f (Y )(cid:1)2 (cid:19) f (Y ) Here E denotes Eλ Using f (Y ) = Y + 1 2 we obtain using Log sovolev inequality (cid:20) E (Y + ≤ λ E ≤ λ E E + 1 2 1 2 1 2 1 2 1 2 λ E ) log(cid:0)Y + (cid:18) 1 (cid:19) (cid:20) (cid:18) 1 (cid:19) (cid:20) log(cid:0)Y + Y + 1 2 Y + 1 2 (cid:1) − (λ + log(cid:0)Y + (cid:20) (cid:1)(cid:21) ) log(cid:0) λ + + 1(cid:1) − log(cid:0) λ + (cid:18) 1 (cid:19) (cid:1) + log(cid:0)Y + (cid:18) 1 (cid:1)(cid:21) (cid:1) − log(cid:0) λ + ≤ λ E Y + 1 2 λ E 1 2 1 2 1 2 1 2 1 2 Y + 1 2 (cid:20) E 1 2 (cid:1) − log(cid:0) λ + log(cid:0)Y + (cid:1)(cid:21) (cid:19)2 − log(cid:0) λ + 1 2 1 2 1 2 1 Y + 1 2 + b (cid:1)(cid:21) 1 2 −, (cid:19) + (cid:18) (cid:1)(cid:21) (cid:4) 17 Moments of Log Normal Poisson E(Y i) = exp (X (cid:124) i β + σ2V 1,n(γ)ii/2) = µi (cid:0)exp(σ2V 1,n(γ)ii) − 1(cid:1) (cid:0)exp(σ2V 1,n(γ)ij) − 1(cid:1) V ar(Y i) = µi + µ2 i Cov(Y i, Y j) = µiµj Corr(Yi, Yj) = exp(σ2V 1,n(γ)ij − 1 [exp(σ2V 1,n(γ)ii) − 1 + µi] 1 2 [exp(σ2V 1,n(γ)jj) − 1 + µj] 1 2 (cid:18) Y i − µi(β) E (cid:19)3 = exp(3X +2 exp(3X (cid:124) i β + 9σ2V 1,n(γ)ii/2) − 3 exp(3X (cid:124) i β + 3σ2V 1,n(γ)ii/2) − 3 exp(2X (cid:124) i β + 2σ2V 1,n(γ)ii/2) + exp(1X (cid:124) i β + 5σ2V 1,n(γ)ii/2) (cid:124) i β + 4σ2V 1,n(γ)ii/2) (cid:124) i β + 1σ2V 1,n(γ)ii/2) −3 exp(2X E(Y i − E(Y i))4 = exp(4X (cid:124) i β + 16σ2V 1,n(γ)ii/2) − 4 exp(4X (cid:124) i β + 6σ2V 1,n(γ)ii/2) − 3 exp(4X (cid:124) i β + 9σ2V 1,n(γ)ii/2) − 12 exp(3X (cid:124) i β + 3σ2V 1,n(γ)ii/2) + 7 exp(2X (cid:124) i β + 2σ2V 1,n(γ)ii/2) + exp(1X (cid:124) i β + 10σ2V 1,n(γ)ii/2) (cid:124) i β + 4σ2V 1,n(γ)ii/2) (cid:124) i β + 5σ2V 1,n(γ)ii/2) (cid:124) i β + 4σ2V 1,n(γ)ii/2) (cid:124) i β + 1σ2V 1,n(γ)ii/2) +6 exp(4X +6 exp(3X +6 exp(3X −4 exp(2X Lemma 1.12.1. Corr(Yi, Yj) ≤ Corr(Ψi, Ψj) Proof. Standardize ( Ψi σ2V 1,n(γ)ii , Ψj σ2V 1,n(γ)jj ) Let resulting marginal model be (Zi , Zj) 18 Standardize ( Yi V ar(Yi) , Yj V ar(Yj ) ) Use the inequality Corr(Yi, Yj) = b > 0 , c > 0 (cid:0)ea−1+b(cid:1)0.5(cid:0)ea−1+c(cid:1)0.5 < a = Corr(Zi, Zj) for |a| < 1 and ea−1 (cid:4) Lemma 1.12.2. Log normal distribution is dependent on first two moments of the latent model Proof. E(eit (cid:124)Y ) = EΨEY |Ψ(eit (cid:80)n j=1 Ψj (eitj−1) = exp EΨ e (cid:0) n(cid:89) j=1 j Y )(cid:1) = EΨ (cid:124) n(cid:89) j=1 EY j|Ψj (eit eΨj (eitj−1) (cid:124)Xβ + (cid:124) σ2 (cid:94) (eit − 1) V 1,n(γ) (cid:94) (eit − 1) 2 (cid:124)Y ) = EΨ (cid:16) (eit − 1) (cid:17) (cid:4) (cid:94) (eit − 1) denotes vector here (cid:94) (eit − 1)j = (eitj − 1) Relationship between induced and latent Model co variance (cid:18) 1 (cid:18) 1 (cid:18) 1 Ei Ei (V 2,n)ii = ≤ ≥ exp (−X iβ + σ2V 1,n(γ)ii/2) exp (−X iβ + σ2 2(1 − |γ|)wi+ ) 1 (cid:19) exp (−X iβ + σ2 1 2wi+ ) Ei (cid:19) (cid:19) Range of γ For the matrix V 1,n(γ) to be positive definite matrix , it is sufficient for γ ∈ (−1, 1) Since iff |γ| ≤ 1 V −1 1,n(γ) = M−1 n − γWn 19 Define strictly diagonal dominant matrix ∆(A)i = |aii| −(cid:88) i(cid:54)=j |aij| ≥ 0for all i Proof. Symmetric strictly diagonal dominant matrix with positive diagonal elements is positive definite from Varah (1975) . V −1 1,n(γ) = M−1 n − γWn satisfies this condition for all n iff |γ| ≤ 1 (cid:4) Useful bounds about V 1,n Define spectral norm (cid:107)A(cid:107)∗ = max i |λi(A)| Form the theory of diagonal dominant matrices we conclude following bounds. When γ > 0 V −1 1,n is M Matrix. We can exploit various results on M Matrix and extend them to the case γ < 0 as ∆ and (cid:107)(cid:107)∞ operator operates on absolute value of matrix entries. (1 − |γ|) min wi+ 1 i 1 2 3 (cid:18) ≤ (cid:107) I − γ M nW n (cid:19)−1 Mn(cid:107)∗ ≤ wi+ 1 (1 − |γ|) max i ≤ V 1,n(γ)ii ≤ 1 wi+ 1 (1 − |γ|)wi+ |V 1,n(γ)ij| ≤ |γ||V 1,n(γ)ii| 20 V 1,n(γ)ii ≥ 1 V −1 1,n(γ)ii (cid:18) Structure of V 1,n V 1,n = I − γ M nW n (cid:19)−1 Mn can be expanded by(cid:80)∞ m=0 γm (M nW n)m Mn since (cid:107)(MnWn)(cid:107)∗ < 1 where (cid:107)(cid:107)∗ represents spectral norm or maximum of absolute eigenvalue. Lemma 1.12.3. It can be proved by induction that (MnWn)r n i,j ≤ maxi 1 wi+ for all ≥ 1 and all n Proof. Note that (MnWn)i,m = wi,m wi+ Diag ( 1 Wi+ )ii where wi,m = wm,i is 0 or 1(cid:80) j∈N (i) wij = wi+. Let M n×n = (MnWn)r n i,m = n(cid:88) j=1 (MnWn)1 n i,m = wi,m wi+ ≤ maxi 1 wi+ (MnWn)1 i,j (MnWn)r−1 (MnWn)i,j maxi ≤ n(cid:88) j=0 1 wi+ wi,j wi+ 1 ≤ maxi 1 wi+ (cid:4) n − γWn)−1 = V 1,n are positive. The row sum can be written as Lemma 1.12.4. (cid:107)V −1 n (cid:107)∞ ≥ inf i Proof. Case γ > 0: Here all entries of matrix (M−1  = (1 − γ)  1 1 ··· 1 V −1 1,n n j,m ≤ n(cid:88) j=0 µi + inf i µ2 i σ2 wi+(1−γ)   w1+ w2+ ··· wi+ ··· wn+   1 1 ··· 1 ≤ max i wi+ 21 (cid:26) exp(σ2V 1,n i,j) − 1 (cid:27)  1 1 ··· 1  ≥ σ2V 1,n  1 1 ··· 1  ≥ max i σ2 wi+(1 − γ)  1 1 ··· 1 (cid:107)Vn(cid:107)∞ ≥ inf i µi + inf i µ2 i σ2 wi+(1 − γ)  (cid:4) Lemma 1.12.5. If 1 − γ − γ 2 1−γ 2 (1 − 2 1 k ) > 0 is strictly diagonally dominant matrix Proof. Since (M nW n) is non negative irreducible matrix by Perron Frobenius therom: for all k  = 1   1 1 ··· 1  1 1 ··· 1 (M nW n)k . using this result geometric expansion of V 1,n = (cid:80)∞ (cid:18) necessary and sufficient condition 1 − γ + min (cid:80)∞ m=2 γm i m=0 γm (M nW n)m Mn we can arrive at (cid:19) 2(M nW n)m i,i − 1 > 0 for strict diag- onal dominance of V 1,n Consider wort scenario where (M nW n)2k+1 equivalent to 1 − γ − γ 2 k ) > 0 which is cubic in gamma. 1−γ 2 (1 − 2 1 i,i = 0 then above condition is (cid:4) 22 Lemma 1.12.6. If V 1,n is strictly diagonally dominant matrix then matrix (Vn)i,j = f (V 1,n)i,j is strictly diagonally dominant matrix or equivalently if |V1,n i,i| ≥(cid:88) j(cid:54)=i |V1,n i,j| then |f (V1,n i,i)| ≥(cid:88) j(cid:54)=i |f (V1,n i,j)| where f (x) = exp(σ2 x) − 1 Proof. We need to show that if from the results 1.12 it is known that CW,γ > Vn i,i > 0 and Vn i,i > |γ||Vn i,j| for all i (cid:54)= j. Using expansion exp((x) − 1 = ∞(cid:88) i=0 xk K! We prove that for all n and each k by induction n,i,i = V k−1 V k n,i,i Vn,i,i ≥ V k−1 n,i,i |V1,n i,j| (cid:88) n,i,j |(cid:88) j(cid:54)=i |V k−1 j(cid:54)=i |V1,n i,j| ≥(cid:88) j(cid:54)=i ≥ max j(cid:54)=i |V k 1,n i,j| (cid:4) The above result can hold large class smooth function f : R → R with domain being variance matrix elements. Lemma 1.12.7. (cid:107)V −1 n (cid:107)∗ ≥ inf i wi+(1+|γ|) wi+(1+|γ|)µi+µ2 i σ2 23 Proof. min i (cid:107)Vn(cid:107)∗ ≤ (cid:107)Vn(cid:107)∞ n ) ≥ (cid:107)Vn(cid:107)∞ 1 λi(V −1 Lemma 1.12.8. If A, B, C, A − B be symmetric positive semidefinite matrix then λ(i)(A + C) ≥ min i min i λ(i)(B + C) Proof. Using results in Hiai & Lin (2017) take k = n to obtain λ(i)(AC−1) ≥ min i min i λ(i)(BC−1) Since all matrices commute with I ] − 1 w+i min λ(i)(AC−1 + I) ≥ min λ(i)(cid:0)(AC−1 + I)C(cid:1) ≥ min λ(i)(BC−1 + I) λ(i)(cid:0)(BC−1 + I)C(cid:1) i i i min i λ(i)(A + C) ≥ min i min i λ(i)(B + C) Lemma 1.12.9. (cid:107)V −1 n (cid:107)∗ ≤ 1 λ(i)(V1,n)] − 1 ≤ exp[ exp[min i Proof. Since Vn = exp((V1,n) − 1 = ∞(cid:88) i=0 V ◦k 1,n K! 1 1 i (1−γ) min 24 (cid:4) (cid:4) where ◦ represents hadamard power We can obtain trivial inequality λi(Vn) ≥ λi(V1,n) But we can obtain bounds dependent upon function form f We use results λ(i)(A ◦ B) ≥ min i min i λ(i)(A) min i λ(i)(B) and using induction and 1.12.8 we conclude that λ(i)[exp(V1,n) − 1] ≥ exp[min i min i λ(i)(V1,n)] − 1 Lemma 1.12.10. (cid:107)Vn(cid:107)∞ ≥ sup (µi + µ2 i σ2 i 1 wi+(1+|γ|) ) (cid:4) Proof. Vn(σ2, γ, βn) = Diag(µi) + Diag(µi)[R(cid:12)2(σ2, γ) − J n]Diag(µi) (cid:18) (cid:19) R(cid:12)2(σ2, γ) − J n = exp(σ2V 1,n i,j) − 1 σ2 V 1,n i,j = ˜σ2 i,j Using the expansion to obtain exp(˜σ2x) − 1 − σ2x = σ2 exp(˜σ2x) − 1 ≥ ˜σ2x max i,j (cid:18) x2/2! + x3/3! + ... + x2n/2n! + x2n+1/(2n + 1)! (cid:19) ≥ 0 25 for all x ∈ [−1, 1] Therefore (cid:107)Vn(cid:107)∞ ≥ sup (µi + µ2 i(cid:107)V1,n(cid:107)∞) ≥ sup (µi + i µ2 i σ2 wi+(1 + |γ|) i ) (cid:4) (cid:0) exp(σ2V 1,n i,i)−1(cid:1) (cid:18) (cid:19) σ2V 1,n i,i (cid:107)∞ ≤ (cid:107)Diag 1 wi+(1−|γ|) (cid:19) var−1(Ψi) Lemma 1.12.11. (cid:107)Vn(cid:107)∞ ≤ sup i µi+µ2 i (cid:107)Diag var−1(Yi) µi+µ2 i (cid:107)Vn(cid:107)∞ ≤ sup (cid:18) (cid:19) (cid:0) exp(σ2V 1,n i,i)−1(cid:1) Vn Diag var−1(Yi) V 1,n i,i 1 wi+(1−|γ|) (cid:18) i Proof. We use the previous result Corr(Yi, Yj) ≤ Corr(Ψi, Ψj) for all i, j hence (cid:18) var−1(Ψi) (cid:19) (cid:107)∞ (cid:4) Derivation of estimating equations 1 derivation of Un is straightforward. 2 V1,n Diag = = ∂µ ∂γ βn + σ2 Diag(V 1,n(γ))n×1/2) ∂ exp (−X (cid:124)  µ1 σ2/2  Diag( ∂γ µn . . . . . . µ1(V −1 1,nW nV −1 1,n)1,1 ∂V 1,n(γ) ∂γ ) =  =  σ2/2 µn(V −1 1,nW nV −1 1,n)n,n 26 σ2/2 Diag(V −1 1,nW nV −1 1,n) µ 3 Derivation of F(cid:48) 2,n is also similar Stationary Case Joint estimate of parameters may be difficult to compute. We solve for simplest case first where (Wn)i,j = f (|i − j|)I(|i − j| ≤ k) for some fixed k. Therefore yielding matrix V 1,n(γ) = (M−1 n − γWn)−1 stationary co variance matrix. It can be seen by Cramer’s rule (V 1,n)i,j = −1 det[(V 1,n)−(i,j)] det[(V 1,n)−1] Re parametrization µi(βn, σ2, γ) = exp (−X (cid:124) i βn + σ2V 1,n(γ)ii/2)  βn σ2 V0 2 ) = exp(−[X (cid:124) i , 1] βn ˜σ2 ) = exp(−[X (cid:124) i , 1] = exp(− ˜X (cid:124) i ˜β) = µi( ˜βn) since V 1,n(γ)ii = V0 1×1 is same for all i due to stationarity Vn(σ2, γ, βn) = Diag(µi) + Diag(µi) (cid:18) (cid:19) (cid:18) (cid:124) [R(cid:12)2( ˜σ2, γ) − J n]Diag(µi) (cid:19) R(cid:12)2( ˜σ2, γ) − J n = exp i,j ˜σ2 V 1,n i,j − 1 V0 27 Here V 1,n(γ)ii = V0 1×1 is same for all i due to stationarity GEE in stationary setup 1 Score equation for mean (cid:1) (cid:0)˜βn Un (pn+1)×1 = (cid:124)(cid:2)Vn(˜σ2, γ, ˜βn)(cid:3)−1 (cid:123)(cid:122) ˜Sn ∂µ ∂ ˜β 1 n (cid:124) (cid:18) (cid:19) (cid:125) Y − µ(˜βn) + q(λn)(|˜βn ,−n|) (cid:12) sgn(˜βn ,−n) 0  under λn ≥ 0 and ˜βn ,−n pn×1 denote sub vector such that ˜βn ,−n ˜βn ,n  = ˜βn q(λn) is derivative of SCAD penalty function wrt β with penalty parameter λn 2 Method of Moment (cid:19) (cid:18) ˜σ2 Vt(γ) V0(γ) exp (cid:20)(cid:80)n − 1 = E i=t+1 (cid:18) (cid:19)(cid:18) (cid:80)n Y i − µi( ˜βn) (cid:19) Y i+t − µi+t( ˜βn) (cid:21) i=t+1 µi( ˜βn)µi+t( ˜βn) 2.1 Score for scale parameter (cid:18) (cid:80)n i=1 ˆexp( ˜σ2) = (cid:18) (cid:20)(cid:80)n i=1 ˆ˜σ2 = log (cid:80)n Y i − ˆµi( ˜βn) i=1 ˆµi( ˜βn)2 (cid:80)n Y i − ˆµi( ˜βn) i=1 ˆµi( ˜βn)2 (cid:19)2 − ˆµi( ˜βn) (cid:19)2 − ˆµi( ˜βn) + 1 (cid:21) + 1 28 ˆ˜σ2 can be negative by using this method. 2.2 Score for correlation parameter (cid:20)(cid:80)n i=t+1 (cid:18) (cid:19)(cid:18) (cid:80)n Y i − ˆµi( ˜βn) (cid:19) Y i+t − ˆµi+t( ˜βn) i=t+1 ˆµi( ˜βn) ˆµi+t( ˜βn) (cid:21) /ˆ˜σ2 + 1 ˆVt(γ) ˆV0(γ) = log ˆγ can be outside (-1,1) Vt(γ) = V|i−j|=t(γ) due to stationarity time series of counts. 3 Cressie variogram estimate (cid:20)(cid:80)n i=t+1(Yi − EYi)(Yi − EYi+t) (cid:21) E n − t (cid:18) (cid:20)(cid:80)n i=t+1 Thus we obtain E exp( ˜σ2) − exp( ˜σ2 Vt(γ)) = V AR(Yi) + V AR(Yi+t) − 2COV (Yi, Yi+t) Y i ˜βn) µi( − Y i+t µi+t( ˜βn) − ˆµi( ˜βn) −1 2 − − ˆµi+t( ˜βn) −1 2 (cid:21) = (cid:19)2 2(n−t) We take t = 1 find ˆγ form ˆV1(γ) when V1(γ) is known function of γ. Cressie showed that when dimension of β is fixed given σ2 |ˆγ − γ| = Op( 1√ n ) Matrix Inverse approximation V 1,n is diagonal matrix (V 3,n)−1 = (V 1,n + σ2 V 2,n)−1 = (V 2,n)−1 − ( σ2 V 2,n)−1 (cid:18) (V 1,n)−1 + ( σ2 V 2,n)−1 ( σ2 V 2,n)−1 (cid:18) (cid:19)−1 (cid:19)−1 = ( σ2 V 2,n)−1 − ( σ2 V 2,n)−1 M−1 n + ( σ2 V 2,n)−1 − γW n ( σ2 V 2,n)−1 (cid:18) σ2 V 2,n)−1 − ( σ2 V 2,n)−1( σ2 V 2,n + M−1 n ) I − γ( σ2 V 2,n + M−1 n ) −1 2 W n σ2 V 2,n + M−1 n ) −1 2 (cid:19)−1 −1 2 29 ( σ2 V 2,n + M−1 n ) −1 2 ( σ2 V 2,n)−1 −1 2 Lemma 1.12.12. The spectral norm of −1 2 W n( σ2 V 2,n + M−1 ( σ2 V 2,n + M−1 n ) n ) is same as spectral norm ( σ2 V 2,n + M−1 n )−1Wn which is less than max row sum of ( σ2 V 2,n + M−1 Hence the spectral norm of ( σ2 V 2,n + M−1 n ) I − γ( σ2 V 2,n + M−1 n ) −1 2 W n( σ2 V 2,n + M−1 n ) can be approximated by −1 2 W n( σ2 V 2,n + M−1 n ) −1 2 n )−1Wn (cid:18) (cid:19) −1 2 ≤ 1 γm(cid:0)( σ2 V 2,n + M−1 n ) n(cid:88) m=0 −1 2 W n( σ2 V 2,n + M−1 n ) 2 (cid:1)m + O(γn) −1 Remark 2. Hence forward Y ∗ will be denoted as Y Bound on Frobenius norm of score Further E(cid:13)(cid:13)(cid:13)X(cid:124)(cid:2)V3,n(γ)(cid:3)−1(Y − Xβ) (cid:13)(cid:13)(cid:13)2 2 ≤ E(cid:13)(cid:13)X(cid:124) Dn(γ)−1(Z − Xβ)(cid:13)(cid:13)2 2 where is Dn is diagonal positive definite matrix.So Z have mutually independent random elements. Using Since (V 2,n)−1 (cid:18) (cid:19)−1 (V 1,n)−1 + (V 2,n)−1 (V 2,n)−1is Postive definite Matrix E(cid:13)(cid:13)(cid:13)X(cid:124)(cid:2)V3,n(γ)(cid:3)−1(Y − Xβ) (cid:13)(cid:13)(cid:13)2 2 = TR(X (cid:124) V −1 3,nX) 30 3,n)tr(X (cid:124) X) ≤ λmax(V −1 ≤ λmax(V −1 (cid:124) 2,n)tr(X ≤ E(cid:13)(cid:13)(cid:13)X(cid:124)(cid:2)Dn(γ)(cid:3)−1(Z − Xβ) X) (cid:13)(cid:13)(cid:13)2 2 where Dn = 1 min Since λmax(V −1 i 2,n) ≤ Ei exp (−X (cid:18) 1 min i Ei (cid:124) i β) − σ2 exp (−X 2 (cid:19) i 1 max wi+ (cid:124) i β) − σ2 1 2 max i wi+ Convergence of scale and correlation parameters n − σ2| = Op((cid:107)ˆβn,n − βn(cid:107)2) | ˆσ2 |(ˆγn − γ)| = Op((cid:107)ˆβn,n − βn(cid:107)2) Convergence of variance parameter In Yasui and Lele [2012] estimation of σ2 are done using both hierarchical and marginal methods of conditional least squares Hierarchical method ˆσ2 n = 1 n (Y ∗ − Xˆβn,n) (cid:124) (M−1 n − ˆγnW n)(Y ∗ − Xˆβn,n) − 1 n (Y ∗∗ − Y ∗) (cid:124) M−1 n (Y ∗∗ − Y ∗) Marginal method ˆσ2 n = 1 n (Y ∗−Xˆβn,n) (cid:124) (M−1 n − ˆγnW n)(Y ∗−Xˆβn,n)− 1 n (cid:18) TR ˆV 2,n−1(ˆβn,n, ˆσ2 n−1, ˆγn)M−1 n (cid:19) (1.1) For both of these methods, it can be shown that rates are same 31 Proof. ˆσ2 n − σ2 1 n (Y ∗ − Xˆβn,n) (cid:124) (cid:18) = − Xˆβn,n) − 1 n − 1 n TR (M−1 (Y ∗ − Xβn) (cid:124) n − γnW n)(Y ∗ − Xβn) (M−1 n−1, ˆγn)M−1 n ) n − ˆγnW n)(Y ∗ (cid:18) (cid:19) V 2,n(βn, σ2, γ)M−1 n ) (cid:1)(Y ∗ − Xˆβn,n) (cid:124)(cid:0) (ˆγn − γ)W n TR 1 n + ˆV 2,n−1(ˆβn,n, ˆσ2 = 1 n (Y ∗ − Xˆβn,n) (cid:19) 1 + n − βn) 1 n + (cid:18)(cid:0) TR 2 (Y ∗(cid:124) − Xβn)(M−1 (cid:124)X(cid:124) (M−1 n − γW n)X(ˆβn,n − βn) n − γW n)X(ˆβn,n − βn) + 1 n (ˆβn,n n−1, ˆγn) − V 2,n(βn, σ2, γ) (cid:1)M−1 n (cid:19) ˆV 2,n−1(ˆβn,n, ˆσ2 (cid:19) (cid:1)(Y ∗ − Xˆβn,n)(cid:107)2 (cid:1)(Y ∗ − Xˆβn,n)(cid:107)2 (cid:124)(cid:0) W n (cid:1)(Y ∗ − Xˆβn,n)) (cid:124)(cid:0) W n (cid:18) (cid:107)(Y ∗ − Xˆβn,n) E (cid:124)(cid:0) (ˆγn − γ)W n ≤ E(cid:107)(ˆγn − γ)(cid:107)2 E(cid:107)(Y ∗ − Xˆβn,n) = (cid:107)(ˆγn − γ)(cid:107)2TR( (Y ∗ − Xˆβn,n) ≤ n (cid:107)(ˆγn − γ)(cid:107)2 λmax(V 3,nW n) ≤ n (cid:107)(ˆγn − γ)(cid:107)2 max (V 3,n)iiλmax(W n) i (cid:107) (Y ∗(cid:124) − Xβn)(M−1 ≤ (cid:107) (Y ∗(cid:124) − Xβn)(M−1 ≤ maxi λi (M−1 (cid:18) n − γW n)X(ˆβn,n − βn)(cid:107)2 n − γW n)X(cid:107)2 (cid:107)(ˆβn,n − βn)(cid:107)2 n − γW n)V 3(M−1 n − γW n) (cid:19) (cid:112)TR(X(cid:124)X) (cid:107)(ˆβn,n − βn)(cid:107)2 32 (cid:18) (M−1 n − γW n)V 3(M−1 n − γW n) (cid:19) (cid:112)TR(XX(cid:124)) = (cid:107)(ˆβn,n − βn)(cid:107)2 (cid:114) 2 maxi λi ) O(cid:112)(n pn) ( = Op pn n = Op(pn) (M−1 n − γW n)X(ˆβn,n − βn) 2 maxiλi (X(cid:124) (M−1 n − γW n) X) (cid:124)X(cid:124) (ˆβn,n − βn) ≤ (cid:107)(ˆβn,n − βn)(cid:107)2 ≤ Op n ( ) pn n = Op (pn) By recursion when n is large ˆσ2 n−1 − σ2 is same order of ˆσ2 n − σ2 (cid:19) V 1,n(γ)ii) V 2,n(βn, σ2, γ) (cid:18) 1 n n Ei σ2 2 exp (−X iβ + = Diag ˆV 2,n−1(ˆβn,n, ˆσ2 − V 2,n(βn, σ2, γ) M−1 n−1, ˆγn)M−1 (cid:1) TR(cid:0) ≤ TR(cid:0) − V 2,n(βn, σ2, γ) (cid:1) (cid:107)λmax(cid:107)(cid:0)M−1 (cid:18) 1 n(cid:88) (cid:18) exp (−X iβ + ˆV 2,n−1(ˆβn,n, ˆσ2 n−1, ˆγn) σ2 2 Ei i=1 = n ≤ n Op |X i| (cid:107)ˆβn,n − βn(cid:107)2 (cid:1) (cid:19) V 1,n(γ)ii/2) (cid:19) |V 1,n(γ)ii − ˆV 1,n(ˆγn)ii| + | ˆσ2 n − σ2| sup V 1,n(γ)ii + σ2 sup i i (cid:19) sup |X i| (cid:107)ˆβn,n − βn(cid:107)2 sup i (cid:18) (cid:114) i ( = n Op = Op n pn n ) 33 Observing that order of last term dominates hence (cid:107) ˆσ2 n− σ2 (cid:107)2 = Op due to contribution from last term. (cid:113) n ) contrary to Op ( pn ( pn n ) (cid:4) the marginal method do not have the last term contributed through bias correction. Therfore the convergence rate for σ2 n follows classical rate of Op ( pn n ) Convergence of γn γ = (Y ∗ − Xβn)(cid:124)M nW nM nW n(Y ∗ − Xβn) + (Y ∗ − Y ∗∗)(cid:124)M nW nM nW n(Y ∗ − Y ∗∗) (Y ∗ − Xβn) (cid:124) M nW n(Y ∗ − Xβn) = f1(βn) f2(βn) + Cn ˆγn = = = = (Y ∗ − Xˆβn,n)(cid:124)M nW nM nW n(Y ∗ − Xˆβn,n) + (Y ∗ − Y ∗∗)(cid:124)M nW nM nW n(Y ∗ − Y ∗∗) (Y ∗ − Xˆβn,n) (cid:124) M nW n(Y ∗ − Xˆβn,n) f1(ˆβn,n) f2(ˆβn,n) + Cn f1(βn) + 2(ˆβn,n − βn) An(Y ∗ − Xβn) + (ˆβn,n − βn) (cid:124)X(cid:124) f2(βn) + 2(ˆβn,n − βn)(cid:124)X(cid:124)AnA n(Y ∗ − Xβn) + (ˆβn,n − βn)(cid:124)X(cid:124)AnA (cid:124) (cid:124)X(cid:124) AnX(ˆβn,n − βn) (cid:124) n X(ˆβn,n − βn) + Cn f1(βn) + g1(βn, ˆβn,n) f2(βn) + g2(βn, ˆβn,n) + Cn Therefore (ˆγn − γ) can be written as f1(βn) + g1(βn, ˆβn,n) f2(βn) + g2(βn, ˆβn,n) + Cn − f1(βn) f2(βn) + Cn It is proven that f2g1 + Cng1 − f1g2 (f2 + Cn)(f2 + g2 + Cn) g1 f2 + g2 + Cn − g2 f1 f2 + g2 + Cn f2 + Cn = = g1(βn, ˆβn,n) ≤ Op(pn) 34 g2(βn, ˆβn,n) ≥ Op(pn) f1(βn) = Op(n) f2(βn) = Op(n) Cn = Op(n) (cid:107)(ˆγn − γ)(cid:107)2 = Op(pn) Op(n) = Op( pn n ) By Slutsky’s theoreom Convergence of precision matrix Lemma 1.12.13. Let A and An be sequences of invertible matrices of same order belong to Ba- nach space over(cid:0)Rn×n ,(cid:107) (cid:107)2 (cid:1) such that (cid:107)An − A(cid:107)2 = Op(rn) then (cid:107)A−1 n − A−1(cid:107)2 = Op(rn) for large n Proof. (cid:107)A−1 n − A−1(cid:107)2 = (cid:107)A−1 n (An − A) A−1(cid:107)2 (cid:107)A(cid:107)2 = (cid:113)(cid:80) i λ2 i where λi is ith singular value of A (cid:107)A−1(cid:107)2 = (cid:113)(cid:80) i 1 λ2 i λi (cid:54)= 0 for any i due to non singularity of A (cid:4) Using above lemma on A = σ2 An = ˆσ2 n The Frobenius norm inequality (cid:107)A(cid:107)2 ≤ √ spectral norm of A n (cid:107)A(cid:107)∗ here A is n × n matrix and (cid:107)A(cid:107)∗ denotes (cid:107) ˆσ−2 n ˆV −1 1,n − σ−2V −1 n − σ−2(cid:107)2 (cid:107)(M n − γW n)(cid:107)2 n) (1 − γ) max i wi+ n 1,n(cid:107)2 ≤ ˆσ−2 ||(ˆγn − γ)||2 (cid:107)W n(cid:107)2 + (cid:107) ˆσ−2 √ √ ≤ Op( pn ) O( ) O( n √ n) + Op pn n ( (cid:114) ≤ Op ( pn) 35 √ Hence (cid:107) ˆV 1,n−V 1,n(cid:107)2 = Op ( pn) implying that (cid:107) ( ˆV 1,n)i,i −(V 1,n)i,i (cid:107)2 = Op ( all diagonal elements ˆV 1,n and V 1,n are of same order. (cid:113) pn n ) provided Now all diagonal elements of ˆV −1 1,n and V −1 1,n are of same order as well as off- diagonal elements due to assumption 8 on wi,js . This property of same order for inverse matrices can be verified using law (V −1)i,i = det(V {i,i}c ) det(V ) V {i,i}c represents sub matrix of A obtained by deleting ith row and ith colomn from A (cid:20) 1 Ei V 2,n = Diag exp (−X iβ + σ2V 1,n(γ)ii/2) (cid:21) By Mean Value Theorem (multivariate) (cid:107) ˆV 2,n − V 2,n(cid:107)2 ≤ n max i + n max i i + n max √ ≤ Op ( pn) exp (−X iβ∗ + σ2V 1,n(γ)ii/2) max exp (−X iβ∗ + σ2∗V 1,n(γ)ii/2) max exp (−X iβ∗ + σ2∗V 1,n(γ)ii/2) (cid:107)σ2∗ F (M n, W n, γ∗) ii/2(cid:107)2 (cid:107)(ˆγn − γ)(cid:107)2 (cid:107)X i(cid:107)2(cid:107)(ˆβn,n − βn)(cid:107)2 (V 1,n(γ)ii/2) | ˆσ2 n − σ2|2 i i 1 Ei 1 Ei 1 Ei (cid:107) ˆV 3,n − V 3,n(cid:107)2 ≤ (cid:107) ˆσ2 n − σ2 (cid:107)2 (cid:107)V 2,n(cid:107)2 + (cid:107) ˆσ2 n(cid:107)2 (cid:107) ˆV 2,n − V 2,n(cid:107)2 + (cid:107) ˆV 1,n − V 1,n(cid:107)2 Using lemma 1.12.13 finally it is established that (cid:107) ˆV 3,n −1 − V −1 3,n(cid:107)2 = Op ( √ pn) 36 derivative wrt γ ∂V 1,n(γ) ∂γ V −1 1,n = −V −1 1,n ∂V −1 1,n(γ) ∂γ 1,n W n V −1 n W nM−1 1,n = V −1 = M−1 n − γ M−1 n W 2 n − γ W 2 nM−1 n + γ2 W 3 n Since gradient is a linear operator ∂(cid:0)V 1,n (cid:1) ∂γ i,i(γ) = (cid:18) ∂V 1,n(γ) (cid:19) ∂γ i,i (V 3,n) = ( σ2 V 1,n + V 2,n) Joint Convexity of Quasilikelihood To estabilish convexity of quasilikelihood, we need to show the directional derivative for every direction t ∈ R2 < t,∇(γ ,σ2)Sn ≥ 0 It can be established that the derivative of score function is monotonic ∂Sn ∂γ = X(cid:124) ∂(cid:2)V −1 3,n(γ)(cid:3) = X(cid:124) ∂(cid:2) σ2 V 1,n + V 2,n ∂γ (Y − Xβ) (cid:3)−1 ∂γ (Y − Xβ) 37 Derivative computation 3,n ∂V −1 ∂γ = −V −1 ∂V 3,n V −1 3,n 3,n ∂γ = −σ2 V −1 3,n ∂V 1,n ∂γ = −σ2 V −1 3,n − σ2 V −1 3,n Diag 2 3,n ∂γ ∂V 1,n (cid:18)(cid:0)V 2,n (cid:18) σ2 V 1,n i,i (cid:18) 2 (cid:20) (cid:20) = −σ2 V −1 3,n Diag exp ≤ −σ2 V −1 3,n (cid:18) (cid:18) Diag exp ≤ −σ2 V −1 σ2 2(V −1 1,n)i,i 3,nL(W, γ)V −1 3,n Remark 3. The Quasi Likelihood(cid:82) Y 3,n 3,n − V −1 V −1 (cid:1) V −1 ∂(cid:0)V 1,n i,i ∂V 2,n ∂γ V −1 3,n (cid:1) i,i (cid:19) V −1 3,n ∂γ (cid:19) (cid:1) (cid:0)V −1 (cid:0)V −1 1,n 1,nW V −1 (cid:1) 1,nW V −1 1,n i,i (cid:19) i,i (cid:21) V −1 3,n + V −1 1,n 1,nW V −1 (cid:21) + V −1 1,nW V −1 1,n V −1 3,n β Sn(β)dβ depends on convexity of matrix L(W, γ). However function of γ since adjacency matrix W is not positive semi definite because by Perron Frobeius theorem it has at least one positive eigenvalue and some eigenvalues are negative since trace(Wn) = 0 Joint convexity Lemma 1.12.14. The unpenalized quasi likelihood function to be jointly convex function of Xβ σ2 γ. The sufficient condition that score function is monotonic function of Xβ, σ2, γ Proof. Results that will be used in this proof are: • If V is non negative definite matrix then the function f : V → V is monotone 38 • The composition of two convex function is convex i.e f, g are both monotone implies f (g) is convex • If f, g are both monotone and f(x) is semi definite for all x f (cid:23) 0 and g (cid:23) 0 the productf g is monotone. V −1 1,n(γ) = (M−1 (cid:19) n −γW n) is monotonic function of γ. So V 1,n is monotonic function of γ by result 1 exp (−X iβ+σ2V 1,n(γ)ii/2) σ2V 1,n is jointly convex function of and σ result 3. V 2,n = Diag (cid:18) 1 Ei is jointly convex function of Xβ ,γ and σ by result 2 and 3. V 1,n + V 2,n is jointly convex function of Xβ ,γ andσ. V −1 3,n = (V 1,n + σ2 V 2,n)−1 is jointly convex function of Xβ ,γ and σ . 1 (cid:4) n X(cid:124)(cid:2)σ2V3,n(γ)(cid:3)−1(Y − Xβ) (cid:125) is jointly convex function of Xβ ,γ and σ (cid:123)(cid:122) (cid:124) Sn Central Limit Theorem 1 We have link function such that E(Y i) = exp (X (cid:124) i β + σ2V 1,n(γ)ii/2) = µi = h−1(X (cid:124) i β) < ∞ and ∂2h−1 ∂β2 < ∞ for all β in parameter space The ∂h−1 ∂β These conditions are satisfied since we assume (cid:107)X h−1(X (cid:124) i β) = exp(σ2V 1,n(γ)ii/2)exp(X (cid:124) i β) (cid:124) i β(cid:107),∞ , max i,j (Xi,j) is bounded ?? and 2 there exists smooth variance function V V AR(Yi|X) = V (h−1(X (cid:124) i β)) This condition is sat- isfied since here Vi(r) = r + r2(eσ2V 1,n(γ)ii − 1) (cid:124)(cid:2)Vn(˜σ2, γ, ˜βn)(cid:3)−1 is finite 3 Each element of vector of ∂µ ˜β ∂ i.e (cid:107)X(cid:124) Diag(µi)V AR−1(Y )(cid:107)∞ = O(1) this is proved by following lemma. Lemma 1.12.15. (cid:107)X(cid:124) Diag(µi)V AR−1(Y )(cid:107)∞ < ∞ 39 Proof. Using assumptions which gives bounds on elements of X and parameters. Write D as diagonal matrix with Di = V ar(Yi) Vn(σ2, γ, βn) = D(cid:48) CORR(Y ) D using Woodbury matrix identity (A+A(cid:48)BA)−1 = A−1−(B−1 +A)−1 and taking A = Diag(µi) with diagonal entries 0 > µi > ∞ it is sufficient to prove that max of row sum of inverse of correlation matrix (cid:107)CORR(Y )−1(cid:107)1,∞ < C bounded by some constant using lemma below (cid:107)CORR(Y )(cid:107)1,∞ < (cid:4) 1 Peligrad’s result on CLT Define (Y (cid:48) i ) = Yi − µi(β) . F m 1 = Sigma field(Y (cid:48) i : 1 ≤ i ≤ m) Let Tn =(cid:80)n i=1 Y (cid:48) i and ν2 n = E(Tn)2 (cid:18) (cid:19) ρ(n) = sup m≥0 X∈L2(F m 1 ) Y ∈L2(F∞ m+n) |CORR(X, Y )| α(n) = sup m≥0 A∈F m B∈F∞ 1 m+n |P (A ∩ B) − P (A)P (B)| Peligard showed that for non stationary sequences {Y (cid:48) i }∞ i=1 CLT is obtainable provided some suf- ficient conditions and and lim n→∞ ρ(n) < 1 and lim n→∞ α(n) = 0 Using the methods we can show CLT holds under these weak sufficient condition. 1 E(Yi) = 0 E(Y 2 2 max1≤i≤n E|Y (cid:48) i ) < ∞ i |2+δ < ∞ Lyapunov condition which suffices for Linderberg Conditions (cid:19) (cid:18) 3 ν2 n = O nh(n) h(n) is slow varying function i.e lim n→∞ h(an) n = 1 for all a > 0 1.1 In our setup sup i V −1 1,n i,i+n = 0 implies under Gaussian setup α(n) → 0 thus making Ψi and Ψi+n ae independent hence Yi and Yi+n are independent as as n tends to infinity. hence α(n) → 0 when under probability space corresponds to process Yi 40 1.2 Corr(Yi, Yj) ≤ Corr(Ψi, Ψj) = √ V1,n i,j V1,n i,i V1,n j,j ≤ |γ| for all i, j so lim n→∞ ρ(n) < 1 2 δ = 1 is satisfied here 3 sup (µi + µ2 i σ2 i wi+(1+|γ|) ) ≤ h(n) ≤ sup 1 i (cid:0) exp(σ2V 1,n i,i)−1(cid:1) σ2V 1,n i,i µi+µ2 i 1 wi+(1−|γ|) h(n) is independent of n thus regular varying function of n. Implication of CLT The unpenalized score equation Sn(βn, σ2, γ) converges to Normal distribution when parameters assume values of true model. Thus if can show that solution of score equation i.e Sn( ˆβn, σ2, γ) = 0 is consistent to βn,0 then ( ˆβn − βn,0) converges weakly to Normal distribution. Further when conditions are satisfied Liang and Zeger the solutions of Sn( ˆβ∗ Zeger (1986) follows CLT i.e ( ˆβn − βn,0) −→ N n, ˆσ2, ˆγ) = 0Liang & Exact structure Of V If the process is stationary the exact form of V −1 is known through paper Toeplitz and Circulant Matrices: A review . When not stationary can we know analytical structure? n − γ Wn)−1 where number of ’1’s in row of matrix Wn is less than k For example let Vn(γ) = (M−1 We can perform update Vn(γt+1) = Vn(γt) + (γt+1 − γt) ∂V 1,n(γ) |γ t ∂γ ∂V 1,n(γ) ∂γ = −V −1 V −1 1,n 1,n ∂V −1 1,n(γ) ∂γ 1,n W n V −1 n W nM−1 1,n = V −1 = M−1 n − γ M−1 n W 2 n − γ W 2 nM−1 n + γ2 W 3 n Therefore diagonal elements are diag(M−1 +i + γ2k(k − 1)/2 0 − 2γw2 i,i ≤ −2γk2 +i + γ2W 3 n W nM−1 n − γ M−1 n W 2 n − γ W 2 nM−1 n + γ2 W 3 n)i = 41 Chapter 2 High Dimensional Sparse-LDA for spatio-temporal Data Introduction 2.1 Introduction The mathematical clarity and simplicity of Fisher’s linear discriminant analysis (LDA) has lead to extensive applications in multimedia information retrieval such as speech and pattern recognition as explained by Yu & Yang (2001). Used both for classification Cox & Savoy (2003) Gutman et al. (2013) and dimension reduction Mourao-Miranda et al. (2005) Rathi & Palani (2012) , in the context of image analysis and feature extraction, LDA has often encountered difficulty pertaining to such high-dimensional problems. Technolgical advancements in medical images specifically magnetic resonance images (MRIs) have now lead to a rise in high-resolution images with dimensions as high as 256x256x198 ≈ 12M volumetric pixels or voxels. As the number of subjects obtained to study such images cannot feasibly exceed these dimensions, it leads to singularity in the construction of the covariate matrix. An initial proposition of a simple two-step algorithm such as PCA-LDA used yet another dimension reductionality step by reducing the initial dimension using principal component analysis (PCA) Wang et al. (2010) Ghosh (2001). Although this technique leads to construction 42 of orthogonal features of lower dimensions, the components from LDA and PCA algorithms maybe incompatible with each other, thus resulting in the loss of important information. The review provided by (Fan & Lv, 2010, Section 4.2) provides insight into issues that classical LDA encounters when a high-dimensional problem arises. They show that dimension reduction significantly is important for reducing the misclassification rate. Certain propositions considered remedies such as imposing independence assumptions on the covariance structure thus significantly lowering the number of estimating parameters to circumvent the singularity Bickel et al. (2004),Tib- shirani et al. (2002) and Fan & Fan (2008). As an application to genetics these techniques did not necessitate the selection of features to facilitate the classification. Fan & Fan (2008) produced the method named Features Annealed Independence Rule (FAIR). This was an improvement over near- est shrunken centroid rule Tibshirani et al. (2003) essentially equivalent to two-sample t-tests, as it sets a relative importance order for features that would result in a more optimal selection. This procedure keeps in check the noise accumulation previously unaccounted for, so as to not subvert faint features. However in an attempt to incorporate and account for significant correlation among the genes, Fan et al. (2012) proposed the regularized optimal affine discriminant (ROAD) and a few variations in nthe assumptions made. For the same microarray dataset J. Shao et al. (2011) propose a sparse LDA (SLDA) using thresholding obtain a sparse estimate of the covariance matrix. Another important contribution with regard to LDA classification in genetics was made by Witten & Tibshirani (2011). The authors here penalized the discriminant vectors in Fisher’s discriminant problem. This results in constraining the within subject covariance while penalizing the between subject covariances. Instead of separately estimating or penalizing the covariance estimates or mean estimates, Cai & Liu (2011) propose sparse estimates of the product of the difference in class means and the covariance, an important portion of the classification rule using l1 minimization. In other work, Mai et al. (2012) propose penalized least squares estimate using a lasso penalty to solve the high dimensional LDA problem. 43 For longitudinal neuroimaging studies data has naturally structured dependencies in higher dimensions that may significantly inform the classification rule. This article explores statistical methodology that may assist in the simultaneous selection of regions and classification of diseases status using structural brain magnetic resonance images (MRIs). The objective would be able to assist researchers in locating brain regions or ROI (Region of Interest) based analysis that identifies specific voxels playing a key role in investigating regions of susceptibility to Alzheimer’s disease. The authors of Yingjie & Maiti (2019) explored penalized LDA imposing a parametric spatial covariance on the within subject images. We extend this work to investigate methods using a spatio-temporal covariance for longitudinal studies and explore estimation under a variety of situations such as space-time separability and non-separability in section 2.4 and covariance tapering in section 2.8.1. All of the methodology in this article has been explored under the smoothly clipped absolute deviation (SCAD) penalty Fan & Li (2001). The algorithm use to obtain the penalized parameters was introduced by Zou & Li (2008) as the one-step sparse estimates. The consistency and selection properties have been studied under this one-step setup. We produce a series of simulation results under a variety of setups in section A dataset from the Alzheimer’s disease neuroimaging initiative (ADNI Petersen et al. (2010)) is used to demonstrate the method and results in areas of the hippocampus that may play a vital role in the classification of healthy brains and patients diagnosed with AD. 2.2 Review of classical Linear discriminant Analysis (LDA) Consider a pT -dimensional discriminant problem between two classes C1 and C2. Let Yk1, .., Yknk be from classes Ck, where k ∈ {1, 2} and Ykj ∈ RpT , We further assume that Ykj ∼ NpT (µk, Σ(θ)) are independent and identically distributed for j = {1, 2, .., nk}. The mean vectors µk vary between 44 the classes but they have a common variance Σ(θ) with parameter θ. The estimation is based on samples of replicates of size n1 and n2 for each class respectively and the total sample size is n = n1 + n2. The membership of a new test sample X into class C1 is then determined by the LDA classifier given by ˆδ(X) such that, (cid:18) (cid:19) ˆδ(X) = X − 1 2 (ˆµ1 + ˆµ2) ˆΣ−1(ˆµ1 − ˆµ2) > 0 (2.1) Alternatively this classifier can be expressed in terms of the difference in mean ∆ = µ1 − µ2 that provides insight into the discriminant vector. (cid:18) ˆδ(X) = (cid:19) > 0 X − 1 2 (ˆµ1 + ˆµ2) ˆΣ−1 ˆ∆ (2.2) where ˆµk and ˆΣ denote the estimated mean and covariance. The maximum likelihood estimates (MLE) that results in sample means and covariances can be obtained by maximizing the log- likelihood function for µk and θ is given by, (ˆµk, ˆθ) = arg max µk,θ L(θ, µ1, µ2; Y ) = − (cid:80)T t=1 pT × n 2(cid:88) nk(cid:88) k=1 i=1 − 1 2 log(2π) − n 2 log|Σ(θ)| 2 (Yk,i − µk)T Σ−1(θ)(Yk,i − µk) (2.3) All the estimates obtained in the setup have consistent properties under the setup where pT < n. This Bayes’ classifier rule is established within the parameter space where l1 and l2 are positive constants such that, Γ = {(∆, Σ(θ)) : ∆T Σ−1(θ)∆ > CpT , l1 ≤ λmin(Σ(θ)) ≤ λmax(Σ(θ)) ≤ l2} (2.4) 45 If the new observation X belongs to C1, then the conditional misclassification rate for unknown parameters Θ = (µ1, µ2, θ) of ˆδ(X) is given by, W1(ˆδ, Θ) = P (ˆδ(X)) ≤ 0|X ∈ C1) = 1 − Φ(ψ1) where, ψ1 = 1 2 ( ˆµ1 − ˆµ2)T ˆΣ ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ (2.5) (2.6) Similarly the expression for X belonging to C2 results in W2( ˆ∆, Θ) = 1 − Φ(ψ2). Therefore the overall misclassification rate is given by, W(ˆδ) = max Θ∈Γ 1 2 (W1( ˆ∆, Θ) + W2( ˆ∆, Θ)) (2.7) Under the assumption that X is normally distributed, the worst case conditional classification error rate where Φ(·) denotes the standard Gaussian distribution function is given by, W(δ) = max Θ∈Γ [1 − Φ( √ ∆T Σ−1∆ 2 )] = 1 − Φ( ) (2.8) (cid:112)CpT 2 In the neuroimaging setup with longitudinal structural MRIs, as discussed in section 2.1, we encounter a high dimensional sceniario where pT >> n rendering singularity of covariance matrix estimates ˆΣ(θ). In what follows, we impose a spatio-temporal dependence between registered images of each subject. 46 2.3 Spatio-temporal LDA For the purposes of classification, the voxels present in MRI images of a particular subject is mod- eled as Gaussian Random Field (GRFs). These random fields exhibit decaying spatial covariance dependences in MRIs for each particular time point for a subject. Hence, longitudinal MRIs of a specific subject can be considered to posses spatio-temporal dependence. To begin with, we con- sider a separable covariance structure where both spatial covariance on each image and the time covariance across images within the same subject are assumed to be second order stationary and isotropic and identical across subjects. Let us consider a p · T dimensional discriminant problem of a spatio-temporal process between two classes C1 and C2. Let {Yki(s, t) : (s, t) ∈ D × T}, for i = 1, .., nk in each class k = {1, 2} of size n1 and n2 respectively such that n1 + n2 = n, denote a spatio-temporal process in the domain n → π for 0 < π < 1 D× T ∈ Rd× [0,∞] where d denotes the spatial dimensionality. Also assume, n1 an n → ∞. Let us express the process Ykj as, Yki(s, t) = µk(s, t) + ki(s, t) (2.9) where µk is the mean effect corresponding to each class k = {1, 2} and the error term {ki(s, t) : (s, t) ∈ D × T} is a Gaussian process with mean 0 and covariance Σ(θ)(s,t),(s(cid:48),t(cid:48)) = γ[(s, t), (s(cid:48), t(cid:48)); θ] such that, γ[(s, t), (s(cid:48), t(cid:48)); θ] = cov[(s, t), (s(cid:48), t(cid:48)) = γ[(cid:107)(s − s(cid:48))(cid:107)2,|t − t(cid:48)|; θ] (2.10) We additionally constrain the the spatio-temporal voxels to be non-random, and for any pair of spatio-temporal sites the distance is bounded below by fixed number such that (cid:107)(s, t) − (s(cid:48), t(cid:48))(cid:107) ≥ η > 0. Hence all of the statistical methodology described will be investigated under the increasing 47 domain framework. Alternatively, the vectorized form of the model can be expressed as, Yk,i,j,l = µk,j,l + k,i,j,l (2.11) where, j = 1, 2, ...p represents the sth time point. Specifi- cally, (Yk,i)pT×1 = (Yk,i,1,1, ...., Yk,i,p,T )(cid:48) , (µk)pT×1 = (µk,1,1, ...., µk,p,T )(cid:48) and k,i = (k,i,1,1, ...., k,i,p,T )(cid:48) has a multivariate Gaussian distribution. Thus, Yk,i ∼ NpT (µk, Σ(θ)) where Σ(θ) is pT × pT co- j spatial site, l = 1, 2, ...T represents the tth l variance matrix. 2.4 Spatio-temporal LDA For the purposes of classification, the voxels present in MRI images of a particular subject is mod- eled as Gaussian Random Field (GRFs). These random fields exhibit decaying spatial covariance dependences in MRIs for each particular time point for a subject. Hence, longitudinal MRIs of a specific subject can be considered to posses spatio-temporal dependence. To begin with, we con- sider a separable covariance structure where both spatial covariance on each image and the time covariance across images within the same subject are assumed to be second order stationary and isotropic and identical across subjects. Let us consider a p · T dimensional discriminant problem of a spatio-temporal process between two classes C1 and C2. Let {Yki(s, t) : (s, t) ∈ D × T}, for i = 1, .., nk in each class k = {1, 2} of size n1 and n2 respectively such that n1 + n2 = n, denote a spatio-temporal process in the domain n → π for 0 < π < 1 D× T ∈ Rd× [0,∞] where d denotes the spatial dimensionality. Also assume, n1 an n → ∞. Let us express the process Ykj as, Yki(s, t) = µk(s, t) + ki(s, t) (2.12) 48 where µk is the mean effect corresponding to each class k = {1, 2} and the error term {ki(s, t) : (s, t) ∈ D × T} is a Gaussian process with mean 0 and covariance Σ(θ)(s,t),(s(cid:48),t(cid:48)) = γ[(s, t), (s(cid:48), t(cid:48)); θ] such that, γ[(s, t), (s(cid:48), t(cid:48)); θ] = cov[(s, t), (s(cid:48), t(cid:48)) = γ[(cid:107)(s − s(cid:48))(cid:107)2,|t − t(cid:48)|; θ] (2.13) We additionally constrain the the spatio-temporal voxels to be non-random, and for any pair of spatio-temporal sites the distance is bounded below by fixed number such that (cid:107)(s, t) − (s(cid:48), t(cid:48))(cid:107) ≥ η > 0. Hence all of the statistical methodology described will be investigated under the increasing domain framework. Alternatively, the vectorized form of the model can be expressed as, Yk,i,j,l = µk,j,l + k,i,j,l (2.14) where, j = 1, 2, ...p represents the sth time point. Specifi- cally, (Yk,i)pT×1 = (Yk,i,1,1, ...., Yk,i,p,T )(cid:48) , (µk)pT×1 = (µk,1,1, ...., µk,p,T )(cid:48) and k,i = (k,i,1,1, ...., k,i,p,T )(cid:48) has a multivariate Gaussian distribution. Thus, Yk,i ∼ NpT (µk, Σ(θ)) where Σ(θ) is pT × pT co- j spatial site, l = 1, 2, ...T represents the tth l variance matrix. 2.4.1 Spatio-temporal covariance In this section, we begin by reviewing the property that any covariance matrix may retain a irreducible block structure. Let ΣP T×P T be a covariance matrix which is formed by unfolding the covariance tensor Σ in the following ordered way, ΣP×T×P×T (a, a(cid:48), c, c(cid:48)) = ΣP T×P T (P a + c, P a(cid:48) + c(cid:48)) 49 Let q = (P − 1)a + c and q(cid:48) = (P − 1)a(cid:48) + c(cid:48) then, Σ(q, q(cid:48)) = Cov(Y (sc, ta), Y (sc(cid:48), ta(cid:48))). Therefore a fully separable model would produce, Σ(q, q(cid:48))P T×P T = γ1(sc−s(cid:48) a; θ) and a non-separable c; θ)γ2(ta−t(cid:48) model would result in Σ(q, q(cid:48)) = γ((cid:112)(sc − s(cid:48) c)2 + (ta − t(cid:48) a)2; θ). All covariance models used in this article is assumed to be up to second order stationary. Cressie & Huang (1999) describe in detail classes of spatio-temporal stationary covariance functions. Assume γ(·,·) is continuous and its spectral distribution posses a spectral density f (ω, τ ) ≥ 0, that is by Bochner’s theorem, (cid:90) (cid:90) Rd R γ(s, t) = eiω(cid:48)s+iτ(cid:48)tf (ω, τ )dωdτ (2.15) Additionally if γ(·,·) is also integrable, we get (cid:90) 1 2π f (ω, τ ) = 1 2πd+1 e−isτ h(ω; t)dt where h(ω; t) := 1 2πd e−is(cid:48)ω−itτ γ(s; t)dsdt = (cid:90) (cid:90) (cid:82) e−is(cid:48)ωγ(s; t)dsdt =(cid:82) eitτ f (ω, τ )dτ . A valid positive definite covariance can (C1) For each ω ∈ Rd, ρ(ω;·) is a continuous autocorrelation function, (cid:82) ρ(ω; u)du < ∞ and (C2) (cid:82) k(ω)dω < ∞. be achieved by modeling h(ω; u) = ρ(ω; u)k(ω) where the following is satisfied, k(ω) > 0. (2.16) Mat´ern (1960) defines a class of space time interacting model through the spectral density f (ω, τ ) = η(α2β2 + β2(cid:107)ω(cid:107)2 + α2τ 2 + (cid:107)ω(cid:107)2τ 2)−(ν+ d+1 2 ) (2.17) for all η, β, α, ν ≥ 0 and belonging to compact subspace and  ∈ [0, 1] which represents the extent of separability between the time and space domain. α−1, β−1 represents the spatial and temporal decay of the correlation respectively. η is the scale parameter and ν is the smoothness parameter. 50 Additionally it can be shown that this Matern class of covariances satisfy the regularity conditions. 2.4.2 Irregular lattice points in space and time Cressie & Lahiri (1996) gives general conditions for uniform convergence REML estimates for lattice as well irregular spaced points. Theses conditions are satisfied for large class of covariance structure including Matern class.However we assume that our brain is embedded in irregular spatial lattice and time points are at irregular lag. Let us define the indexing set of time point {0, 1, 2, ..n} and n → ∞ . For each time point t ∈ Tn = {0, t1, ..tn} we have a spatial domain St. Similarly we can define subset consisting of odd indices T1n = {0, t1, ..t2k+1 : 0 ≤ k ≤ (cid:98)n/2(cid:99)} Here d=3 each pi,t for i ∈ 1, 2, ..d representing the number of points in each direction.Here the cardinality |St| = p1,tp2,tp3,t = Pt. We assume that the the index set containing origin and neighbors at unit distance to be L = {(l1, l2, ..ld) :(cid:80)d difference between spatial lattice is fixed which denoted by H = (h1, h2, h3) independent of n. Define i |li| ≤ 1}. Define Z by the set of all positive and negative integers including 0. Z2 = {2i; i ∈ Z} the set of all even integers. Let the generator of indexing set Zn,t = {(i1, ..ij, ..id) : 0 ≤ ij ≤ pi and 1 ≤ j ≤ d} Z1n,t = {(k1, ..jj, ..kd) : kj = 2xj +1 0 ≤ xj ≤ 2(cid:98)(pi − 1)/2(cid:99)−1 and 1 ≤ j ≤ d} be odd integer subset of Zn . Therefore regular spatial lattice is generated by St = L ◦ Zn,t ◦ H and S1t = L ◦ Z1n,t ◦ H 2.4.3 Well separateness in space and time domain However stationarity,isotropy and non increasing as function of distance is satisfied by Matern covariance functions,this properties allows to validate our results when space does not have lattice structure. Henceforth, we assume our spatial points in irregular domain Stfor all t ∈ {0, t1, ..tTn} only with the restriction a ≤ inf (cid:107)s(cid:48) − s(cid:107)2 ≤ A for some constants a, A > 0 independent of time. s(cid:48)∈St 51 Similarly,for irregular time domain for all Tn have b ≤ inf t∈Tn (cid:107)t(cid:48) − t(cid:107) ≤ B for some constants b, B > 0. Therefore we have an increasing domain setup. 2.5 Regularity Conditions Denote(cid:80)T t=1 Pt = ˜PT and(cid:80)T t=1 st = ˜sT Σ(θ) is assumed to be second order stationary, isotropic and twice differential over all dimensions of space and time for θ ∈ Ξ over all (s, t) ∈ D × T where Ξ is the parametric space of θ. Define projection covariance matrix ˜Π(θ) = ˜Σ−1(θ) − ˜Σ−1(θ) ˜X1( ˜X T 1 ˜Σ−1(θ) ˜X1)− ˜X T 1 ˜Σ−1(θ) In general, let us consider any covariance matrix Σ(θ) for θ ∈ Ξ constructed by a covariance function γ(θ) where the true parameter θ is given by θ0. Also note that the derivative of Σ(θ) w.r.t θm is denoted by Σm(θ) = ∂ ∂θm θ = (θ1, ..θm, ..θr)(cid:48). Then we require the following assumptions, Σ(θ) and second derivative is denoted by Σmm(cid:48) (θ) = ∂2Σ(θ) ∂θm∂θm(cid:48) where n → π (A1) n1 (A2) (cid:80)T t=1 st = o(n) (A3) Let V = denote a r × r matrix so that trace( ˜Π(θ) ∂ ˜Σ(θ) ∂θm ˜Π(θ) ∂ ˜Σ(θ) ∂θm(cid:48) ) = vm,m(cid:48) are the elements of V, then limp∧T→∞ √ vm,m(cid:48) √ vm(cid:48)m(cid:48) exists and V is non-singular. vmm (A4) For any compact subset K ⊂ Θ have,any θ ∈ K and finte non zero constant ζ1,K and ζ2,k and for all m, m(cid:48) ∈ {1, 2, ..r} we have, (A4).1 limp∧T→∞ λmax(Σ(θ)) = O(1) (A4).2 limp∧T→∞λmaxΣm(θ0) < ∞ = O(1) where Σm(θ0) = ∂Σ(θ) ∂(θm)|θ=θ0. (A4).3 limp∧T→∞λmaxΣmm(cid:48) (θ∗) < ∞ = O(1) where Σmm(cid:48) (θ∗) = Σ(θ) ∂(θm)∂(θ(cid:48) m)|θ=θ∗. (A4).4 (cid:107)Σm(θ∗)(cid:107)2 F = O(((cid:80)T t=1 Pt) 1 2 +δ) where δ > 0. 52 un n ≥ 1 − δ lim inf n→∞ λrnΣm(θ) > (A4).5 limp∧T→∞ λmin(Σ(θ)) > ζ2,K > 0 (A4).6 For some subsequence un and δ > 0 such that lim sup n ζ2,K > 0 for all m where λ1 ≥ λ2 ≥ ..λ ˜PT . 53 2.5.1 Asymptotic optimal misclassification rate Below under the usual setup for classical LDA, we show the behavior of the misclassification error rate. First lemma 2.5.4 provides the asymptotic matrix property below. Theorem 2.5.1. Let ˆθn be Restricted Expected Maximum Likelihood(REML) estimate of θ0 then (cid:107)ˆθn − θ0(cid:107)2 = Op( ) (cid:113) 1(cid:80)T t=1 Pt n Proof. Assuming conditions A(1)-A(4) hold, Cressie & Lahiri (1996) theorem 3.2 shows that −1 2 (ˆθREM L − θ0) ⇒ N (0, I) V ∂θm(cid:48) ) ≥ O((cid:80)T ˜Π(θ) ∂ ˜Σ(θ) t=1 Pt n) inf i λi( ˜Π(θ) ∂ ˜Σ(θ) ∂θm ˜Π(θ) ∂ ˜Σ(θ) ∂θm(cid:48) ) We infer that(cid:107)ˆθn− Since Vi,j = trace( ˜Π(θ) ∂ ˜Σ(θ) ∂θm θ0(cid:107)2 = Op( (cid:113) 1(cid:80)T t=1 Pt n even in irregular space time domain ) Later we show that above assumptions holds for Materrn class covariance (cid:4) Lemma 2.5.2. (cid:0) ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ 2 ∆T Σ−1∆/2(cid:1) = √ − (cid:112)∆T Σ−1∆)((cid:112)∆T Σ−1∆) + OP ( P T OP ( P T n ) n )((cid:112)∆T Σ−1∆) (cid:113) 1 P T n ) Proof. From assumptions A(4), we get ˆΣ = Σ + E where  = Op( and (cid:107)E(cid:107)2 < ∞ So by geometric series expansion is valid for large n . Thus we have ˆΣ−1 = Σ−1 + Op()Σ−1EΣ−1 + op()for large n Using the fact that (cid:107)Σ−1(cid:107)2 < ∞ and ∆ ∼ NP T (∆, Σ( 1 Theorem 2.5.3. Using assumptions A(1) to A(4) ,(cid:80)T + 1 n2 n1 )) that W (ˆδM LE) → 1 − Φ( √ C∞ 2 ) 54 t=1 Pt = ˜PT = o(n) and C ˜PT (cid:4) → C∞ we show Proof. By Taylor Series expansion Φ( 2 √ = Φ( ) ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ ∆T Σ−1∆/2) +(cid:0) (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ ˆ∆T Σ−1 ˆ∆ − 1 2 ( 2 ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ ∆T Σ−1∆/2)2 ( 2 √ ∆T Σ−1∆/2(cid:1)φ( √ √ − √ ∆T Σ−1∆/2) φ( √ ∆T Σ−1∆/2)+ ∆T∗Σ−1∗∆∗/2)(cid:1) for some quantity ∆T∗Σ−1∗∆∗ ∈ [ √ ˆ∆T Σ−1 ˆ∆ ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ 2 √ , ∆T Σ−1∆/2] Now using lemma 2.5.2 and boundedness of 0 < φ(.) ≤1 ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ 2 Φ( √ ) = Φ(lim n ∆T Σ−1∆/2) + op(1) (cid:4) Lemma 2.5.4. 2 − √ (cid:0) ˆ∆T Σ−1 ˆ∆ ∆T Σ−1∆/2(cid:1) = (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ n )((cid:112)∆T Σ−1∆) (cid:113) 1 pT n ) and (cid:107)E(cid:107)2 < ∞ So by geometric Proof. From lemma 2.5.4, we get ˆΣ = Σ + E where  = Op( series expansion is valid for large n .Thus we have ˆΣ−1 = Σ−1 + Op()Σ−1EΣ−1 + op() for large n. Using the fact that (cid:107)Σ−1(cid:107)2 < ∞ and ∆ ∼ NpT (∆, ( 1 (cid:4) (cid:112)∆T Σ−1∆)((cid:112)∆T Σ−1∆) + OP ( pT OP ( pT n ) )Σ) + 1 n2 n1 55 Theorem 2.5.5. Under assumption A(3) if pT = o(n) and CpT → C∞ as n → ∞, then W(ˆδM LE) → 1 − Φ( √ ) C∞ 2 Proof. Φ( 2 √ = Φ( √ ) ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ ∆T Σ−1∆/2) +(cid:0) ∆T Σ−1∆/2(cid:1)φ( (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ ˆ∆T Σ−1 ˆ∆ √ − ( 1 2 − 2 √ ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ 2 ∆T Σ−1∆/2)+ ∆T Σ−1∆/2)2 ( √ ∆T Σ−1∆/2) φ( ∆T∗Σ−1∗∆∗/2)(cid:1) √ For some quantity ∆T∗Σ−1∗∆∗ ∈ [ Now using lemma 2.5.4 and boundedness of 0 < φ(.) ≤1 ˆ∆T Σ−1 ˆ∆ ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ 2 , √ √ ∆T Σ−1∆/2] ˆ∆T Σ−1 ˆ∆ (cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ 2 Φ( √ ) = Φ( lim n→∞ ∆T Σ−1∆/2) + op(1) Results follow from Lemma 2.5.4. (cid:4) With the application to neuroimaging, below we investigate the properties of the misclassi- fication error rate whenever pT >> n. If we further assume a known covariance matrix Σ, with unknown mean estimates of the LDA classifier obtained using MLE, the following theorem provides the necessary motivation to modify the current method. Theorem 2.5.6. Under the known true covariance Σ, let us define the LDA classifier, ˆδµ(X) = (X − ˆµ)(cid:48)Σ−1(ˆµ1 − ˆµ2) (2.18) 56 where ˆµ1,ˆµ2 are the MLE estimates and ˆµ = ˆµ1+ˆµ2 pT /n → ∞ and let CpT → C∞ where 0 < C∞ < ∞ then, 2 . Let us assume the high dimensional setup (1) For n1 (cid:54)= n2 but nk > n/4 for k ∈ {1, 2} √CpT pT /n → c for c > 0, we get W(ˆδµ) → 0 but √CpT pT /n → 0, we get W(ˆδµ) → 1 2 (i) If (ii) If (2) For n1 = n2, W(ˆδµ) √C 1−Φ( pT 2 → ∞ ) √CpT pT /n → ∞, we get W(ˆδµ) → 0 but √CpT pT /n → c for c > 0, we get W(ˆδµ) → 1 − Φ( c √CpT pT /n → 0, we get W(ˆδµ) → 1 W(ˆδµ) √C 1−Φ( pT 2 2 ) (i) If (ii) If (iii) If → ∞ 4 ), which is a constant in (0, 1 2 ) Proof. We will prove that ˆ∆T Σ−1 ˆ∆ 2 Φ( ) = Φ( (cid:113) ∆T Σ−1∆)(1 + Op( 1√ ∆T Σ−1∆)(1 + Op( npT 1√ 2 ) + pT n (n1 − n2)(1 + Op( n (n1 + n2)(1 + Op( 1√ npT 1√ ) + pT npT )) )) npT The key step here is, (µ1 − µ)T ˆΣ−1 ˆ∆ = ∆T Σ−1∆(1 + Op( 1√ npT )) + p n1n2 (n1 − n2)(1 + Op( 1√ npT )). (µ1−µ)T ˆΣ−1 ˆ∆ = 1 2 [(∆ ˆΣ−1∆)+( ˆµ1−µ1)T ˆΣ−1( ˆµ1−µ1)−( ˆµ2−µ2)T ˆΣ−1( ˆµ2−µ2)−2∆T ˆΣ−1( ˆµ2−µ2)] from the previous results we have, ˆΣ−1 = Σ−1(1 + Op( 1√ npT )) (cid:4) In section 2.1, we have established that a variety of approaches have been investigated to overcome this singularity. Fan & Lv (2010) furthered emphasized on how the sparsity assumption and estimation may lead to optimal misclassification rates. Similar to the spatial approach adopted 57 by Yingjie & Maiti (2019), we expand methods to the spatio-temporal setup and incorporate non- separability into the following theoretical results. 2.6 Penalized Linear Discriminant Analysis (pLDA) In most imaging studies with relatively higher resolution information, the number of spatial points (pixels and/or voxels) is almost exponentially higher than number of individual subjects for clas- sification. Thus pT > n is a rather unrealistic scenario. The LDA classifier under investigation too renders itself unsuitable in situations where acquisition of data results in pT >> n, as ˆΣ is singular. As for the misclassification rate, we can also establish that under the setup where pT /n → ∞ that the misclassification rate would be equivalent to random chance even when the true covariance is known due to the inconsistent accumulation of variance of the estimates of ˆµk for k ∈ {1, 2}. Therefore there is a need to develop methods that can better accommodate a much more likely scenario of pT > n. Although, based on setup we are able to reduce the number of estimates to s + 2 << n dimensions where a solution maybe obtained using MLE, the nature of these estimates are unstable. Therefore we proceed by proposing a penalized LDA. To describe the method, we add to the notation used in section 2.4 as follows. Let ∆ = µ1 − µ2 = (∆1, ..., ∆pT )(cid:48) be a pT × 1 dimensional vector which is the difference of the mean effect between classes C1 and C2. We define the signal set S = {ν : ∆ν (cid:54)= 0}, denoting the true non-zero differences between classes. Further, let s denote the cardinality of S i.e. s = |S| and we assume that s < n << pT . To formalize vector and matrix representation of the model in order to easily incorporate the penalty term, for Yk,i ∼ NpT (µk, Σ(θ)), we define Zi = Y1,i − ¯Y for i = 1, 2, ..., n1 and Zi = Y2,i − ¯Y for i = n1 + 1, n1 + 2, ..., n − 1 where ¯Y = (cid:80) i=1 Yk,i/2n and n = n1 + n2. Let τ1 = n1 n and (cid:80)nk k=1,2 58 Zi ∼ n Σ(θ)), n Σ(θ)), for i = 1, 2, .., n1 for i = n1 + 1, 2, .., n − 1 and Cov(Zi, Zj) = 1 n Σ(θ)) for i (cid:54)= j. Alternatively, τ2 = n2 n . Thus, NpT (τ1 ∆, n−1 NpT (−τ2 ∆, n−1  ∼ N  Z1 Z2 ... Zn−1 (cid:18)−τ21n1  ⊗ IpT (cid:125) (cid:123)(cid:122) (cid:124) τ11n2−1 ˜X ∆(cid:124)(cid:123)(cid:122)(cid:125) β (cid:124) , (In−1 − 1 n (cid:123)(cid:122) ˜Σ(θ) Jn−1) ⊗ Σ(θ) (cid:19) (cid:125) (2.19) where Ib signifies a b × b identity matrix and Jb signifies a b × b matrix of 1s while 1b denotes a b × 1 dimensional vector. We also have, ˜Σ−1(θ) = (In−1 + Jn−1) ⊗ Σ−1(θ) and X = such that ˜X = X ⊗ IpT . We now express the joint log-likelihood of Z as, −τ21n1 τ11n2−1  L(β, θ; Z) = − npT 2 log(2π) − 1 2 (Z − ˜Xβ)T ˜Σ−1(Z − ˜Xβ) − 1 2 log|det( ˜Σ)| (2.20) It is clear that we seek to obtain a solution for a sparse true β0 = (β(cid:48) 1,0, 0)(cid:48) where β1,0 is an s dimensional vector, β2,0 is a p − s dimensional vector and β0 is pT × 1. In this high- 1,0, β0,2)(cid:48) = (β(cid:48) dimensional setup, we get solutions for estimating θ and β by solving the penalized log-likelihood given by, Q(β, θ; Z) = L(β, θ; Z) − n T(cid:88) p(cid:88) t=1 u=1 pλt,n(|βu|) (2.21) Here pλt,n denotes the penalty function p with its corresponding tuning parameter λt,n depend- 59 ing on n. To obtain, well defined properties of the sparse estimator we consider the smoothly clipped absolute deviation (SCAD, Fan & Li (2001)) penalty function given by,  pλn(β) ∼ λt,n|β|, − β2−2αλt,nβ+λt,n 2(α−1) (α+1)λt,n 2 2 , 2 , if |β| < λt,n if λt,n < |β| < αλt,n if |β| > αλt,n 2.6.1 Across sample independence Notice that samples are not uncorrelated due to covariance matrix (In−1 − 1 n ⊗Σ(θ) but we can make a linear transformation to data Z(cid:48) sample in Z thus independence in Gaussian case. Take Hn = (In−1 − 1√ n = (H ⊗ IP T )Zn which ensures no correlation across n+1 J(n−1)) and resulting (cid:124) −τ21n1 τ11n2−1 HH T Jn−1) (cid:123)(cid:122) (cid:125)  =  (−τ2 − τ1√ (τ1 − τ1√  n+1 )1n1 n+1 )1n2−1 design matrix would be X(cid:48) = HnX = (In−1 − 1√ n+1 J(n−1) Note that ˜X T ˜Σ−1 ˜X = ˜X(cid:48)T ˜Σ(cid:48)−1 ˜X(cid:48) Henceforth we refer to transformed likelihood equation as our original equation and with the abuse of notation, we assign the notation of original variable to transformed variable. We now express the joint log-likelihood of Z as, L(β, θ; Z) = − npT 2 log(2π) − 1 2 (Z − ˜Xβ)T ˜Σ−1(Z − ˜Xβ) − 1 2 log|det( ˜Σ)| (2.22) It is clear that since we seek to obtain a solution for a sparse true β0 = (β(cid:48) 1,0, 0)(cid:48) where β1,0 is an s dimensional vector and β0 is pT × 1 in this high-dimensional setup, we get solutions for estimating θ and β by solving the penalized log-likelihood given by, Q(β, θ; Z) = L(β, θ; Z) − n T(cid:88) p(cid:88) t=1 u=1 pλt(|βu|) 60 (2.23) Here pλ denotes the penalty function p with its corresponding tuning parameter λn depending on n. To obtain, well defined properties of the sparse estimator we consider the smoothly clipped absolute deviation (SCAD, Fan & Li (2001)) penalty function given by,  pλn(β) ∼ λn|β|, −|β|2−2αλn|β|+λn 2(α−1) (α+1)λn 2 2 , 2 , if |β| < λn if λn < |β| < αλn if |β| > αλn 2.6.2 REML estimation Without loss of generality we can expand β0 P T×1 = (β0 1,P1×1, β0 2,P2×1, .β0 t,Pt×1, ..β0 T,PT ×1)T For each t we can further expand βT Notice that X⊗IP T β = X⊗⊕T Ist 0 0 t,Pt×1 = (β0 1,t,st×1, 0PT −st×1)T  (β0 1,1,st×1, 0Pt−st×1 ..β0 1,t,st×1, 0Pt−st×1..β0 1,T,sT ×1, 0PT −sT ×1)T 0 0 = X ⊗ ⊕T t=1 (cid:124) Ist 0 (cid:123)(cid:122) 0 β 0 t=1  (cid:125) In(Pt−st) 0 0 0 In the context of this result let us redefine ˜X = X ⊗ Q Clearly rank of ˜X = nP T −(cid:80)T QP T×P T t=1 st Suppose, we obtain a orthogonal matrix Bn such that n Bn = ⊕T BT t=1diag n Bn = I − ˜X( ˜X T ˜X)− ˜X T  and BT REML estimates are defined by ˆθn,REM L = arg min θ − log det( ˜Σ(θ)) − log det( ˜X T ˜Σ(θ)−1 ˜X) − 1 2 ZT Π(θ)Z 61 Let us define oracle estimator ˆβ ˜Σ−1(ˆθ) ˜X1,t)− ˜X T orc t 1,t setup, ˆβ 1,t |ˆθ = ( ˜X T (cid:18) (cid:19) orc t when it is set of zeros in βt,0 is already known. Under that ˜Σ−1(ˆθ)Zn where ˜X1,t = Ist This leads to equation trace Π(ˆθn,REM L) ∂ ˜Σ(ˆθn,REM L) ∂θ + ZT ∂Π(ˆθn,REM L) ∂θ Z = 0 (2.24) where ˜Π(θ) = ˜Σ−1(θ) − ˜Σ−1(θ) ˜X1( ˜X T 1 ˜Σ−1(θ) ˜X1)− ˜X T 1 ˜Σ−1(θ) The attractive features of REML are that equation 5.4 is unbiased and the covariance estimate ˆθREM L and mean estimate ˆβ are asymptotically independent. Theorem 2.6.1. Oracle REML Cressie & Lahiri (1996) (cid:107)ˆθn,REM L − θ0(cid:107)2 = Op( (cid:113) 1(cid:80)T ) t=1 Pt (n) Proof. From the result, P r( ˆβt|ˆθ = ˆβorc t,j = 0} = {j : βt,j = 0}, we know the indices or position where β0 is zero. therefore we are able to identify |ˆθ) → 1 and from the assumption that {j : ˆβorc t with probability one about linear subspace of design matrix generated by ˜X1,t needed to construct projected covariance matrix ˜Π .So, with probability converging to 1, we get result as stated in (cid:4) Cressie & Lahiri (1996) 2.6.3 Validation of REML assumptions From Cressie & Lahiri (1996) ,we observe that assumption A1-A4 are sufficient to guarantee con- vergence of covariance parameter ˆθ. In this section we prove those assumption hold for Matern covariance class under suitable chosen parameter space. Lemma 2.6.2. For every isotropic , stationary function γ(h, t) the matrix Σ is positive definite with entries Σi,j = γ(si − sj, ti − tj) for some fixed P spatial points si, sj ∈ S and T time points ti, tj ∈ T 62 Proof. For any vector a∼; a∼ T Σa∼ = V ar(aT X∼ ) here covar(Xi, Xj) = γ(si − sj, ti − tj) (cid:4) conditions C4-C6 is sufficient for assumptions A4 which guarantees convergence of REML esti- mators even in irregular time domain. Since we assume our covariance follows Matern class which is isotropic and second order stationary we can establish following lemma: Lemma 2.6.3. Conditions A4.1-3 , A4.5 and A4.6 are satisfied if the item 1 , item 2 and item 3-4 holds for any isotropic and second order stationary process. For any θ ∈ K compact subset s(cid:48)∈St,t(cid:48)≤Tn |γ(.)(s− s(cid:48), t− t(cid:48); θ)| < ζ1,K here γ(.) = γ0, γiγi,j denoting covariance, (cid:80) (cid:80) • lim n lim sup s∈St,t≤Tn • lim n lim sup s∈St,t≤Tn • for all i lim (cid:80) first and second derivative of covariance respectively s(cid:48)(cid:54)=s∈St,t(cid:54)=t≤Tn |γ0(s − s(cid:48), t − t(cid:48); θ)| < ζ2,Kγ0(0, 0; θ) lim sup s∈St,t≤Tn n s(cid:48)∈St,t≤Tn i (s − s(cid:48), t − t(cid:48); θ) > ζ2,K γ2 • for all i and N(s) = {s(cid:48) : s + s(cid:48) ∈ St} and N(t) = {t(cid:48) : t + t(cid:48) ∈ St} (cid:88) lim n s∈St,t≤Tn < ζ2,K lim n |γi(s, t; θ)| (cid:88) (cid:88) s(cid:48)∈N(s),t(cid:48)∈N(t) |γi(s − s(cid:48), t − t(cid:48); θ)| s(cid:48)(cid:54)=s∈S1t,t(cid:48)(cid:54)=t∈T1n i (s(cid:48), t(cid:48); θ) for each fixed s ∈ S1,t, t ≤ T1,n γ2 Proof. Since closed form of γ(.) is not available for all values for separability parameter  , we use the argument that eigenvalues of Σ and ∂Σ ∂θi are continuous in  due to continuity of f () and ∂f ∂θi as function of . Our strategy is to prove all the results for  = 0, 1 for which closed form are available. Also notice that spectral density f () is continuously differentiable in .  ηπ ηπ γ0((cid:107)s(cid:107), t) = d 2 Γ(ν+ d π 2 )(α(cid:107)s(cid:107))ν Kν (α(cid:107)s(cid:107)) 2 )(β|t|)ν Kν (β|t|) 1 2 Γ(ν+ 1 2ν−1α2ν 2ν−1β2ν √ √ α2(cid:107)s(cid:107)2+β2t2)ν Kν ( 2ν−1(α2+β2)ν α2(cid:107)s(cid:107)2+β(cid:107)2t2) 2 )( d+1 2 Γ(ν+ d+1  = 1  = 0 63 Let k ≥ 0 be an integer define the set Ek = s(cid:48) : (cid:107)s(cid:48) − s(cid:107) ≤ k + 1/s(cid:48) : (cid:107)s(cid:48) − s(cid:107) ≤ k and Nk = |Ek| number of points in Ek. To each point s(cid:48) ∈ Ek we associate a disjoint (cid:107)(cid:107)2 ball of volume ≤ 2(cid:0)(k+1)d−kd(cid:1) , so total space occupied is Nk2 , also volume of Ek = 2 2 π d d 2 ( a 2 )d π Γ(1+ d 2 ) Γ(1+ d 2 ) therefore we obtain Nk ≤ d(k+1)d−12d ad and since non decreasing behavior as a d 2 ( a 2 )d π Γ(1+ d 2 ) 2(cid:0)d(k+1)d−1(cid:1) d Γ(1+ d 2 ) π 2 function of distance between arguments. sup s(cid:48)∈Ek γ0(s − s(cid:48), t − t(cid:48)) ≤ γ0(k, t − t(cid:48)) (cid:88) lim sup s∈St,t≤Tn lim n ≤ (cid:88) (cid:90) ≤ CK 1 b k∈Z+,k2∈Z+ (cid:90) |γ0(s − s(cid:48), t − t(cid:48); θ)| = lim s(cid:48)∈St,t(cid:48)≤Tn d(k + 1)d−12d γ0(k, bk2; θ) = 1 b ad lim sup s∈St,t≤Tn n (cid:90) (cid:90) s(cid:48)∈St,t(cid:48)≤Tn d(s + 1)d−12d R+ R+ ad d(s + 1)d−12d R+ R+ ad (s, t)dsdt = O(1) due to finite moment of modified Bessel function of second kind •• From the property of uniform integrability , (cid:88) γ0(s − s(cid:48), t − t(cid:48); θ) γ0(s, t)dsdt (cid:4) 2.6.3.1 Tapered REML Suppose we taper each spatial matrix with fixed tapered range rt = Kt√ Pt such that constant Kt is to be determined by cross validation. We solve modified REML equation (cid:18) trace Π(ˆθn,REM L,tapered))tap (cid:19) ∂ ˜Σ(ˆθn,REM L) ∂θ + ZT ∂Π(ˆθn,REM L,tapered)tap ∂θ Z = 0 (2.25) Here Πtap = ˜Σ ◦ KT ap −1 − ˜Σ ◦ KT ap −1 ˜X1( ˜X T 1 ˜Σ ◦ KT ap −1 ˜X1)− ˜X T 1 ˜Σ ◦ KT ap −1 In order to show the convergence of tapered REML estimater i.e. (cid:107)ˆθn,REM L,tapered − θ0(cid:107)2 = (cid:113) 1(cid:80)T Op( t=1 Pt (n) ) we ensure that even after tapering all conditions as mentioned in Cressie & Lahiri 64 (1996) remains intact. Hence we need to prove the following lemma which assures that tapered in not ”far” from original covarinace matrix Lemma 2.6.4. Assuming general condition 13 holds, we have the results that for each t, • (cid:107)Σt − Σt,T aper(cid:107)1 = Op( 1√ Pt ) • (cid:107)Σk,t − Σk,t,T aper(cid:107)1 = Op( 1√ Pt ) • (cid:107)Σk,j,t − Σk,j,t,T aper(cid:107)1 = Op( 1√ Pt ) Proof. = max i t=1 1 T(cid:88) pt(cid:88) (cid:88) (cid:90) ∞ hij≥wpt (cid:88) hij≤wpt ≤ Ktρ |γtaper(s, t : θ)| + max i (cid:88) (cid:88) m=(cid:99) w(pt) δ j∈Bi m (cid:98) (cid:107)Σt − Σt,T aper(cid:107)1 = |γtaper(s, t : θ)| |γtaper(s, t : θ)| ≤ 3 Ktρ w(pt) xd|γ(x : θ)|dx wp (cid:4) Therefore based on equation 2.23 in order to estimate ˆβ and ˆθ, we obtain the solutions of deriva- tive of the penalized likelihoods with respect to each of the unknown parameters and iteratively solve until convergence is obtained. 2.6.4 Regularity conditions for penalty In order to demonstrate properties of the subject level images over time, we introduce βt,0,j that is the unknown true parameters of dimension p × 1 for each fixed time point t = {1, 2, .., T}. Analogously under the sparsity assumption, we denote st to determine the number of non-zero components within a specific time point. Therefore,(cid:80)T t=1 st = s. The methods below without loss 65 of generality, as voxelwise analysis that involves a subject wise registration on a single template space are performed, that each image contains the same number spatial sites p. We further assume that for every subject the multiple images were acquired over time with a total number of T time points. Clearly, for separable cases this provides the ease of the Kronecker product. Σ(θ) is assumed to be second order stationary, isotropic and twice differential over all dimensions of space and time for θ ∈ Ξ over all (s, t) ∈ D × T where Ξ is the parametric space of θ. In general, let us consider any covariance matrix Σ(θ) for θ ∈ Ξ constructed by a covariance function γ(θ) where the true parameter θ is given by θ0. Also note that the derivative of Σ(θ) w.r.t θm is denoted by ∂θm∂θm(cid:48) where θ = (θ1, ..θm, ..θr)(cid:48). Σ(θ) and second derivative is denoted by Σmm(cid:48) Σm(θ) = ∂ ∂θm (θ) = ∂2Σ(θ) Listed below are conditions under which consistent estimates of the pLDA are obtained. (A3) at,n = max 1≤j≤p {p(cid:48) λt,n (|βt,0,j|), β0,j (cid:54)= 0} = O( √ n) (|βt,0,j|), β0,j (cid:54)= 0} = o(1) {p(cid:48)(cid:48) λt,n (A4) bt,n = max 1≤j≤p s → ∞ nλt,n√ (A5) √ (A6) λt,n = o(1) as n → ∞. (A7) s4 n → 0 as n → ∞. (A8) min 1≤i≤s |βt,0,i| λt,n → 0 as n → ∞. (A9) (A10) n→∞ lim inf lim θ→0+ (cid:80)T (cid:80)T 1 dpt 1 st nCpT p(cid:48) λt,n (|θ|) > 0 as n → ∞. → 0 where dp = maxi≤st (cid:80)pt k=st+1 σ2 k,t We can rearrange β0 and re-express it as, β0 pT×1 = (β0 1,p×1, β0 2,p×1, ..., βt,0,p×1, ..βT,0,p×1)T 66 For all of the methods shown below the only constraint for identifiability is that(cid:80)T t=1 st = o(n). Considering λt,n = λn is equal for each time group, below are results under these special cases. (cid:113) 1 P T n ) , we want to prove that for each t = 1, 2..., T n ) and ˆβt 2 = 0P×1 with probability close to 1 where ˆβt|ˆθ = arg max Q(β, ˆθ; Z) βt Lemma 2.6.5. Assuming (cid:107)ˆθ − θ(cid:107)2 = Op( (cid:107) ˆβt− βt,0(cid:107)2 = Op((cid:112) st In first part we show that (cid:107) ˆβt − βt,0(cid:107)2 = Op((cid:112) st Proof. For each fixed t the proof consists of two steps n ) . In the next part we show that ˆβt 2 = 0P×1 with probability close to 1 . Thus the proof is in two parts : Without loss of generality, β0,t = (β0,t,1, 0)T and lets assume that an oracle has already informed us about the positions of zero, thus we can construct a oracle estimator which already have zeros in the positions of ˆβorc t |ˆθ = arg max βt,1|βc t,1=0 L(β, ˆθ; Z) By this construction we are only searching for ˆβorc t in the space of real vectors whose non zero element corresponds to non zero part of true βt,0. In the second part we show that for any maximizer of penalized likelihood ˆβt|ˆθ = arg max β Q(β, ˆθ; Z) P r( ˆβt|ˆθ = ˆβorc t |ˆθ) → 1 Furthermore if Likelihood is concave like in our Gaussian case , the maximizer ˆβt is unique. To prove second part we use the stationary condition of KKT conditions refereed in Kwon & Kim (2012) The proof of first part Theorem 2.6.6. Assuming (cid:107)ˆθ − θ(cid:107)2 = Op( (cid:107) ˆβorc |ˆθ − βt,0(cid:107)2 = Op((cid:112) st n ) t (cid:113) 1 67 P T n ) , we want to prove that for each t = 1, 2..., T Denote vector ut ∈ Rpn with entries 1 corresponding to index j such that βt,j (cid:54)= 0 and 0 elsewhere, but without loss of generality , we denote ut as u.Similarly, ˜Xt denotes the colomns of design matrix ˜X corresponding to βt,j (cid:54)= 0. Hence we get identity that ˜Xtut = ˜X n ) , ξn = Op((cid:112) s It is sufficient to show that for any  > 0 ,(cid:80)T some C > 0 under the condition that for r ∈ {1, 2, ....T} − {t} we have (cid:107) ˆβr − βr(cid:107)2 = Op((cid:112) sr P r(cid:0) sup L(βt,0 + uξn,t, ˆθ) − L(βt,0, ˆθ)(cid:1) > 1 −  t=1 st = s, ξn,t = Op((cid:112) st 0 0 n ) 0 n ) and for (cid:107)u(cid:107)2=C   u 0 0 0 0 Ist×st 0 or equivalently we can prove that P r(cid:0) sup (cid:107)u(cid:107)2=C L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ)(cid:1) > 1 −  For term (1) we get using eigenvalue inequality result that ψi is i th eigenvalue Proof. uT t ˜X T t 1 2 (cid:124) ˜Σ(ˆθ)−1 ˜Xtutξn (cid:125) (cid:123)(cid:122) 1 = 1 2 (cid:124) (cid:123)(cid:122) 1a q(cid:88) r=1 uT t ˜X T t + 1 2 (cid:124) (cid:125) (cid:123)(cid:122) 1b uT ˜X T ˜Σ(θ0)−1 ˜Xtutξn uT ˜X T ˙˜Σ(θ∗)r(ˆθr − θ0,r)uT ˜X T ξn (cid:125) 1(a) ≤ O((cid:107)u(cid:107)2 ˜Xt T ˜Xt min i ψi( ˜Σ(ˆθ)−1))ξ2 n = O(min i ψi( ˜˜X T t (In−1 + Jn−1)) ˜Xt) min i ψi( ˜Σ(ˆθ)−1))ξ2 n = O(nξ2 n,t)by eigenvalue inequality Similarly using CS inequality1(b) = O(nξ2 n,tηn) (cid:4) 68 For term (2) we use Taylor expansion of ˜Σ(ˆθ)−1 around θ0 such that ˆθ − θ0 = vηn 2b term (2a) using Markov inequality that for any r.v (Zt − ˜Xβt)T ˙˜Σ(θ∗)r(ˆθr − θ0,r) ˜Xuξn (cid:123)(cid:122) (cid:125) (cid:114) 2a (cid:124) (cid:125) − q(cid:88) (cid:124) r=1 Z = Op −(Z − ˜Xβ)T ˜Σ(θ0)−1 ˜Xtutξn = Op( Proof. −(Z − ˜Xβ)T ˜Σ(ˆθ)−1 ˜Xtut = −(Z − ˜Xβ)T ˜Σ(θ0)−1 ˜Xuξn (cid:123)(cid:122) (cid:112)E(Z2) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)min = Op((cid:112) n1n2 (cid:80)q Similarly term (2b) (cid:113) r=1(Z − ˜Xβ)T ˜Σ(θ∗)r ˜Xuξn n st(cid:107)u(cid:107)ξn) = Op( ψi(Σ(θ0)−1) trace( ˜X T = Op( i = Op( = Op( √ i min nst(cid:107)u(cid:107)ξnηn) = op( √ nst(cid:107)u(cid:107)ξn) ˜Xt) t √ nst(cid:107)u(cid:107)ξn) ψi( ˜Σ(θ0)−1) trace( ˜˜X T ˜˜X)(cid:107)u(cid:107)2ξn) min i −τ21n1 τ11n2−1 T −τ21n1 τ11n2−1 (In−1 + Jn−1) (cid:107)u(cid:107)2ξn) (cid:4) (cid:4) ψi( ˜Σ(θ∗)r ˜Σ(θ0) ˜Σ(θ∗)r)trace ˜X T ˜X)(cid:107)u(cid:107)2(cid:107)v(cid:107)2ξnηn) Here we observe that term 1 dominates all other term hence for appropriate constant C the value above is negative, hence proved. Lemma 2.6.7. P r(min j≤qn | ˆβt,j|ˆθ| > aλn) → 1 Proof. | ˆβt,j|ˆθ | ≤ min j≤qn |β0,t,j| − max j≤qn min j≤qn Using result (cid:107) ˆβt,j|ˆθ − β0,t,j(cid:107)2 = Op((cid:112) st | ˆβt,j|ˆθ − β0,t,j| n ) and assumption A6 we establish the above result. (cid:4) 69 (cid:113)(cid:80)T st Lemma 2.6.9. (cid:107) ˆβ − β0(cid:107) = Op( n ) then (cid:107)ˆθ − θ0(cid:107) = Op( P r(cid:0) sup Proof. We will show that for arbitrary  > 0 (cid:107)v(cid:107)2=C Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0)(cid:1) > 1 −  here η = Op( (cid:125) Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0) = Q(β0, θ0 + vηn) − Q(β0, θ0) (cid:123)(cid:122) + Q( ˆβ, θ0 + vηn) − Q(β0, θ0 + vηn) − Q( ˆβ, θ0) − Q(β0, θ0) (cid:18) 1 (cid:124) (cid:123)(cid:122) 2 q,q(cid:88) ) 1 Ptn (cid:113) 1(cid:80)T (cid:113) 1(cid:80)T 1 Ptn ) (cid:19) (cid:125) (cid:20) ∂2Q(β0, θ∗) (cid:123)(cid:122) ∂θ(cid:48) rθr 13 Lemma 2.6.8. P r( max qn+1≤j≤pn t,j | ˆθ,ˆθ) | ∂L( ˆβorc ∂β | < nλn|θ=ˆθ for allt) → 1 ∂L(β0,θ=ˆθ) ∂β = 0 due to conditional expectation.Using lemma 9.1, we can prove that Proof. Eβ,θ t,j | ˆθ,ˆθ) ∂L( ˆβorc ∂β = −(Z − ˜X ˆβorc t,j |ˆθ)T ˜Σ−1(ˆθ) ˜X1,t = −(Z − ˜Xβ0)T ˜Σ−1(ˆθ) ˜X1,t + −( ˆβorc quantity is P T O(e−λ2 n,t) Hence proved. t,j |ˆθ − β0)T ˜X T By Gaussian concentraion inequality each of this (cid:4) (cid:124) q(cid:88) r=1 q,q(cid:88) Expanding term (1) by Taylor series we obtain vr(cid:48)[−˜tr,r(cid:48)(θ0)]vr ∂Q(β0, θ0) ηn vr (cid:124) 11 vr(cid:48) ∂θr (cid:124) (cid:125) (cid:124) r,r(cid:48)=1 + η2 n + η2 n (cid:123)(cid:122) (cid:123)(cid:122) (cid:18)(cid:113) = −(Z − ˜Xβ0)T ˜Σr(θ0)−1(Z − ˜Xβ0) + trace( ˜Σ(θ0) ˜Σr(θ0)−1) Using Markov inequality trace ˜Σ(θ0) ˜Σr(θ0)−1 ˜Σ(θ0) ˜Σr(θ0)−1 √ nP T )λmax(Σ(θ0))λmax(Σr(θ0)−1) = Op( = Op((cid:107) ˜Σ(θ0) ˜Σr(θ0)−1(cid:107)F ) + ˜tr,r(cid:48)(θ0) (cid:125) (cid:19) r,r(cid:48)=1 nP T ) (cid:125) vr 12 ∂Q(β0,θ0) ∂θr = Op ≤ Op( √ (cid:21) (11) = Op(1)(cid:107)ν(cid:107) Let ˜ti,j(θ) = trace(cid:0)In ⊗ Σ−1(θ0)Σi(θ0)Σ−1(θ0)Σj(θ0) (cid:125) (cid:123)(cid:122) (cid:124) (cid:1) ti,j (θ0) (cid:18) (cid:19) ti,j√ ti,itj,j 1 t 2 j,j = n ti,j(θ0) = n t 1 2 i,i 1 2 = n t i,iai,jt 1 2 j,j By assumption, lim n ≥ P T λ2 min(Σ−1(θ0))λ2 min(Σi(θ0)) A exists with lim n λmax(A) = Op(1) and ti,i(θ0) 70 n(cid:107)ν(cid:107)2 (cid:125) Hence for some constant Kn (12) ≤ −nP T η2 n(cid:107)v(cid:107)2 2 lim n λmax(A)λ2 min(Σi(θ0)) = −nP T η2 λ2 2Kn min(Σ−1(θ0)) max i≤q (cid:123)(cid:122) 131 ∂2Q(β0, θ∗) ∂θ(cid:48) rθr (cid:124) = trace( ˜Σ(θ∗) ˜Σr,r(cid:48) (θ∗)−1) − (Z − ˜Xβ)T ˜Σr,r(cid:48) (θ∗)−1(Z − ˜Xβ) + trace( ˜Σr(θ∗) ˜Σr(cid:48) (θ∗)−1) (cid:18) ˜Σ(θ∗) − ˜Σ(θ0) (cid:19) (θ∗)−1 Similar to expression (11) (131) = Op((cid:107) ˜Σ(θ∗) ˜Σr,r(cid:48) √ ≤ Op( nP T )λmax(Σr,r(cid:48) + nλmax[(Σr,r(cid:48) (θ∗)−1(cid:107)F ) + trace ˜Σr,r(cid:48) (θ∗)−1)λmax(Σ(θ0)) (θ∗)−1) trace(Σ(θ0) − Σ(θ∗) )] √ √ = Op( ≤ Op( nP T ) + nOp(1)P T max i≤P T nP T ) + nOp(1)P T max i≤P T ∂θ (cid:104) ∂γi (cid:107) ∂γi ∂θ (cid:12)(cid:12)θ∗∗, θ0 − θ∗(cid:105) (cid:12)(cid:12)θ∗∗(cid:107)∞(cid:107)θ0 − θ∗(cid:107)2 = Op( √ nP T ) Using identity ˜Σr(θ) ˜Σr(cid:48) (θ)−1 = − ˜Σr(θ) ˜Σ−1(θ) ˜Σr(cid:48)(θ) ˜Σ−1(θ) we obtain that (θ∗)−1) (θ∗)−1 ) + trace( ˜Σr(cid:48) (θ0)−1) − trace( ˜Σr(θ∗) ˜Σr(cid:48) (θ0)−1 − ˜Σr(cid:48) trace( ˜Σr(θ0) ˜Σr(cid:48) = trace( ˜Σr(θ0)( ˜Σr(cid:48) ≤ nλmax(Σr(θ0))trace(Σr(cid:48)(θ0)−1 − Σr(cid:48)(θ∗)−1 ) + nλmax(Σr(cid:48)(θ∗)−1) trace(Σr(θ0) − Σr(θ∗) ) (cid:12)(cid:12)(cid:12)(cid:12)θ∗∗ (θ∗)−1( ˜Σr(θ0) − ˜Σr(θ∗) ) (cid:12)(cid:12)(cid:12)(cid:12)θ∗∗ , θ0 − θ∗(cid:105) = nOp(1)P T max i≤P T nP T )η2 √ (13) = Op( i ∂θ (cid:104) ∂γ(r) n(cid:107)ν(cid:107)2 2 = Op( 1√ nP T )(cid:107)2(cid:107)2 , θ0 − θ∗(cid:105) + nOp(1)P T max i≤P T (cid:104) ∂γ(r(cid:48),−1) i ∂θ 2 Similarly as expression second part of (131). Thus term (1) is dominated by (12) As for expression (2) it is easy to check that penalty term vanishes and we are left with ( ˆβ−β0)T ˜X T [ ˜Σ−1(θ0 +νηn)− ˜Σ−1(θ0)](Z− ˜Xβ0) + ( ˆβ−β0)T ˜X T [ ˜Σ−1(θ0 +νηn)− ˜Σ−1(θ0)] ˜X( ˆβ−β0) = ηn( ˆβ − β0)T ˜X T(cid:80)q r=1 ˜Σr(θ∗)νr(Z − ˜Xβ) + ( ˆβ − β0)T ˜X T(cid:80)q ˜Σr(θ∗)νr ˜X( ˆβ − β0) r=1 71 = Op((cid:107)( ˆβ − β0)(cid:107)2(cid:107)(ˆθ − θ0)(cid:107)2)(cid:112)n trace(Σ−1(θ0)Σr(θ0)−1) + Op(n(cid:107)( ˆβ − β0)(cid:107)2 appropriate large value on (cid:107)ν(cid:107) we prove the statement. 2(cid:107)(ˆθ − θ0)(cid:107)2λmaxΣr(θ∗)−1 Hence (11) dominates all of the rest of the terms, thus (cid:4) Theorem 2.6.10. min 1≤j≤qn ˆβorc j |ˆθ = O(n2c) Proof. Follows min 1≤j≤qn Op((cid:107) ˆβorc |ˆθ − β|θ(cid:107)2) = oP (n2c) j ˆβorc j |ˆθ ≤ min 1≤j≤qn β|θ + Op((cid:107) ˆβorc j |ˆθ − β|θ(cid:107)2). By assumption and the result that (cid:4) 2.7 Algorithm and methodology to obtain optimal solutions Based on the objective function (2.23), in order to estimate ˆβ and ˆθ, we obtain the solutions of derivative of the penalized likelihoods with respect to each of the unknown parameters and iteratively solve until convergence is obtained. Similar to estimating solutions in Zou & Li (2008), as a first step initialization we replace Σ(ˆθ) with an identity matrix IpT and obtain the solution for β denoted by ˆβ(ini). Given ˆβ(ini), we obtain a solution for ˆθ(0), which is then used to obtain the solution ˆβ(0), which then used to obtain a solution for ˆθ(1). Finally we obtain ˆβ(1) given the estimates ˆθ(1). Similar to already established MLE based methods under the SCAD penalty it has been show with these one-step estimates consistency can be achieved. There is no apparent need to continue to with further iterations and attain convergence in the estimates. Along similar lines theoretical properties of consistency are established for estimates under the assumption that remainder of the parameters are considered to be fixed at any given iteration, based on the solution obtained from the previous step. Step (1): Let Σ = IpT and solve ˆβt (ini) = arg maxβt L(βt, Σ = IpT ; Z) Step (2): Let βt = ˆβt (ini) , Solve ˆθ(0) = arg maxθ L( ˆβt (ini) , θ; Z) 72 Step (3): Solve ˆβt (0) = arg max βt L(β, θ = ˆθ(0); Z) Step (4): Solve ˆθ(1) = arg maxθ L(βt = ˆβt (0) , θ; Z) Step (5): Solve ˆβt (1) = arg max βt L(β, θ = ˆθ(1); Z) The final estimates are therefore given by ˆθ(1) and ˆβt (1) for the spatio-temporal covariance abd difference in mean parameters respectively. 2.7.1 Asymptotic properties of one-step estimates In order to show properties of consistency for all estimators, we show that consistency is obtained under the assumption that previous estimates of the fixed parameter used to solve the next iteration are also consistent. Based on these notions, consider the following theorem. Theorem 2.7.1. Assuming conditions A1-A10 hold, then for each fixed t = {1, .., T}, (i) (Consistency) there exists an estimate ˆβt = ( ˆβ1,t, ˆβ2,t)T such that, (cid:107) ˆβt − β0,t(cid:107) = Op((cid:112) st (ii) (Sparsity)further, if (i) holds then, Q(( ˆβ1, 0)(cid:48), θ; Z) > argmax(cid:107) ˆβ2(cid:107)2=c Q(( ˆβ1, ˆβ2)(cid:48), θ; Z) n ) √ s n Proof.Check appendix. In a similar way, we are able to establish existence and consistency results for ˆθ as well. That is, Theorem 2.7.2. Assuming 2.5to (A10) hold, (cid:107)ˆθ − θ0(cid:107) = Op( (cid:113) 1 pT n ) as p, T, n → ∞. Since for every time component we are able to show asymptotic properties, these results can be neatly combined to obtain the following result. Theorem 2.7.3. Using the above theorem one can easily check that for β0,pT×1 = (β0,1,p×1, β0,2,p×1, ...β0,t,p×1, ..β0,T,p×1)T and ˆβ0,pT×1 = ( ˆβ0,1,p×1, ˆβ0,2,p×1, ... ˆβ0,t,p×1, .. ˆβ0,T,p×1)T 73 (i) (Consistency) there exists an estimate ˆβ such that (cid:107) ˆβ − β0(cid:107) = Op((cid:112) s n ) as n → ∞. (ii) (Sparsity) P ( ˆβt,2 = 0 ∀t) → 1 as n → ∞. All the proofs of theorems 2.7.1, 2.7.2 and 2.7.3 are provided in the Appendix ??. The covariance function is assumed to be stationary and the results included are based on the SCAD penalty resulting in unbiased estimators. 2.8 Computational Complexity Thus far, we have described a penalized LDA technique, that can simultaneously estimate pa- rameters of an underlying spatio-temporal covariance model and selection and estimation of the differences between the means. However, whenever the vector Z is constructed the corresponding matrix ˜Σ(θ) which is of dimension (n − 1)pT × (n − 1)pT . Assuming the subjects are independent we will require to calculate the inverse of a pT × pT matrix, which could have a computational cost of O(pT 3). In order to ease this burden, we begin by studying tapering techniques. 2.8.1 Covariance Tapering In spatial statistics Kaufman et al. (2008) introduced covariance tapering to approximate the like- lihood by replacing the spatial covariance with a positive-definite tapered version of the covariance matrix and established the computational gain while maintaining the the underlying theoretical properties. For very large datasets, Furrer et al. (2006) delineate the use of tapered covariance ma- trices. As the density of sites in a image is exponentially larger than the number of time-points in our longitudinal imaging study we taper only the spatial covariance of a separable spatio-temporal covariance matrix. Suppose we use tapering function KT ap(h, w) only for spatial domain with the spectral density of 74 tapered function fK(ω) = (2π)−d (cid:90) Rd exp(−iw T x )KTap(x , ω)dx (1+ Mψ (cid:107)ω(cid:107)2 α2 )ν+d/2+ψ with the restriction, fK(ω) ≤ where ψ > (1 − ν, d/4). This tapering function also included Wedland taper function of degree k satisfies the above condition 2.8.1 for k > max{1/2, ν + (d − 2)/4 + δ} for some δ > 0. However, we consider the Wedland taper function for each time fixed covariance matrix Σt for KT ap(h, rt) = [(1 − h )+]2 have k = 2, d = 3 here rt 2.8.2 One way Tapering vs Two way Tapering One way tapering yields biased score function: l1(θ) = − nP T log(2π) − (Z − ˜Xβ)T ( ˜Σ ◦ K−1 2 (cid:80)P j=1 pλt,n(|β0 t,j|) −(cid:80)T t=1 T ap)(Z − ˜Xβ) − log(det( ˜Σ ◦ KT ap)) Two way tapering yields unbiased score function: l2(θ) = − nP T log(2π) − (Z − ˜Xβ)T [( ˜Σ ◦ K−1 T ap) ◦ KT ap](Z − ˜Xβ) − log(det( ˜Σ ◦ KT ap)) 2 t=1 (cid:80)P −(cid:80)T j=1 pλt,n(|β0 t,j|) (cid:113) 1(cid:80)T op( t=1 Ptn ) under certain assumptions. It is established that both methods give asymptotically close estimates (cid:107)ˆθ1,taper − ˆθ2,taper(cid:107)2 = 2.8.3 Tapering range We have established that, choosing tapering range guarantees estimation consistency of θ0 as n → ∞ under increasing domain setup. In the work of Furrer et al. (2016) and Chu et al. (2011), authors have established estimation consistency under some sufficient conditions;which are satisfied for our Matern class of covariance. Mainly due to the fact that for appropriate true parameters γ(s, t; θ) < 1+(cid:107)s(cid:107)3+α1 +|t|1+α2 and its derivative ∂γ(s,t;θ) ∂θ A 1+(cid:107)s(cid:107)3+α1 +|t|1+α2 for some α1 > 0 and α2 > 0. A 75 Similar results hold for non separable covariances as well. Additionally,(cid:82) ∞ 0 sd+1t2γ(s, t)dsdt < ∞ √ as spectral density is differentiable indefinitely.Thus we show that tapering range rt = O( Pt) |Lθ0−Lθtap| = 1 n sup θ0∈Θ log(| ˜Σ(θ0)( ˜Σ(θ0)◦Ktaper)−1)|)+ 1 n (Z−Xβ0)T ( ˜Σ−1(θ0)− ˜Σ(θ0)◦Ktaper)−1)(Z−Xβ0) = op(1) 2.9 Misclassification Optimality This theorem shows that after penalized maximum likelihood estimation under tapering, the plug in classifier is optimal or universal. Theorem 2.9.1. Suppose st nCp W (ˆδ) → 1 − Φ( √ C0 2 ) → 0 the worst classification error rate of Proof. Refer to appendix 2.11.4.1 (cid:4) 2.10 MRI Data Preprocessing Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimag- ing Initiative (ADNI) database (adni.loni.usc.edu). For up-to-date information, see www.adni- info.org. To demonstrate the methodology we consider MRI images obtained from a two year study using a 3T scanner. A total of 51 subjects (31 Controls, 18 AD ) are collected. On average each of these subjects have 4 visits. The preprocessing protocol from ADNI allows us to down- load N3 (nonparametric nonuniformity normalization) bias-corrected NIFTi images. For visualizing these images, we use the freely available toolkit ITKsnap developed by Yushkevich et al. (2016). An affine registration with the skull on was performed on every subject to its separate visits using ANTs Avants et al. (2011) registration implemented in R using the AntsR and extrantsr packages Muschelli et al. (2018). i.e. visit two, three and four were affine registered to visit one. 76 After the first set of registrations on this longitudinal dataset, for each of the subjects using just the first visit, skull-stripping was performed by extracting only the brain tissue using the tool MASS developed by Doshi et al. (2013). This was run in parallel as it is known to be computationally heavy. If this is hard to implement one may consider the simpler FSL Bet tool developed by Popescu et al. (2012). The brain mask obtained after skull stripping the first visit corresponding to each subject, is then applied to all other visits of the respective subject. To perform voxel-wise analyses, a deformable registration is performed using a template. For the purposes of the 3D 3T MRI scans we chose the SRI24 atlas Rohlfing et al. (2008). This deformable registration was once again performed using ANTs registration. These transformations were calculated using the first visit of each subject to the template. A forward transformation consists of .mat file (for initial affine registration) and a forward warp image .nii.gz file. The transformations then may be applied to the corresponding time points. A quality control check may then be done using the ITKsnap tool to ensure that all images have been registered to the same space. In order to obtain ROI labels one can visit (www.nitrc.org/projects/sri24) to obtain the segmenta- tion of the ROIs and the corresponding label text file. The hippocampus for AD studies is labeled as 37 (left) and 38(right). 2.11 Simulation Studies In the following simulation study, we investigate a variety of conditions under which the pro- posed methods are employed. On a lattice grid of p × p ∈ R2 multinomial gaussian observa- tions where p = {10, 15, 20} are produced over t = {1, .., 4} time points. The mean of this multinomial gaussian distribution is characterized in two ways. The first condition assumes that there is a strong signal indicating a significant difference between the two classes C1 and C2 i.e. 77 ∆ = {1, ..., 1, 2....2, 0..., 0}T where s = 20 is the number of non-zero entities and the first 10 are 1 while the following 10 are 2. Similarly a smaller difference between the two classes was as- sumed with ∆small = {0.75, ..., 0.75, 0.5....0.5, 0..., 0}T . Under space and time separable assumption multiple spatial covariance functions are assumed such as: • Exponential covariance: The spatial dependence of the error terms are γ(d) = σ2(1 − c) exp(−d/r) where γ(d) = σ2 when d = 0 and d denotes the euclidean distance between two spatial points. r = {3, 10}, c = {0.2, 0.5} and σ = 1. • Polynomial correlation: The covariance function is γ(d) = ρd where ρ = 0.9. • Matern covariance: The covariance function is given by, (cid:0)(cid:113) (cid:0)(cid:113) (cid:1)νKν 2ν d ρ γ(d) = σ2 21−ν Γ(ν) 2ν d ρ (cid:1) where Kν(·) is the modified Bessel function of the second kind, ρ, ν > 0 and Γ is the gamma function. In this study, ν = 2 and ρ = 5. The covariance assumed for the four time points was exponential where γ(dt) = exp(−dt/rt) where dt denotes the euclidean distance between pairs of time points and rt = {0.5, 0.7, 1}. For the non-separable case we use the following covariance function, (cid:0) (cid:1) γ(d, dt) = e −dν t )0.5γµ (1+dλ 1+dλ t • Gneiting covariance: This non-separable space time covariance is, , where ν, λ ∈ [0, 2] and γ ∈ (0, 1). If γ = 0 then the model is separable. In the simulation study the values considered were ν = 0.8, λ = 1 and γ = 0.6. For all of the simulation studies we have 100, 225 and 400 locations on 4 time points. Therefore the total number of observations per subject is 400, 900 and 1600 observations respectively. The total number of non-zero components in the mean vector s = 20. So best case scenario for selection is when true positives (TP) is 20 and false positives (FP) is 0. A total of 100 subjects per study is generated with a 50% training and validation split. 78 2.11.1 Exponential Space Time Covariance (weak correlations) 2.11.1.1 With ∆ p × p 100 225 400 r=3 3.009 (0.0979339536) 2.996 (0.0736011373) 3.002 (0.0516560637) c=0.2 0.231 (0.0228625958) 0.234 (0.0175467671) 0.233 (0.0125129651) σ = 1 1 (4.234e-07) 1 (5.015e-07) 1 (2.954e-07) rt = 1 0.992 (0.0103475934) 0.995 (0.0069044814) 0.996 (0.0048776816) Table 2.1: Exponential separable space and time covariance estimation when ∆ is the mean vector p × p 100 225 400 TP 20.00 20.00 19.99 FP 47.14 111.02 189.68 MSE Train1 Train2 Test1 Test2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00202 0.00124 0.00097 Table 2.2: Estimation and selection of ∆ using the estimated covariance p × p 100 225 400 TP 19.87 19.83 19.70 FP 45.38 64.56 78.59 MSE Train1 Train2 Test1 Test2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00851 0.00467 0.00310 Table 2.3: Estimation and selection of ∆ under independence p × p 100 225 400 r=3 2.992 (0.0940290558) 3.005 (0.0735975377) 3 (0.0567520396) c=0.2 0.234 (0.0215993179) 0.232 (0.0166009694) 0.233 (0.0131230581) σ = 1 1 (5e-07) 1 (4.172e-07) 1 (4.156e-07) rt = 1 0.993 (0.0109421777) 0.993 (0.0080306034) 0.995 (0.0060978283) Table 2.4: Exponential separable space and time covariance estimation when ∆small is the mean vector 79 p × p 100 225 400 TP 17.66 16.13 12.99 FP 119.99 231.17 311.30 MSE Train1 Train2 Test1 Test2 0.79 0.73 0.66 0.79 0.73 0.67 0.79 0.73 0.67 0.79 0.73 0.67 0.00839 0.00506 0.00385 Table 2.5: Estimation and selection of ∆small using the estimated covariance p × p 100 225 400 TP 15.70 14.41 12.68 FP 45.50 55.40 54.21 MSE Train1 Train2 Test1 Test2 0.75 0.72 1.00 0.75 1.00 0.69 0.75 1.00 0.69 0.75 0.71 0.69 0.01081 0.00561 0.00345 Table 2.6: Estimation and selection of ∆small under independence 2.11.2 Exponential Space Time Covariance (strong correlations) 2.11.2.1 With ∆ p × p 100 225 400 r=10 10.024 (0.6592026826) 9.918 (0.4420207073) 9.952 (0.3472843583) c=0.5 0.519 (0.0310983865) 0.523 (0.021304777) 0.522 (0.0165295756) σ = 1 1 (3.3601e-06) 1 (2.4351e-06) 1 (3.5861e-06) rt=0.5 0.47 (0.1196661211) 0.466 (0.1284748148) 0.475 (0.1094737503) Table 2.7: Exponential separable space and time covariance estimation when ∆ is the mean vector p × p 100 225 400 TP 20.00 16.13 12.99 FP 0.79 231.17 311.30 MSE Train1 Train2 Test1 Test2 1.00 0.73 0.66 0.79 0.73 0.67 1.00 0.73 0.67 1.00 0.73 0.67 0.00029 0.00506 0.00385 Table 2.8: Estimation and selection of ∆ using the estimated covariance p × p 10x10 15x15 20x20 TP 19.98 19.98 19.99 FP 47.08 61.14 78.91 MSE Train1 Train2 Test1 Test2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00487 0.00229 0.00154 Table 2.9: Estimation and selection of ∆ under independence 80 p × p 100 225 400 r= 10 10.056 (0.5445091578) 10.021 (0.415936012) 9.952 (0.3829308584) c=0.5 0.517 (0.0251724131) 0.519 (0.0190434918) 0.522 (0.0185292498) σ = 1 1 (2.5151e-06) 1 (5.481e-07) 1 (2.2068e-06) rt=0.5 0.466 (0.1292439526) 0.46 (0.136410963) 0.475 (0.1505265743) Table 2.10: Exponential separable space and time covariance estimation when ∆small is the mean vector p × p 100 225 400 TP 20.00 19.99 19.99 FP 29.41 231.17 102.89 MSE Train1 Train2 Test1 Test2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00036 0.00021 0.00015 Table 2.11: Estimation and selection of ∆small using the estimated covariance p × p 100 225 400 TP 19.98 19.98 19.99 FP 62.76 61.14 78.91 MSE Train1 Train2 Test1 Test2 0.99 0.98 0.98 1.00 0.97 0.98 0.00630 0.00375 0.00225 1.00 0.97 0.98 1.00 0.97 0.98 Table 2.12: Estimation and selection of ∆small under independence 2.11.3 Matern Space Covariance and exponential time (separable) 2.11.3.1 With ∆ p × p 100 225 400 ν = 2 1.998 (0.0143793046) 2.001 (0.0091730859) 2 (0.0070390667) rt=0.7 0.698 (0.0175455886) 0.701 (0.0108351911) 0.7 (0.0074068311) φ = 5 4.795 (0.1068378951) 4.808 (0.06846612) 4.799 (0.0494941786) Table 2.13: Matern covariance with separable exponential time when ∆ is the mean vector p × p 100 225 400 TP 15.18 14.11 12.73 FP MSE Train1 Train2 Test1 Test2 0.98 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 1.00 0.03 0.01 0.01 86.51 155.46 214.40 Table 2.14: Estimation and selection of ∆ using the estimated covariance 81 p × p 100 225 400 TP 14.49 17.32 10.19 FP MSE Train1 Train2 Test1 Test2 0.93 0.98 0.84 0.93 0.97 0.83 0.93 0.97 0.83 0.93 0.97 0.98 0.06 0.00 0.02 62.76 61.14 51.06 Table 2.15: Estimation and selection of ∆ under independence 2.11.4 Non separable space and time Gneiting covariance (separable) 2.11.4.1 With ∆ p × p 100 225 400 ν = 0.8 0.831 (0.0147602785) 0.832 (0.0108536628) 0.831 (0.0078036784) λ = 1 1.059 (0.0263549226) 1.057 (0.0162489845) 1.057 (0.0130021963) γ = 0.6 0.982 (0.0063041298) 0.983 (0.0039539343) 0.983 (0.0031399539) Table 2.16: Gneiting covariance with non-separable when ∆ is the mean vector p × p 100 225 400 TP 19.08 17.22 14.40 FP MSE Train1 Train2 Test1 Test2 0.96 0.96 0.95 0.97 0.96 0.95 0.97 0.96 0.95 0.03 0.01 0.01 0.97 0.96 0.95 111.03 178.08 178.88 Table 2.17: Estimation and Selection of ∆ using the estimated covariance 82 APPENDIX 83 APPENDIX Useful lemmas Lemma 2.11.1. If X ∼ N (0, 1) then, X = Op(1) Proof. Using Gaussian concentration inequality P (|X| > exists a δ = √−log(/2) . 2 √−log(/2) 2 ) ≤  for every  > 0 there (cid:4) Lemma 2.11.2. If X ∼ χ2(K) then, X = Op(K) Proof. Using the sub-exponential concentration inequality X−K is subexponential with parameters (4K, 2). Similarly for every  > 0 there exist a δ which implies P ( K > 0. |X−K| 4K > δ) <  for some constant (cid:4) (cid:113) 1 Theorem 2.11.3 (2.7.2). Assuming (cid:107)ˆθ − θ(cid:107)2 = Op( t = 1, 2..., T (cid:107) ˆβt − βt,0(cid:107)2 = Op((cid:112) st P T n ) , we want to prove that for each n ) and ˆβt 2 = 0P×1 with probability close to 1 where ˆβ = arg maxβ L(β, ˆθ; Z) Proof. For each fixed t the proof consists of two steps In first part we show that (cid:107) ˆβt − βt,0(cid:107)2 = Op((cid:112) st n ) . In the next part we show that ˆβt 2 = 0P×1 with probability close to 1 . To prove this second part we need an additional lemma 2.11.4 from the paper The proof of first part Denote vector ut ∈ Rpn with entries 1 corresponding to index j such that βt,j (cid:54)= 0 and 0 elsewhere, but without loss of generality , we denote ut as u.Similarly, Xt denotes the colomns of design matrix X corresponding to βt,j (cid:54)= 0. Hence we get identity that Xtut = X Ist×st 0 n ) , ξn = Op((cid:112) s show that for any  > 0 ,(cid:80)T the condition that for r ∈ {1, 2, ....T} − {t} we have (cid:107) ˆβr − βr(cid:107)2 = Op((cid:112) sr t=1 st = s, ξn,t = Op((cid:112) st 0 0 n )  u It is sufficient to n ) and for some C > 0 under 84 P r(cid:0) sup (cid:107)u(cid:107)2=C Q(βt,0 + uξn,t, ˆθ) − Q(βt,0, ˆθ)(cid:1) > 1 −  or equivalently we can prove that P r(cid:0) sup (cid:107)u(cid:107)2=C Q(β0,t + uξn, ˆθ) − Q(β0,t, ˆθ)(cid:1) > 1 −  Q(β0,t + uξn, ˆθ) − Q(β0,t, ˆθ) = L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ) − n = L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ) − n Pt(cid:88) st(cid:88) j=1 j=1 (pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|)) (pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|)) − n (pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|)) (pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|)) pt(cid:88) st(cid:88) j=1 j=st+1 ≤ L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ) − n forj = st + 1, ., pt pλn,t(|βt,0,j| = 0) = 0 pλn,t(|βt,0,j − uξn| > 0) = 0 = 1 2 − n st(cid:88) ˜Σ(ˆθ)−1Xtutξn −T ˜Σ(ˆθ)−1Xtut t X T uT t (pλn,t(|βt,0,j + uξn|) − pλn(|βt,0,j|)) j=1 = 1 2 − n = 1 2 (cid:124) ˜Σ(ˆθ)−1Xtutξn − (Z − Xβ)T ˜Σ(ˆθ)−1Xtut st(cid:88) j=1 t X T uT t (pλn(|βt,0,j + uξn|) − pλn(|βt,0,j|)) (cid:123)(cid:122) ˜Σ(ˆθ)−1Xtutξ2 uT t X T t (cid:124) − (Z − Xβ)T ˜Σ(ˆθ)−1Xtutξn,t (cid:125) (cid:123)(cid:122) 1 (cid:125) n,t 2 85 (cid:0)p(cid:48) st(cid:88) j=1 −n (cid:124) λn,t(|β0, j|)sgn(β0, j)utξn,t (cid:125) (cid:123)(cid:122) 3a n,t(1 + op(1)(cid:1) (cid:123)(cid:122) (cid:125) λn,t(|β0,j|)u2 t ξ2 (cid:124) + p(cid:48)(cid:48) 3b For term (1) we get using eigenvalue inequality result that ψi is i th eigenvalue uT t X T t ˜Σ(ˆθ)−1Xtutξn uT X T ˜Σ(θ0)−1Xtutξn (cid:125) = 1 2 (cid:124) (cid:125) + uT t X T t uT X T ˙˜Σ(θ∗)r(ˆθr − θ0,r)uT X T ξn (cid:123)(cid:122) 1a (cid:125) Proof. 1 2 (cid:124) 1 2 (cid:124) (cid:123)(cid:122) 1 q(cid:88) r=1 (cid:123)(cid:122) 1b i 1(a) ≤ O((cid:107)u(cid:107)2X T X min ψi( ˜Σ(ˆθ)−1))ξ2 n = O(min i ψi( ˜X T (In−1 + Jn−1)) ˜X) min i ψi( ˜Σ(ˆθ)−1))ξ2 n = O(nξ2 n,t)by eigenvalue inequality Similarly using CS inequality1(b) = O(nξ2 n,tηn) (cid:4) 86 ψi( ˜Σ(θ0)−1) trace( ˜X T ˜X)(cid:107)u(cid:107)2ξn) −τ21n1 τ11n2−1 (cid:107)u(cid:107)2ξn) (In−1 + Jn−1) (cid:4) (cid:4) For term (2) we use Taylor expansion of ˜Σ(ˆθ)−1 around θ0 such that ˆθ − θ0 = vηn Proof. −(Z − Xβ)T ˜Σ(ˆθ)−1Xtut = −(Z − Xβ)T ˜Σ(θ0)−1Xuξn (cid:124) (cid:123)(cid:122) 2a (cid:125) − q(cid:88) (cid:124) r=1 min i i = Op( (cid:123)(cid:122) 2b Z = Op ψi(Σ(θ0)−1) trace(X T term (2a) using Markov inequality that for any r.v (Zt − Xβt)T ˙˜Σ(θ∗)r(ˆθr − θ0,r)Xuξn (cid:125) (cid:113) (cid:112)E(Z2) −(Z − Xβ)T ˜Σ(θ0)−1Xtutξn = Op( (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)min T −τ21n1 = Op((cid:112) n1n2 Similarly term (2b)(cid:80)q (cid:113) −n(cid:80)s r=1(Z − Xβ)T ˜Σ(θ∗)rXuξn ( ˜X T ˜X)(cid:107)u(cid:107)2(cid:107)v(cid:107)2ξnηn) = Op(q (|β0, j|)sgn(β0, j)uξn ≤ n n st(cid:107)u(cid:107)ξn) = Op( √ nst(cid:107)u(cid:107)ξn) ψi( ˜Σ(θ∗)r) min τ11n2−1 = Op( t Xt) s max √ min i j=1(p(cid:48) λn i For term (3a) using CS inequality and using assumption √ n1n2n) Term (3b)using CS inequality and using assumption p(cid:48)(cid:48) λn (|β0,j|)u2ξ2 nbn = Op(nξ2 n) p(cid:48) λn (|β0, j|)(cid:107)u(cid:107)2 j 2ξn = O(n sξn an) = Op( √ snξn) √ n ≤ n(cid:112)(s)ξ2 Here we observe that term 1 dominates all other term hence for appropriate constant C the value above is negative, hence proved. Lemma 2.11.4. This lemma proves sparsity of the estimator.SCAD penalized estimator demon- strates this oracle property which means as if number of zeros in parameter in known initially.Now, let ˆβt = ( ˆβ1 t , ˆβ2 t )T .For any given ˆβt satisfying (cid:107) ˆβ1 t − β1 n ) and assumptions A(1) to t,0(cid:107)2 = Op((cid:112) st A(12) then with high probability  ˆβ1 t 0  = max t (cid:107)≤C (cid:107) ˆβ2 Q √ st n Q  ˆβ1 t ˆβ2 t  87 Proof. It is sufficient to prove that for j = st, st + 1, ....pt ∂Q(βt) ∂βt,j < 0 for 0 < βt,j < C and ∂Q(βt) ∂βt,j > 0 for 0 > βt,j > −C (cid:114) st n (cid:114) st n (cid:113) ∂Q(βt; ˆθ) = −(Z − Xβ)T ˜Σ(ˆθ)−1Xj − nP (cid:48) λ(|βt,j|)sgn(βt,j) ∂βt,j −(Z − Xβ)T ˜Σ(ˆθ)−1Xj = −(Z − Xβ)T ˜Σ(θ)−1Xj − Q(cid:88) (Z − Xβ)T ˜Σk(θ∗)−1ukηn −(Z − Xβ)T ˜Σ(θ)−1Xj = Op( k=1 Xj ˜Σ−1Xj) = Op( √ n) by concentration inequality similarly, Q(cid:88) k=1 √ (Z − Xβ)T ˜Σk(θ∗)−1ukηn = Op( nηn) Collecting all terms we achieve that ∂Q(βt; ˆθ) ∂βt,j = nλn,t (cid:0)Op( √ st√ n λn,t) + sgn(βj)(cid:1) p(cid:48) λn,t (|βj|) λn,t √ n λn,t → 0 sgn(βj) dominates , hence proved st√ Since (cid:113)(cid:80)T st Lemma 2.11.5. (cid:107) ˆβ − β0(cid:107) = Op( n ) then (cid:107)ˆθ − θ0(cid:107) = Op( P r(cid:0) sup Proof. We will show that for arbitrary  > 0 Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0)(cid:1) > 1 −  here η = Op( (cid:107)v(cid:107)2=C By Taylor Series expansion around ˆθ 88 1 Ptn (cid:113) 1(cid:80)T (cid:113) 1(cid:80)T 1 Ptn ) (cid:4) ) Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0) = (Z − X ˆβ)T ˙˜Σ(θ0)r(Z − X ˆβ)vη2 + (cid:125) vT (Z − X ˆβ)T [ ¨˜Σ(θ∗)r,r(cid:48) − D](Z − X ˆβ)vη2 + 1 (cid:123)(cid:122) (cid:125) n q(cid:88) (cid:124) r=1 q(cid:88) (cid:124) (cid:124) r,r(cid:48)=1 (cid:123)(cid:122) 2 (cid:125) n (cid:123)(cid:122) 3 vT (Z − X ˆβ)T D(Z − X ˆβ)vη2 Lemma 2.11.6. limp∧T→∞λmaxΣ(θ∗) = O(1) Proof. limP,T→∞,∞λmaxΣ(θ∗) P T(cid:88) ≤ max ≤ ≤ ≤ |Σi,j| i j=1 j=1 |Σi,j| ∞(cid:88) (cid:90) ∞ (cid:90) ∞ (cid:90) ∞ (cid:90) ∞ (cid:90) (cid:90) ∞ (cid:90) ∞ ω=0 τ =0 s=0 t=0 | (cid:107)ω(cid:107)=0 τ =0 ≤ 2 |γ(s, t; θ)| (cid:90) exp(ω(cid:48)s + τ(cid:48)t)f (ω, τ )| R Rd η(α2β2 + β2(cid:107)ω(cid:107)2 + α2τ 2 + (cid:107)ω(cid:107)2τ 2)−(ν+ d+1 2 )dωdτ = O(1) for d > 2 ν > 0 Lemma 2.11.7. limP,T→∞,∞λmaxΣ(θ∗) ◦ KT aper = O(1) and limP,T→∞,∞λminΣ(θ∗) ◦ KT aper Proof. We use the result that min KT aper i,iλminΣ(θ∗) ≤ λmaxΣ(θ∗) ◦ KT aper ≤ max KT aper i,iλmaxΣ(θ∗) 89 (cid:4) (cid:4) (cid:4) Proof. By Taylor series expansion, Φ(X) = Φ(y) + (X − y)φ(y) + op(X − y) holds for almost surely a random variable X √ ˆ∆T Σ−1 ˆ∆ ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆ √ ∆T Σ−1∆ . 2 Now taking X = Notice that ˆ∆T Σ−1 ˆ∆ ∼ F ( ˜PT , n − 2) with non central parameter since ˆ∆ and ˆΣ are independent (cid:4) and y = 2 If we use weighted mixture of Matern Covariance function,we can show that all of the above conditions hold for covariance matrix ΣP T×P T (θ) Lemma 2.11.8. The expression of derivative of covariance function with respect to parameter θ is given below Proof. By Fubini’s theorem, we know that the differentiation and integral can be exchanged. ∂γ(s, t; θ) ∂f (ω, τ ; θ) ∂θ dωdτ exp(ω(cid:48)s + τ(cid:48)t) (cid:90) (cid:90) d + 1 2 ) Rd R exp(ω(cid:48)s + τ(cid:48)t)g(ω, τ )−(1+ν+ d+1 2 ) = K(θ) ∂θ (cid:90) (cid:90) Rd R = −(ν + =  2αη(β2 + τ 2) 2βη(α2 + τ 2) η((cid:107)ω(cid:107)2τ 2) (α2β2 + β2(cid:107)ω(cid:107)2 + α2τ 2 + (cid:107)ω(cid:107)2τ 2) log(α2β2+β2(cid:107)ω(cid:107)2+α2τ 2+(cid:107)ω(cid:107)2τ 2) ν+ d+1 2  dωdτ We can verify that each of the element of K(θ) has finite integral multiplied with a continuous (cid:4) function of θ. Since Θ is compact space in Rq, K(θ) is bounded . Lemma 2.11.9. The maximum eigenvalue of matrix ∂Σ ∂θ is bounded from above 90 Proof. We will show that (cid:107) ∂Σ ∂θ (cid:107)2 ≤ (cid:107) ∂Σ ∂θ (cid:107)1 =(cid:82) Rd (cid:82) ∂γ(s,t;θ) ∂θ < ∞ R Lemma 2.11.10. The maximum eigenvalue of matrix ∂2Σ ∂θ2 is bounded from above Proof. Similar to proof above either we can differentiate again and apply Fubini Theorem Lemma 2.11.11. (cid:107)Σ(cid:107)2 F = OpP T >> Op(P T ) 1 2 Proof. Since covarince function is isotropic and stationary (cid:107)Σ(cid:107)2 P T(cid:80) γ2(hij; θ ≤(cid:80) Using the fact that(cid:80) i min j i,j γ2(hij; θ) ≤ P T(cid:80) γ2(hij; θ) ≤(cid:82) ∞ i max γ2(hij; θ) 0 γ2(x; θ)dx < ∞ j i max j F =(cid:80) i,j γ2(hij; θ) Lemma 2.11.12. If λ1 ≥ λ2... ≥ λn be eigenvalues of positive semi definite matrix the con- √ (n−1) √ m(A)−v(A) 2v(A) (n−1) and lower bound 1 + (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) can be have upper bound 1 + 2v(A)/ dition number kn = λmax λmin √ (n−1) √ m(A)−v(A)/ where m(A) = tr(A)/n and v(A) = tr(A2)/n − m2 (n−1) Proof. From lemma in Wolkowicz & Styan (1980) Lemma 2.11.13. Assuming general condition 13 holds, we have the results • (cid:107)Σt − Σt,T aper(cid:107)1 = Op( 1√ Pt ) • (cid:107)Σk,t − Σk,t,T aper(cid:107)1 = Op( 1√ Pt ) • (cid:107)Σk,j,t − Σk,j,t,T aper(cid:107)1 = Op( 1√ ) Pt Proof. (cid:107)Σt − Σt,T aper(cid:107)1 ≤ 2 Ktρ wp xd|γ(x : θ)|dx (cid:82) ∞ wp Similar result holds for derivative and double derivative of Σ(θ) proof for theorem 7.2 Proof. By using Lemma 8.1, 8.3 ,7.1 we have established all regularity conditions for tapered matrix ˜ΣT apered W ( ˆβP M LE) = 1 − Φ( n (n1−n2)(1+Op( ˜PT ∆T Σ−1∆)(1+Op( (cid:115) )+ )) ∆T Σ−1∆)(1+Op( 2 )+ ˜PT n (n1+n2)(1+Op( 1√ n ˜PT 1√ n ˜PT )) (cid:4) 1√ n ˜PT 1√ n ˜PT 91 Chapter 3 Random Projection for Tensor data 3.1 Introduction Given a vector x ∈ Rp×1, we can project it to lower dimensional space through linear map f : Rp×1 → Rd×1 defined by f (x) = ˜A(K)x 1√ k d where ˜Ap×d is random matrix formed with element sampled independently from special classes of distributions. iid∼D a ˜ai,j This linear projection also preserves pairwise distances with high probability. This phenomenon is due to JL lemma for some  > 0) and constant C independent of dimension.For k = 1 P(1 −  ≤ (cid:107)f (x) − f (x(cid:48))(cid:107)2 2 (cid:107)x − x(cid:48)(cid:107)2 2 ≤ 1 − ) ≤ 2 exp(− d 8 (2/2 − 3/3)) 92 Lemma 3.1.1. P(< f (x), f (y) > − < x, y >≥  < x, y >) ≤ 2 sup x∈Rp P((cid:107)f (x)(cid:107)2 − (cid:107)x(cid:107)2 ≥ (cid:107)x(cid:107)2) Proof. Sun et al. (2018) lemma A.1 (cid:4) Suppose there are n points in Rp×1 corresponding to data matrix X = [X1..Xi..Xn]1≤i≤n. We can chose d ≥ 2 log(n)−2 such that all pairwise distances belonging to each of n(n−1) 2 pairs are preserved with high probability. There an be various choices of distributions for random variable a. For example, a ∼ N(0, 1) or in sparse RP s = 3 and very sparse RP s = d respectively √ ai,j = 2 2 +1 with probability 1  −1 with probability 1 √ + √ + P. Li et al. (2006) ai,j = 0 s with probability 1 2s with probability 1 − 1 s s with probability 1 2s 3.2 Kronecker factors Supposedly ˜A(K)p×d could be decomposed as Kronecker product of k matrices ˜A(K) = A(1) ⊗ ..A(l).. ⊗ A(k) 93 where matrix A(l) is of order dl × pl for each l ∈ {1, 2, .., k}.Trivially p = (cid:81)k l=1 pl and dk. The elements of product matrix are given by ˜ai,j = a(1) i1,j1 a(2) i2,j2 ....a(l) il,jl ..a(k) ik,jk l=1(jl − 1)(cid:81)k where j = 1 +(cid:80)k For l = k the expressions(cid:81)k r=l+1 pr and i = 1 +(cid:80)k r=l+1 pr and(cid:81)k l=1(il − 1)dk−l−1 r=l+1 dr assumed to be 1. If we take each elements of the matrices {Al}1≤l≤k to be independently and identically dis- tributed of each other, the resulting elements of product matrices ˜Ai,j are no longer mutually independent but product of independent random variables, thus weekly dependent in the sense it is ρ mixing, and martingale with respect to filtration In this work, we show that even under this dependence structure we can achieve JL kind of for finite p and d. We further assume that E(a(l) il,jl ) = 0 and E[(a(l) il,jl )]2 = 1 for all l ∈ [k], j ∈ [pl], i ∈ [dl]. 3.2.1 Literature reviews Concentration inequalities for quadratic forms like ours is known as Hansen Wright inequality.For dependent variables these type of inequality are discussed in Adamczak et al. (2015) which certain concentration property for 1 Lipschitz function which is hard to verify in our cases.Samson et al. (2000) discussed the concentration property for 1 Lipschitz function for strong mixing sequence and Markov chain.However strong mixing conditions like φ mixing is also difficult to verify Another ap- proach would be use Herbst argument applicable to Log Sobolev measure as mentioned in Ledoux (1999).But proving such inequality is much more tedious in our case 94 Define a random variable Yi(x) =< ˜AK i , x > ,our reader may notice that for fixed i, Yi = (cid:80)(p1,p2,...pK ) given j = 1 +(cid:80)k j1=1,j2=1,...jK =1 a(1) i1,j1 a(2) i2,j2 l=1(jl − 1)(cid:81)k ....a(l) il,jl ..a(k) ik,jk xj r=l+1 pr form a martingale w.r.t filtration Fl and 1 ≤ l ≤ K. But concentration inequalities for martingale differences provides very poor bounds. Readers may also notice stationarity a random variables Yi although actual distribution is in- tractable. Yi is popularly known as Polynomial Gaussian Chaos in probability literatures. Statis- ticians recognize this expression as general U statistics. Its moments and tail bounds are widely studied in Adamczak et al. (2015) and Lata(cid:32)la et al. (2006). But in above literatures,moment bound involve supreme of empirical processes which are hard to estimate. Also, exponential bounds given in these literature is depended upon various semi norms of vector x based partition of subset of {1, 2, ...K}. But it is desirable to derive inequalities in terms of L2 norm of x only. Taking all the above bottleneck into consideration, we follow approach by Schudy & Sviridenko (2012) where moment bounds are calculated through brute force combinatorial argument. To our aid we have result known as hyper- contractivity which provides bound of (cid:107)Y (cid:107)r through ((cid:107)Y (cid:107)2)r in Janson et al. (1997) theorem 5.10 and theorem 6.7. Below is the result stated 3.3 Weak dependence By convention, for a random variable X ∈ (R,BR, P), (cid:107)X(cid:107)p = (E(|X|)). Define filtration sigma filed FK = σ(A(l) il,jl : d ≤ il ≤ 1, pl ≤ jl ≤ 1; K ≤ l ≤ 1) Suppose there is a sequence of random variables (˜ai)n : 1 ≤ i ≤ h) rho ρ(n) = sup k≥1 ˜a∈Gh ˜b∈G∞ 1 h+n Gk 1 = σ(˜ai n→∞ ρ(n) → 0 lim i=1 and ˜an ∈ (R,BR, P),define sigma field |Cov(˜a, ˜b)|/(cid:107)˜a(cid:107)2(cid:107)˜b(cid:107)2 a sequence is ρ mixing if 95 0 for a ∼ N(0, 1) 0 for a ∼ very sparse with s = √ d In our case, ρ(|i − i(cid:48)|) = for i (cid:54)= i(cid:48) Proof. Define Yi(x) =< ˜AK i , x >. for i (cid:54)= i(cid:48) but sometimes j = j(cid:48) (cid:88) (cid:88) ˜ai,j(cid:48) ˜ai(cid:48),jxjxj(cid:48)) E(Yi, Y (cid:48) i ) = E( j j(cid:48) Now, for i (cid:54)= i(cid:48) with il = i(cid:48) = E(a(1) i1,j1 = E(a2 (1) i1,j1 ....a(l) il,jl ....a2 (l) il,jl a2 (2) i2,j2 a(2) i2,j2 ..a(k) ik,jk ..a(k) ik,jk l but ik = i(cid:48) k without loss of generality and j = j(cid:48) E(˜ai,j ˜ai(cid:48),j(cid:48)) a(2) ....a(l) i(cid:48) i(cid:48) 2,j2 l,jl ) = 1 E(a(k) ik,jk ..a(k) i(cid:48) k,jk )E(a(k) i(cid:48) k,jk a(1) i(cid:48) 1,j1 a(k) i(cid:48) k,jk ) = 0 ) Due to independence. Hence whole expression has expectation 0. Therefore covariance is 0. (cid:4) 3.4 Concentration Inequality For random variables following Gaussian distribution and very sparse distribution P. Li et al. (2006), we are able to prove this bound. We require some concentration bound for expression P(| 1 dk xT ˜A(K),T ˜A(K)x (cid:107)x(cid:107)2 2 − 1| > t) to decay at exponential else we would not attain the efficiency like JL lemma meaning the lower projected dimension will be much more than standard result for k=1,i.e we want d = O(log n). Theorem 3.4.1. JL lemma for Gaussian or Raedmacher For some constant m = m(K) depending on K P(| 1 dk Sdk(cid:107)x(cid:107)2 2 − 1| > t) < − d e m(K) t 2 K m(K) 96 Proof. Since (cid:107) Sdk CE(et N1 N2 d .. d (cid:107)x(cid:107)2dk(cid:107)2r 2r = O((cid:107) N1 d d .. Nk N2 d (cid:107)2r 2r) where Nq iid∼ N(0, 1) . Then it follows that E(e t S dk (cid:107)x(cid:107)2 dk ) < Nk d ) And using the result stated in Achlioptas (2003), we complete the proof. Also note (cid:4) that this inequality is consistent with Lata(cid:32)la et al. (2006). 3.4.1 Choice of d We choose d > m(K)logn t2/K using union bound so that distance between n points are preserved. 3.4.2 Memory efficiency Under Kronecker decomposition, we need to store(cid:80)k (cid:81)k l=1 pldk, this is huge reduction even after considering trade-off in the probabilistic bound. l=1 pld elements as compared to original matrix 3.4.2.1 Tensor type data Another application can be for tensor data x ∈ Rp1×p2×..pk we can also obtain lower dimensional embedding which preserves distance with high probability. yd1×d2×..dk = (A(1) ⊗ ..A(l).. ⊗ A(k))T xp1×p2×..pk V ec(y)d×1 = (A(1) ⊗ ..A(l).. ⊗ A(k))T V ec(x)p×1 1(cid:113)(cid:81)k l=1 dl 1√ d 3.4.3 Variance reduction through averaging Since all random projection are non adaptive methods. In most of the literature, iit is recommended that we generate several independent RPs and take their average. This ensemble technique also provide reduction in variance. Let { ˜Ah}1≤h≤H be H independent copies of RP big Kronecker matrices. WE can define new 97 ensemble RP as f∗,H (x) = 1√ H H(cid:88) h=1 ˜Ahx 3.5 Simulation result Our first experiment evaluates the quality of the isometry for maps We generate n = 10 independent vectors x1; ..; xn of sizes d = 2500; 10000; 40000 .We consider the following three RPs: 1. Gaussian RP; 2. Sparse RP ; 3. Very Sparse RP . For each, we compare the performance of RP, KRP with order 4 and d1 = d2=d3=d4. We evaluate the methods by repeatedly generating a RP and computing the reduced vector, and plot the ratio of the pairwise distance 1. Method Gaussian RP KRP(4) 0.1409 (0.0015) 0.1431 (0.0016) Sparse 0.1407 (0.0013) 0.1431 (0.0015) Very Sparse 0.1412 (00.0014) 0.1520 (0.0033) Table 3.1: Average of total deviation of ratios of pairwise distance between projected and actual data from 1 . (Variability) 3.6 Future scope We believe that such bounds can be achieved for wider class of random variables following identity (cid:107)Y (cid:107)r r−1 for all r as shown in Schudy & Sviridenko (2012). r ≤ (r)M(cid:107)Y (cid:107)r−1 3.7 Introduction to Tensors With the advancement of information and engineering technology, modern days data science prob- lems often come with gigantic size and increased complexity. On popular technique of storing 98 these data is use of multi-dimensional arrays, which preserves the data anatomy and contain multi- dimensional spatial and spatio-temporal correlations. A tensor is a multi-dimensional or d-way array, which is a generalization of data matrix in a higher dimensional situation. More formally, according to Kolda & Bader (2009) and Hackbusch (2012), a d-way tensor is an element of the tensor space generated by the tensor products of d vector spaces. Similar to the traditional vector based machine learning literature, the learning of tensor can also be divided into supervised learning and unsupervised learning. Unsupervised tensor learning generally involves the tensor decomposition and feature selection. Some theoretical results and applications about unsupervised tensor learning can be found in Kolda & Bader (2009), X. He et al. (2006), Chi & Kolda (2012), De Lathauwer et al. (2000), and Lu et al. (2008). The framework of supervised tensor learning has been proposed by Tao et al. (2005), in which one can learn a tensor based rule from training data for classification and regression. The tensor regression problem has been widely studied. Such examples include Zhou et al. (2013), Wimalawarne et al. (2016), L. Li & Zhang (2017), Lock (2018), and Raskutti et al. (2019). As an indispensable part of the supervised tensor learning problem, however, the tensor type data classification is under developed. Research on tensor classification includes Zhou et al. (2013), Pan et al. (2018), Signoretto et al. (2011), L. He et al. (2014), Q. Li & Schonfeld (2014), and Tan et al. (2012). 3.7.1 Limitation for Gaussian assumptions There are several major deficiencies under the current development. Firstly, the assumption about data distribution. The Gaussian assumption for tensor type of data, e.g., Pan et al. (2018) may not be adequate in many applications. Since the probability theory for tensor data has not been well established, probabilistic discriminant such as LDA and Bayes classifier may have limitations from theoretical foundation perspective. Secondly, the distance based methods, e.g., Lu et al. (2008) and Q. Li & Schonfeld (2014). These methods train classifier based on tensor Frobenius norm, which is 99 not suitable for high dimensional tensor data. It is known that the high dimensional data has issues with L2 norm and Frobenius norm. For example Domingos (2012) wrote “ If we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere”. Beyer et al. (1999) showed that the difference between the maximum and minimum distances to a given query point does not increase as fast as the nearest distance to any point in high dimensional space. As data dimension increases, these distance based method may fail. Signoretto et al. (2011) and L. He et al. (2014) performed classification with kernels however, there is no theoretical results about classification error established. 3.8 Preliminaries 3.8.1 Mathematical Background for Tensor We first introduce standard tensor notation and operations (e.g. Kolda & Bader (2009)) that are used in this paper. Numbers and scalars are denoted by lowercase letters such as x, y. Vectors are denoted by boldface lowercase letters, e.g. a. Matrices are denoted by boldface capital letters, e.g. A, B. A higher-dimensional tensor is a generalization of vector and matrix representation for higher order data, which are denoted by boldface Euler script letters such as X, Y. Notations for vector and tensor spaces will be sepcified when necessary. The order of a tensor is the number of dimensions, also known as ways or modes. For example, a scalar can be regarded as a order zero tensor, a vector can be a order one tensor, and a matrix can be a order two tensor. In general, a tensor can have d modes as long as d is an integer. The way of indicating entries of tensors is same as we do for vectors and matrices. The i-th entry of a vector x is xi, the (i, j)-th element of a matrix X is xi,j, and the (i1, ..., id)-th element of a d-way tensor X is xi1,...,id. The indices of a tensor i1, ..., id range from 1 to their captial version, e.g. ik = 1, ...., Ik for every mode k = 1, ...d. 100 Sub-arrays of a tensor are formed when a subset of the indices are fixed. Similar to matrices that have rows and columns, high-dimensional tensors have various types of sub-arrays. For example, by fixing every index but one in a d-way tensor, we can get one of its fibers, which are analogue of matrix rows and columns. Another type of frequently used tensor sub-arrays is slice, which is a two dimensional section of a tensor. A slice of a tensor can be defined by fixing all but two indices. We will use x:i2...id to denote one fiber of a d-way tensor, and use X::i3...id to denote one of its slices. 3.9 Kernelized Support Tensor Machine 3.9.1 Framework of the Classification Problem The classification problem for tensor data is a problem of learning a tensor from the training data. Let T = {(X1, y1), ...., (Xn, yn)} be the training set, where Xi ∈ RI1×I2...×Id are d-mode tensors, yi are labels. If we assume the training risk of a classifier f ∈ X ∗ is ˆRn(f ) = 1 I(f (Xi) (cid:54)= yi), the n(cid:80) n i=1 problem will be looking for a ˆf ˆf = {f : ˆRn(f ) = min ˆRn(f ), f ∈ H∗} (3.1) To solve this problem, we need to search all functions in the functional tensor space, which is challenging. However, the problem will be simplified if the function is an universal reproducing kernel. This will let us find the solution only on the Hilbert space embedded by continuous functions. We shall prove this claim by presenting a represent theorem for kernelized support tensor machine. Further, we would be able to prove the consistency of our classifier. 101 3.9.2 Support Tensor Machine Generalizing the support vector machine to tensor version is quite straightforward. The kernelized support tensor machine classifier is sign(f (X)) The function f is the optimal of the following objective function min f λ||f||∗ + 1 n L(yi, f (Xi)) n(cid:88) i=1 (3.2) (3.3) where ||f||∗ is the norm of the functional tensor space H∗, and L is a measurable loss function defined on (X ∗ × Y ). X ∗ is an algebraic tensor space. The optimal function is a measureable function f : X ∗ → R, where X ∗ = RI1 ⊗ RI2... ⊗ RId. The representer theorem in support vector machine says the solution of support vector machine can be written as a linear combination of kernel functions. As a result, one only needs to learn the coefficients in the linear combination instead of learning the whole function. Similarly, we can also propose a representer theorem for this tensor learning problem. Theorem 3.9.1. (Tensor Representer Theorem) Let K(·,·) be a fixed kernel coming from the K(X ∗ ×X ∗, R), H∗ be the corresponding Reproducing Kernel Hilbert Space. Let L be an arbitrary loss function. If the optimization function (3.3) has optimal solutions, then all the solutions can be written in the following way: n(cid:88) ˆf (X) = ˆβiK(Xi, X) i=1 102 The proof of this theorem is attached in our appendix. As a direct benefit of this result, one just needs to estimate those parameters ˆβi in order to get a optimal solution. 3.9.3 STM with random projection We can choose K(j) = A(j) where A(j) is random projection matrix thus r(cid:88) d(cid:89) r(cid:88) d(cid:89) K = ( A(j)(x(j) k,1, x(j) l ), ..., A(j)(x(j) k,n, x(j) l ))T k,l=1 j=1 k,l=1 j=1 . r is the CP rank of the algebraic tensor space, x(j) ik 3.9.4 Solving the STM Now we start discussing the way of estimating our Support tensor machine from a group of training tensor and their corresponding labels. We are going to consider only the cumulant-based tensor kernel functions introduced in the previous section in this part. The situation of naive tensor kernels are straightforward, and one can use traditional SVM method to estimate with only a slight modification. According to the representer theorem and the definition of cumulant-based kernel function, we assume that the solution of support tensor machine has the following explicit form. n(cid:88) r(cid:88) d(cid:89) ˆf (X) = βi i=1 k,l=1 j=1 = KT β K(j)(x(j) k,i , x(j) l ) (3.4) where β = (β1, ..., βn)T and K = ( r(cid:80) k,l=1 (cid:81)d j=1 K(j)(x(j) k,1, x(j) l ), ..., r(cid:80) k,l=1 (cid:81)d j=1 K(j)(x(j) k,n, x(j) l ))T . r is the CP rank of the algebraic tensor space, x(j) ik and x(j) k are components of tensor CP decomposition of the corresponding training and testing tensors. If the data do not come in CP form, we may need to perform a CP decomposition at first. Plugging the soluction (3.4) into the objection function 103 (3.3), we can get min β λβT Kβ + n(cid:88) i=1 1 n L(yi, KT (:, i)β) (3.5) where K(:, i) is the ith column of matrix K = [K1, ..., Kn]. This problem can be solved directly with gradient descent. All values except the primal vector can be easily evaluated from the training set. The derivative of β is n(cid:88) i=1 2λKβ + 1 n ∂L ∂β (3.6) Let equation (3.6) equals to zero and solve for our problem. In our application, we took squared hing loss which L(y, X) = [max(0, 1 − yf (X))]2. The algorithm of training and prediction are described below denoted as algorithm ?? and algorithm ?? respectively: In algorithm ??, the complexity of the training process is O(n2r2 any vectorized method, O(n2(cid:81)d d(cid:80) j=1 Ij), which is will be much smaller than the complexity of j=1 Ij), with low rank assumption. In addition, we suggest to save the decomposed list of the training data when handling high dimensional data in algorithm ??. This is because the decomposed list is much smaller than the original data, which saves memory for computers. Having decomposed list instead of the original data can also make the prediction faster, since one does not need to repeat the decomposition again. 3.9.5 Estimation with Complete Tensor Data As we mentioned above, a CP decomposition is required when getting data ready for our algorithm. We can estimate components of a rank r tensor from this procedure, and can feed this estimation to our model. The estimated tensor decomposition will be an unique approximation for the original data under most situation. Due to this uniqueness, we take the proposition from Kolda & Bader (2009): 104 r(cid:80) k ⊗ x(1) k = [X(1), X(2), ..., X(d)], where the columns of each Xj ∈ RIj×r, j = 1, ..., d are X (j) Proposition 1. Let X be a d-way tensor with rank r. If it can be expressed as X = k ... ⊗ x(d) x(2) k=1 k k = 1, ..., r; j = 1, .., d from the expression. The CP decomposition of this tensor is unique if d(cid:88) R(X(j)) (cid:62) 2r + d − 1 (3.7) k=1 where R(X(j)) are the corresponding column ranks of matrices X(j). As a result, the probability of miss-classification is identical, i.e. P(ˆy (cid:54)= y|X) = P(ˆy (cid:54)= y| ˆX) where ˆX is the tensor with estimated CP decomposition (3.8) For more details of the uniqueness of the decomposition, we refer Kolda & Bader (2009) and Sidiropoulos & Bro (2000). During our application, we use the Alternating Least Square (ALS) to estimate the decomposition. The second issue about our model is the rank assumption. We assume all tensor data are ideally from the same tensor space with rank r. Under most situations, we do not know the rank r apriori. We prescribe the ranke r when we pre-process the data for training by finding one which can provide the best approximation for tensor decomposition. 3.10 Statistical Property of STM In the last part of our theoretical results, we want to highlight the performance and the general- ization ability of our classifier for tensors. In the general evaluation of a decision rule, one will be interested in exploring the bound for its classification risk. For example, assume our data for classification are {(x, y) ∈ X ∗ × Y}. Let f be a decision rule for data generated from X ∗ × Y, the 105 classification risk of this rule is defined as: (cid:90) X×Y R(f ) = Pr(f (x) (cid:54)= y|x)µ(dx)µ(dy) (3.9) where µ are measures defined on X and Y. The lower bound of this risk is the Bayes risk, which we denote with R∗. For simplicity, we consider the binary classification case where Y = {−1, 1}. In addition, we assume there is no noise in the problem such that ∀X ∈ X ∗, Pr(f (X = 1)|X) (cid:54)= Pr(f (X = −1)|X) (3.10) In other words, we will not consider the situation where posterior distribution does not provide a decision and one can only guess randomly. However, a classification rule learned directly from a given training set usually will not be able to reach the Bayes risk. In fact, we can only estimate the empirical risk for a rule when an observation with length n is given, which is n(cid:88) i=1 ˆR(f ) = 1 n I{f (xi) (cid:54)= yi} (3.11) {(xi, yi), i = 1, ..., n} is our training set. We always try to minimize this empirical risk with some methods, and find out the optimal classifier ˆf for this risk. This is what we called Empirical Risk Minimization(ERM) in general statistical learning problems. The evaluation of the method we followed to find ˆf depends on the bound δn = |R∗ − R( ˆf )| (3.12) If this quantity δ converges to zero as the sample size increase, then the classifier is consistent and 106 the method we follow is statistically sound. This quantity can be bounded by δn (cid:54) |R∗ − R∗(f )| + |R∗(f ) − R( ˆf )| (3.13) where R∗(f ) = inf f∈H R(f ) is the minimal risk of a collection of classifiers H. According to the result from Steinwart & Christmann (2008), we have the following theorem: Theorem 3.10.1. If K is a universal kernel on a compact subset of tensor space X ∗. The loss function L is Lipschitz continuous. Then R∗ = R∗(f ). One can see Steinwart & Christmann (2008) for the proof of this theorem. The intuition of this theorem is pretty simple since we have shown the universal approximation property of tensor based kernel functions. This result will always hold in the classification problems since the loss functions such as hinge loss are always Lipschitz continuous. The consistency problem turns out to be finding bounds for |R∗(f ) − R( ˆf )|. Our next result shows the convergence of this part. Theorem 3.10.2. Let V be a compact subspace of the algebraic tensor space X ∗. Let K be a cumulant-based kernel function for tensor that is universal on the V, and |K|∞ (cid:54) 1. If we assume the tensor space has dimension p = I1 × I2 × ... × Id < ∞. Then for all Borel probaility measure Pr on (X ∗ × Y) satisfying Pr(f (X = 1)|X) (cid:54)= Pr(f (X = −1)|X), we have R( ˆfn) → R∗(f ) n → ∞ (3.14) in probability. 107 APPENDIX 108 a(2) i2,j2 ....a(l) il,jl ..a(k) ik,jk = ˜ai,j (cid:107)(< ˜AK i , x >)(cid:107)2 j=1 x2 j (˜ai,j)2 2 = E(cid:80)P + (cid:80)P (cid:80)P Proof. Define a(1) i1,j1 j=1 xjx(cid:48) j(cid:48)=1 j(cid:54)=j(cid:48) j ˜ai,j ˜ai,j(cid:48) APPENDIX Main lemmas Lemma 3.10.3. E(< ˜AK i , x >) = 0 for any row i Lemma 3.10.4. (cid:107)(< ˜AK i , x >)(cid:107)2 = (cid:107)x(cid:107)2 for any row i Second term is zero due to conditional expectation property. First we try to estimate the r th moment of S2 d(x) = xT ˜A(K),T ˜A(K)x dK = (cid:80)d i=1 (cid:107) ˜A(K) i x(cid:107)2 2 dK (cid:4) in terms 1 = ( ˜A(K) Y 2 1 x)2, since {Yi}dk i=1 are exchangeable, stationary, weak ρ mixing sequence. Unfortunately, Lemma 3.10.5. Rosenthal Inequality for rho mixing sequences For distribution following sparse distribution P. Li et al. (2006) Assume E(Y ) = 0 and (cid:107)Y (cid:107)2r 2r < ∞ then there exists a positive constant C = Cr,ρ such that E max 1≤i≤dk (|Sdk|)2r ≤ C(dr(cid:107)x(cid:107)2r 2 + d(2r − 1)rK(cid:107)x(cid:107)2r 2 ) Proof. Q.-M. Shao (1995) theorem 1, using ρ = 0 and (cid:107)Y1(cid:107)2 using result that (cid:107)Y1(cid:107)2r 2r ≤ (2r − 1)rK(cid:107)Y1(cid:107)2r 2 (cid:107) 2 = (cid:107)x(cid:107)2 2 and exchangeability of Yi now (cid:4) Lemma 3.10.6. Hypercontractivity for Gaussian distribution Suppose Y is k degree Gaussian polynomial chaos in some Gaussian Hilbert space, then (cid:107)Sd(cid:107)2r 2r ≤ (2r − 1)Kr(cid:107)Sd(cid:107)2r 2 = (r − 1)K/2d(cid:107)Y (cid:107)2 2 = (cid:107)x(cid:107)2 2 Proof. Janson et al. (1997) theorem 5.10 and theorem 6.7 (cid:4) 109 Lemma 3.10.7. Hypercontractivity for Gaussian distribution Suppose Y is k degree Gaussian polynomial chaos in some Gaussian Hilbert space, then 2r ≤ (2r − 1)Kr(cid:107)Sdk(cid:107)2r 2 = (r − 1)K/2d(cid:107)Y (cid:107)2 2 = (cid:107)x(cid:107)2 (cid:107)Sdk(cid:107)2r 2 Proof. Janson et al. (1997) theorem 5.10 and theorem 6.7 (cid:4) Lemma 3.10.8. Hypercontractivity for very sparse distribution (cid:107)Sdk(cid:107)r ≤ Cr,K(cid:107)Sdk(cid:107)2 Proof. Janson et al. (1997) lemma 5.2 proves hypercontractivity for Raedmacher variable Z. Ob- serve that for any constant H , same will hold for Z(cid:48) = HZ. Define a new random variable V , a ∼ V Z(cid:48) any two point independent of Z(cid:48)such that P r(V = d1/4) = 1√ and P r(V = 0) = 1 − 1√ d d distribution is hyper contractive as well Janson et al. (1997) lemma 5.2. So, due to independence their product V Z(cid:48) is hyper contractive (cid:4) 110 BIBLIOGRAPHY 111 BIBLIOGRAPHY Achlioptas, D. (2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66 (4), 671–687. Adamczak, R., et al. (2015). A note on the hanson-wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20 . Avants, B. B., Tustison, N. J., Song, G., Cook, P. A., Klein, A., & Gee, J. C. (2011). A reproducible evaluation of ants similarity metric performance in brain image registration. Neuroimage, 54 (3), 2033–2044. Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor” meaningful? In International conference on database theory (pp. 217–235). Bickel, P. J., Levina, E., et al. (2004). Some theory for fisher’s linear discriminant func- tion,naive bayes’, and some alternatives when there are many more variables than obser- vations. Bernoulli, 10 (6), 989–1010. Cai, T., & Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106 (496), 1566–1577. Chi, E. C., & Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33 (4), 1272–1299. Chu, T., Zhu, J., Wang, H., et al. (2011). Penalized maximum likelihood estimation and variable selection in geostatistics. The Annals of Statistics, 39 (5), 2607–2625. Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fmri)“brain reading”: detecting and classifying distributed patterns of fmri activity in human visual cortex. Neuroimage, 19 (2), 261–270. Cressie, N., & Huang, H.-C. (1999). Classes of nonseparable, spatio-temporal stationary covariance functions. Journal of the American Statistical Association, 94 (448), 1330– 1339. Cressie, N., & Lahiri, S. N. (1996). Asymptotics for reml estimation of spatial covariance parameters. Journal of Statistical Planning and Inference, 50 (3), 327–341. De Lathauwer, L., De Moor, B., & Vandewalle, J. (2000). A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21 (4), 1253–1278. Domingos, P. M. (2012). A few useful things to know about machine learning. Commun. acm, 55 (10), 78–87. Doshi, J., Erus, G., Ou, Y., Gaonkar, B., & Davatzikos, C. (2013). Multi-atlas skull-stripping. Academic radiology, 20 (12), 1566–1576. 112 Fan, J., & Fan, Y. (2008). High dimensional classification using features annealed indepen- dence rules. Annals of statistics, 36 (6), 2605. Fan, J., Feng, Y., & Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74 (4), 745–771. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96 (456), 1348–1360. Fan, J., & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20 (1), 101. Furrer, R., Bachoc, F., & Du, J. (2016). Asymptotic properties of multivariate tapering for estimation and prediction. Journal of Multivariate Analysis, 149 , 177–191. Furrer, R., Genton, M. G., & Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15 (3), 502–523. Ghosh, D. (2001). Singular value decomposition regression models for classification of tumors from microarray experiments. In Biocomputing 2002 (pp. 18–29). World Scientific. Gutman, B. A., Hua, X., Rajagopalan, P., Chou, Y.-Y., Wang, Y., Yanovsky, I., . . . others (2013). Maximizing power to track alzheimer’s disease and mci progression by lda-based weighting of longitudinal ventricular surface features. Neuroimage, 70 , 386–401. Hackbusch, W. (2012). Tensor spaces and numerical tensor calculus (Vol. 42). Springer Science & Business Media. He, L., Kong, X., Yu, P. S., Yang, X., Ragin, A. B., & Hao, Z. (2014). Dusk: A dual structure- In preserving kernel for supervised tensor learning with applications to neuroimages. Proceedings of the 2014 siam international conference on data mining (pp. 127–135). He, X., Cai, D., & Niyogi, P. (2006). Tensor subspace analysis. In Advances in neural information processing systems (pp. 499–506). Hiai, F., & Lin, M. (2017). On an eigenvalue inequality involving the hadamard product. Linear Algebra and its Applications, 515 , 313–320. Janson, S., et al. (1997). Gaussian hilbert spaces (Vol. 129). Cambridge university press. Kaufman, C. G., Schervish, M. J., & Nychka, D. W. (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103 (484), 1545–1555. Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM review , 51 (3), 455–500. Kwon, S., & Kim, Y. (2012). Large sample properties of the scad-penalized maximum likelihood estimation on high dimensions. Statistica Sinica, 629–653. 113 Lata(cid:32)la, R., et al. (2006). Estimates of moments and tails of gaussian chaoses. The Annals of Probability, 34 (6), 2315–2331. Ledoux, M. (1999). Concentration of measure and logarithmic sobolev inequalities. In Seminaire de probabilites xxxiii (pp. 120–216). Springer. Li, L., & Zhang, X. (2017). Parsimonious tensor response regression. Journal of the American Statistical Association, 112 (519), 1131–1146. Li, P., Hastie, T. J., & Church, K. W. (2006). Very sparse random projections. In Proceedings of the 12th acm sigkdd international conference on knowledge discovery and data mining (pp. 287–296). Li, Q., & Schonfeld, D. (2014). Multilinear discriminant analysis for higher-order tensor data classification. IEEE transactions on pattern analysis and machine intelligence, 36 (12), 2524–2537. Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 13–22. Lock, E. F. (2018). Tensor-on-tensor regression. Journal of Computational and Graphical Statistics, 27 (3), 638–647. Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2008). Mpca: Multilinear principal IEEE transactions on Neural Networks, 19 (1), component analysis of tensor objects. 18–39. Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99 (1), 29–42. Mat´ern, B. (1960). Spatial variation, volume 36 of. Lecture Notes in Statistics. Mourao-Miranda, J., Bokde, A. L., Born, C., Hampel, H., & Stetter, M. (2005). Classify- ing brain states and determining the discriminating activation patterns: support vector machine on functional mri data. NeuroImage, 28 (4), 980–995. Muschelli, J., Gherman, A., Fortin, J.-P., Avants, B., Whitcher, B., Clayden, J. D., . . . Crainiceanu, C. M. (2018). Neuroconductor: an r platform for medical imaging analysis. Biostatistics. Pan, Y., Mai, Q., & Zhang, X. (2018). Covariate-adjusted tensor classification in high dimensions. Journal of the American Statistical Association, 1–15. Petersen, R. C., Aisen, P., Beckett, L. A., Donohue, M., Gamst, A., Harvey, D. J., . . . others (2010). Alzheimer’s disease neuroimaging initiative (adni): clinical characterization. Neurology, 74 (3), 201–209. Popescu, V., Battaglini, M., Hoogstrate, W., Verfaillie, S. C., Sluimer, I., van Schijndel, R. A., . . . others (2012). Optimizing parameter choice for fsl-brain extraction tool (bet) on 3d t1 images in multiple sclerosis. Neuroimage, 61 (4), 1484–1494. 114 Raskutti, G., Yuan, M., Chen, H., et al. (2019). Convex regularization for high-dimensional multiresponse tensor regression. The Annals of Statistics, 47 (3), 1554–1584. Rathi, V., & Palani, S. (2012). Brain tumor mri image classification with feature selection and extraction using linear discriminant analysis. arXiv preprint arXiv:1208.2128 . Rohlfing, T., Zahr, N. M., Sullivan, E. V., & Pfefferbaum, A. (2008). The sri24 multichannel brain atlas: construction and applications. In Medical imaging 2008: Image processing (Vol. 6914, p. 691409). Samson, P.-M., et al. (2000). Concentration of measure inequalities for markov chains and φ-mixing processes. The Annals of Probability, 28 (1), 416–461. Schudy, W., & Sviridenko, M. (2012). Concentration and moment inequalities for polynomi- als of independent random variables. In Proceedings of the twenty-third annual acm-siam symposium on discrete algorithms (pp. 437–446). Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39 (2), 1241–1265. Shao, Q.-M. (1995). Maximal inequalities for partial sums of ρ-mixing sequences. The Annals of Probability, 948–965. Sidiropoulos, N. D., & Bro, R. (2000). On the uniqueness of multilinear decomposition of n-way arrays. Journal of chemometrics, 14 (3), 229–239. Signoretto, M., De Lathauwer, L., & Suykens, J. A. (2011). A kernel-based framework to tensorial data analysis. Neural networks, 24 (8), 861–874. Steinwart, I., & Christmann, A. (2008). Support vector machines. Springer Science & Business Media. Sun, Y., Guo, Y., Tropp, J. A., & Udell, M. (2018). Tensor random projection for low memory dimension reduction. In Neurips workshop on relational representation learning. Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., & Zhuang, Y. (2012). Logistic tensor re- gression for classification. In International conference on intelligent science and intelligent data engineering (pp. 573–581). Tao, D., Li, X., Hu, W., Maybank, S., & Wu, X. (2005). Supervised tensor learning. In Data mining, fifth ieee international conference on (pp. 8–pp). Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99 (10), 6567–6572. Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to dna microarrays. Statistical Science, 104–117. 115 Varah, J. M. (1975). A lower bound for the smallest singular value of a matrix. Linear Algebra and its Applications, 11 (1), 3–5. Wang, Y., Fan, Y., Bhatt, P., & Davatzikos, C. (2010). High-dimensional pattern regression using machine learning: from medical images to continuous clinical variables. Neuroimage, 50 (4), 1519–1535. Wimalawarne, K., Tomioka, R., & Sugiyama, M. (2016). Theoretical and experimental analyses of tensor-based regression and classification. Neural computation, 28 (4), 686– 715. Witten, D. M., & Tibshirani, R. (2011). Penalized classification using fisher’s linear discrim- inant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (5), 753–772. Wolkowicz, H., & Styan, G. P. (1980). Bounds for eigenvalues using traces. Linear algebra and its applications, 29 , 471–506. Yasui, Y., & Lele, S. (1997). A regression method for spatial disease rates: an estimating function approach. Journal of the American Statistical Association, 92 (437), 21–32. Yingjie, L., & Maiti, T. (2019). High dimensional discriminant analysis for spatially depen- dent data. In Progress. Yu, H., & Yang, J. (2001). A direct lda algorithm for high-dimensional data—with applica- tion to face recognition. Pattern recognition, 34 (10), 2067–2070. Yushkevich, P. A., Gao, Y., & Gerig, G. (2016). Itk-snap: An interactive tool for semi- automatic segmentation of multi-modality biomedical images. In Engineering in medicine and biology society (embc), 2016 ieee 38th annual international conference of the (pp. 3342–3345). Zhou, H., Li, L., & Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108 (502), 540–552. Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36 (4), 1509. 116