HIGH DIMENSIONAL CLASSIFICATION FOR SPATIALLY DEPENDENT DATA WITH APPLICATION TO NEUROIMAGING By Yingjie Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2018 ABSTRACT HIGH DIMENSIONAL CLASSIFICATION FOR SPATIALLY DEPENDENT DATA WITH APPLICATION TO NEUROIMAGING By Yingjie Li The methods developed in this thesis are motivated by how to use structure Magnetic resonance imaging (MRI) data to predict Alzheimer’s disease (AD) or to discriminate be- tween healthy subjects and AD patients. Imaging data is a typical example of spatially dependent data where the correlation between data points collected at various voxels (pix- els) can be described by proximity. Also, it is high dimensional data since the number of voxels is extremely high comparing to the number of subjects. The first piece of work considers use longitudinal volumetric MRI data of five regions of interest (ROIs), which are known to be vulnerable to Alzheimer’s disease (AD) for predic- tion. A longitudinal data prediction method based on functional data analysis is applied for identifying when an early prediction can reasonably be made and determining whether one ROI is superior with regard to predicting progression to AD over others. By adopting sta- tistically validated procedures, we compared the prediction performance based on individual ROIs as well as their combinations. For all the models, the results show that the overall one year, two years, three years in advance prediction accuracy is above 80%. MCI converter subjects can be correctly detected as early as two years prior to conversion. The second piece of work considers use voxel level MRI data for classification of AD pa- tients and healthy subjects. A supervised learning method based on the linear discriminant analysis (LDA) was developed for high dimensional spatially dependent data. The theory shows that the method proposed can achieve consistent parameter estimation, consistent features selection, and asymptotically optimal misclassification rate. The extensive simu- lation study showed a significant improvement in classification performance under spatial dependence. We applied the proposed method to voxel level MRI data for classification. The classification performance is superior compared to other comparable methods. Copyright by YINGJIE LI 2018 I dedicate this dissertation to my family, especially to my dearest grandmother Chen, Dezhen. v ACKNOWLEDGMENTS I would like to express my sincerest gratitude to my advisor Professor Tapabrata Maiti, for his invaluable assistance, constructive guidance and immense knowledge. His vision on statistics and his enthusiasm on research encouraged me all the time. I am fully indebted to him for his understanding, patient and encouragement. It is impossible to have this thesis completed and comprehensive without his continuous support to me and to my family. I would like to thank Professor David Zhu, for sharing his precious experience of working on brain imaging data, introducing many of the interesting practical problems in Alzheimer’s disease, and also for serving as one of my committee members. I also wish to thank Professor Chae Young Lim and Professor Pingshou Zhong for serving on my dissertation committee. I am extremely grateful for their constructive advice in this thesis work. I would also thank the entire faculty and staff members in the Department of Statistics and Probability who have helped me during my study in Michigan State University. Last I would like to thank my parents and my sister for loving me and supporting me at every stage of my life. Many thanks go to my husband Yuzhen for standing by my side always. I give my special thanks to my daughter Aurora, who brought a wonderful light to my life. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Spatial statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Linear discriminant analysis (LDA) for high dimensional data . . . . . . . . 1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Early prediction of Alzheimer’s disease using longitudinal vol- umetric MRI data from ADNI . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Functional Principal Component Analysis with longitudinal data . . . 2.2.2 Early prediction and choosing trajectory . . . . . . . . . . . . . . . . 2.2.3 Prediction with logistic classifier . . . . . . . . . . . . . . . . . . . . . 2.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Prediction using hippocampus . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Prediction results using single ROI . . . . . . . . . . . . . . . . . . . 2.4.3 Prediction results using combinations of ROIs . . . . . . . . . . . . . 2.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix xi 1 1 2 6 10 13 13 17 17 21 22 24 28 28 32 33 35 37 40 3.3.1 The penalized maximum likelihood estimation (PMLE) Chapter 3 High dimensional discriminant analysis for spatially dependent data and its application in neuroimaging data . . . . . . . . . . 44 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 44 3.2 Classification using maximum likelihood estimation (MLE-LDA) . . . . . . . 50 3.3 Classification using penalized maximum likelihood estimation (PLDA) . . . . 54 . . . . . . . 54 3.3.1.1 Consistency of PMLE . . . . . . . . . . . . . . . . . . . . . 58 3.3.1.2 Covariance tappering and PMLE . . . . . . . . . . . . . . . 60 3.3.2 The penalized maximum likelihood estimation LDA (PLDA) classifier 62 3.4 Simulation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6 Proofs of the main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.6.1 Proofs for classification using MLE . . . . . . . . . . . . . . . . . . . 75 3.6.2 Proofs for consistency of PMLE . . . . . . . . . . . . . . . . . . . . . 93 3.6.3 Proofs for consistency for PMLE with tappering . . . . . . . . . . . . 103 3.6.4 Proofs for classification using PLDA . . . . . . . . . . . . . . . . . . 108 vii 3.6.5 Remarks on the assumptions . . . . . . . . . . . . . . . . . . . . . . . 116 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 viii LIST OF TABLES Table 2.1 Subjects characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 25 Table 2.2 Data characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 2.3 Subjects characteristics by year . . . . . . . . . . . . . . . . . . . . 28 Table 2.4 Parameter estimation (hippocampus for 1y prediction) . . . . . . . . 29 Table 2.5 Prediction accuracy rates (hippocampus) . . . . . . . . . . . . . . . 32 Table 2.6 Prediction performance using single ROI . . . . . . . . . . . . . . . 34 Table 2.7 Prediction performance of combinations of two ROIs . . . . . . . . . 41 Table 2.8 Prediction performance of combinations of three ROIs . . . . . . . 42 Table 2.9 Prediction performance of combinations of four or more ROIs . . . 43 Table 3.1 Classification Accuracy Rate (Exponential covariance, p=36) . . . . 67 Table 3.2 Classification Accuracy Rate (Exponential covariance, p=400) . . . 67 Table 3.3 Classification Accuracy Rate (Exponential covariance, p=1225) . . 67 Table 3.4 Parameter estimation (Exponential covariance) . . . . . . . . . . . 68 Table 3.5 Number of variables selected (Exponential covariance, p=36) . . . . 68 Table 3.6 Number of variables selected (Exponential covariance, p=400) . . . 68 Table 3.7 Number of variables selected (Exponential covariance, p=1225) . . . 69 Table 3.8 Classification Accuracy Rate (Polynomial covariance, p=36) . . . . 69 Table 3.9 Classification Accuracy Rate (Polynomial covariance, p=400) . . . 69 Table 3.10 Classification Accuracy Rate (Polynomial covariance, p=1225) . . . 69 Table 3.11 Parameter estimation (Polynomial covariance) . . . . . . . . . . . . 70 ix Table 3.12 Number of variables selected (Polynomial covariance, p=36) . . . . . 70 Table 3.13 Number of variables selected (Polynomial covariance, p=400) . . . . 70 Table 3.14 Number of variables selected (Polynomial covariance, p=1225) . . . 71 Table 3.15 Classification Accuracy Rate (Mis-specified covariance, p=36) . . . 71 Table 3.16 Classification Accuracy Rate (Mis-specified covariance, p=400) . . . 71 Table 3.17 Classification Accuracy Rate (Mis-specified covariance, p=1225) . . 71 Table 3.18 Subjects characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 73 Table 3.19 Subjects characteristics of training and testing set . . . . . . . . . . 74 Table 3.20 Classification performance for voxel level MRI data. Training and testing samples are of sizes 200 and 174, respectively . . . . . . . . . 74 x LIST OF FIGURES Figure 1.1 left: T1-weighted MRI of AD subject; right: TI-weighted MRI of NL subject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 2.1 Longitudinal volume of ROIs for MCI subjects. In (a)-(e), the value on Y axis is the normalized volume. In (f), the value on Y axis is the ICV (mm3). The value on X axis is “disease year”. Thin lines are observations for each subject. Blue lines are for MCI-c subjects and red lines are for MCI-nc subjects. Blue and red thick lines are pooled mean curves for MCI-c group and MCI-nc group respectivly. . . . . Figure 2.2 PACE analysis using hippocampus for 1-year prediction. (a)-(c) show the estimations of mean function µ(t), covariance surface G(t) and the first two eigen functions φ1(t) and φ2(t). (d) shows the first two eigen functions explained 98.907% of the total variance. . . . . . . Figure 2.3 Second versus first FPC scores for hippocampus (1 year prediction). The triagulars indicate MCI-nc and the crosses indicate MCI-c. . . . Figure 2.4 ROC curve for prediction using hippucampus. Solid line, dotted line and dotdash line are the ROC curves for 1-year, 2-year and 3-year prediction respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 26 30 31 32 Figure 2.5 Figure 3.1 Prediction sensitivity(a), specificity(b), accuracy(c), and AUC(d) us- ing single and combinations of ROIs. In each panel, every bar rep- resent a predict result using different combinations of biomarkers (from left to right): Hippocampus(H), whole brain(W), entorhinal cortex(EC), fusiform gyrus(F), middle temporal cortex(MTC), H+WB, H+EC, H+FG, H+MTC, WB+EC, WB+FG, WB+MTC, EC+FG, EC+MTC, FG+MTC, H+WB+EC, H+WB+FG, H+WB+MTC, H+EC+FG, H+EC+MTC, H+FG+MTC, WB+EC+FG, WB+EC+MTC, WB+FG+MTC, EG+FG+MTC, H+WB+EC+FG, H+WB+EC+MTC, H+WB+FG+MTC, H+EC+FG+MTC, WB+EC+FG+MTC and com- bination of all the five ROIs. . . . . . . . . . . . . . . . . . . . . . . Two dimensional domain example. Left: 2D domain with p=4 × 4; middle: µ1; right: µ2. . . . . . . . . . . . . . . . . . . . . . . . . . . 64 38 xi Chapter 1 Introduction 1.1 Motivation Brain imaging provides a non-invasive way to observe the human central nervous system. Many researches are working on improving imaging technologies, data analysis and the appli- cation of imaging to investigate neurological disorders (including Alzheimer’s disease (AD), Parkinson’s disease and dementia). AD is the most common form of dementia. Many re- searches are focusing on find better ways to treat the disease, delay its onset, and prevent it from developing. Also, it is of great interest in developing objective biologically based methods to diagnose and predict Alzheimer’s disease, track the progression, and monitor the efficacy of treatment. Brain atrophy is a primary pathologic process in AD due to widespread neuronal death (see, e.g., [28]). As a non-invasive, widely available and cost-effective way for obtaining brain imaging, MRI data plays an important role as a potential biomarker to monitor atrophy progression and thereby the progression of AD. Figure 1.1 gives an example of MRI of subject with AD and that of normal subject (NL). This work is motivated by how to use structure Magnetic resonance imaging (sMRI) data of the brain to discriminate between the healthy subjects and the AD patients and to predict Alzheimer’s disease (AD) progress. Imaging data is a typical example of spatial dependent data where the relationship between the data points collected at various voxels (pixels) can be described by proximity. Also, it is high dimensional data since the number of voxels 1 Figure 1.1 left: T1-weighted MRI of AD subject; right: TI-weighted MRI of NL subject. is extremely high comparing to the number of subjects. If consider the signal from each voxel as a feature, we might have millions of features for classification. Motivated by the characteristics of the brain imaging data, the goal of this thesis is to develop classification methods for high dimensional data with spatial dependency. 1.2 Spatial statistical models Spatial statistical models are widely applied to many scientific disciplines, including geology, agriculture, climatology, ecology, epidemiology, forestry, astronomy, atmospheric science, real estate, brain imaging, and any discipline that the spatial dependence among observations cannot be ignored in data analysis. According to [42], there are three types of spatial data. 1. Geostatistical data. Let D be a continuous domain in the d-dimensional Euclidean space Rd. Let Y be the variable concerned in the study, such as the air temperature at different locations in US. D is a continuous domain means Y can be observed everywhere within D, that is between any two locations si and sj, theoretically there are infinite locations where Y are well defined. We use spatial process {Y (s), s ∈ D} 2 to model the variable Y . If data are collected at some locations s1, ..., sn in D, say Y (s1), ..., Y (s2), the dataset is called geostatistical data. 2. Lattice data. Lattice data are spatial data where the domain D is fixed and discrete, in other words, non-random and countable. For example, lattice data could be data collected by ZIP code, prime rate of a county, or remotely sensed data reported by pixels. The distinction between geostatistical data and lattice data are not always clearcut. For example, the imaging data with very low resolution can be considered as lattice data; the imaging data with very high resolution can be considered as geo- statistical data; those with moderate resolution could be considered as either one, determined by the goal of the study. For the convenience in the notation, we use {Y (si), si ∈ D, i = 1, 2, ...} to represent lattice data. 3. Point patterns. Geostatistical and lattice data have in common the fixed, non-stochastic domain. An important feature of point pattern data is the random domain. Locations of avian flu over the world, locations of long leaf pines in a natural forest are examples of point pattern. Imaging data can be considered as geostatistical data. Geostatistical data could be modeled by spatial random field {y(s), s ∈ D ⊂ Rd} such that y(s) = µ(s) + (s) where µ(s) is the mean effect function and (s) is the random noise. Assume the error term (s) : s ∈ D is a Gaussian random field with mean zero and a covariance function γ(s, s(cid:48); θ) = Cov((s), (s(cid:48)); θ), where s, s(cid:48) ∈ D and θ is a q × 1 vector of covariance 3 function parameter which could be estimated from the data. If we further assume the random field (s) is isotropic and stationary, the covariance function could be represented as γ(h; θ) = cov((s), (s(cid:48)); θ), where h =(cid:13)(cid:13)s − s(cid:48)(cid:13)(cid:13) is the Euclidean distance between s and s(cid:48). There are many ways to model the covariance function γ(h; θ). A widely used family of covariance function is the M at´ern covariance function. It is defined as: γ(h; σ2, c, ν, r) := σ2(1 − c) 21−ν Γ(ν) (h/r)νKν(h/r) (1.2.1) where Kν(·) is a modified Bessel function of the second kind and σ2 > 0 is the variance, 0 ≤ c ≤ 1 is a nugget effect, ν > 0 is the scale and smoothness parameter (see, e.g., [17]). First, The M at´ern covariance function is isotropic. The correlation decreases when the distance h increases. Second, when ν increases, the smoothness of the random field increases, the M at´ern covariance function converges to Gaussian covariance function γ(h; σ2, c, r) = σ2(1 − c)exp(−h2/r2) as ν → ∞. Last, if ν = 1 2, (1.2.1) is reduced to the well known exponential covariance function γ(h; σ2, c, r) = σ2(1 − c)exp(−h/r). r is called the range parameter since it measures the distance at which the correlation have decreased below certain threshold. Consider the spatial statistical model with replications. Assume there are n independent samples of the spatial random field: y1(s), y2(s), ..., yn(s). Suppose any sample of the spatial random field, there are observations at p discrete sites s1, ..., sp ∈ D. Let Yij = yi(sj) be the observation at the jth site from the ith sample of the spatial random field. For i = 1, 2, ..., n and j = 1, 2, ..., p, the model can be represented by: Yij = µj + ij (1.2.2) 4 where µj = µ(sj) is the mean effect at the jth location and ij = i(sj) is the corresponding Gaunssian random noise at the jth location for the ith sample. We can write the model in matrix notation. For i = 1, 2, ..., n, let Y i = (Y T ip )T be a p-dimensional vector. Then Y i = µ + i, where µ = (µ1, ..., µp)T is the mean vector and  = (1, ..., p)T has i1 , ..., Y T multivariate normal distrubition N (0, Σ). Since (s) has a covariance function γ(h; θ), the i,j=1, where hij =(cid:13)(cid:13)si − sj (cid:13)(cid:13), covariance matrix Σ can be represented by Σ(θ) = [γ(hij; θ)]p i.e. the (i, j)th entry of Σ(θ) is γ(hij; θ). Maximum likelihood estimation (MLE) are often used to estimate parametric model (see, e.g., [37]). Considering the model in (1.2.2), Y i ∼ N (µ, Σ(θ)). Let Y = (Y T 1 , ..., Y T n ), the MLE of (µ, θ) is obtained by maximizing the log likelihood function, log l(µ, θ; Y ) = −np 2 log(2π) − n 2 log|Σ(θ)| − 1 2 n(cid:88) (Y i − µ)T Σ−1(θ)(Y i − µ) (cid:80)n i=1 The unique µ to maximize logl(µ, θ; Y ) is: ˆµM LE = 1 n i=1 Y i = ¯Y . Plug in ˆµ to log l(µ, θ; Y ), we obtain L(θ; Y ) = −n 2 log|Σ(θ)| − 1 2 n(cid:88) (Y i − ˆµ)T Σ−1(θ)(Y i − ˆµ) i=1 ˆθM LE can be obtained by maximizing L(θ; Y ). Usually there’s no closed-form solution. Numerical methods could be applied to obtain the solution. 5 1.3 Linear discriminant analysis (LDA) for high di- mensional data Consider the p-dimensional discriminant problem between two classes C1 and C2. According to some classification rule T (x) : Rd → {1, 2}, a new observation X = x can be classified into class C1 (T (X) = 1) or C2 (T (X) = 2). If X ∈ C1, the misclassification rate is the probability that it is classified into class C2, i.e. P (T (X) = 2|X ∈ C1). Similary, P (T (X) = 1|X ∈ C2) is the misclassification rate if X ∈ C2. Here X is a p-dimensional random variable. By abuse of notation we use the notation: X ∈ Ck to denote that the observation X is from class Ck. It can be shown that the optimal classifier in terms of minimizing the misclasification rate is known as the Bayes rule, which classify the new observation into the most probable class, using the posterior probability of the observation (see Chapter 2 in [23]). Suppose fk(x) is the conditional density of an observation X in class Ck, (k = 1, 2). Let πk be the prior probability of class k with π1 + π2 = 1. By Bayes theorem, the posterior probability of an observation X = x in each class is: P (X ∈ Ck|X = x) = fk(x)πk f1(x)π1 + f2(x)(1 − π1) There are many ways to model the class densities. Suppose the densities for both classes are multivariate Gaussian N (µ1, Σ) and N (µ2, Σ) respectively, where µk (k = 1, 2) are the class mean vectors and Σ is the common positive definite covariance matrix. Then the 6 density of an observation X = x from Ck is: fk(x) = 1 (2π)p/2 |Σ|1/2e 2 (x−µk)T Σ−1(x−µk)) − 1 Under this assumption, the Bayes rule assigns X = x in to C1 if π1f1(x) ≥ π2f2(x). Equivalently, x is assigned to C1 if log π1 π2 + (x − µ)T Σ−1(µ1 − µ2) ≥ 0 where µ = (µ1 + µ2)/2. Note that this classifier is linear in x. In practice, the parameters of the Gaussian distribution are unknown. We can esti- are from class Ck, where k ∈ {1, 2} mate them using training data. Suppose Y k1, ..., Y knk and Y kj ∈ Rp are independent and identically distributed as Np(µk, Σ(θ0)), where µk = (µk1, ..., µkp)T , nk is the sample size for class Ck. Σ(θ) is the covariance matrix with pa- n → π, 0 < π < 1 as n1 rameter θ = θ0. Let n = n1 + n2 be the total sample size. Assume n → ∞. p is depending on n. Assume the two classes have equal prior probability for the two classes, i.e. both the probability that a new observation is from class C1 and C2 are 1 2. Consider the classification rule ˆδ: ˆδ(X) = (X − ˆµ1 + ˆµ2 2 )T ˆΣ−1 ˆ∆ > 0 (1.3.1) where ˆµ1, ˆµ2, ˆΣ and ˆ∆ are estimates of µ1, µ2, Σ and ∆ from the data, where ∆ = (∆1, ..., ∆p)T = µ1 − µ2 is the difference of the two classes in mean. A new observation x is classified into class C1 if ˆδ(x) > 0 and C2 otherwise. Let Θ = (µ1, µ2, θ). If the new 7 observation X comes from C1, then the conditional misclassification rate of ˆδ is: W1(ˆδ; Θ) = P (ˆδ(X) ≤ 0|X ∈ C1, Y ki, i = 1, 2, ..., nk, k = 1, 2) = 1 − Φ(Ψ1) (1.3.2) where (µ1 − ˆµ)T ˆΣ( ˆµ1 − ˆµ2) (cid:113) ( ˆµ1 − ˆµ2)T ˆΣ−1Σ ˆΣ−1( ˆµ1 − ˆµ2) Ψ1 = (1.3.3) Similarly we can define the error rate for observations from C2. If a new observation X comes from class C2, the conditional misclassification rate of ˆδ is: W2(ˆδ, Θ) = P(ˆδ(X) > 0|X ∈ C2, Y ki, k = 1, 2; i = 1, ..., nk) = Φ(Ψ2) (1.3.4) where (cid:113) Ψ2 = (µ2 − ˆµ)T ˆΣ( ˆµ1 − ˆµ2) ( ˆµ1 − ˆµ2)T ˆΣ−1Σ ˆΣ−1( ˆµ1 − ˆµ2) , (1.3.5) Since we assumed the equal prior probability for the two classes, the overall misclassifi- cation rate is : W (ˆδ; Θ) = 1 2(W1(ˆδ; Θ) + W2(ˆδ; Θ)) (1.3.6) If µ1, µ2 and Σ are known, the optimal classification rule is Bayes rule, which classifies 8 a new observation X = x into class C1 if δ(x) = (x − µ1 + µ2 2 )T Σ−1∆ > 0, (1.3.7) Bayes rule has the smallest misclassification rate. If there’s a new observation X from class C1, since X has normal distribution N (µ1, Σ(θ)), we can calculate that the conditional misclassification rate of Bayes rule δ is W1(δ; Θ) = W2(δ; Θ) = 1 − Φ( (cid:112)Cp 2 ) where Cp = ∆T Σ−1(θ)∆ and Φ(·) is the standard Gaussian distribution function. The overall misclassification rate of Bayes rule is: W (δ; Θ) = 1 2(W1(δ; Θ) + W2(δ; Θ)) = 1 − Φ( (cid:112)Cp 2 ) Since Bayes rule has the smallest misclassification rate, we write WOP T = 1 − Φ( the optimal misclassification rate. (1.3.8) √ Cp 2 ) as In practice the parameters Σ(θ), µ1, µ2 are unknown. We need to estimate them us- ing data. The linear discriminant analysis (LDA) estimates the parameters with sample estimates (k = 1, 2): ˆµk = Y ki/nk = ¯Y k· nk(cid:88) (cid:88) (cid:88) i=1 ˆΣ = (Y ki − ˆµk)T (Y ki − ˆµk)/(n1 + n2 − 2) (1.3.9) (1.3.10) k i 9 Then LDA classifies X into class C1 if ˆδLDA(X) = (X − 1 2 ( ¯Y 1· + ¯Y 2·))T ˆΣ−1( ¯Y 1· − ¯Y 2·) > 0 (1.3.11) However LDA is not applicable under high dimensional setting (p (cid:29) n) for two reasons. First the sample covariance matrix ˆΣ is singular and it is difficult to estimate the precision matrix Ω = Σ−1. Second, even though the true covariance is known, the classification per- formance when p (cid:29) n is poor due to the error accumulated when estimate the p-dimensional features. Fan and Fan (2008) [19], and Shao et al. (2011) [46] show theoretically that the variable selection is critical respectively. Though there are many literature working on high dimensional classification to seek a better classification performance for general correlation structure in Σ (see, e.g., [5, 9, 15, 19, 20, 33, 35, 36, 43, 46, 48, 51, 52, 55]), but still not much has been talked about the high dimensional classification for spatially dependent data. Here in this thesis we fill in the P→ 1; P→ means convergence in gap. As defined in [46], a classifer ˆδ is (1) asymptotically optimal if W (ˆδ; θ)/WOP T (2) asymptotically sub-optimal if W (ˆδ; θ) − WOP T P→ 0, where probability. Our goal in this thesis is to develop a asymptotically optimal classifier. 1.4 Overview The rest of this dissertation is as follows. As a pilot study, in Chapter 2 we consider use longitudinal volumetric MRI data of five regions of interest (ROIs), which are known to be vulnerable to Alzheimer’s disease (AD) for prediction. A longitudinal data classification method based on functional data analysis is developed. It is applied for identifying when 10 an early prediction can reasonably be made and determining whether one ROI is superior with regard to predicting progression to AD over others. We first recover the longitudinal observations into trajectories using PACE (principal components analysis through condi- tional expectation; see [56]). PACE is a version of functional principal components (FPC) analysis, in which the FPC scores are estimated from conditional expectations. To check the appropriate time period for early prediction, using the idea from [24], we use only part of the observations from longitudinal data in this step. Then logistic regression with FPC scores as the explanatory variables was applied for classification. We compare the accuracy, sensitiv- ity, and specificity of prediction based on individual ROIs as well as their combinations. For all the models, the results show that the overall one year, two years, three years in advance prediction accuracy is above 80%. MCI converter subjects can be correctly detected with a greater-than-chance probability as early as two years prior to conversion. The data used in Chapter 2 is highly summarized data derived from volumetric MRI, which measures the size of each ROI. Summarized volumetric MRI data is a good way to measure the brain atrophy. However potentially there’s a lot of information lost due to the ignorance of the spatial pattern. How to do classification incorporating the spatial dependency is of great interest for not only the brain imaging data, but also for the imaging data from a wide range of source. In Chapter 3 we consider use voxel level MRI data for classification of AD patients and healthy subjects. This data is characterized with high dimension and spatial dependency. We develop a supervised learning method based on linear discriminant analysis for high dimensional spatially dependent data. To solve the two issues in standard LDA for high dimensional data, first, we incorporate the information of spatial dependency among all features using spatial statistical model. Then a non-singular covariance structure can be de- 11 rived by maximum likelihood estimation. Second, penalized maximum likelihood estimation (PMLE) is applied for selecting important features and estimating parameters simultane- ously. To reduce computation load, the tapering technique is applied. Additionally, we develop the theories which show that the method proposed can achieve consistent parameter estimation, consistent feature selection, and asymptotically optimal misclassification rate. Extensive simulation study shows a significant improvement in classification performance under spatial dependence compared to other comparable methods. In the end we apply this method in using brain imaging data for diagnosis of Alzheimer’s disease. 12 Chapter 2 Early prediction of Alzheimer’s disease using longitudinal volumetric MRI data from ADNI 2.1 Introduction The accurate prediction of Alzheimer’s disease (AD) in individuals with Mild Cognitive Im- pairment (MCI) is essential for clinical management and selection of potential interventions. It also plays an important role in improving the efficiency of clinical trials of AD-modifying treatments by enriching the study population with a higher proportion of individuals who will develop the disease during the trial period. Structural MRI has been a primary re- search tool for the development of prognostic markers to aid in disease prediction, taking advantage of brain atrophy, a primary pathologic process in AD due to widespread neuronal death (see, e.g., [28]). Many studies suggest that structural changes in the brain can be detected in the early stages of AD (see, e.g., [2, 22, 29]). Several studies analyzed brain atrophy patterns in different regions of interests (ROIs) and different disease stages (see, e.g., [2, 22, 29, 31, 32, 38]. They found that rates of atrophy are not uniform, but vary in accordance with disease stage and by region. For example, Leung et al. (2010) [31] found a 13 higher rate of hippocampal atrophy in MCI converters (MCI-c) than in MCI nonconverters (MCI-nc). Also, Fennema-Notestine et al. (2009) [22] found that some regions such as the mesial temporal, exhibited a linear rate of atrophy throughout both MCI and AD. Other regions such as lateral temporal, middle gyrus showed accelerated atrophy later in the dis- ease. These results imply that brain atrophy in certain regions can be used to differentiate MCI-c and MCI-nc. Longitudinal data captures atrophy patterns that change with time at different disease stages. Compared to single point brain imaging data, longitudinal data does a better job of describing brain atrophy and likely performs better in predicting conversion from MCI to AD. MRI offers a non-invasive, widely available and cost-effective way for obtaining imaging biomarkers. Image analysis software such as FreeSurfer has been able to provide accurate estimation of regional brain volume, cortical thickness, and even curvature. Regional brain volume has been investigated most as a potential biomarker to monitor atrophy progression and thereby the progression of AD. In this paper we studied longitudinal volumetric data of five regions of interests (ROIs) measured with MRI, including the hippocampus (H), entorhinal cortex (EC), middle temporal cortex (MTC), fusiform gyrus (FG) and whole brain (WB), in order to develop a better potential biomarker strategy for use with structural data. Among the papers which investigate prediction of conversion from MCI to AD using longitudinal MRI data, several stand out as particularly pertinent to our present work. Misra et al. (2009) [40] applied high-dimensional pattern classification to longitudinal MRI scans for prediction of conversion within an average period of 15 months. An abnormality score was calculated for classification and achieved an accuracy rate of 81.5%. Zhang et al. (2012) [57] performed predictions on multimodal longitudinal data (i.e., MRI, PET, etc.) using a longitudinal feature selection method. Multi-kernel SVM was used for classification. 14 They achieved an accuracy of 78.4%, sensitivity of 79.0%, and specificity of 78.0% for at least six months ahead of the conversion from MCI to AD. In Li et al. (2012) [34], 262 features were calculated from longitudinal cortical thickness MRI data. They first applied mRMR (minimum redundancy maximum relevance) for feature selection, then an SVM (support vector machine) was trained for classification. Their method can detect 81.7% (AUC = 0.875) of the MCI converters six months ahead of conversion to AD. Lee et al. (2016) [30] used baseline plus 1-year follow-up callosal MRI scans for predicting conversion to AD in the following 2 to 6 years. Logistic regression model with fused lasso regularization was applied. They found the accuracy of prediction was 84% in females and 61% in males. Arco et al. (2016) [3] performed prediction based on baseline and 1-year follow-up MRI data and clinical scores (MMSE and ADAS-Cog). Feature selection based on a two sample t-test and classification based on maximum-uncertainty linear discriminant analysis were applied. They achieved a classification accuracy of 73.95%, with a sensitivity of 72.14% and a specificity of 73.77% for six months before conversion. In Adaszewski et al. (2013) [1], voxel-based longitudinal structure MRI data was used. Weighted based feature selection and SVM were applied for prediction. Both sensitivity and specificity were up to 65% at 1 to 4 years before conversion. Notably they found that conversion could be detected as early as four years before conversion with a sensitivity of 62%. As mentioned in Eskildsen et al. (2013) [18], one factor preventing high predictive power is the heterogeneity of the MRI data due to the variability in underlying disease stage among subjects. Without standardizing the data based on disease stage, it is impossible to establish a specific pattern of atrophy at specific disease stages. Adaszewski et al. (2013) [1] attempted to solve this issue by aligning the data by “time of conversion”. We also aligned the data by “time of conversion” in this paper. This alignment enables us to estimate the pattern of 15 atrophy in accordance with the stages of disease. Moreover, it makes it feasible to compare ahead of how many years we can reasonably predict the conversion. Another source of variability in the longitudinal MRI data is that the subjects have dif- ferent numbers of observations. In the ADNI cohort, some subjects have nine years of data with up to 10 longitudinal observations, but many subjects only have 1 to 3 data points. Due to the sparsity and irregularity of the data, it is not feasible for the traditional longitudinal data analysis method to make predictions and meanwhile, compare the prediction windows. In this study, we applied a technique known as the principal component analysis through conditional expectation (PACE, see [56]) to analyze the realigned longitudinal MRI data. In PACE, the mean curve and covariance structure of the biomarker along with time are obtained based on the pooled data from all individuals. In this way, longitudinal observa- tions from each subject could be recovered as a smooth trajectory even if only one or few observations are available. Finally, once the longitudinal observations are recovered to be smooth trajectories, they could be treated as functional data. Then a functional prediction method from Hall et al. (2012) [24] could be employed to determine when an early decision can reasonably be made, and identify which ROI is best for prediction, using only part of the trajectory. Moreover, we also examined whether any combination of the ROIs can provide a better prediction result. The remainder of this chapter is organized as follows: Section 2.2 presents the methods, the PACE approach for analyzing longitudinal data and the functional prediction method. Section 2.3 introduces the data that we use in analysis. Section 2.4 presents the numerical results and conclusions. The discussion is in section 2.5. More tables about results can be found in the Appendix. 16 2.2 Methodology 2.2.1 Functional Principal Component Analysis with longitudinal data We first recover the longitudinal observations into trajectories using PACE (principal com- ponents analysis through conditional expectation; [56]). PACE is a version of functional principal components (FPC) analysis, in which the FPC scores are framed as conditional expectations. This method extends the applicability of FPC analysis to longitudinal data analysis, where only a few repeated and irregularly spaced measurements are available for each subject. In PACE, the longitudinal data is modeled as noisy sampled points from a collection of trajectories that are assumed to be independent realizations of a smooth ran- dom function X(t), with unknown mean function EX(t) = µ(t) and covariance function cov(X(s), X(t)) = G(s, t), where t, s ∈ T and T is a bounded and closed time interval. The eigenfunctions φk and non-increasing eigenvalues λk : G(s, t) =(cid:80) covariance function G of the process has an orthogonal expansion (in the L2 sense) in terms of k λkφk(s)φk(t), t, s ∈ T . From classical functional principal component analysis, the ith underlying random curve can be expressed through the Karhunen − Lo`eve expansion as (cid:88) Xi(t) = µ(t) + ξikφk(t), t ∈ T (2.2.1) k where ξik are uncorrelated random variables with mean 0 and variance Eξ2 (cid:80) k λk < ∞, λ1 ≥ λ2 ≥ · · ·. Let Yij be the jth observation of the random function Xi(·), observed at a random time ik = λk, with 17 Tij ∈ T . Then Yij can be modeled as: Yij = Xi(Tij) + ij = µ(Tij) + ∞(cid:88) (2.2.2) ξikφk(Tij) + ij k=1 where ij are the additional measurement errors that are assumed to be independent, iden- tically distributed and independent of ξij, i = 1, ..., n, j = 1, ..., Ni, Ni is the number of observations for the ith subject, and E(ij) = 0, var(ij) = σ2. To reflect sparse and irregu- lar designs, Ni are assumed to be independent and identically distributed random variables as well as independent of all other random variables. Now we need to estimate the mean function µ(t), covariance function G(s, t), eigenfunc- tions φk(t) and eigenvalues λk as well as functional principal component scores (FPC scores) ξik for k = 1, 2, ... for each subject i = 1, 2, ..., n. Assume that the mean function, covariance function and eigenfunctions are smooth, they could be estimated by local linear smoothing. Firstly the mean function µ(t) is estimated based on the pooled data from all individuals by minimizing: n(cid:88) Ni(cid:88) i=1 j=1 Tij − t hµ κ1( )(Yij − β0 − β1(t − Tij))2 with respect to β0 and β1, where κ1(·) is kernel function and hµ is bandwidth. Then the estimate of µ(t) is ˆµ(t) = ˆβ0(t). Let Gi(Tij, Til) = (Yij − ˆµ(Tij))(Yil − ˆµ(Til)) be the “raw” covariances. The local linear 18 surface smoother for G(s, t) is defined by minimizing n(cid:88) (cid:88) Tij − s hG , Til − t hG κ2( ) × (Gi(Tij, Til) − f (β, (s, t), (Tij, Til)))2, i=1 1≤j(cid:54)=l≤Ni where f (β, (s, t), (Tij, Til)) = β0 + β11(s − Tij) + β12(t − Til), κ2(·,·) is kernel function and hG is bandwidth. Minimization is with respect to β = (β0, β11, β12). Then the estimate of G(s, t) is ˆG(s, t) = ˆβ0(s, t). The estimates of eigenfunctions ˆφk(t) and eigenvalues ˆλk are the solutions of the eigenequa- tions, (cid:90) T ˆG(s, t) ˆφk(s)ds = ˆλk ˆφk(t). The eigenfunctions can be estimated by discretizing the smoothed covariance. Traditionally, when the measurements for each subject are densely sampled, the FPC scores ξik = (cid:82) (Xi(t) − µ(t))φk(t)dt are estimated by numerical integration. However, for longitudinal data, the time points of measurements vary widely across subjects and are sparse. The FPC scores cannot be well approximated by the usual integration method. However, as PACE further assumes that in (2.2.2), ξij and ij are jointly Gaussian, FPC scores for the ith subject could be estimated by the best prediction: ˜ξik = E[ξik|(cid:126)Yi] = λkφT ikΣ−1 Yi ( (cid:126)Yi − µi), (2.2.3) where (cid:126)Yi = (Yi1, Yi2, ..., YiNi is (ΣYi (j, l)’s entry of ΣYi = cov((cid:126)Yi, (cid:126)Yi), i.e. the )T , φik = (φk(Ti1), ..., φk(TiNi )j,l = G(Tij, Til) + σ2δjl with δjl = 1 if j = l and 0 if j (cid:54)= ))T , ΣYi l. Substituting the parameters in (2.2.3) by the estimates of µi, λi, φik, ΣYi , we have the 19 estimation for FPC scores: ˆξik = ˆE[ξik|(cid:126)Yi] = ˆλk T ˆφ ik ˆΣ−1 Yi ((cid:126)Yi − ˆµi), (2.2.4) Where the (j, l)th element of ˆΣYi is ˆΣj,k = ˆG(Tij, Til) + ˆσ2δjl. Assume the infinite-dimensional processes of (2.2.1) could be approximated by the pro- jection on the function space spanned by the first K eigenfunctions. The prediction for the trajectory Xi(t) for the ith subject, using the first K eigenfunctions is: K(cid:88) ˆXK i (t) = ˆµ(t) + ˆξik ˆφk(t) (2.2.5) k=1 According to [56], the number of eigenfunctions K can be selected by cross validation based on prediction error or by AIC-criteria. In this study we selected K by setting a threshold. That is, we selected the first K eigen functions such that they explained more than 95% of the total variation. In summary, to recover a trajectory for each subject from longitudinal observations, we first estimate the mean curve ˆµ(t) and covariance ˆG(s, t) from pooled data of all individuals, from which, eigenvalues ˆλk and eigenfunctions ˆφk can be estimated. Then FPC scores are estimated using available observations from each subject by conditional expectation, even if only one observation is available. From (2.2.5), the FPC scores ˆξik (k = 1,··· , K) characterize each subject i = 1, ..., n and can be used to describe differences between subjects. As a result, we can use FPC scores for classification or prediction [41]. For details of PACE methodology, please refer to Yao et al. (2015) [56]. The MATLAB package for “PACE” is available from http://www.stat.ucdavis.edu/PACE/. 20 2.2.2 Early prediction and choosing trajectory Hall et al. (2012) [24] introduced a methodology for classifying and predicting the future state using functional data. It provided an approach to determine when an early decision can be made reasonably, using only part of the trajectory and showed how to use the method to choose a better biomarker as a predictor. Hall et al. (2012) [24] assume there are q types of biomarkers from two classes of the [k] population. Let X ji (t) be the observed data function of the kth biomarker from the ith subject in population Πj, where j = 1, 2, i = 1, 2, ..., nj, k = 1, 2, ..., q, t ∈ I. Without loss of generality, assume I = [0, 1]. First, the dimension of the functional data is reduced by discretizing on a grid, i.e. by confining attention to X [k] ji (tl), where l = 1, 2, ..., p and p denotes the number of grid points. Second, a classifier based on p-variate is constructed (linear discriminant analysis and logistic classification were considered in [24]). Then the classifier is applied to each type of biomarkers using only a portion of the trajectory, for example, on the interval I = [0, t] with t ∈ [0, 1], to predict which class the subject belongs to in the end, i.e. at t = 1. The estimated error rates are denoted by ˆerr[k](t). Comparing ˆerr[k](t) for a range of values t of interest (t ∈ [0, 1]) and k = 1, ..., q gives an idea of when a relative early prediction could be made and which is a more reliable biomarker. The results are examined both numerically and theoretically by checking the consistency of ˆerr[k](t). In this paper, we use longitudinal data for prediction, as opposed to [24] which used dense functional data. As a result, instead of reducing dimension by discretizing on a grid, we reduce dimension by PACE, which was introduced in section 2.2.1. There are two main steps in this process. First, we apply PACE to the data to calculate FPC scores for each subject. To check the appropriate time period for early prediction, using the idea from [24], 21 we use only part of the observations from longitudinal data in this step. Second, we apply the logistic classification method to the resulting FPC scores for prediction. In the numerical study, we compared the one year, two years, and three years early prediction results. We also compared the error rates for different ROIs to choose which one is the most reliable biomarker. 2.2.3 Prediction with logistic classifier Logistic regression is a special case of generalized linear models which can be used for clas- sification. In the generalized linear model settings, logistic regression has a response Y with binomial distribution and a logit link function logit(p) = log p 1−p, where p is the probability of Y = 1. As in the framework of classification, the response Y denotes the index of two groups, say G1 and G0 with Y = 1 if the observation comes from class G1 and Y = 0 if it comes from class G0. Here we set Y = 1 for MCI-c subjects who convert from MCI to AD and Y = 0 for MCI-nc subjects who stay as MCI. Suppose xi1, xi2, ..., xiq are the predictors for the subject i. The logistic regression equa- tion is logit(pi) = log pi 1 − pi = q(cid:88) k=1 xikβk, f or i = 1, 2, ..., n where n is the number of subjects and pi is the probability that the ith subject belongs to class G1. The regression coefficients βk are usually estimated using maximum likelihood estimation(MLE). Then pi is derived by pi = 1 1 + e−Xiβ 22 (2.2.6) where Xi = (1, xi1, xi2, ..., xiq), and β = (β0, β1, β2, ..., βq)(cid:48). If pi is greater than a certain threshold (usually the threshold is set to be 0.5), subject i will be classified into class G1, otherwise, classified into class G0. In this paper, we would use the trajectory of ROIs’ volume as the predictor (we use function X(t), t ∈ [0, T ] to denote the trajectory). As introduced in the previous section, we firstly use PACE to reduce dimension and denote the predicted trajectory for subject i as ˆXi(t) = ˆξi1 ˆφ1(t) + ˆξi2 ˆφ1(t) + ... + ˆξiK ˆφ1(t). Thus we can do logistic regression with FPCSs ˆξi1, ..., ˆξiK as predictors. For example, if two FPC scores are chosen, i.e. K=2, the logistic regression is logit(pi) = β0 + β1 ˆξi1 + β2 ˆξi2, i = 1, 2, ..., n (2.2.7) We can easily include more variables in the logistic regression model. In the numerical study in section 2.4, in addition to FPCSs of certain ROIs, we also included some basic but impor- tant clinical features in the logistic regression model: baseline age, baseline MMSE (Mini- Mental State Examination) and APOE (apolipoprotein E) genotype. The corresponding logistical regression model is: logit(pi) = β0 + β1 ˆξi1 + β2 ˆξi2 + β3Agei + β4AP OEi + β5M M SEi, i = 1, 2, ..., n (2.2.8) 23 2.3 Data Description The longitudinal volume of H, EC, MTC, FG and WB are considered to be potential pre- dictors of conversion from MCI to AD in this paper. Baseline and follow-up volumetric MRI data, if available, were downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, http://www.adni-info.org/) database along with the corresponding clinical informa- tion. The structural MRI scans were acquired from 1.5T or 3T scanners, manufactured by GE, Siemens or Philips. Regional brain segmentation and volume estimation were carried with FreeSurfer by the UCSF/SF VA Medical Center [25], [49]. From a total of 872 individuals with a baseline diagnosis of MCI who were recruited for ADNI, 66 subjects were excluded due to missing data and 5 subjects were excluded due to reverting from AD to MCI. In the end 801 MCI subjects were included in this study. They were split into two categories. The MCI subjects who converted to AD after some years are labeled as MCI-c (mild cognitive impairment converters; n = 272); the rest of them who did not convert to AD during the follow-up period were labeled as MCI-nc (mild cognitive impairment nonconverters; n = 529). At study entry (baseline), all subjects underwent a comprehensive clinical evaluation, cognitive/functional assessments, and a structural brain MRI scan. Subjects also provided a blood sample for apolipoprotein E (APOE) genotyping and proteomic analysis. The subjects were then followed longitudinally at specific time points (6, 12, 18, 24, 36... months). However, due to the variability of their visits, the data collected is irregular in the number of observations and in time intervals for each subject. Table 2.1 shows the characteristics of the MCI subjects included in this study. It shows, except for gender, the other three features (baseline age, baseline MMSE, and APOE) are significantly different between MCI-c and MCI-nc subjects at a significance level of 0.05, 24 Table 2.1 Subjects characteristics n Age (Mean±sd) Gender (F/M) MMSE (Mean±sd) APOE (+/-) MCI-c 272 73.5 ± 7.03 110/162 26.9 ± 1.77 182/90 p-value MCI-nc 529 72.3 ± 7.59 220/309 28.0 ± 1.70 < 0.001 < 0.001 226/303 0.019 0.813 Key: MCI-c, mild cognitive impairment (converters); MCI-nc, mild cognitive impairment (non converters); Age, baseline age; MMSE, baseline Mini-Mental State Examination; APOE, apolipoprotein E genotype . which suggests the three features might be helpful in prediction. For all the subjects, the longest observed year is about 9.18 years and the minimum observed year is 0 (only one data point is collected, see Table 2.2). The numbers of observations for the subjects are also summarized in Table 2.2. The maximum number of observations is 10, the minimum number is 1, and the median number is about 3 − 5 in each category. During the data analysis, all regional volumes were normalized by dividing the intracra- nial volume (ICV) to correct for individual differences in head size. For example, to normalize the hippocampal volume, we calculate the fraction of the hippocampal volume and the ICV from the same subject at the same time point. The resulting fractions are then used in the analysis. In the spaghetti plot (Figure 2.1), (f) is the longitudinal ICV for each subject. It shows that the ICV for MIC-c and MIC-nc subjects have a similar pattern on average. Also for each subject, as expected the ICV is relatively constant during the observed period. For the other five normalized regional volumes ((a)-(e) in Figure 2.1), on average the MCI-c group has a higher rate of atrophy in volume compared to MCI-nc group. As shown in the spaghetti plot (Figure 2.1), first, we aligned the longitudinal volumetric data of ROIs by the timeline. Specifically, the MCI-c subjects were aligned by “time of conversion”, i.e. the time point of conversion is defined as year 0 and year −1 is defined for 25 (a) normalized Hippocampus (b) normalized EC (c) normalized FG (d) normalized MTC (e) normalized WB (f) ICV (mm3) Figure 2.1 Longitudinal volume of ROIs for MCI subjects. In (a)-(e), the value on Y axis is the normalized volume. In (f), the value on Y axis is the ICV (mm3). The value on X axis is “disease year”. Thin lines are observations for each subject. Blue lines are for MCI-c subjects and red lines are for MCI-nc subjects. Blue and red thick lines are pooled mean curves for MCI-c group and MCI-nc group respectivly. 26 0.0030.0040.0050.0060.007−7.5−5.0−2.50.0yearnormalized hippocampuslabelMCI−cMCI−nc0.0010.0020.0030.004−7.5−5.0−2.50.0yearnormalized EClabelMCI−cMCI−nc0.0100.0150.020−7.5−5.0−2.50.0yearnormalized FGlabelMCI−cMCI−nc0.0100.015−7.5−5.0−2.50.0yearnormalized MTClabelMCI−cMCI−nc0.50.60.70.80.9−7.5−5.0−2.50.0yearnormalized whole brainlabelMCI−cMCI−nc12000001400000160000018000002000000−7.5−5.0−2.50.0yearICVlabelMCI−cMCI−nc Table 2.2 Data characteristics n year (max) year (min) year (median) obsN obsN (max) (min) obsN (median) All subjects MCI-c MCI-nc All 3y subjects MCI-c MCI-nc 272 529 30 97 8.15 9.18 8.10 8.03 0 0 2.97 2.52 1.65 2.07 3.03 3.02 8 10 8 10 1 1 4 3 3 3 5 4 Key: MCI-c, mild cognitive impairment (converters); MCI-nc, mild cognitive impairment (non converters). “All 3y subjects” are the subjects who have observations at all the “-3y”, “-2y”, “-1y” time points. Column 2 is the number of subjects in each category. Column 3 − 5 are the maximum, minimum and median length of observed years. Column 6 − 8 are the maximum, minimum and median number of observations. the time points observed 1 year prior to conversion. The MCI-nc subjects were aligned by the last time point of observation, i.e. the time point of last observation is defined as year 0 and the time points observed 1 year prior to the last observation is defined as year “−1” and so forth. In this way, the data were homogenized by the timeline of disease progression. The goal is to predict whether a subject will convert to AD in the end, i.e. at year 0. One of the goals of this paper is also to determine a reasonable prediction window. So we need to compare the prediction accuracy among different prediction windows. Here we compare one year, two years and three years of early prediction. To make the three years’ prediction results comparable, we pick the subjects who have observations at all the three most recent years: year “−3”, year “−2”, year “−1” as the testing set (labeled as “all 3y” subjects in Table 2.2 and 2.3). There are 127 “all 3y” subjects in total, of which 30 are MCI- cs and 97 are MCI-ncs. In Table 2.3, characteristics of the subjects who have observations at year “−3”, “−2”, and “−1” are listed in the first three columns respectively. The last column in Table 2.3 lists the characteristics of the testing set. All the subjects are considered as the training set. Leave-one-out testing is applied to the subjects in the testing set. 27 Table 2.3 Subjects characteristics by year -3y -2y -1y all 3y MCI-c n Age (mean±SD) MMSE (mean±SD) Gender (F/M) APOE (+/-) MCI-nc n Age (mean±SD) MMSE (mean±SD) Gender (F/M) APOE (+/-) 71 73.94±7.06 27.14±1.82 158 73.61±7.05 27.05±1.69 28/43 40/31 58/100 103/55 245 73.47±7.20 26.94±1.78 100/145 162/83 30 73.94±5.98 27.03±1.63 10/20 19/11 242 72.09±7.61 28.03±1.64 99/143 90/152 371 72.43±7.36 28.08±1.61 154/217 148/223 473 72.39±7.47 27.97±1.7 197/276 203/270 97 70.22±7.20 28.12 ±1.62 40/57 40/57 Key: MCI-c, mild cognitive impairment (converters); MCI-nc, mild cognitive impairment (non converters); MMSE, baseline Mini-Mental State Examination; APOE, apolipoprotein E genotype.“-3y” column, “-2y” column and “-1y” column are characteristics of subjects who have observations at year “-3”, “-2”, and “-1” respectively after alignment. “all 3y” column is the characteristics of the subjects who have observations at all the 3 years: year “-3”, “-2” and “-1”. 2.4 Numerical results This section presents the numerical results and conclusions. We first took the hippocampus as an example to illustrate the details of the prediction procedure using a single ROI’s volume in section 2.4.1. We then list all prediction results using a single ROI in section 2.4.2. In the end, we investigate whether or not any combinations of the ROIs would improve the prediction performance in section 2.4.3. There are 26 different combinations of the 5 ROIs. The prediction results from the 26 combination prediction models are listed in this section. 2.4.1 Prediction using hippocampus In this section, we take the volume of the hippocampus as an example to describe how to use a single ROI for prediction. We started with the 1-year early prediction. 28 Table 2.4 Parameter estimation (hippocampus for 1y prediction) Estimate Std. Error z value (Intercept) APOE4 Age MMSE FPC Score1 FPC Score2 6.597 0.8120 -0.0225 -0.2243 5265 20364 1.779 0.1802 0.0132 0.0500 1631 5776 3.71 4.51 -1.71 -4.48 3.23 3.53 p-value 2.07e-04 6.62e-06 8.70e-02 7.41e-06 1.25e-03 4.22e-04 The first step is PACE analysis with the partially observed data and calculating the FPC scores for each subject. As outlined in section 2.3, the longitudinal data were realigned according to disease status (Figure 2.1). We wanted to predict whether or not a subject would convert to AD at year 0. 1-year prediction means to use all the data observed prior to year “−1” to predict the state at year 0. Since the data points were irregularly collected, we use the data with year t ∈ [−9.18,−0.5) for 1-year prediction (9.18 is the longest trajectory for all the subjects and all ROIs, see Figure 2.1). Figure 2.2 shows the PACE analysis results for 1-year early prediction using hippocampus. (a) and (b) are the estimation of mean curve µ(t) and covariance surface G(s, t) for s, t ∈ [−9.18,−0.5), which are introduced in section 2.2.1. We selected the number of the eigenfunctions used in (2.2.5) by setting a threshold to be 95%. (d) shows that the first two principal components explained 98.9% of the variance. Thus only the first two eigenfunctions and corresponding FPC scores ˆξi1 and ˆξi2 of each subject i are employed for prediction. (c) is the estimation of the first two eigenfunctions φ1(t) and φ2(t) for t ∈ [−9.18,−0.5). The FPC scores are calculated by (2.2.4) and then applied to logistic regression model (2.2.8). Figure 2.2.3 shows the scatter plot of the first two FPC scores for all the training subjects. Table 2.4 provides sample statistics for the coefficient estimates which are obtained from the logistic regression of group indicator (“MCI-c”= 1, “MCI-nc”= 0) with linear model 29 (a) Mean curve ˆµ(t) of hippocampus (b) Covariance surface ˆG(s, t) of hippocampus (c) The first two eigen function (d) Fraction of variance explained by eigen function Figure 2.2 PACE analysis using hippocampus for 1-year prediction. (a)-(c) show the estima- tions of mean function µ(t), covariance surface G(t) and the first two eigen functions φ1(t) and φ2(t). (d) shows the first two eigen functions explained 98.907% of the total variance. 30 t-10-9-8-7-6-5-4-3-2-10µ(t)×10-32345678910Mean functiondatamean-10-8-6Smooth covariance surfacet-4-200-2-4-6t-802-246810×10-7-10t-10-9-8-7-6-5-4-3-2-10φ(t)-0.8-0.6-0.4-0.200.20.4eigen function vs tfirst eigen function φ1(t)second eigen function φ2(t)No. of Principal Components0510152025FVE (%)0102030405060708090100 k = 2, FVE = 98.907% (final choice)Fraction of variance explained by No. of PC for function x Figure 2.3 Second versus first FPC scores for hippocampus (1 year prediction). The triagulars indicate MCI-nc and the crosses indicate MCI-c. (2.2.8). It shows both the first two FPC scores ˆξ1 and ˆξ2 are significant in increasing the odds of being MCI-c. The resulting estimates of coefficients from the logistic models (2.2.8) are plugged into (2.2.6) to calculate the conversion probability. We let pi be the probability If pi ≥ p0, subject i will be of conversion to AD for subject i. Let p0 be a threshold. classified as MCI-c, otherwise, it will be classified as MCI-nc. If set p0 = 0.5, the sensitivity, specificity, and accuracy for 1-year early prediction are 0.7, 0.86 and 0.82. By changing the threshold p0 from 0 to 1, we can derive a ROC (receiver operating characteristic) curve with AUC (area under the curve) to be 0.838. We then apply the same procedure mentioned above to a 2-year early prediction and a 3-year early prediction with hippocampus volume. Setting threshold p0 to be 0.5, the corresponding sensitivity, specificity, and accuracy are summarized in Table 2.5. Setting the threshold p0 to be from 0 to 1, the ROC curve are shown in Figure 2.4. From the ROC curve, shorter prediction windows perform better, i.e., 1-year early prediction is the best and 3-year 31 −4e−04−2e−040e+002e−04−5e−050e+005e−05Second FPC score vs first PCS score (Hippocampus)first FPC scoresecond FPC scoreMCI−ncMCI−c Table 2.5 Prediction accuracy rates (hippocampus) sensitivity specificity accuracy -1 y 0.7 0.86 0.82 -2 y 0.53 0.89 0.80 -3y 0.23 0.98 0.80 Key: The probability threshold p0 is set to be 0.5, i.e. if p ≥ 0.5, classify it as MCI-c, otherwise classify it as MCI-nc. early prediction is the worst. It is expected, and the procedure provides risk evaluation in early prediction. Figure 2.4 ROC curve for prediction using hippucampus. Solid line, dotted line and dotdash line are the ROC curves for 1-year, 2-year and 3-year prediction respectively. 2.4.2 Prediction results using single ROI In section 2.4.1, we took the hippocampus as an example to illustrate the procedure of 1- year, 2-year and 3-year early prediction using a single ROI. The results shown in section 2.4.1 were based on one training data set. In this section, we apply the same procedure to all the five ROIs: H, WB, EC, FG and MTC. To check the robustness of prediction performance, 32 ROC (hippocampus)SpecificitySensitivity0.00.20.40.60.81.01.00.80.60.40.20.01y (AUC=0.838)2y (AUC=0.83)3y (AUC=0.813) we repeated the procedure on 100 samples of training data. That is, we randomly sampled 2/3 of the training data (2/3 MCI-c and 2/3 MCI-nc) for 100 times. We then repeated the prediction procedures stated in section 2.4.1 for 100 times. The mean and standard devia- tion of sensitivity, specificity, and accuracy were calculated when classification probability threshold p0 was set to be 0.5. Also, when set p0 to vary from 0 to 1, the mean and standard deviation of the AUCs from the ROC curves were also calculated. The results are shown in Table 2.6. From Table 2.6, all the five ROIs have prediction accuracy above 75% and AUC around 0.8 for all 1-year, 2-year, and 3-year prediction. Both H and EC can identify MCI-c correctly with a greater-than-chance probability as early as two years. EC is the best ROI with prediction accuracy above 80% and AUC above 0.8 for all the three years’ predictions. Moreover, its 2-year sensitivity is 62% and 1-year sensitivity achieves 70%. 2.4.3 Prediction results using combinations of ROIs We then check the prediction performance from the combinations of ROIs. Combination prediction is realized by plug combinations of ROIs’ FPC scores into the logistic regression model. For example, if we predict using the combination of H and EC, we first calculate the FPC scores of H ( ˆξ1i, ˆξ2i) as well as FPC scores of EC (ˆη1i, ˆη2i) for each subject i (suppose the number of the eigenfunctions selected for both H and EC is 2). Then the logistic regression model for predicting using combination of H and EC is: logit(pi) = α + β1 ˆξi1 + β2 ˆξi2 + β3 ˆηi1 + β4 ˆηi2 + β5Agei + β6AP OEi + β7M M SEi, for i = 1, 2, ..., n. 33 Table 2.6 Prediction performance using single ROI H (Mean ± SD) WB (Mean ± SD) EC (Mean ± SD) FG (Mean ± SD) MTC (Mean ± SD) sensitivity 1 y 0.64±0.044 2 y 0.51±0.029 3 y 0.27±0.054 1 y 0.54±0.039 2 y 0.37±0.051 0.1±0.027 3 y 0.7±0.011 1 y 0.62±0.03 2 y 3 y 0.32±0.037 0.53±0.03 1 y 2 y 0.46±0.026 3 y 0.23±0.039 1 y 0.57±0.043 2 y 0.42±0.027 3 y 0.24±0.038 specificity 0.85±0.013 0.89±0.006 0.96±0.018 0.86±0.015 0.88±0.013 0.95±0.015 0.85±0.016 0.91±0.012 0.95±0.01 0.88±0.006 0.89±0.008 0.95±0.015 0.88±0.015 0.9±0.011 0.94±0.012 accuracy 0.8±0.013 0.8±0.007 0.8±0.011 0.79±0.012 0.76±0.01 0.75±0.01 0.82±0.012 0.84±0.009 0.8±0.009 0.8±0.006 0.79±0.006 0.78±0.01 0.8±0.011 0.79±0.01 0.78±0.01 AUC 0.84±0.003 0.83±0.004 0.81±0.006 0.8±0.004 0.8±0.004 0.78±0.005 0.84±0.003 0.84±0.004 0.81±0.004 0.82±0.003 0.81±0.003 0.79±0.003 0.84±0.005 0.83±0.006 0.81±0.007 Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG stands for fusiform gyrus and MTC stands for middle temporal cortex. Sensitivity is the proportion of true MCI-c in prediction, specificity is the proportion of true MCI-nc in prediction and accuracy is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the ROC (receiver operatin characteristic) curve. The mean and standard deviation of sensitivity, specificity, accuracy, and AUC are based on 100 samples of training data sets. 34 There are 26 different combinations in total. Table 2.7 shows the results of combinations of two ROIs. Table 2.8 shows the results of combinations of three ROIs. Table 2.9 shows the results of combinations of more than three ROIs. Please see Table 2.7, Table 2.8 and Table 2.9 in Appendix. 2.4.4 Conclusion The graph comparison of the prediction performances from the 31 different combinations of the ROIs (5 single variable models and 26 combination models) are listed in Figure 2.5. From Table 2.6, 2.7, 2.8, 2.9 and Figure 2.5, we have the following observations: First, the longitudinal volumes from the listed brain ROIs measured by MRI have pre- diction power as early as three years in advance. In 1-year, 2-year, and 3-year prediction for all the models, the overall specificity is above 80%. The accuracy is above 70%, and the AUC is above 0.8. The sensitivity varies for different prediction windows and models from 10% to 70%. Second, short-term prediction performs better, i.e. 1-year early prediction performs the best compare to 2-year and 3-year prediction. It is consistent with intuition because more information is included and less noise is introduced in short-term prediction procedure. But the error curves over time are clearly helpful for early interventions. To compare the models using different combinations of ROIs, since the overall specificity, accuracy, and AUC are similar among models, we mainly focus on checking the sensitivity in all models. Sensitivity is the probability that the MCI-c subjects are correctly identified. It is crucial in prognosis and clinical trials. Overall, most of the models derived a sensitivity of greater than 50% for 1-year and 2-year prediction. However, the maximum sensitivity for 3-year in advance prediction is just about 35%, which implies it is not easy to correctly 35 identify MCI-c subjects three years in advance using the information in our model. Among the five models using single ROI, EC performs the best, followed by H, and WB is the worst. All the five ROIs have sensitivity higher than 50% for 1-year prediction. Both H and EC have a sensitivity greater than 50% in 2-year early prediction (Table 2.6). 3-year prediction sensitivity are about 10% to 30% for all the ROIs. For the prediction results from the models using combinations of two or more ROIs (see Table 2.7), except for the combinations of WB+FG and WB+MTC , all the other ROIs’ combinations have a sensitivity higher than chance for both 1-year and 2-year prediction. The overall sensitivity for 3-year prediction is about 20%-30% for models of combinations of two ROIs, while which is around 30% for models of three ROIs and around 35% for models of more than three ROIs (see Table 2.7 and Table 2.8). To select the best overall prediction performance in the sense of prediction window, sensitivity, specificity, accuracy, and AUC, the results from the models of EC (see Table 2.6), H+EC+FG (see Table 2.8), H+WB+EC+FG and H+WB+EC+FG+MTC (see Table 2.9) are the best. They all have the highest level of prediction sensitivity, specificity, accuracy, and AUC. The highest 1-year sensitivity is about 70%, specificity is about 86%, accuracy is about 82%, and AUC is about 0.85. For 2-year prediction, they also have the highest prediction rate with sensitivity as about 62% − 65%, specificity as about 90%, accuracy as about 84%, and AUC as about 0.85. For 3-year early prediction, they all have sensitivity above 32%, specificity above 90%, accuracy above 80%, and AUC above 0.81. In the four best models, the combination models work a little bit better than single ROI model. Because the three combination models have a higher AUC than that of EC model. However, they are not significantly superior than EC. On the other hand, the combinations without EC and H, which are the first and second best single biomarker, perform relatively 36 worse than other combinations (WB+FG, WB+MTC, FG+MTC, WB+FG+MTC, see Ta- ble 2.7 and Table 2.8). This observation implies that the prediction power of the combination models comes from the prediction power of EC and H. 2.5 Discussion In this study we use sparsely observed volumes of ROIs quantified by longitudinal volumetric MRI to predict whether or not a MCI subject will convert to AD in a specified time window. A longitudinal data prediction method based on functional data analysis is developed for early prediction in varying time windows, as well as identifying the most important ROI(s) in the process. The application of existing methods is not obvious due to the complexity of the data structure. To apply the prediction method based on functional data analysis, we first use PACE to extract statistically validated information from the longitudinal data. FPC scores are calculated and used for prediction. This method is new for analyzing ADNI data. Our best prediction model (1-year prediction using H+WB+EC+FG+MTC with AUC = 0.86, accuracy = 82%, specificity=86%, and sensitivity=69%) performs favorably compared to the existing literature for prediction using the same longitudinal data, even though the comparison is not entirely comparable since those studies used different biomarkers and different prediction windows. Actually, our model used fewer predictors (longitudinal ROI volumes and three clinical features) and has a longer prediction window (one year, two years, and three years), which suggests that our model efficiently used less information and derived comparable prediction results. Since the primary goal of this paper is to introduce a statistically validated, advanced methodology for prediction using longitudinal data, we 37 (a) sensitivity (b) specificity (c) accuracy (d) AUC Figure 2.5 Prediction sensitivity(a), specificity(b), accuracy(c), and AUC(d) using single and combinations of ROIs. In each panel, every bar represent a predict result using different combinations of biomarkers (from left to right): Hippocampus(H), whole brain(W), entorhi- nal cortex(EC), fusiform gyrus(F), middle temporal cortex(MTC), H+WB, H+EC, H+FG, H+MTC, WB+EC, WB+FG, WB+MTC, EC+FG, EC+MTC, FG+MTC, H+WB+EC, H+WB+FG, H+WB+MTC, H+EC+FG, H+EC+MTC, H+FG+MTC, WB+EC+FG, WB+EC+MTC, WB+FG+MTC, EG+FG+MTC, H+WB+EC+FG, H+WB+EC+MTC, H+WB+FG+MTC, H+EC+FG+MTC, WB+EC+FG+MTC and combination of all the five ROIs. 38 −3y−2y−1yyearsensitivity0.00.10.20.30.40.50.60.7−3y−2y−1yyearspecificity0.00.20.40.60.81.0−3y−2y−1yyearaccuracy0.00.20.40.60.81.0−3y−2y−1yyearAUC0.00.20.40.60.81.0 do not intend to elaborate the results by including more predictors. However, one could continue experimentation with other available biomarkers by the techniques adapted here. Besides, due to the irregularity and sparsity of longitudinal data, most other literature make predictions just in one prediction window. Our proposed method makes it feasible for predicting on more flexible prediction windows, even if only one observation is available. It enables us to compare the prediction performance in varying prediction windows and helps to understand the disease risk over time. Moreover, most of the methods employed in the literature are machine learning tech- niques, which are not accessible for statistical validation and interpretation. Our model is based on two well-established statistical methods, which is easy to validate and interpret. One limitation of the proposed model is it has a necessary assumption for PACE method. We need to assume the measurement error ij and FPC scores ξij are jointly Gaussian. This assumption is not always valid. However, the existing simulation studies indicate the method is robust to some extent in violations of the Gaussian assumption (see [56]). Also, in the prediction result, the sensitivity is poor for all combinations of the ROIs, especially for 3-year prediction. This reveals the limitation of using volume of ROI for prediction. More advanced classification technique would be applied to improve the prediction sensitivity in the future work. In summary, the proposed procedure in this paper provides a method to predict con- version from MCI to AD using longitudinal data. The longitudinal prediction curve can be utilized for understanding disease risk over time. The procedure can be easily applied to an- alyze other longitudinal data sets for prediction in clinical trials to decide a better prediction window and choose better biomarker(s). 39 APPENDIX 40 Table 2.7 Prediction performance of combinations of two ROIs H+WB (Mean ± SD) H+EC (Mean ± SD) H+FG (Mean ± SD) H+MTC (Mean ± SD) WB+EC (Mean ± SD) WB+FG (Mean ± SD) WB+MTC (Mean ± SD) EC+FG (Mean ± SD) EC+MTC (Mean ± SD) FG+MTC (Mean ± SD) sensitivity 1 y 0.63±0.044 0.5±0.037 2 y 3 y 0.22±0.065 1 y 0.66±0.031 2 y 0.58±0.044 3 y 0.33±0.029 1 y 0.67±0.032 2 y 0.57±0.035 3 y 0.27±0.049 1 y 0.64±0.039 2 y 0.53±0.033 0.3±0.052 3 y 1 y 0.66±0.025 2 y 0.62±0.022 3 y 0.29±0.026 1 y 0.55±0.052 2 y 0.46±0.026 3 y 0.22±0.048 0.6±0.054 1 y 2 y 0.48±0.047 3 y 0.24±0.043 1 y 0.63±0.032 0.6±0.011 2 y 3 y 0.34±0.038 1 y 0.64±0.016 0.6±0.029 2 y 0.36±0.03 3 y 0.6±0.02 1 y 2 y 0.53±0.032 3 y 0.28±0.055 specificity 0.85±0.012 0.88±0.008 0.94±0.016 0.84±0.011 0.89±0.013 0.94±0.012 0.88±0.016 0.9±0.01 0.95±0.009 0.85±0.014 0.88±0.016 0.94±0.011 0.83±0.015 0.89±0.014 0.94±0.015 0.88±0.007 0.89±0.006 0.93±0.012 0.87±0.014 0.88±0.016 0.94±0.011 0.85±0.014 0.89±0.01 0.95±0.009 0.86±0.012 0.9±0.012 0.94±0.009 0.89±0.009 0.9±0.01 0.94±0.01 accuracy 0.8±0.01 0.79±0.011 0.77±0.013 0.8±0.01 0.82±0.015 0.8±0.009 0.83±0.01 0.82±0.01 0.79±0.01 0.8±0.011 0.79±0.012 0.79±0.013 0.79±0.012 0.82±0.012 0.78±0.009 0.8±0.012 0.79±0.006 0.76±0.011 0.8±0.012 0.79±0.014 0.78±0.011 0.8±0.01 0.82±0.007 0.81±0.009 0.81±0.008 0.83±0.007 0.81±0.008 0.82±0.006 0.81±0.009 0.78±0.012 AUC 0.84±0.004 0.83±0.006 0.81±0.008 0.85±0.003 0.84±0.003 0.82±0.005 0.85±0.003 0.84±0.004 0.82±0.005 0.85±0.004 0.84±0.005 0.83±0.006 0.85±0.004 0.85±0.004 0.82±0.005 0.83±0.006 0.83±0.007 0.8±0.007 0.84±0.005 0.83±0.007 0.81±0.005 0.85±0.003 0.84±0.003 0.82±0.003 0.86±0.003 0.85±0.004 0.83±0.005 0.85±0.004 0.83±0.005 0.82±0.005 Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG stands for fusiform gyrus and MTC for middle temporal cortex. Sensitivity is the proportion of true MCI-c in prediction, specificity is the proportion of true MCI-nc in prediction, and accuracy is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the ROC (receiver operating characteristic) curve. The mean and standard deviation of the sensitivity, specificity, accuracy, and AUC are based on 100 samples of training data sets. 41 Table 2.8 Prediction performance of combinations of three ROIs H+WB+EC (Mean ± SD) H+EC+FG (Mean ± SD) H+EC+MTC (Mean ± SD) sensitivity 1 y 0.66±0.029 2 y 0.61±0.031 0.3±0.03 3 y 0.68±0.03 1 y H+WB+FG 2 y 0.57±0.055 (Mean ± SD) 3 y 0.27±0.052 H+WB+MTC 1 y 0.64±0.048 2 y 0.55±0.045 (Mean ± SD) 0.29±0.05 3 y 0.7±0.013 1 y 0.63±0.03 2 y 3 y 0.33±0.037 1 y 0.66±0.018 2 y 0.59±0.041 0.35±0.03 3 y 1 y 0.67±0.036 H+FG+MTC (Mean ± SD) 2 y 0.61±0.041 0.29±0.05 3 y WB+EC+FG 1 y 0.69±0.023 2 y 0.62±0.032 (Mean ± SD) 3 y 0.34±0.047 WB+EC+MTC 1 y 0.67±0.032 2 y 0.61±0.029 (Mean ± SD) 3 y 0.35±0.034 WB+FG+MTC 1 y 0.66±0.047 2 y 0.54±0.034 (Mean ± SD) 3 y 0.29±0.055 EC+FG+MTC 1 y 0.65±0.029 2 y 0.62±0.017 (Mean ± SD) 3 y 0.34±0.032 specificity 0.84±0.014 0.88±0.011 0.93±0.015 0.87±0.013 0.9±0.011 0.94±0.01 0.84±0.013 0.87±0.013 0.94±0.013 0.86±0.017 0.9±0.012 0.95±0.009 0.86±0.012 0.89±0.012 0.94±0.01 0.86±0.016 0.9±0.011 0.94±0.008 0.86±0.017 0.89±0.009 0.94±0.009 0.85±0.011 0.89±0.012 0.94±0.011 0.88±0.011 0.9±0.013 0.93±0.008 0.87±0.016 0.9±0.011 0.94±0.008 accuracy 0.79±0.011 0.82±0.012 0.78±0.012 0.82±0.009 0.82±0.016 0.78±0.012 0.79±0.012 0.79±0.011 0.78±0.014 0.82±0.013 0.84±0.012 0.8±0.01 0.81±0.009 0.82±0.009 0.8±0.008 0.82±0.01 0.83±0.011 0.79±0.012 0.82±0.014 0.83±0.009 0.8±0.012 0.81±0.01 0.82±0.01 0.8±0.011 0.83±0.011 0.81±0.013 0.78±0.013 0.82±0.01 0.83±0.009 0.8±0.01 AUC 0.85±0.003 0.84±0.005 0.81±0.006 0.85±0.005 0.84±0.007 0.82±0.008 0.85±0.005 0.84±0.006 0.82±0.006 0.86±0.003 0.85±0.004 0.83±0.004 0.86±0.004 0.85±0.003 0.82±0.006 0.86±0.003 0.84±0.005 0.83±0.005 0.85±0.004 0.85±0.006 0.82±0.006 0.86±0.004 0.85±0.005 0.82±0.006 0.85±0.006 0.84±0.009 0.82±0.007 0.86±0.003 0.85±0.004 0.83±0.004 Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG stands for fusiform gyrus and MTC for middle temporal cortex. Sensitivity is the proportion of true MCI-c in prediction, specificity is the proportion of true MCI-nc in prediction, and accuracy is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the ROC (receiver operating characteristic) curve. The mean and standard deviation of the sensitivity, specificity, accuracy, and AUC are based on 100 samples of training data sets. 42 Table 2.9 Prediction performance of combinations of four or more ROIs H+WB+FG+MTC (Mean ± SD) H+WB+EC+MTC (Mean ± SD) H+WB+EC+FG (Mean ± SD) sensitivity 0.7±0.011 1 y 2 y 0.64±0.038 3 y 0.33±0.039 1 y 0.68±0.029 2 y 0.61±0.036 3 y 0.34±0.036 0.68±0.04 1 y 2 y 0.63±0.046 3 y 0.31±0.054 1 y 0.68±0.022 2 y 0.64±0.023 3 y 0.33±0.034 1 y 0.68±0.031 2 y 0.64±0.028 3 y 0.35±0.038 H+WB+EC+FG+MTC 1 y 0.69±0.022 2 y 0.65±0.033 3 y 0.35±0.038 (Mean ± SD) H+EC+FG+MTC (Mean ± SD) WB+EC+FG+MTC (Mean ± SD) specificity 0.86±0.015 0.9±0.012 0.95±0.01 0.85±0.012 0.88±0.013 0.94±0.011 0.86±0.017 0.9±0.012 0.94±0.008 0.87±0.014 0.9±0.012 0.94±0.008 0.86±0.013 0.9±0.01 0.94±0.009 0.86±0.012 0.9±0.012 0.94±0.008 accuracy 0.82±0.012 0.84±0.013 0.8±0.011 0.81±0.009 0.82±0.011 0.8±0.012 0.82±0.012 0.83±0.014 0.79±0.013 0.82±0.011 0.84±0.01 0.8±0.009 0.82±0.011 0.84±0.01 0.8±0.011 0.82±0.01 0.84±0.012 0.8±0.011 AUC 0.85±0.005 0.85±0.007 0.82±0.007 0.86±0.004 0.85±0.005 0.82±0.007 0.86±0.005 0.85±0.008 0.82±0.007 0.86±0.003 0.85±0.004 0.83±0.005 0.86±0.004 0.85±0.006 0.82±0.007 0.86±0.005 0.85±0.007 0.82±0.008 Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG stands for fusiform gyrus and MTC for middle temporal cortex. Sensitivity is the proportion of true MCI-c in prediction, specificity is the proportion of true MCI-nc in prediction, and accuracy is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the ROC (receiver operating characteristic) curve. The mean and standard deviation of the sensitivity, specificity, accuracy, and AUC are based on 100 samples of training data sets. 43 Chapter 3 High dimensional discriminant analysis for spatially dependent data and its application in neuroimaging data 3.1 Introduction Spatial statistics is rapidly developing for analyzing data which is featured with spatial structure. Applications of spatial statistics are for a broad range of disciplines such as agriculture, geology, soil science, oceanography, forestry, meteorology and climatology as well as imaging data. With the development of computer technology into the processing of images, spatial statistics has played an important role in the processing of images and pattern recognition. Cressie (1992) [17] introduced details of spatial methodologies in analysis of images. Moti- vated by the problem of how to use brain imaging data for diagnosis and classification, we focus on discriminant analysis of data featured with spatial correlation. Another important characteristic of the imaging data is the high dimensionality. For 44 example, in MRI scans, signals are collected from a 3D space. The number of voxels could be as many as millions. Usually the number of images is just hundreds. If we use the signals from all voxels as features for classification, since the number of features is much more than the sample size, it would be a high dimensional problem. LDA is an asymptotically optimal classifier under traditional large sample scenario, that is, the dimension of variables (p) is fixed and the sample size (n) tends to infinity. However it is not applicable in high dimensional data. Bickel and Levina (2004) [5] demonstrated that LDA performs asymptotically no better than random guessing if p/n → ∞. As mentioned in 1.3, there are two main issues in high dimensional classification using LDA. The first issue is when p > n, the sample covariance ˆΣ will be singular. To solve this, independence rule (IR) ignores the correlations among features and use diagonal of ˆΣ to replace ˆΣ, which is applicable for any high dimensions. Bickel and Levina (2004) [5] also shows in theory that IR lead to a better classification result than the naive LDA, where the Moore-Penrose inverse is used to replace ˆΣ−1. Another similar way to solve this issue is nearest shrunken centroid classifier [48]. Fan and Fan (2008) [19] propose feature annealed independence rule (FAIR) that performs feature selection by t-test in addition to IR. However, ignoring the covariance structure of the features, these methods can not asymptotically achieve optimal classification rate. Many other high dimensional LDA methods are proposed in which the covariance struc- ture in Σ is incorporated. Again, the first challenge for high dimensional LDA is the singu- larity of ˆΣ. Many methods have been proposed for covariance matrix estimation or precision matrix estimation in high dimension scenario (see, e.g., [6, 7, 8, 10, 11, 12, 44, 45, 47, 53]). The covariance matrix estimated by these methods can be directly used in LDA. However, an accurate estimate of Σ does not necessarily lead to better classification. Fan and Fan (2008) 45 [19] and Shao et al. (2011) [46] shows that even though the true covariance matrix is known, the classification could be no better than random guess because of the noise accumulated from estimating the means. This lead to the second challenge for high dimensional LDA. That is the classification performance is poor due to the noise brought in from the estimate of many non-informative features. To address this issue, Shao et al. [46] assumes sparseness in both ∆, which measures the difference of the two classes in mean and the covariance matrix Σ. Then Σ and ∆ are estimated by hard thresholding for classification. This method is then extended to quadratic discriminant analysis in [33]. One issue with this method is the assumption of the sparseness on the covariance matrix is too restrictive. Witten and Tibshirani (2011) [51] proposed penalized LAD by applying penalties on the discriminant vectors. Cai and Liu (2011) [9], Mai and Zhou (2012) [36] and Fan et al. (2012) [20] assume the sparsity of the discriminant direction βBayes = Σ−1(µ2 − µ1). These methods borrow the idea of penalization in regression to regularize the estimated discriminant direction βBayes directly, avoiding of estimating Σ−1. The same idea is still using in [13]. Though informative direction plays a direct role in the discriminant function, selecting the informative features would be easier for interpretation. Xu et al. (2014) [55] proposed a covariance-enhanced method to select informative features for linear discriminant analysis. However the proposed method doesn’t directly enforce the selecting of informative features. In this research, we take advantage of the spatial dependency among features and assume the features are equipped with spatial covariance structure. We then developed a penalized maximum likelihood estimation procedure to simultaneously estimate the covariance param- eter, the mean, and select important features for discriminant analysis. For a spatial domain of interest D in Rd, we consider two classes of spatial processes 46 {yk(s) : s ∈ D, k = 1, 2}, (k = 1, 2), such that yk(s) = µk(s) + (s), (3.1.1) Where µk(s) is the mean effect function and (s) is the corresponding random noise. Assume that the error process {(s) : s ∈ D} is a Gaussian process with mean zero and a covariance function γ(s, s(cid:48); θ) = cov((s), (s(cid:48))) (3.1.2) where s, s(cid:48) ∈ D and θ is a q × 1 vector of covariance function parameters. We assume the spatial domain is expanding as the number of samples on the domain is increasing: restriction(cid:13)(cid:13)si − sj A 1. Assume the sample set D ∈ Rd (d ≥ 1) is predetermined and non-random with the ≥  > 0, for si, sj ∈ D for all pairs i, j = 1, 2, ..., p to ensure that the (cid:13)(cid:13) 2 sampling domain increases in extent as p increases. Assume for any sample of the spatial processes, there are observations at p discrete sites s1, ..., sp ∈ D. Suppose yki(s) (i = 1, 2, ..., n1) is from class Ck (k = 1, 2). Let Ykij = yki(sj) be the observation at jth site for the ith sample of spatial process yk(s), where k = 1, 2, i = 1, 2, ..., nk, j = 1, 2, ..., p, then the jth observation for sample i can be represented by Ykij = µkj + kij (3.1.3) where µkj = µk(sj) is the mean effect at jth location in class Ck and kij = ki(sj) is the corresponding Gaussian random noise for ith sample at jth location. In matrix notation, 47 the above model can be written as: Y ki = µk + ki (3.1.4) where Y ki = (Yki1, ..., Ykip)T , µk = (µk1, ..., µkp)T is the mean vector of class Ck and ki = (ki1, ...kip)T has multivariate nomal distribution N (0, Σ). Since (s) has a covariance function (3.1.2), the covariance matrix Σ can be represented by Σ(θ) = [γ(si, sj; θ)]p i,j=1, i.e. γ(si, sj) is the (i, j)th entry. By (3.1.4), we have Y ki ∼ N (µk, Σ(θ)) (3.1.5) Assume θ0 be the true parameter in 3.1.2. If θ = θ0, we write Σ(θ) as Σ for simplicity. We need to make some assumptions on the covariance function γ(si, sj; θ): A 2. (i) Let Ξ be the parameter space for θ. Assume the covariance function γ(s, t; θ) is stationary, isotropy, and twice differentiable with respect to θ for all θ ∈ Ξ and s, t ∈ D. (ii ) γ(s, t; θ) is positive-definite in the sense that for every finite subset {s1, s2, ..., sp} of D the covariance matrix Σ = [γ(si, sj; θ)] is positive-definite. Under the stationary and isotropic assumption, we can write Σ(θ) = [γ(hij; θ)]p i,j=1, where hij =(cid:13)(cid:13)si − sj (cid:13)(cid:13) 2 is the distance between site si and sj. Let Y = (Y T 11, ..., Y T 1n1 , Y T 21, ..., Y T 2n2 )T . As defined in (3.1.5), Y has multivariate normal distribution. Then we have the log-likelihood function for µk and θ L(θ, µ1, µ2; Y ) = − p(n1 + n2) nk(cid:88) 2(cid:88) 2 log|Σ(θ)| log(2π) − n1 + n2 (Y ki − µk)T Σ(θ)−1(Y ki − µk) 2 − 1 2 k=1 i=1 48 (3.1.6) µ1, µ2 and θ can be estimated by maximum likelihood estimation (MLE) even in high di- mensional settings. In the setting of spatial statistical model, the resulting Σ(ˆθ) is a positive definite matrix. This resolves the first challenge lies in high dimensional classification. We denote the resulting estimates as ˆµ1M LE, ˆµ2M LE, ˆθM LE, which can be easily plug into 1.3.1 and get the MLE-LDA classifier ˆδM LE : ˆδM LE(X) = (X − ˆµ1M LE + ˆµ2M LE 2 )T Σ−1(ˆθM LE)( ˆµ1M LE − ˆµ2M LE). In section 3.2, we investigate the properties of MLE-LDA classifier. We first derived the parameter estimation consistency in MLE for p/n → 0 and p/n → C > 0. Then we show that MLE-LDA is asymptotically optimal if p/n → 0 under some regulation conditions. However, when p/n → C > 0, the MLE-LDA could be no better than random guess even if the true covariance is known unless the signals are very strong. This indicates that feature selection is still necessary for high dimensional case even if we can estimate the covariance parameter consistently. In section 3.3, we propose to estimate the parameters by penalized MLE (PMLE) by applying a penalty on ∆ = µ1 − µ2, which measures the difference of the two classes in mean. We assume the sparseness of ∆, which indicates that only a fraction of the p features are important in classification. Then we derive the parameter estimation consistency and feature selection consistency of PMLE. In the end a classifier (PLDA classifier) is developed using the PMLE and we show that it is asymptotically optimal even if p/n → C > 0. Simulation study and real data analysis are conducted in section 3.4 and 3.5. All proofs and remarks about assumptions are given in 3.6. 49 3.2 Classification using maximum likelihood estima- tion (MLE-LDA) In this section, we investigate the consistency of parameter estimation in (3.1.6). Also we investigate the classification performance of ˆδM LE. Here are some notations. Let µ1 = (µ11, µ12, ..., µ1p), µ2 = (µ21, µ22, ..., µ2,p) and θ0 = (θ01, θ02, ..., θ0q) be the true parameters. Let Σk(θ), k = 1, 2, ..., q be the partial derivative of the matrix Σ(θ) with respect to θk, i.e. Σ(θ) = Σk(θ). Also let Σk(θ), k = 1, 2, ..., q denote the partial derivative of the matrix ∂ ∂θk Σ(θ)−1 with respect to θk, i.e. Σkj(θ) = ∂Σ−1(θ) . We are going to simplify the notation if θ = θ0, i.e. we write Σ(θ0) as Σ, Σ−1(θ0) as Σ−1, Σk(θ0) as Σk and Σk(θ0) as Σk. Denote the set of all the eigenvalues Σ−1(θ) = Σk(θ). Also, denote Σkj(θ) = ∂Σ(θ) ∂θk∂θj ∂θk∂θj ∂ ∂θk and of the square matrix A by λ(A). Moreover, denote the maximum and minimum eigenvalues of a square matrix A by λmax(A) and λmin(A), respectively. Here are the regularity conditions assumed in Theorem 1. A 3. lim supp→∞λmax(Σ) < ∞, lim infp→∞λmin(Σ) > 0 A 4. (cid:107)Σk(cid:107)−2 = Op(p−1) F A 5. Assume limp→∞ aij exist, where aij = t tij 1/2 ii t 1/2 jj and tij = tr(Σ−1ΣiΣ−1Σj). A 6. There exists an open subset ω that contains the true parameter point θ0 such that for all θ∗ ∈ ω, we have: (i) −∞ < limp→∞ λmin(Σk(θ∗)) < limp→∞ λmax(Σk(θ∗)) < ∞. (ii ) −∞ < limp→∞ λmin(Σkj(θ∗)) < limp→∞ λmax(Σkj(θ∗)) < ∞. (iii) (cid:107) ∂tij (θ∗) (cid:107)2= Op(p), where tij(θ∗) = tr(Σ−1(θ∗)Σi(θ∗)Σ−1(θ∗)Σj(θ∗)) ∂θ Since Σk = −ΣΣkΣ and Σkj = Σ−1(ΣkΣ−1Σj + ΣjΣ−1Σk − Σkj)Σ−1, by A3 and A 6, 50 we have and −∞ < lim p→∞ λmin(Σk(θ∗)) < lim p→∞ λmax(Σk(θ∗)) < ∞ −∞ < lim p→∞ λmin(Σkj(θ∗)) < lim Notice that for any p × p matrix A we have (cid:107)A(cid:107) p→∞ λmax(Σkj(θ∗)) < ∞. ≤ √ p(cid:107)A(cid:107) 2 = √ F pλmax(A), then from A 6 we have: (cid:13)(cid:13)(cid:13)Σk(θ∗) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)Σkj(θ∗) F F (1) (2) √ = Op( p); √ = Op( p). First we have the following theorem about MLE consistency of (3.1.6). Theorem 1. Assume A1-A6 hold . Let (µ1, µ2, θ0) be the true parameter. The maximum likelihood estimation (MLE) of (3.1.6) is: ˆµ1M LE = ¯Y 1·, ˆµ2M LE = ¯Y 2·, ˆθM LE, where ¯Y k· =(cid:80)nk i=1 Y ki/nk. Also, (cid:13)(cid:13)(cid:13)ˆθM LE − θ0 (cid:13)(cid:13)(cid:13) 2 (i) If p/n → 0, (ii) If p/n → C with 0 < C ≤ ∞ and p/n → 0, np); = Op( 1√ √ (cid:13)(cid:13)(cid:13)ˆθM LE − θ0 (cid:13)(cid:13)(cid:13) 2 = Op( 1 n). Proof. See the proof in the section 3.6. Theorem 1 shows that under the spatial statistical model, all the parameters (µ1, µ2, θ) can be estimated consistently by the MLE for either p/n → 0 or p/n goes to a positive constant of ∞. Therefore we obtain a positive definite covariance matrix estimate of Σ(θ). We can easily plug the MLE ˆµ1 = ˆµM LE1, ˆµ2 = ˆµM LE2 and ˆΣ = Σ(ˆθM LE) into (1.3.7) 51 to build up the classification function as following: ˆδM LE(X) = X − 1 2 ( ˆµ1M LE + ˆµ2M LE) (cid:16) (cid:17)T (cid:16) Σ−1(ˆθM LE) (cid:17) (3.2.1) ˆµ1M LE − ˆµ2M LE Then a new observation x would be classified into class C1 if ˆδM LE(x) > 0 and C2 otherwise. Using the same notations in section 1.3, the conditional misclassification rate is defined by (1.3.2) and (1.3.4). For simplicity, We are going to use ˆµ1 to denote ˆµM LE1, ˆµ2 to denote ˆµM LE2, ˆθ to denote ˆθM LE in this section. We will see in the following theorem that the approximate optimal error rate can be achieved while p/n → 0. However, if p/n → C with 0 < C ≤ ∞, the error rate would be no better than random guessing even if we know the true covariance, due to the error accumulated in the estimation of µ1 and µ2. Theorem 2. Let Cp = ∆T Σ(θ)∆. Assume p nCp → ∞ as n, p → ∞. n → 0, Cp → C0 with 0 ≤ C0 ≤ ∞, and (1) The overall misclassification rate W (ˆδM LE; Θ) is asymptotically sub-optimal. In other words, W (ˆδM LE; θ) − WOP T P→ 0. (2) Moreover, if Cp → C0 with 0 ≤ C0 < ∞ or if Cp → ∞ and Cp p n → 0, then W (ˆδM LE; θ) is asymptotically optimal, i.e. W (ˆδM LE ;Θ) WOP T P→ 1. Proof. See the proof in the section 3.6. The following theorem shows that while p n goes to a positive constant or ∞, even though the true covariance is known, the error accumulated in the estimation of µ1 and µ2 would cause biased misclassification rate unless the signal levels are extremely high. This discovery 52 suggests that eventhough there’s no problem in parameter estimation in our model even in high dimensional case, it is still necessary to select important features for classification. Theorem 3. Assume the true covariance Σ is known, denote the classifier function as δ ˆµ(x) = (x − ˆµ)T Σ−1( ˆµ1 − ˆµ2) (3.2.2) where ˆµ1, ˆµ2 are MLE in (3.1.6) and ˆµ = Cp → C0 with 0 ≤ C0 ≤ ∞. Assume n1 (cid:54)= n2 and nk > n 2 ˆµ1+ ˆµ2 . Assume p/n → C with 0 < C ≤ ∞, (1) For (2) For Cp p/n → ∞, then W (ˆδ ˆµ) p/n → c with 0 < c < ∞, Cp P→ 0 and WOP T (i) if p (ii) if p n → C < ∞, then limP W (ˆδ ˆµ) > 1 − Φ( n → ∞, then W (ˆδ ˆµ) p/n → 0, then W (ˆδ ˆµ) P→ 0 and WOP T P→ 1 2. Cp (3) For 4 (k = 1, 2), then P→ 0 but W (ˆδ ˆµ) WOP T P→ ∞. √ C0 2 ); P→ 0, but W (ˆδ ˆµ) WOP T P→ ∞. Proof. See the proof in the section 3.6. Corollary 1. With all conditions the same as in Theorem 3. If n1 = n2, then (1) If (2) If Cp√ p/n Cp√ p/n → ∞, then W (ˆδ ˆµ) P→ 0 and WOP T P→ 0, but W (ˆδ ˆµ) WOP T P→ ∞; → c with 0 < c < ∞, (i) If p (ii) If p n → C, then W (ˆδ ˆµ) n → ∞, then W (ˆδ ˆµ) P→ 1 − Φ( (cid:113) P→ 1 − Φ( c 2 P→ 1 − Φ( √ √ c 2 C ) ) and WOP T C c √ 4+c/ 4), and WOP T P→ 0; (3) If Cp√ p/n P→ 0, we have W (ˆδ ˆµ) P→ 1 2. 53 Proof. See the proof in the section 3.6. Theorem 3 and Corollary 1 shows that while p/n → C with 0 < C ≤ ∞, ˆδ ˆµ is never asymptotically optimal. It is asymptotically sub-optimal only if Cp → ∞. It reveals that though there’s no difficulty in applying LDA on spatial dependent data if estimate the parameters by MLE, however, in high dimensional case (p/n → C with 0 < C ≤ ∞), the discriminant performance may be poor due to the noise accumulated in the estimation of µ1 and µ2 (see, e.g., [19] and [46]). Therefore, feature selection is still critical for classification with features of high dimension. Fan and Fan (2008) [19] seeks to extract salient features by two-sample t-test and proved that t-test can pick up all important features by choosing an appropriate critical value once the features are assumed to be independent. Shao et al. (2011) [46] proposes to select features by threshold. For the spatially correlated features, we can use penalized maximum likelihood estimates (PMLE) for feature selection. 3.3 Classification using penalized maximum likelihood estimation (PLDA) 3.3.1 The penalized maximum likelihood estimation (PMLE) In this section, we consider the high dimensional classification problem (i.e. p/n → C with 0 < C ≤ ∞ as p → ∞ and n → ∞). We continue to use the notation in section 3.1. ∆ = µ2 − µ1 is a p dimensional vector. Then ∆ = (∆1, ..., ∆2) = (µ21 − µ11, ..., µ2p − µ1p) is the difference of the mean effect between class C1 and C2. Define the signal set S = {j : ∆j (cid:54)= 0}. Let s be the number of non zero elements in ∆. The important features are those in the set S. Instead of assuming the sparsity of discriminant direction as in [9], [20] 54 and [36], we assume the sparsity of ∆, that is s (cid:28) n and s/n → 0. Then employing the idea of penalization in regression, we develop the penalized maximum likelihood estimates to estimate ∆ and θ. As defined in section 3.1, the observations Y ki are normally distributed Y ki ∼ N (µk, Σ(θ0)) for k = 1, 2 and i = 1, 2, ..., nk. Let Ip be the p× p identity matrix and Jp be the p× p matrix with 1 as its entries. Define diagonal block matrix for square matrix A as diagn(A) with n × n blocks, i.e. diagn(A) = A 0 ··· 0 A ··· ... . . . ... 0 0 ... (cid:123)(cid:122) 0 0 0 A n×n blocks  (cid:124) ··· ··· . . . (cid:125)    Ip Ip ... (cid:124) (cid:123)(cid:122) (cid:125) Ip n blocks Then define ˜In,p = diagn(Ip). Also define block matrices ˜Jn,p and ˜1n,p as followed:  (cid:124)  (cid:125) Ip Ip ... Ip Ip Ip Ip ... Ip ... Ip (cid:123)(cid:122) Ip Ip n×n blocks ˜Jn,p = , ˜1n,p = )T . Let Y = (Y T 11,··· , Y T 1n1 , Y 21,··· , Y T 2n2 In order to estimate ∆ = µ1 − µ2, we make transformation with Y . Let Z = VY , where V is a (n − 1)p × np matrix made up by n−1)T , where Zi = Y 1i− ¯Y for the first (n−1)p rows of ˜In,p− 1 i = 1, 2, ..., n1, Zi = Y 2(i−n1)− ¯Y for i = n1+1, n2+2, ..., n−1 and ¯Y = 1 i=1 Y ki. ˜Jn,p. Then Z = (ZT (cid:80)nk (cid:80)2 1 ZT 2 ··· ZT k=1 n n Then 55 • Zi ∼ N (−τ2∆, n−1 n Σ(θ0)) for i = 1, 2, ..., n1; • Zi ∼ N (τ1∆, n−1 n Σ(θ0)) for i = n1 + 1, ..., n − 1. n1 n and τ2 = where τ1 = such that Xi = X(1) for i = 1, 2, ..., n1 and Xi = X(2) for i = n1 + 1, , ..., n − 1, where n . Also, cov(Zi, Zj) = − 1 n2 nΣ for i (cid:54)= j. Define X(1) and X(2)  −τ2 0 0 −τ2 ... ... 0 0 0 ··· ··· 0 ... . . . 0 −τ2  X(1) =  τ1 0 0 ... 0 τ1 ... 0 ··· ··· . . . 0 0 ... 0 τ1  p×p , X(2) = p×p Write X = (XT 1 XT 2 ··· XT n−1)T and write β = ∆. Then Z is a (n − 1)p × 1 vector ˜Jn−1,p)diagn−1(Σ). with multivariate normal distribution N (Xβ, ˙Σ), where ˙Σ = ( ˜In−1,p− 1 Denote all the unknown parameters by η = (β, θ) ∈ Rp+q. Since n (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˙Σ (cid:12)(cid:12)(cid:12)(cid:12) ˜In−1,p − 1 n (cid:12)(cid:12)(cid:12)(cid:12)|diagn−1Σ(θ))| = ( ˜Jn−1,p )p |Σ(θ)|n−1 1 n and ( ˜In−1,p − 1 n ˜Jn−1,p)−1 = ˜In−1,p + ˜Jn−1,p, we can write the penalized log-likelihood function for β and θ: Q(θ, β; Z) = − np 2 log(2π) − 1 2log (cid:12)(cid:12)(cid:12) ˙Σ (cid:12)(cid:12)(cid:12) − 1 2(Z − Xβ)T ˙Σ−1(Z − Xβ) − n p(cid:88) j=1 Pλ(|βj|) (3.3.1) log |Σ| − 1 2(Z − Xβ)T diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)(Z − Xβ) =Cn,p − n − 1 − n p(cid:88) 2 Pλ(|βj|) j=1 56 where Cn,p = − (n−1)p 2 log π + p 2 log n. Pλ(x) is a generic sparsity-inducing penalty. It could be the lasso penalization or folded concave penalization (such as the SCAD and the MCP). We use SCAD penalization in this research. The true parameter β0 is a column vector parameters of size p, and θ0 = (θ01, θ02, ..., θ0q) is the q−dimensional parameters in covariance function. Assume β0 is sparse. Without loss 0,2)T , where β0,1 ∈ Rs is non-zero component, of generality we can write β0 = (βT n → 0 as n, p, s → ∞. Also, we can write i is the p× s submatrix of Xi formed by columns β0,2 = 0(p−s)×1 is the zero component of β0 and s i ), for i = 1, 2, .., n, where X1 0,1, βT Xi = (X1 i , X2 in supp(β0) and X2 i is the p × (p − s) complement matrix. We are going to follow the one step estimation procedure in [14]. As demonstrated in [58], the one-step method is as efficient as the fully interative method both empirically and theoretically, provided that the initial estimators are reasonably good. Here’s the algorithm: Algorithm 1. Initialize β by minimizing R(β) = (Z−Xβ)T (Z−Xβ)+n(cid:80)p j=1 Pλ(|βj|) with respect to β. Denote the result by ˆβ (0) ; 2. Fix β = ˆβ (0) , estimate θ by maximize Q(θ, ˆβ (0) ; Z) in (3.3.1) with respect to θ. Denote the result by ˆθ (0) ; 3. Fix θ = ˆθ (0) , update β by maximize Q(ˆθ (0) , β; Z) in (3.3.1) with respect to β. Denoted the result by ˆβ (1) ; 4. Fix β = ˆβ (1) , estimate θ by maximize Q(θ, ˆβ (1) ; Z) in (3.3.1) with respect to θ. Denote the result by ˆθ (1) . Then ˆθP M LE = ˆθ (1) and ˆβP M LE = ˆβ (1) are the desired estimates. We call ˆθ (1) (1) and ˆβ 57 one-step PMLE. (1) (1) and θP M LE = ˆθ Denote ˆ∆P M LE = ˆβ from the algorithm. Then µ1 and µ2 can be estimated by ˆµ1P M LE = ¯Y − τ2 ˆ∆P M LE and ˆµ2P M LE = ¯Y + τ1 ˆ∆P M LE. Also we have estimated covariance ˆΣ = Σ(ˆθP M LE), where the (i, j)th element of ˆΣ is: where hij =(cid:13)(cid:13)sj − si (cid:13)(cid:13) 2 ˆσi,j = γ(hij; ˆθP M LE) (3.3.2) is the distance between site si and sj. 3.3.1.1 Consistency of PMLE Penality function largely determines the sampling properties of the penalized likelihood estimators. Some additional assumptions about the penality function and tuning parameter λ are needed: A 7. Assume an = Op( 1√ A 8. bn → 0 as n → ∞, where bn = max1≤j≤m{p ), where an = max1≤j≤p{p n (cid:48)(cid:48) λn A 9. λn → 0 and λn/(cid:112) s (|β0j|), β0j (cid:54)= 0} (cid:48) λn (|β0j|), β0j (cid:54)= 0} n → ∞. A 10. lim infn→∞ p→∞ lim infθ→0+ P (cid:48) λn (|θ|)/λn > 0 A 7 ensure the unbiasedness property for large parameters and the existence of the consistent penalized likelihood estimator. A 8 ensure that the penalty function does not influence the penalized likelihood estimators more than the likelihood function. A10 makes the penalized likelihood estimators possess the sparsity property and A9 lead to the variable selection consistency. Smoothly Clipped Absolute Deviation (SCAD) penality satisfy all those assumptions and we are going to apply SCAD in the simulation and real data analysis in this paper. Fan 58 and Li (2001) [21] proposed the SCAD penality function and claims that it satisfy three good properties: unbiasedness, sparsity and continuity. Unbiasedness means there is no overpenalization of large features to avoid unnecessary modeling biases. Sparsity means the insignificant parameters are set to 0 by a thresholding rule to reduce model complexity. Continuity makes sure the penalized likelihood produces continuous estimators. It is defined as  pλ(β) = λ|β| − β2−2aλβ+λ2 2(a−1) (a+1)λ2 2 if |β| ≤ λ if λ < |β| ≤ aλ if |β| > aλ for some a > 0. More details can be found in [21]. Theorem 4. Assume conditions A 1-A 10 hold. Assume β0 = (βT 2,0)T , where β1,0 ∈ Rs n → C is non-zero component, β2,0 = 0(p−s)×1 is the zero component of β0 with s with 0 < C ≤ ∞ as n, p, s → ∞. The PMLE estimate of (3.3.1) from the Algorithm in n → 0, p 1,0, βT section 3.3.1 is ˆη = ( ˆβ, ˆθ) with ˆβ = ( ˆβ T 1 , ˆβ T 2 )T and ˆβ1 is a subvector of ˆβ formed by components in supp(β0). Then ˆη satisfy: a. (consistancy) = Op( 1√ np) and (cid:13)(cid:13)(cid:13)ˆθ − θ0 (cid:13)(cid:13)(cid:13) 2 (cid:13)(cid:13)(cid:13) ˆβ − β0 (cid:13)(cid:13)(cid:13) 2 = Op((cid:112) s n). b. (sparsity) ˆβ2 = 0 with probability tending to 1 as n → ∞. Proof. See the proof in the section 3.6. 59 3.3.1.2 Covariance tappering and PMLE When the number of sites is large (p is large) for each realization of the spatial process, calculating the likelihood can be computationally infeasible (requiring O(p3) calculation). Covariance tapering can be used to approximate the likelihood. While the covariance matrix is replaced by a tapered one, the resulting matrixes can then be manipulated using efficient sparse matrix algorithms which would reduce computation effectively. In section 3.1, the covariance matrix is defined as Σ(θ) = [γ(si, sj; θ)]p i,j=1. Under A2, we can simply write it as Σ = [γ(hij; θ)]p i,j=1, where hij =(cid:13)(cid:13)si − sj (cid:13)(cid:13) is the distance 2 between sites si and sj. Let KT (h, w) denote a tapering function, which is an isotropic autocorrelation function when 0 < d < w and 0 when h ≥ w for a given threshold w > 0. Compactly correlation function can be used as the tapering function. We are going to use an tapering function from [50], KT (h, ω) = [(1 − h/w)+]2 (3.3.3) where x+ = max(x, 0) in which case the correlation is 0 at lag distance greater than the denote the p × p tapering matrix. threshold distance w. Let K(w) = [KT (hii(cid:48), w)]p Then a tapered covariance of Σ is defined as ΣT = Σ ◦ K(w), where ◦ is the Schur product =1 i,i (cid:48) (i.e. elementwise product). By the properties of the Schur product (see, e.g., [26], chap. 5), the tapered covariance matrix would keep the positive definiteness thus it is still a valid covariance matrix. When p is large, we are going to approximate the penalized log-likelihood 60 QT (θ, β; Z) = − np 2 log(2π) − 1 2 log 2(Z − Xβ)T ˙Σ−1 T (Z − Xβ) − n Pλ(|βj|) (cid:12)(cid:12)(cid:12) ˙ΣT (cid:12)(cid:12)(cid:12) − 1 (3.3.1) by replacing Σ by ΣT and obtain a covariance tapered penalized logliklihood: p(cid:88) j=1 (3.3.4) log |ΣT| − 1 2(Z − Xβ)T diagn−1(Σ−1 T )( ˜In−1,p + ˜Jn−1,p)(Z − Xβ) =Cn,p − n − 1 − n p(cid:88) 2 Pλ(|βj|) j=1 where Cn,p = − (n−1)p 2 log π + p 2 log n. We keep all the notations the same as in (3.3.1), except Σ is replaced by ΣT . Let ∆P M LE,T = βP M LE,T and θP M LE,T be the penalized maximum likelihood estimates with tapered covariance (PMLET ). We are going to check the consistency of PMLET . Let γk(θ, h) = ∂γ(θ,h) ∂θk (θ) and γjk(θ, h) = ∂2γ(θ,h) ∂θkθj . Two additional assumptions are made here for regularity. A 11. Assume 0 < infp{ wp pδ0 } < supp{ wp pδ0 } < ∞, where wp is the threshold distance in the tapering function for δ0 > 0. A 12. Let d (d ≥ 1) be the dimension of the domain, i.e. D ⊂ Rd. Assume for all θ ∈ Ξ and 1 ≤ k, j ≤ q, we have γ(θ, h), γk(θ, h), γjk(θ, h) belong to the function space £, where £ = {f (h) :(cid:82) ∞ 0 hdf (h)dh < ∞}. Let Σ be the covariance matrix and ΣT be the tapered covariance matrix. Σk,T = ∂ΣT ∂θk and Σjk,T = ∂2ΣT ∂θj ∂θk . By using the tapering function (3.3.3), we have the following result. Theorem 5. Assume conditions 1-12 hold. Assume β0 = (βT 2,0)T , where β1,0 ∈ Rs is n → C non-zero component, β2,0 = 0(p−s)×1 is the zero component of β0 with s with 0 < C ≤ ∞ as n, p, s → ∞. The PMLE estimates of (3.3.4) from Algorithm in section n → 0, p 1,0, βT 61 3.3.1 is ˆηT = ( ˆβT , ˆθT ) with ˆβ = ( ˆβ T 1,T , ˆβ T 2,T )T and ˆβ1,T is a subvector of ˆβT formed by components in supp(β0). Then ˆηT satisfy: (cid:13)(cid:13)(cid:13)ˆθT − θ0 (cid:13)(cid:13)(cid:13) a. (consistancy) (cid:13)(cid:13)(cid:13) ˆβT − β0 (cid:13)(cid:13)(cid:13) 2 = OP ((cid:112) s n). = Op( 1√ np) and 2 b. (sparsity) ˆβ2,T = 0 with probability tending to 1 as n → ∞. Proof. See the proof in the section 3.6. 3.3.2 The penalized maximum likelihood estimation LDA (PLDA) classifier No matter using PMLE or PMLET , we have the estimates for ∆ and θ denoted by ˆ∆ and ˆθ. the estimation of µ1 and µ2 are ˆµ1 = ¯Y − τ2 ˆ∆ and ˆµ2 = ¯Y + τ1 ˆ∆. Also we have estimated covariance ˆΣ = Σ(ˆθ), where the (i, j)th element of ˆΣ is: ˆσij = γ((cid:12)(cid:12)sj − si (cid:12)(cid:12) ; ˆθ) (3.3.5) Since p > n, the error accumulated in estimate of each ˆσij may also cause problems in classification (see, e.g., [6] and [46]). For regularization of the covariance matrix, we use the tapered coavariance matrix in classification function. Specifically, we define ˜Σ = ΣT (ˆθ) = Σ(ˆθ) ◦ K(w), where K(w) is defined in section 3.3.1.2. We then replace µ1, µ2, Σ in LDA (1.3.7) by ˆµ1, ˆµ2 and ˜Σ for classification. Then the PLDA function is: (cid:80)2 k=1 where ¯Y = 1 n ˆδP LDA(X) = (X − ¯Y − n1 − n2 (cid:80)nk 2n i=1 Y ki. 62 ˆ∆)T ˜Σ−1 ˆ∆ (3.3.6) The conditional misclassification rate for class 1 and class 2 are defined by (1.3.2) and (1.3.4) with ˆΣ replaced by ˜Σ. Similarly we have the overall misclassification rate defined in (1.3.6). A 13. Assume(cid:82) ∞ 1 hdr(h; θ)dh < ∞ and(cid:82) 1 0 hd−1r(h; θ)dh < ∞ for θ ∈ Ξ. A 14. Assume there exist a constant M such that for any h ≥ 0 and θ ∈ Ξ, (cid:107) ∂γ(h;θ) ∂θ (cid:107)2≤ M . Theorem 6. Assume ˆθ, ˆ∆ in (3.3.6) are estimated from Theorem 4 and 5. Suppose assump- n → C with 0 < C ≤ ∞, Cp → C0 with n → 0, p √ d ) with 0 < α < 1, and w−1 = O(p−δ) → 0. Also, assume w = O(( np) tions A1-A3 and A13-14 hold. Assume s 0 ≤ C0 ≤ ∞, Cp√ α s/n with δ > 0, where d is the dimension of the domain. Then the classification error rate of ˆδP LDA is asymptotically sub-optimal, i.e. W (ˆδ; Θ) P→ 1 − Φ( ). Moreover, √ C0 2 (1) If Cp → C0 < ∞, W (ˆδ; Θ) is asymptotically optimal, i.e. W (ˆδ;Θ) WOP T P→ 1; (2) If Cp → ∞ and Cpκn,p → 0, W (ˆδ; Θ) is asymptotically optimal, i.e. W (ˆδ;Θ) WOP T P→ 1, where κn,p = max( wd√ np , 1 w ,(cid:112) s n). Proof. See the proof in the section 3.6. 3.4 Simulation Analysis We do the simulation analysis in this section. Assume the spatial domain of interest D in R2 is a u× u square. We can observe signal at each lattice. Then we have p = u× u features for classification. Assume the mean effects of the signal for class C1 and C2 are µ1 and µ2. We assume µ1 = (110, 0p−10) and µ2 = 0p, where where 1k is a k dimension vector with all the elements equal to 1 and 0k is a k dimension vector with all the elements equal to 0. 63 4 3 2 8 7 6 12 11 10 16 15 14 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 5 13 1 Figure 3.1 Two dimensional domain example. Left: 2D domain with p=4 × 4; middle: µ1; right: µ2. 0 0 0 0 1 1 1 0 For example, if u = 4 hence p = 16, then the corresponding spatial domain D, µ1 and µ2 are as in Figure 3.1. In the simulation setting, we let p = 36, 400, 1225 respectively. For the spatial covariance, we generate the error terms from stationary and isotropic Gaussian process with zero-mean. M at´ern covariance function which was defined in 1.2.1 is widely used as spatial covariance function. We consider two special cases of M at´ern covariance function: exponential covariance function and polynomial covariance function. Let h be the Euclid distance between two sites on the domain D. Specifically, on the domain D ∈ R2, the distance between site i with coordinate si = (xi, yi) and site j with coordinate (cid:113) sj = (xj, yj) is hij = (xi − xj)2 + (yi − yj)2. • Exponential covariance function If let the smoothness parameter ν = 1 2 in 1.2.1, the the spatial dependence of the error terms reduced to an exponential covariance function: γ(h) = σ2(1 − c)exp(−h/r) σ2 if h > 0 if h = 0 where σ2 is the variance parameter, c is the nugget effect and r is the range parameter. In the simulation, we set σ2 = 1, c = 0.2 and r = 1, 2, ..., 8, 9. The classification performance are in Table 3.1, 3.2 and 3.3. The parameter estimation results are in 64 Table 3.4. The model selection results are in Table 3.5, 3.6, 3.7. • Polynomial covariance function The covariance function of polynomial covariance function is γ(h) = σ2ρh, where ρ is the correlation parameter. Then the covariance  ρh21 ... 1 ... ρhp1 ρhp2  . . . ρh2p ... . . . . . . 1 matrix is: Σp×p = σ2 1 ρh12 . . . ρh1p In the simulation, we let σ2 = 1 and ρ = 0, 0.1, 0.2..., 0.8, 0.9. Polynomial correlation function ρh is a special case of exponential correlation function exp(−h/r), since ρh = exp(−h log 1 ρ). The classification performance are in Table 3.8, 3.9 and 3.10. The parameter estimation results are in Table 3.11. The model selection results are in Table 3.12, 3.13, 3.14. 100 groups of training sets with n1 = n2 = 30 are generated according to different setting of µ1, µ2 and Σ(θ0). For each training set, we can estimate the parameters µ1, µ2 and θ0 by MLE, tapered MLE, PMLE and tapered PMLE. 100 groups of testing data sets with n1 = n2 = 100 are generated for testing the classification performance. The average classification error rate was calculated from the 100 groups of testing data sets. All the reported numbers in the tables are means with their standard errors, calculated by 100 groups of training and testing data sets, in parentheses. We name the classification method proposed in the research as PLDA. For each choice of p, we compared the classification performance of PLDA with oracle classification, MLE- LDA, PREG-LDA, FAIR (Feature Annealed Independence Rule; [19]) and NB (Naive Bayes; 65 [5]). Specifically, MLE-LDA uses ˆµ1M LE, ˆµ2M LE and Σ(ˆθM LE) in LDA function for classi- fication; PREG-LDA uses ˆ∆ = ˆβ (0) and ˆθ = ˆθ (0) in LDA in classification function, where (0) ˆ∆ and ˆθ (0) are the same ˆ∆ (0) (0) and ˆθ estimated in the algorithm in section 3.3.1. This method doesn’t consider spatial correlation in feature selection. NB uses sample mean ˆµ1, ˆµ2 and diagnol of sample covariance ˆΣ in LDA. This method is also known as independent rule(IR). FAIR assumes independence between variables and utilizes t-test for variable se- lection in IR. Oracle uses true mean µ1, µ2 and true covariance Σ(θ0). We also compared the average number of variables selected from PMLE, PREG and FAIR. The tunning pa- rameter λ in P REG and P M LE is selected by 10 fold cross validation by minimizing the classification error rate. For p = 36, the performance of MLE-LDA, PREG-LDA, PMLE, FAIR, NB are shown. For p = 400 the performance of MLE-LDA, PREG-LDA, PMLE, FAIR, NB are shown. The classification performance of MLE-LDA, PLDA with the parameters estimated by tapered MLE and tapered PMLE are also shown for p = 400. For p = 1225, the classification performance of the tapered MLE-LDA, tapered PLDA, PREG-LDA, FAIR and NB are shown. 66 Table 3.1 Classification Accuracy Rate (Exponential covariance, p=36) TURE p=36 0.88(0.02) 0.88(0.02) 0.89(0.02) 0.90(0.02) 0.91(0.02) 0.92(0.02) 0.93(0.02) 0.93(0.02) 0.94(0.02) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 MLE PREG PMLE FAIR NB 0.84(0.03) 0.84(0.03) 0.85(0.03) 0.87(0.03) 0.88(0.02) 0.89(0.02) 0.90(0.02) 0.91(0.02) 0.91(0.02) 0.84(0.04) 0.84(0.04) 0.85(0.04) 0.87(0.03) 0.88(0.02) 0.89(0.02) 0.90(0.03) 0.91(0.02) 0.92(0.02) 0.84(0.05) 0.84(0.05) 0.85(0.05) 0.87(0.04) 0.88(0.04) 0.89(0.04) 0.90(0.03) 0.91(0.03) 0.92(0.02) 0.81(0.04) 0.76(0.05) 0.74(0.04) 0.73(0.05) 0.72(0.05) 0.72(0.06) 0.71(0.06) 0.71(0.06) 0.71(0.05) 0.84(0.04) 0.78(0.05) 0.77(0.04) 0.76(0.05) 0.75(0.05) 0.75(0.06) 0.75(0.06) 0.75(0.06) 0.75(0.05) Table 3.2 Classification Accuracy Rate (Exponential covariance, p=400) TURE p=400 0.92(0.02) 0.92(0.02) 0.94(0.02) 0.95(0.01) 0.95(0.01) 0.96(0.01) 0.96(0.01) 0.97(0.01) 0.97(0.01) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 MLE MLET PREG PREGT PMLE PMLET FAIR NB 0.74(0.03) 0.75(0.03) 0.78(0.03) 0.80(0.03) 0.81(0.03) 0.83(0.03) 0.84(0.03) 0.85(0.03) 0.86(0.02) 0.75(0.03) 0.76(0.03) 0.78(0.03) 0.80(0.03) 0.82(0.03) 0.84(0.03) 0.85(0.02) 0.86(0.02) 0.87(0.02) 0.84(0.05) 0.85(0.05) 0.86(0.05) 0.88(0.04) 0.90(0.05) 0.91(0.05) 0.91(0.04) 0.92(0.04) 0.93(0.04) 0.84(0.05) 0.85(0.05) 0.87(0.05) 0.89(0.05) 0.90(0.04) 0.91(0.04) 0.92(0.04) 0.93(0.04) 0.93(0.04) 0.82(0.05) 0.85(0.05) 0.88(0.04) 0.91(0.03) 0.92(0.03) 0.93(0.02) 0.95(0.02) 0.95(0.02) 0.96(0.02) 0.84(0.04) 0.86(0.05) 0.89(0.05) 0.91(0.03) 0.93(0.03) 0.94(0.02) 0.94(0.02) 0.95(0.02) 0.95(0.02) 0.83(0.04) 0.79(0.04) 0.77(0.04) 0.75(0.04) 0.74(0.04) 0.73(0.05) 0.72(0.06) 0.72(0.06) 0.72(0.05) 0.74(0.04) 0.67(0.04) 0.63(0.05) 0.61(0.05) 0.60(0.05) 0.59(0.05) 0.58(0.05) 0.58(0.05) 0.58(0.05) Finally, the Table 3.15 and Table 3.16 and Table 3.17 showed the classification perfor- mance when the covariance are misspecified. We generate data using Gaussian covariance function with σ2 = 1, c = 0.2, and r = 1, 2, ..., 9. Then we estimate the parameters using exponential covariance function. It shows the PLDA classification method still yielded the best performance in the misspecified case. Table 3.3 Classification Accuracy Rate (Exponential covariance, p=1225) TURE p=1225 0.92(0.02) 0.92(0.02) 0.93(0.02) 0.94(0.02) 0.95(0.02) 0.96(0.01) 0.96(0.01) 0.96(0.01) 0.97(0.01) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 MLET PREGT PMLET FAIR NB 0.67(0.03) 0.68(0.03) 0.70(0.04) 0.72(0.04) 0.73(0.04) 0.75(0.03) 0.76(0.04) 0.77(0.04) 0.77(0.04) 0.83(0.05) 0.83(0.06) 0.85(0.06) 0.87(0.06) 0.88(0.05) 0.89(0.06) 0.89(0.06) 0.90(0.05) 0.91(0.05) 0.80(0.06) 0.82(0.06) 0.86(0.05) 0.89(0.03) 0.91(0.03) 0.92(0.03) 0.93(0.02) 0.93(0.02) 0.94(0.02) 0.83(0.05) 0.78(0.05) 0.76(0.04) 0.75(0.05) 0.74(0.05) 0.73(0.04) 0.73(0.04) 0.72(0.05) 0.72(0.05) 0.66(0.05) 0.61(0.05) 0.58(0.04) 0.57(0.05) 0.56(0.05) 0.55(0.04) 0.55(0.04) 0.54(0.05) 0.54(0.05) 67 Table 3.4 Parameter estimation (Exponential covariance) TURE 1 0.2 1 2 0.2 1 3 0.2 1 4 0.2 1 5 0.2 1 6 0.2 1 7 0.2 1 8 0.2 1 9 0.2 1 p=36 p=400 p=1225 MLE 1.0(0.2) 0.2(0.1) 1.0(0.0) 2.0(0.3) 0.2(0.1) 1.0(0.1) 3.0(0.4) 0.2(0.0) 1.0(0.1) 4.1(0.6) 0.2(0.0) 1.0(0.1) 5.1(0.8) 0.2(0.0) 1.0(0.1) 6.1(0.9) 0.2(0.0) 1.0(0.1) 7.1(1.2) 0.2(0.0) 1.0(0.1) 8.1(1.4) 0.2(0.0) 1.0(0.1) 9.2(1.6) 0.2(0.0) 1.0(0.1) PMLE 1.0(0.2) 0.2(0.1) 1.0(0.0) 2.1(0.3) 0.2(0.1) 1.0(0.1) 3.1(0.4) 0.2(0.0) 1.0(0.1) 4.1(0.6) 0.2(0.0) 1.0(0.1) 5.1(0.8) 0.2(0.0) 1.0(0.1) 6.1(1.0) 0.2(0.0) 1.0(0.1) 7.1(1.2) 0.2(0.0) 1.0(0.1) 8.1(1.4) 0.2(0.0) 1.0(0.1) 9.1(1.6) 0.2(0.0) 1.0(0.1) MLE 1.0(0.1) 0.2(0.0) 1.0(0.0) 2(0.1) 0.2(0.0) 1.0(0.0) 3.0(0.1) 0.2(0.0) 1.0(0.0) 4.0(0.2) 0.2(0.01) 1.0(0.0) 5.0(0.3) 0.2(0.0) 1.0(0.0) 6.1(0.4) 0.2(0.0) 1.0(0.0) 7.1(0.5) 0.2(0.0) 1.0(0.0) 8.1(0.6) 0.2(0.0) 1.0(0.1) 9.1(0.7) 0.2(0.0) 1.0(0.1) PMLE 1.0(0.1) 0.2(0.0) 1(0.0) 2.0(0.1) 0.2(0.0) 1.0(0.0) 3.0(0.1) 0.2(0.0) 1.0(0.0) 4.0(0.2) 0.2(0.01) 1.0(0.0) 5.0(0.3) 0.2(0.0) 1.0(0.0) 6.1(0.4) 0.2(0.0) 1.0(0.0) 7.1(0.5) 0.2(0.0) 1.0(0.1) 8.1(0.6) 0.2(0.0) 1.0(0.1) 9.1(0.7) 0.2(0.01) 1.0(0.1) MLET 1.0(0.1) 0.3(0.1) 1.0(0.0) 2.1(0.1) 0.3(0.0) 1.0(0.0) 3.2(0.2) 0.3(0.0) 1.0(0.0) 4.6(0.4) 0.3(0.0) 1.0(0.0) 6.4(0.8) 0.3(0.0) 1.0(0.0) 8.8(1.4) 0.2(0.0) 1.0(0.0) 12.0(2.5) 0.2(0.0) 1.0(0.1) 16.6(4.6) 0.2(0.0) 1.0(0.1) 24.0(9.6) 0.2(0.0) 1.0(0.1) PMLET 1.0(0.1) 0.3(0.0) 1.0(0.0) 2.1(0.1) 0.3(0.0) 1.0(0.0) 3.2(0.2) 0.3(0.0) 1.0(0.0) 4.6(0.4) 0.3(0.0) 1.0(0.0) 6.4(0.8) 0.2(0.0) 1.0(0.0) 8.8(1.4) 0.2(0.0) 1.0(0.0) 12.0(2.6) 0.2(0.0) 1.0(0.1) 16.8(4.9) 0.2(0.0) 1.0(0.1) 24.3(10.3) 0.2(0.0) 0.99(0.1) MLET 1.2(0.1) 0.5(0.0) 1.0(0.0) 2.5(0.1) 0.5(0.0) 1.0(0.0) 3.9(0.2) 0.5(0.0) 1.0(0.0) 5.6(0.4) 0.5(0.0) 1.0(0.0) 7.6(0.7) 0.5(0.0) 0.9(0.0) 10.1(1.2) 0.5(0.0) 0.9(0.0) 13.3(1.9) 0.5(0.0) 0.9(0.0) 17.6(3.3) 0.4(0.0) 0.9(0.0) 23.6(5.9) 0.4(0.0) 0.9(0.0) PMLET 1.2(0.1) 0.5(0.0) 1.0(0.0) 2.5(0.1) 0.5(0.0) 1.0(0.0) 3.9(0.2) 0.5(0.0) 1.0(0.0) 5.6(0.4) 0.5(0.0) 1.0(0.0) 7.6(0.7) 0.5(0.0) 1.0(0.0) 10.1(1.1) 0.5(0.0) 1.0(0.0) 13.3(1.9) 0.4(0.0) 1.0(0.0) 17.7(3.3) 0.4(0.0) 1.0(0.0) 23.7(5.8) 0.4(0.0) 1.0(0.0) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 r c σ r c σ r c σ r c σ r c σ r c σ r c σ r c σ r c σ Table 3.5 Number of variables selected (Exponential covariance, p=36) PMLE PREG FAIR p=36 r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 N-s 21(6.2) 20(5.9) 19(5.6) 20(5.6) 20(5.5) 20(5.2) 20(5.1) 20(4.8) 20(5.0) N-c 9(1.7) 9(1.7) 9(1.8) 9(1.7) 10(1.5) 10(1.2) 10(1.2) 10(1.0) 10(1.1) N-s 19(7.0) 19(6.8) 20(7.7) 20(7.4) 20(7.1) 20(7.4) 19(7.3) 20(7.7) 19(7.6) N-c 10(0.8) 10(0.9) 10(0.8) 10(0.4) 10(0) 10(0.1) 10(0) 10(0) 10(0) N-s 6(4.0) 5(3.3) 4(2.4) 3(1.6) 3(1.5) 3(1.4) 3(1.4) 3(1.4) 3(1.4) N-c 5(2.4) 4(1.9) 3(1.7) 3(1.5) 3(1.5) 3(1.4) 3(1.4) 3(1.4) 3(1.3) Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model. Table 3.6 Number of variables selected (Exponential covariance, p=400) PMLE PREG PMLET PREGT FAIR p=400 N-s r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 85(55.3) 92(39.3) 82(31.5) 80(35.7) 74(32.2) 67(27.8) 59(20.5) 57(22.3) 53(21.9) N-c 9(1.5) 10(1.4) 10(1.2) 10(0.8) 10(0.6) 10(0.6) 10(0.5) 10(0.5) 10(0.5) N-s 42(55.2) 56(77.8) 50(69.1) 53(68.6) 51(73.3) 64(91.7) 56(91) 61(99) 64(98.5) N-c 9(1.4) 9(1.1) 9(1.2) 9(1.1) 9(1.2) 10(1.1) 10(1.1) 10(1.1) 10(1.1) N-s 82(51.3) 90(38.9) 80(37) 72(29.7) 65(32.2) 58(24.6) 56(35.1) 49(28.9) 47(35.7) N-c 10(1) 10(1.2) 10(1.4) 10(0.7) 10(0.7) 10(0.4) 10(0.5) 10(0.5) 10(0.4) N-s 46(64.7) 44(62.3) 51(76.9) 55(79.9) 65(97.5) 64(96) 64(91.4) 62(85) 65.(91.8) N-c 9(1.4) 9(1.4) 9(1.3) 10(1.1) 10(0.9) 10(1.1) 10(0.5) 10(1.2) 10(0.7) N-s 21(15.1) 20(15.6) 18(14.7) 14(11.7) 11(10.6) 9(9.2) 8(8.5) 8(8.2) 7(7.6) N-c 7(2.1) 7(2.5) 6(2.6) 6(2.7) 5(2.8) 5(2.8) 4(2.7) 4(2.8) 4(2.7) Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model. 68 Table 3.7 Number of variables selected (Exponential covariance, p=1225) PMLTT PREGT FAIR p=1225 N-s r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 116(185.7) 143(192.3) 133(135.1) 104(91.5) 95(77.4) 76(74.0) 66(54.3) 48(46.3) 48(48.1) N-c 8(2.1) 9(2.2) 9(1.8) 9(1.3) 10(1.0) 10(0.8) 10(0.7) 10(0.6) 10(0.7) N-s 52(97.2) 53(128.3) 63(151.9) 56(129.7) 56(115.2) 49(107.9) 49(87.2) 53(110.2) 55(104.3) N-c 9(1.8) 9(1.7) 9(1.7) 9(1.7) 9(1.7) 9(2.0) 9(1.8) 9(1.6) 9(1.7) N-s 31(22.3) 37(22.7) 34(24.5) 28(19.6) 26(20.6) 23(18.1) 21(16.2) 18(15.6) 15(14.0) N-c 7.0(1.7) 7.0(2.1) 6.9(2.4) 6.5(2.8) 6.4(2.8) 6.2(3.0) 6.1(3.1) 5.6(3.3) 5.3(3.4) Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model. Table 3.8 Classification Accuracy Rate (Polynomial covariance, p=36) TURE MLE 0.94(0.02) 0.92(0.02) 0.90(0.02) 0.88(0.02) 0.88(0.02) 0.88(0.02) 0.89(0.02) 0.91(0.02) 0.94(0.02) 0.99(0.01) 0.92(0.02) 0.89(0.03) 0.86(0.03) 0.84(0.03) 0.83(0.03) 0.83(0.03) 0.85(0.03) 0.88(0.02) 0.92(0.02) 0.98(0.01) PREG p=36 0.92(0.03) 0.89(0.03) 0.86(0.04) 0.84(0.04) 0.83(0.03) 0.83(0.04) 0.85(0.04) 0.88(0.03) 0.92(0.03) 0.97(0.02) ρ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 PMLE FAIR NB 0.92(0.02) 0.89(0.03) 0.86(0.04) 0.84(0.05) 0.83(0.05) 0.84(0.04) 0.85(0.04) 0.88(0.04) 0.92(0.03) 0.97(0.02) 0.89(0.04) 0.87(0.04) 0.84(0.04) 0.81(0.04) 0.78(0.04) 0.76(0.05) 0.74(0.04) 0.72(0.05) 0.71(0.05) 0.70(0.05) 0.92(0.04) 0.89(0.04) 0.87(0.04) 0.84(0.04) 0.81(0.04) 0.79(0.05) 0.77(0.04) 0.75(0.05) 0.74(0.05) 0.73(0.05) Table 3.9 Classification Accuracy Rate (Polynomial covariance, p=400) TURE MLE MLET 0.94(0.02) 0.93(0.02) 0.92(0.02) 0.91(0.02) 0.92(0.02) 0.92(0.02) 0.93(0.02) 0.95(0.01) 0.97(0.01) 1.0(0) 0.79(0.03) 0.76(0.03) 0.75(0.03) 0.74(0.03) 0.74(0.03) 0.75(0.03) 0.77(0.03) 0.81(0.03) 0.87(0.02) 0.97(0.01) 0.79(0.03) 0.77(0.03) 0.76(0.03) 0.74(0.03) 0.75(0.03) 0.76(0.03) 0.78(0.03) 0.81(0.03) 0.87(0.02) 0.97(0.01) PREG p=400 0.89(0.04) 0.87(0.03) 0.85(0.04) 0.84(0.05) 0.83(0.05) 0.84(0.05) 0.86(0.05) 0.88(0.05) 0.93(0.04) 0.98(0.02) ρ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 PREGT PMLE PMLET FAIR NB 0.89(0.04) 0.87(0.04) 0.85(0.04) 0.85(0.05) 0.84(0.05) 0.84(0.05) 0.86(0.04) 0.88(0.05) 0.93(0.04) 0.98(0.02) 0.90(0.03) 0.86(0.04) 0.83(0.05) 0.82(0.05) 0.83(0.05) 0.85(0.05) 0.88(0.04) 0.92(0.03) 0.96(0.03) 0.99(0.02) 0.90(0.03) 0.87(0.04) 0.84(0.04) 0.84(0.04) 0.84(0.04) 0.86(0.05) 0.89(0.04) 0.92(0.03) 0.96(0.03) 0.99(0.02) 0.88(0.04) 0.87(0.04) 0.86(0.04) 0.83(0.04) 0.82(0.04) 0.80(0.04) 0.78(0.04) 0.75(0.04) 0.73(0.04) 0.70(0.06) 0.79(0.03) 0.78(0.03) 0.77(0.03) 0.74(0.04) 0.71(0.04) 0.68(0.04) 0.65(0.04) 0.61(0.05) 0.59(0.05) 0.56(0.05) Table 3.10 Classification Accuracy Rate (Polynomial covariance, p=1225) TURE MLET 0.94(0.02) 0.93(0.02) 0.92(0.02) 0.91(0.02) 0.91(0.02) 0.92(0.02) 0.93(0.02) 0.95(0.02) 0.97(0.01) 1.00(0) 0.70(0.04) 0.68(0.04) 0.67(0.03) 0.67(0.03) 0.67(0.03) 0.68(0.03) 0.69(0.03) 0.71(0.03) 0.75(0.04) 0.85(0.04) PREGT p=1225 0.88(0.04) 0.86(0.05) 0.83(0.05) 0.83(0.04) 0.82(0.05) 0.83(0.06) 0.84(0.05) 0.86(0.06) 0.89(0.07) 0.95(0.04) ρ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 PMLET FAIR NB 0.88(0.04) 0.85(0.05) 0.81(0.06) 0.80(0.06) 0.80(0.06) 0.83(0.05) 0.86(0.04) 0.89(0.04) 0.94(0.02) 0.98(0.02) 0.88(0.05) 0.86(0.04) 0.85(0.05) 0.83(0.04) 0.81(0.04) 0.79(0.04) 0.76(0.05) 0.74(0.06) 0.71(0.07) 0.69(0.07) 0.70(0.05) 0.69(0.04) 0.68(0.05) 0.66(0.04) 0.64(0.04) 0.61(0.04) 0.59(0.05) 0.57(0.06) 0.55(0.07) 0.53(0.07) 69 Table 3.11 Parameter estimation (Polynomial covariance) p=36 p=400 p=1225 TURE ρ = 0 σ = 1 ρ = 0.1 σ = 1 ρ = 0.2 σ = 1 ρ = 0.3 σ = 1 ρ = 0.4 σ = 1 ρ = 0.5 σ = 1 ρ = 0.6 σ = 1 ρ = 0.7 σ = 1 ρ = 0.8 σ = 1 ρ = 0.9 σ=1 MLE 0.0(0.0) 1.0(0.0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1.0(0.0) 0.3(0.0) 1.0(0.0) 0.4(0.0) 1.0(0.0) 0.5(0.0) 1.0(0.0) 0.6(0.0) 1.0(0.1) 0.7(0.0) 1.0(0.1) 0.8(0.0) 1.0(0.1) 0.9(0.0) 1.0(0.1) PMLE 0.0(0.0) 1(0.0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1.0(0.0) 0.3(0.0) 1(0.0) 0.4(0.0) 1(0.0) 0.5(0.0) 1(0.0) 0.6(0.0) 1(0.1) 0.7(0.0) 1.0(0.1) 0.8(0.0) 1.0(0.1) 0.9(0.01) 1.1(0.1) MLE 0(0) 1.0(0.0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1.0(0.0) 0.3(0.0) 1.0(0.01) 0.4(0.0) 1.0(0.0) 0.5(0.0) 1.0(0.0) 0.6(0.0) 1.0(0.0) 0.7(0.0) 1.0(0.0) 0.8(0.0) 1.0(0.0) 0.9(0.0) 1.0(0.1) MLET 0(0) 1.0(0.0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1(0.0) 0.3(0.0) 1(0.0) 0.4(0.0) 1(0.0) 0.5(0.0) 1.0(0.0) 0.6(0.0) 1.0(0.0) 0.7(0.01) 1.0(0.0) 0.8(0.0) 1.0(0.0) 0.9(0.01) 1.0(0.1) PMLE 0(0) 1.0(0.0) 0.1(0.1) 1.0(0.0) 0.1(0.1) 1.0(0.0) 0.3(0.0) 1.0(0.0) 0.4(0.0) 1.0(0.0) 0.5(0.0) 1.0(0.0) 0.6(0.0) 1.0(0.0) 0.7(0.0) 1.0(0.0) 0.8(0.0) 1.0(0.1) 0.9(0.0) 1.0(0.1) 0(0) PMLET MLET 0(0) 1.0(0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1.0(0.0) 0.2(0) 1.0(0.0) 0.3(0.0) 1.0(0.0) 0.3(0) 1.0(0.0) 0.4(0.0) 1.0(0.0) 0.5(0.0) 0.9(0.0) 0.5(0.0) 0.9(0.0) 0.7(0.0) 0.8(0.0) 1.0(0.0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1(0.0) 0.3(0.0) 1.0(0.0) 0.4(0.0) 1(0.0) 0.5(0.0) 1(0.0) 0.6(0.0) 1.0(0.0) 0.7(0.0) 1.0(0.0) 0.8(0.0) 1.0(0.1) 0.9(0.0) 1.1(0.1) PMLET 0(0) 1.0(0.0) 0.1(0.0) 1.0(0.0) 0.2(0.0) 1.0(0.0) 0.2(0) 1(0.0) 0.3(0.0) 1.0(0.1) 0.3(0) 1.0(0.0) 0.4(0.0) 1.0(0.0) 0.5(0.0) 1.0(0.0) 0.5(0.0) 0.9(0.0) 0.7(0.0) 0.9(0.0) Table 3.12 Number of variables selected (Polynomial covariance, p=36) p=36 PMLE PREG FAIR ρ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 N-s 20(5.0) 21(5.4) 20(5.2) 21(6.2) 20(5.9) 21(5.6) 21(5.6) 20(5.9) 19(5.6) 15(5.9) N-c 10(0.4) 10(0.7) 10(1.4) 10(1.8) 10(1.8) 10(1.7) 10(1.6) 10(1.9) 9(2.2) 7(2.8) N-s 17(7.2) 18(6.5) 18(6.4) 19(7.0) 20(7.0) 20(6.9) 21(7.3) 20(7.3) 19(7.7) 15(7.9) N-c 10(0.7) 10(0.6) 10(0.4) 10(0.8) 10(0.8) 10(0.7) 10(0.7) 10(0.2) 10(0) 10(0.3) N-s 9(3.0) 8(3.2) 7(3.7) 7(4.3) 6(4.1) 5(3.9) 4(2.9) 3(2.4) 3(1.5) 2(1.1) N-c 7(2.0) 7(2.3) 6(2.5) 6(2.4) 5(2.4) 4(2.2) 4(2) 3(1.8) 3(1.5) 2(1.0) Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model. Table 3.13 Number of variables selected (Polynomial covariance, p=400) p=400 PMLE PREG PMLET PREGT FAIR ρ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 N-s 46(47.1) 51(51.9) 77(57.3) 84(50.5) 97(47.3) 93(42.7) 85(32.4) 78(26.4) 60(27.9) 29(30) N-c 9(1) 9(1.2) 9(1.4) 9(1.5) 10(1.5) 10(1.5) 10(1.3) 10(0.7) 10(1.2) 9(1.5) N-s 38.5(49.8) 41(50.4) 35(54.5) 40(51.9) 59(75.3) 55(78.7) 52(75.2) 54(69.4) 68(95.6) 55(97.1) N-c 9(1.3) 9(1.1) 9(1.5) 9(1.4) 9(1.3) 9(1.1) 9(1.2) 10(1.2) 10(0.5) 10(1.2) N-s 47.4(47.4) 52(57.6) 73(60.6) 82(48.9) 88(43.3) 92(42.6) 83(31.2) 72(24.9) 56(27.5) 29(24.8) N-c 9.4(0.9) 9(1.1) 9(1.3) 9(1.2) 10(0.9) 10(1.2) 10(1) 10(0.7) 10(1.1) 9(1.6) N-s 38(49.6) 36(46.7) 39(58.5) 39(56.4) 52(70.3) 46(57.1) 47(63.2) 59(78.6) 69(96) 56(98.1) N-c 9(1.3) 9.1(1.2) 9(1.5) 9(1.3) 9(1.3) 9(1.2) 9(1.2) 10(1.3) 10(0.5) 10(1.2) N-s 15(12.1) 17(13.1) 18(12.8) 19(13.9) 22(15.7) 22(15.6) 21(15.8) 16(13) 10(9.6) 5(5.8) N-c 7(1.9) 7(1.9) 7(1.9) 7(2.1) 7(2.2) 7(2.4) 7(2.4) 6(2.8) 5(3) 3(2.6) Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model. 70 Table 3.14 Number of variables selected (Polynomial covariance, p=1225) p=1225 PMLTT PREGT FAIR ρ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 N-s 43(67.9) 53(73.5) 112(136.7) 121(147.5) 166(160.5) 152(99.5) 140(91.2) 113(82.5) 87(77.6) 25(40.2) N-c 9(1.4) 8(1.7) 8(2.2) 8(2.3) 8(2.5) 9(1.6) 9(1.4) 9(1.5) 10(0.7) 10(1.1) N-s 35(48.0) 35(49.1) 45(70.1) 52(87.2) 58(123.7) 49(74.9) 58(96.0) 79(184.8) 82(171.0) 76(173.7) N-c 9(1.6) 8(1.8) 8(1.8) 9(1.6) 9(1.7) 9(2.1) 9(2.1) 9(1.8) 9(2.0) 9(1.8) N-s 17(13.2) 24(17.7) 27(19.0) 31(19.4) 34(19.8) 36(21.6) 38(23.5) 31(22.2) 23(19.7) 12(11.8) N-c 7(1.8) 7(1.7) 7(1.7) 7(1.8) 7(1.9) 7(2.0) 7(2.3) 7(2.8) 6(3.3) 5(3.6) Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model. Table 3.15 Classification Accuracy Rate (Mis-specified covariance, p=36) TURE p=36 0.887(0.02) 0.879(0.02) 0.904(0.02) 0.925(0.02) 0.939(0.02) 0.951(0.02) 0.958(0.01) 0.963(0.01) 0.969(0.01) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 MLE PREG PMLE FAIR NB 0.841(0.03) 0.851(0.03) 0.888(0.03) 0.907(0.02) 0.922(0.02) 0.935(0.02) 0.943(0.02) 0.95(0.02) 0.956(0.01) 0.84(0.04) 0.85(0.04) 0.883(0.03) 0.906(0.03) 0.922(0.02) 0.935(0.02) 0.943(0.02) 0.949(0.02) 0.954(0.02) 0.843(0.05) 0.848(0.04) 0.889(0.03) 0.91(0.03) 0.925(0.03) 0.935(0.02) 0.946(0.02) 0.953(0.02) 0.959(0.02) 0.826(0.05) 0.755(0.05) 0.727(0.06) 0.713(0.05) 0.709(0.05) 0.701(0.06) 0.7(0.05) 0.701(0.05) 0.705(0.05) 0.854(0.05) 0.771(0.05) 0.746(0.06) 0.737(0.05) 0.735(0.05) 0.734(0.06) 0.735(0.05) 0.736(0.05) 0.736(0.05) Table 3.16 Classification Accuracy Rate (Mis-specified covariance, p=400) TURE p=400 0.91(0.02) 0.93(0.02) 0.95(0.01) 0.97(0.01) 0.98(0.01) 0.98(0.01) 0.99(0.01) 0.99(0.01) 1.00(0.01) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 MLE MLET PREG PREGT PMLE PMLET FAIR NB 0.73(0.03) 0.76(0.03) 0.82(0.03) 0.85(0.02) 0.88(0.02) 0.90(0.02) 0.92(0.02) 0.93(0.02) 0.94(0.02) 0.74(0.03) 0.78(0.03) 0.82(0.03) 0.86(0.02) 0.88(0.02) 0.90(0.02) 0.87(0.06) 0.69(0.11) 0.61(0.05) 0.84(0.05) 0.86(0.04) 0.89(0.04) 0.92(0.03) 0.94(0.03) 0.95(0.03) 0.96(0.03) 0.96(0.03) 0.97(0.02) 0.84(0.05) 0.87(0.04) 0.90(0.04) 0.93(0.03) 0.94(0.03) 0.95(0.03) 0.94(0.04) 0.86(0.07) 0.84(0.06) 0.83(0.05) 0.89(0.04) 0.93(0.03) 0.96(0.02) 0.97(0.02) 0.97(0.02) 0.98(0.01) 0.98(0.01) 0.98(0.01) 0.84(0.04) 0.90(0.04) 0.93(0.03) 0.96(0.02) 0.96(0.02) 0.97(0.01) 0.96(0.02) 0.88(0.05) 0.87(0.03) 0.85(0.05) 0.79(0.04) 0.76(0.04) 0.74(0.04) 0.73(0.04) 0.71(0.05) 0.71(0.07) 0.70(0.07) 0.70(0.06) 0.75(0.03) 0.67(0.04) 0.62(0.04) 0.60(0.04) 0.59(0.04) 0.58(0.04) 0.57(0.04) 0.56(0.04) 0.56(0.04) Table 3.17 Classification Accuracy Rate (Mis-specified covariance, p=1225) TURE p=1225 0.91(0.02) 0.93(0.02) 0.95(0.02) 0.97(0.01) 0.98(0.01) 0.98(0.01) 0.99(0.01) 0.99(0.01) 0.99(0.01) r=1 r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 MLET PREGT PMLET FAIR NB 0.66(0.04) 0.69(0.03) 0.72(0.03) 0.74(0.03) 0.75(0.04) 0.76(0.04) 0.75(0.04) 0.72(0.05) 0.65(0.07) 0.83(0.04) 0.84(0.05) 0.87(0.05) 0.89(0.06) 0.90(0.06) 0.91(0.06) 0.91(0.06) 0.90(0.07) 0.86(0.09) 0.81(0.05) 0.88(0.04) 0.92(0.04) 0.94(0.02) 0.95(0.02) 0.95(0.02) 0.95(0.02) 0.94(0.02) 0.90(0.05) 0.83(0.05) 0.78(0.05) 0.75(0.04) 0.73(0.05) 0.73(0.05) 0.72(0.05) 0.712(0.05) 0.71(0.06) 0.71(0.06) 0.67(0.05) 0.60(0.05) 0.57(0.04) 0.55(0.05) 0.54(0.05) 0.54(0.05) 0.54(0.05) 0.54(0.06) 0.53(0.06) 71 3.5 Real Data Analysis Data used in the real data example were obtained from the Alzheimer’s disease Neuroimag- ing Initiative (ADNI) database (http:// www.loni.ucla.edu/ADNI), which is lauched in 2004, aiming to improve clinical trials for prevention and treatment of Alzheimer’s disease (AD). In the interest of promoting consistency in data analysis, the ADNI Core has created stan- dardized analysis sets of the structure MRI scans comprising only image data that have passed quality control (QC) assessments conducted at the Aging and Dementia Imaging Research laboratory at the Mayo Clinic (see [27]). In this study, we used T1-weighted MRI images from the collection of standardized datasets. The description of the standardized MRI imaging from ADNI can be found in http://adni.loni.usc.edu/methods/mri-analysis/ adni-standardized-data/ and [54]. According to [27], the images were obtained using magnetization prepared rapid gra- dient echo (MPRAGE) or equivalent protocols with varying resolutions (typically 1.0 × 1.0 mm in plane spatial resolution and 1.2 mm thick sagittal slices with 256*256*166 voxels). The images were then pre-processed according to a number of steps detailed in [27] and http://adni.loni.usc.edu/methods/mri-analysis/mri-pre-processing/, which corrected gradient non-linearity, intensity inhomogeneity and phantom-based distor- tion. In addition, the pre-processed imaging were processed by FreeSurfer for cortical recon- struction and volumetric segmentation by Center for Imaging of Neurodegnerative Diseases, UCSF. The skull-stripped volume (brain mask) obtained by FreeSurfer cross-sectional pro- cessing were used in this study. Only images from ADNI-1 subjects obtained using 1.5 T scanners at screening visits were used in this study, and we used the first time point if there are multiple images of the 72 Table 3.18 Subjects characteristics n Age (Mean±sd) Gender (F/M) MMSE (Mean±sd) AD 187 75.28 ± 7.55 88/99 23.28 ± 2.04 p-value NL 227 75.80 ± 4.98 110/117 29.11 ± 1.00 < 1e − 15 0.4168 0.813 Key: AD, subjects with Alzheimer’s disease ; NL, healthy subjects; Age, baseline age; MMSE, baseline Mini-Mental State Examination. same subject acquired at different times. 187 subjects diagnosed as Alzheimer’s disease at screening visits and 227 healthy subjects at screening visits are contained in this study. The total number of subjects is 414. Details of the subjects can be found in Table 3.18. After downloading the pre-processed imaging data from ADNI, an R package ANTsR were applied for imaging registration. Then we use “3dresample” command by AFNI soft- ware ([16]) to adjust the resolution and reduce the total number of voxels of the imaging to 18*22*18. Take x axis and y axis for horizontal plane, x axis and z axis for coronal plane and y axis and z axis for sagittal plane. Only the 1100 voxels located in the center of the brain were used as features for classification (i.e. the voxels with x coordinate from 5 to 14; y coordinate from 6 to 16; z coordinate from 5 to 14). After removing the voxels with zero signal for most of the subjects (more than 409 subjects), we have 1077 voxels left in use. The distance between each pair of voxels can be calculated by their coordinates. For example, there are two voxels s1, s2 with coordinate s1 = (x1, y1, z1) and s2 = (x2, y2, z2)). Then the distance between s1 and s2 is defined by: d(s1, s2) =(cid:112)(x1 − x2)2 + (y1 − y2)2 + (z1 − z2)2. From the plot of sample covariance, spatial correlations can be observed among voxels. We random sample 100 from the 187 AD subjects and 100 from the 227 health subjects as the training set. Then there are 87 AD subjects and 127 health subjects left. The testing 73 Table 3.19 Subjects characteristics of training and testing set AD n NL n Age (Mean±sd) Gender (F/M) MMSE (Mean±sd) Age (Mean±sd) Gender (F/M) MMSE (Mean±sd) training set 100 75.64 ± 7.39 47/53 23.22 ± 2.08 100 75.99 ± 5.39 42/58 29.06 ± 1.04 testing set 87 74.85 ± 7.75 41/46 23.36 ± 2.01 87 75.34 ± 4.56 50/37 29.09 ± 1.01 p-value 0.478 0.999 0.649 0.3723 0.05 0.8307 Key: AD, subjects with Alzheimer’s disease ; NL, healthy subjects; Age, baseline age; MMSE, baseline Mini-Mental State Examination. Table 3.20 Classification performance for voxel level MRI data. Training and testing samples are of sizes 200 and 174, respectively Accuracy No. of training err No. of testing err 38 51 No. of selected voxels 1077 MLE 0.707 PREG PMLE 0.747 0.776 FAIR 0.621 47 44 3 46 39 5 77 66 7 NB 0.689 55 54 1077 set includes the 87 AD subjects and a random sample of 87 from the 127 health subjects. Detail of the subjects in the training and testing set can be found in Table 3.19. We assume the exponential correlation among voxels. Then we apply the PLDA method proposed in this research for classification. First, the parameter are estimated by PMLE: r = 6.59, c = 0.924, σ2 = 230.38 and 5 voxels are selected for classification meanwhile from training data. Then we plug in the estimates in the classification function and do classification on the testing data. The classification accuracy rate of PLDA is listed in Table 3.20. We also listed the classification accuracy from other methods. It shows the classification accuracy rate of our method is about 77.6%, which is superior to other comparable methods (MLE-LDA: 70.7%, PREG-LDA: 74.7%, FAIR: 62.1% and NB: 68.9%). 74 3.6 Proofs of the main results 3.6.1 Proofs for classification using MLE Lemma 3.6.1. Let  be p-dimensional vectors and  ∼ N (0, Σ), where Σ is a p × p positive uT u = C and p × m definite covariance matrix. For m-dimension vector u with (cid:107)u(cid:107) √ = 2 matrix Xi, we have: (cid:12)(cid:12)(cid:12)T Xu (cid:12)(cid:12)(cid:12) = Op( (cid:113) tr(XT ΣX)(cid:107)u(cid:107) 2 ) (3.6.1) Proof. Since E(T X) = 0, E(T Xu)2 ≤(cid:104) E(T XXT ) =tr(XT ΣX)(cid:107)u(cid:107) 2 (cid:105)1/2 (cid:107)u(cid:107) (cid:104) = 2 (cid:105)1/2 (cid:107)u(cid:107) 2 E(tr(T XXT )) (3.6.2) By Chebyshev’s inequality, for any M (cid:113)(cid:107)u(cid:107)2 2 T Xu tr(XT ΣX) P ( > M ) ≤ E(T Xu)2 M 2 (cid:107)u(cid:107)2 2 tr(XT ΣX) = 1 M 2 (3.6.3) Thus for any  > 0, exits M large enough such that (cid:12)(cid:12)(cid:12)T Xu (cid:12)(cid:12)(cid:12) (cid:113)(cid:107)u(cid:107)2 2 P ( tr(XT ΣX) (cid:12)(cid:12)(cid:12)T Xu (cid:12)(cid:12)(cid:12) = Op( This lead to (cid:113) tr(XT ΣX)(cid:107)u(cid:107) ). 2 > M ) <  (3.6.4) Lemma 3.6.2. Let i(i = 1, 2, ..., n) be p-dimensional vectors and i ∼ N (0, c(n)Σ), where 75 c(n) is a function of n and Σ is a p × p positive definite covariance matrix with λ(Σ) < ∞. For a p × p matrix A, we have (cid:104) (cid:105) i Ai − c(n)tr(AΣ) T n(cid:88) i=1 √ n(cid:107)A(cid:107) = Op(c(n) F ) Proof. Since E(T i Ai) = tr(c(n)AΣ), we have n(cid:88) n(cid:88) E( i Ai − c(n)tr(AΣ))2 = T E(T i Ai − c(n)tr(AΣ))2 = i=1 i=1 n(cid:88) i=1 E(T i Ai)2 − nc2(n)tr2(AΣ) (3.6.5) Let B = c(n)Σ 1 2 AΣ 1 2 , then exit orthogonal matrix Q such that B = QT ΛQ where Λ = 2 i, then ˜i ∼ N (0, Ip×p) where (cid:113) c(n)QΣ− 1 diag(λi) and λi are eigenvalues of B. Let ˜i = Ip×p is identity matrix. Then E(T i Ai)2 = E(˜T i Λ˜i)2 = E( λj˜2 ij)2 p(cid:88) p(cid:88) i=1 p(cid:88) = E( j ˜4 λ2 ij + λjλk˜2 ij˜2 ij) = 2 i=1 j,k=1 = 2tr(BT B) + tr2(B) = c(n)2[2tr(ΣAΣA) + tr2(AΣ)] (3.6.6) p(cid:88) i=1 p(cid:88) i=1 λj)2 λ2 j + ( Hence n(cid:88) E( i Ai − c(n)tr(AΣ))2 = 2nc(n)2tr(ΣAT ΣA) ≤ 2nc(n)2λ2 T max(Σ)tr(AT A) (3.6.7) i=1 76 By Chebyshev’s inequality, for any M we have (cid:80)n P ( i=1 T (cid:113) i Ai − c(n)tr(AΣ) nc(n)2 (cid:107)A(cid:107)2 > M ) ≤ 2nc(n)2tr(ΣAT ΣA) M 2nc(n)2 (cid:107)A(cid:107)2 F ≤ 2λ2 max(Σ) M 2 (3.6.8) Then for any  > 0 exits M large enough such that F (cid:80)n p( i=1 T (cid:113) i Ai − c(n)tr(AΣ) nc(n)2 (cid:107)A(cid:107)2 ) F which means(cid:80)n √ i Ai − c(n)tr(AΣ) = Op(c(n) i=1 T n(cid:107)A(cid:107) ) F > M ) <  (3.6.9) Proof of Theorem 1. Take derivative with respect to µk (k = 1, 2) with the function L(θ, µ1, µ2) defined in 3.1.6. Considering Σ−1(θ) is nonsingular, we have µk = ˆµkM LE = ¯Y k·. For ˆθM LE, we first consider the case of p/n → 0. It is sufficient to prove that for any given  > 0, there is a large constant C such that for large p and n, the smallest rate of (cid:113) 1 convergence ηn,p is np such that we have L(θ0 + uηn,p, ˆµ1, ˆµ2) < L(θ0, ˆµ1, ˆµ2)) > 1 −  (3.6.10) P ( sup (cid:107)u(cid:107) 2 =C where u ∈ Rq. This implies that there exists a local maximum for the function L in the neighborhood of θ0 with the radius at most proportional to ηn,p. L(θ0 + uηn,p, ˆµ1, ˆµ2) − L(θ0, ˆµ1, ˆµ2) (3.6.11) = ( ∂L ∂θ = −n 2 (θ0))T uηn,p + 1 uT T (θ0)uη2 n,p + ( (θ∗))uη2 ∂2L ∂θ∂θT (θ0))T uηn,p + 1 n,p 2 uT ( ∂L ∂θ 2 uT ( ∂2L ∂θ∂θT (θ∗) + mnT (θ0))uη2 n,p 77 = (I) + (II) + (III) where T (θ0) is a q × q matrix with its (i, j)th element tij(θ0) = tr(Σ−1ΣiΣ−1Σj). From A4 and A5, tij = aij(tii) 1 2 (tjj) 1 2 ≥ aijλ−2 min(Σ)(cid:107)Σi(cid:107) F . There exists a (cid:13)(cid:13)Σj (cid:13)(cid:13) F constant M such that (I) = −n 2 uT T uη2 n,p = −n 2 q(cid:88) i,j=1 tijuiujη2 n,p ≤ −M np 2 n,p (cid:107)u(cid:107)2 η2 2 (3.6.12) nk(cid:88) q(cid:88) (cid:20) i=1 j=1 (Y ki − ˆµk)T Σj(Y ki − ˆµk) − ( (cid:21) )tr(ΣΣj) nk − 1 nk uθj ηn,p (3.6.13) and (II) = − 1 2(cid:88) q(cid:88) k=1 2 j=1 = (1) + (2) + 1 2 tr(ΣΣj)uθj ηn,p n (cid:13)(cid:13)(cid:13)Σj(cid:13)(cid:13)(cid:13) ≤(cid:13)(cid:13)Σi(cid:13)(cid:13)2 2 √ ηn,p) = Op( np(cid:107)u(cid:107) (cid:107)u(cid:107) 2 ηn,p) 2 (3.6.14) F = pλ2 max(Σi) = O(p). Also by A6(i), tr(ΣΣj) = tr(Σ1/2ΣjΣ1/2) ≤ λmax(Σj)tr(Σ). Then noticing that tr(Σ) = nk−1 nk Because Y ki − ˆµk ∼ N (0, 2(cid:88) |(1)| = Op( k=1 nk − 1√ nk (cid:13)(cid:13)(cid:13)Σj(cid:13)(cid:13)(cid:13) Σ), by lemma 3.6.2, √ ηn,p) = Op( (cid:107)u(cid:107) 2 F The last equality is because from A6(i),(cid:13)(cid:13)Σi(cid:13)(cid:13)2 F 78 O(p), and p/n → 0 |(2)| = Thus (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 2 q(cid:88) j=1 tr(ΣΣj)ujηn,p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = Op(tr(Σ)ηn,p (cid:107)u(cid:107) 2 ) = Op(pηn,p (cid:107)u(cid:107) ) 2 √ (II) = Op(( np + p)ηn,p (cid:107)u(cid:107) ) 2 √ • If p/n → 0, (II) = Op( np)ηn,p. By choosing sufficient large C = (cid:107)u(cid:107) , the minimal 2 rate of ηn,p to have (II) be dominated by (I) is ηn,p = Op( (cid:113) 1 np); • If p/n → C with 0 < C ≤ ∞, (II) = Op(pηn,p). Then the minimal rate of ηn,p to have (II) dominated by (I) is ηn,p = Op( (cid:113) 1 n) Since ∂L ∂θj∂θl (θ) = (cid:104) n 2 tr(Σj(θ)Σ(θ)) + tr(Σj(θ)Σl(θ)) (cid:105) − 1 2 2(cid:88) nk(cid:88) k=1 i=1 (Y ki − ˆµk)T Σjl(θ)(Y ki − ˆµk) )tr(Σ(θ0)Σjl(θ∗))(cid:1)(cid:35) nk − 1 nk uθj uθl η2 n,p (III) can be written as q(cid:88) − 1 2 2n1n2 j,l=1 n q(cid:88) q(cid:88) j,l=1 + + + n 2 n 2 k=1 i=1 nk(cid:88) (cid:0)(Y ki − ˆµk)T Σjl(θ∗)(Y ki − ˆµk) − ( tr(Σjl(θ∗)Σ(θ0))uθj (cid:34) 2(cid:88) q(cid:88) (cid:104) (cid:105) tr(Σjk(θ∗)Σ(θ∗)) − tr(Σjk(θ∗)Σ(θ0)) (cid:104) (cid:105) tr(Σj(θ∗)Σl(θ∗)) − tr(Σj(θ0)Σl(θ0)) η2 n,p uθl j,l=1 uθj uθl η2 n,p uθj η2 n,p uθl j,l=1 79 =(3) + (4) + (5) + (6) By lemma 3.6.2 and A6(ii), √ |(3)| = Op( n For (4), by A6(ii ) (cid:13)(cid:13)(cid:13)Σjl(θ∗) (cid:13)(cid:13)(cid:13) F √ η2 n,p) = Op( npη2 n,p) tr(Σjl(θ∗)Σ(θ0)) = Op(p) thus |(4)| = Op( p n η2 n,p) It is easy to see (3), (4) are dominated by (I). For (5), |(5)| = n tr(Σkj(θ∗)(Σ(θ0) − Σ(θ∗)))uθk η2 n,p uθj (3.6.15) Let dil(θ∗) be the i, lth entry of matrix Σkj(θ∗), γil(θ) be the i, lth entry of Σ(θ), then by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1 4 q(cid:88) j,k=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . A2 and A6(ii) tr(Σkj(θ∗)(Σ(θ0) − Σ(θ∗))) = ≤ ≤ dil(θ∗)(γli(θ0) − γli(θ∗)) |dil(θ∗)|(cid:107) ∂γil(θ∗) ∂θ (cid:107)2(cid:107) θ0 − θ∗ (cid:107)2 (3.6.16) i,l=1 p(cid:88) p(cid:88) p(cid:88) i,l=1 i,l=1 80 2)M ηn,p (cid:13)(cid:13)(cid:13)Σkj(θ∗) (cid:13)(cid:13)(cid:13) dil(θ∗ (cid:113) F ≤ M ηn,pp = Op( p3ηn,p) Hence |(5)| = OP (n (cid:113) p3η3) = Op((p3/2nηn,p)η2 n,p)) (3.6.17) • If p/n → 0 and ηn,p = Op( 1√ np), (5) is dominated by (I). • If p/n → C with 0 < C ≤ ∞, etan,p = Op( 1 n) and √ p/n → 0, (5) is dominated by (I). For (6), let tij(θ∗) = tr(Σ−1(θ∗)Σi(θ∗)Σ−1(θ∗)Σj(θ∗)), by A6(iii), q(cid:88) k,j=1 (cid:13)(cid:13)(cid:13)(cid:13) ∂tij(θ∗) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) 2 |(6)| ≤ n (cid:107)θ∗ − θ0(cid:107) uθk uθj 2 η2 n,p (3.6.18) = OP (npη3 n,p) While ηn,p = Op( 1√ np) or ηn,p = Op( 1 n), (6) is also dominated by (I). Hence (III) is dominated by (I). This completes the proof. Proof of Theorem 2. We start with W1(ˆδM LE; Θ) = 1 − Φ(Ψ1), where W1(ˆδM LE; Θ) is the conditional misclassification rate defined in 1.3.2 and Ψ1 is defined in 1.3.3. The idea is to prove lim infn,p→0 Ψ1 → (cid:105)p np). Recall that Σ = Σ(θ) =(cid:2)γ(hij; θ)(cid:3)p From Therom 1, we have (cid:107) ˆθ − θ (cid:107)2= Op( 1√ √ C0 2 (cid:104) i,j=1 . . and ˆΣ = Σ(ˆθ) = γ(hij; ˆθ) i,j=1 By A2, we have: |γ(hij; θ) − γ(hij; ˆθ)| ≤ M (cid:107) θ − ˆθ (cid:107)2 max i,j (3.6.19) 81 Thus there exist  > 0 and matrix E =(cid:2)eij (cid:3)p i,j=1 such that ˆΣ = Σ + E (3.6.20) where  = Op( 1√ i.e. (cid:12)(cid:12)eij np) and E is a p × p matrix with absolute values of all entries less than 1, (cid:12)(cid:12) ≤ 1 for any i, j = 1, 2, ..., p. As a result, for large p and n, the inverse of Σ can be written as: ˆΣ−1 = Σ−1 − Σ−1EΣ−1 + O(2)E2 (3.6.21) where E2 is a p × p matrix with all entries less than 1, see [39]. Now we consider the denominator of 1.3.3. We first claim the denominator can be written as: T ˆ∆ ( ˆΣ−1Σ ˆΣ−1) ˆ∆ = ˆ∆ T Σ−1 ˆ∆(1 + op(1)). (3.6.22) Because by 3.6.21, we have ˆΣ−1Σ ˆΣ−1 = (Σ−1 − Σ−1EΣ−1 + O(2)E2)Σ(Σ−1 − Σ−1EΣ−1 + O(2)E2) = Σ−1 − 2A + 2AΣA + O(2)E2 + O(3)EE2Σ−1 + O(4)E2E2 (3.6.23) where A = Σ−1EΣ−1. Also, noticing that  = O( 1√ np), k1 ≤ λmin(Σ) ≤ λmax(Σ) ≤ k2 and λmax(E) ≤ tr(E) ≤ 82 p, we have: T T ˆ∆ ˆ∆ (A) ˆ∆ Σ−1 ˆ∆ = yT Ey yT Σy ≤ λmax(E) λmin(Σ) ≤ p/λmin(Σ) = O( (cid:114) p n ) (3.6.24) where yT = ˆ∆ T Σ−1. Similarly, we have: ˆ∆ T (2AΣA) ˆ∆ T ˆ∆ Σ−1 ˆ∆ ≤ 2 λ2 max(E) λ2 min(Σ) ≤ 2p2/λ2 min(Σ) = O( p n ) (3.6.25) ˆ∆ T (O(2)E2) ˆ∆ T ˆ∆ Σ−1 ˆ∆ ≤ O(2)λmax(E2)λmin(Σ) ≤ O(2)pλmax(Σ) = O( 1 n ) (3.6.26) T ˆ∆ (O(3)EE2Σ−1) ˆ∆ ˆ∆ Σ−1 ˆ∆ T ≤ O(3) λmax(E2)λmax(E)λmax(Σ−1) λmin(Σ) ≤ O(3)p2 λmax(Σ) λmin(Σ) (cid:114) p 1 n ) n = O( (3.6.27) (3.6.28) ≤ O(4)p2λmax(Σ) = Op( 1 n2 ) T ˆ∆ (O(4)E2E2) ˆ∆ ˆ∆ Σ−1 ˆ∆ T Since p n → 0 as n → ∞ and p → ∞, (3.6.22) is derived by combining (3.6.24)- (3.6.28). Now we investigate ˆ∆ T Σ−1 ˆ∆ and claim that: T ˆ∆ Σ−1 ˆ∆ = ∆T Σ−1∆(1 + op(1)) + (cid:80)n1 i=1 Y 1i, which is normally distributed as N (µ1, 1 n1 (1 + op(1)) np n1n2 (3.6.29) Σ). Also, Recall ˆµ1 = ¯Y 1· = 1 n1 (cid:80)n2 ˆµ2 = ¯Y 2· = 1 n2 i=1 Y 2i, which is normally distributed as N (µ2, 1 n2 Σ). Let ˆµ1 = µ1 + ˆ1 83 and ˆµ2 = µ2 + ˆ2 where ˆ1 ∼ N (0, 1 n1 Σ) and ˆ2 ∼ N (0, 1 n2 Σ). Then we have: T ˆ∆ Σ−1 ˆ∆ = ∆T Σ−1∆ + 2∆T Σ−1(ˆ1 − ˆ2) + (ˆ1 − ˆ2)T Σ−1(ˆ1 − ˆ2) Noticing ˆ1 − ˆ2 ∼ N (0, n n1n2 Σ), by Chebyshev’s inequality, for any 0 > 0 ∆T Σ−1(ˆ1 − ˆ2) ∆T Σ−1∆ P ( > 0) ≤ E(cid:0)∆T Σ−1(ˆ1 − ˆ2)(cid:1)2 (0∆T Σ−1∆)2 = ≤ n n1n2∆T Σ−1∆ n n1n2Cp = 1 π(1 − π)nCp → 0 It goes to 0 because nCp → ∞. Then (3.6.30) ∆T Σ−1(ˆ1 − ˆ2) = op(∆T Σ−1∆) (cid:113) n1n2 1 2 (ˆ1 − ˆ2). Then ˜ ∼ n Σ Then we consider the third term in 3.6.29. Let ˜ = N (0, Ip×p). Now for any ε0 > 0 P (|(ˆ1 − ˆ2)T Σ−1(ˆ1 − ˆ2) − np/n1n2 np/n1n2 | > 0) p | > 0) = P (|˜T ˜ − p ≤ E(cid:0)˜T ˜(cid:1)2 2 0p2 2 → 0 2 p 1 = as p → ∞. Then (ˆ1 − ˆ2)T Σ−1(ˆ1 − ˆ2) = np n1n2 (1 + op(1)) (3.6.31) 84 Then 3.6.29 followed. Now 3.6.22 and 3.6.29 yield: T ˆ∆ ( ˆΣ−1Σ ˆΣ−1) ˆ∆ = ∆T Σ∆(1 + op(1)) + np n1n2 (1 + op(1)) (3.6.32) Now we consider the nominator of 1.3.3. (cid:17) ˆΣ−1ˆ1 − 2∆T ˆΣ−1ˆ2 (3.6.33) (cid:16) = (µ1 − ˆµ)T ˆΣ ˆ∆ = 1 2 ˆΣ−1ˆ2 − ˆT ∆T ˆΣ−1∆ + ˆT ((1) + (2) − (3) − (4)) 1 2 2 1 (1) = ∆T (Σ−1 − E + O(2)E2)∆ By the assumption that k1 ≤ λmin(Σ) ≤ λmax ≤ k2, λmax(E) ≤ p and λmax(E2) ≤ p, we have and thus ∆T (E)∆ ∆T Σ−1∆ → 0 ∆T (O(2)E2)∆ ∆T Σ−1∆ → 0 (1) = ∆T Σ−1∆(1 + op(1)) (3.6.34) By the same argument, we have (2) = ˆT 2 Σ−1ˆ2(1+op(1)) and (3) = ˆT 1 Σ−1ˆ1(1+op(1)). 85 Since ˆ1 ∼ N (0, 1 n1 Σ) and ˆ1 ∼ N (0, 1 n1 Σ),similar to the proof of (3.6.31), we have: (2) = p n2 (1 + op(1)) and (3) = p n1 (1 + op(1)) (3.6.35) Now we consider term (4) in 3.6.33. (4) =∆T (Σ−1 − Σ−1EΣ−1 + O(2)E2)ˆ2 =∆T Σ−1ˆ2 + ∆T Σ−1EΣ−1ˆ2 + O(2)∆T E2ˆ2 Similar to the proof of 3.6.30, all the three terms in (4) are small order of ∆T Σ−1∆. Thus we have (4) = op(∆T Σ−1∆) (3.6.36) (3.6.37) (cid:17) Now the nominator can be written as: (cid:18) ∆T Σ−1∆(1 + op(1)) + (cid:19) (n1 − n2)(1 + op(1)) p n1n2 (µ1 − ˆµ)T ˆΣ−1 ˆ∆ = 1 2 3.6.32 and 3.6.37 yield W1(ˆδM LE; Θ) = 1 − Φ (cid:16)∆T Σ−1∆(1 + op(1)) + p (cid:113) ∆T Σ−1∆(1 + op(1)) + np n1n2 n1n2 2 (n1 − n2)(1 + op(1)) (1 + op(1)) By the same argument, we have: W2(ˆδM LE; Θ) = Φ (cid:16)−∆T Σ−1∆(1 + op(1)) + p (cid:113) ∆T Σ−1∆(1 + op(1)) + np n1n2 n1n2 2 (n1 − n2)(1 + op(1)) (1 + op(1)) (cid:17) 86 Since p n → 0 and Cp = ∆T Σ−1∆ → C0 with 0 ≤ C0 ≤ ∞, we have W (ˆδM LE; Θ) = 1 2 (cid:16) 1 − Φ Cp(1 + op(1)) + p (cid:113) −Cp(1 + op(1)) + p (cid:113) n1n2 2 + Φ n1n2 Cp(1 + op(1)) + np n1n2 (1 + op(1)) (n1 − n2)(1 + op(1)) 2 Cp(1 + op(1)) + np n1n2 (1 + op(1)) →1 − Φ( √ C0 2 ) (n1 − n2)(1 + op(1))  (cid:17) as p → ∞ and n → ∞. If Cp → C0 < ∞, 1 − Φ( optimal. Now we check the asymptotically optimal when Cp → ∞. From the inequality ) > 0. Thus ˆδM LE is asymptotically √ C0 2 x 1 + x2 e − x2 2 ≤ Φ(−x) ≤ 1 x − x2 2 , x > 0 e (3.6.38) we have xy 1 + x2 e − x2−y2 2 ≤ W (ˆδM LE) ) Φ(− √ Cp 2 ≤ 1 + y2 xy − x2−y2 e 2 (cid:113) Cp(1+op(1))± p n1n2 Cp(1+op(1))+ 2 (n1−n2)(1+op(1)) np n1n2 (1+op(1)) where x = 1 and 1+y2 . It is easy to check that xy xy → 1 as Cp → ∞. Also x2−y2 → 0 if Cp(p/n) → 0. This completes the proof. and y = √ Cp 2 1+x2 → Proof of Theorem 3. The misclassification rate of δ ˆµ is: W (δ ˆµ; Θ) = 1 2(W1(δ ˆµ; Θ) + W2(δ ˆµ; Θ)) 87 where W1(δ ˆµ; Θ) = 1 − Φ(Ψ1) and W2(δ ˆµ; Θ) = Φ(Ψ2) where Ψ1 and Ψ2 is defined by 1.3.3 and 1.3.5 with ˆΣ replaced by Σ. We start with (cid:112) (µ1 − ˆµ)T Σ−1( ˆµ1 − ˆµ2) ( ˆµ1 − ˆµ2)T Σ−1( ˆµ1 − ˆµ2) Ψ1 = The denominator is (3.6.29) and it can be represented as: ˆ∆Σ−1 ˆ∆ = ∆T Σ∆(1 + op(1)) + np n1n2 (1 + op(1)) The nominator is: (µ1 − ˆµ)T Σ−1 ˆ∆ = 1 2(∆T Σ−1∆ + ˆT 1 Σ−1ˆ1 − ˆT 2 Σ−1ˆ2 − 2∆T Σ−1ˆ2) By the similar procedure in the proof of (3.6.37), it can be represented by (µ1 − ˆµ)T Σ−1 ˆ∆ = 1 2 (cid:18) ∆T Σ−1∆(1 + op(1)) + (cid:19) (n1 − n2)(1 + op(1)) p n1n2 Thus we have: W1(ˆδ ˆµ; Θ) = 1 − Φ (cid:16)Cp(1 + op(1)) + p (n1 − n2)(1 + op(1)) (cid:17) n1n2 (cid:113) 2 Cp(1 + op(1)) + np n1n2 (1 + op(1)) 88 (3.6.39) Similarly, we have W2(ˆδ ˆµ; Θ) = Φ (cid:16)−Cp(1 + op(1)) + p (n1 − n2)(1 + op(1)) (cid:17) n1n2 Cp(1 + op(1)) + np n1n2 (1 + op(1)) (cid:113) 2 (i) If Cp p/n → ∞. Then (cid:113) ±Cp(1 + op(1)) + p n1n2 (n1 − n2)(1 + op(1)) 2 Cp(1 + op(1)) + np n1n2 (1 + op(1)) p n1n2Cp (n1 − n2)(1 + op(1))) Cp(1 + np (1 + op(1)) n1n2Cp p n1n2Cp (1 + np n1n2Cp (n1 − n2)(1 + op(1))) (1 + op(1)) ±Cp(1 ± (cid:113) ±(cid:112)Cp(1 ± (cid:113) 2 2 = = →±√ C0 2 which yields W (ˆδ ˆµ) → 0 since p → ∞ in probability. that W (ˆδ ˆµ) WOP T n → C with 0 < C < ∞ and Cp → C0 = ∞. Now we show Noticing the fact that we have x 1 + x2 e − x2 2 ≤ Φ(−x) ≤ 1 x − x2 2 , x > 0 e Φ(cid:0)− x WOP T (cid:1) = 2 √ Φ(− Φ(cid:0)− x Cp 2 (cid:1) ≤ 4 + x2 x(cid:112)Cp ) 2 − 1 8 (Cp−x2) e 89 (cid:113) Cp(1+op(1)± p(n1−n2) n1n2 np n1n2 Cp(1+op(1))+ (1+op(1))) (1+op(1)) 4 + x2 x(cid:112)Cp x(cid:112)Cp 4 x(cid:112)Cp + = → a constant where x = because and → 0 1 x(cid:112)Cp  → 1 → a0 if c = ∞, if c < ∞ x(cid:112)Cp = . Also where a0 = √ √ c+1/ √ c+1/(π(1−π)) c → ∞ (3.6.40) Cp − x2 = Thus we have As a result C2 p op(1) + Cp (n±(n1−n2))p n1n2 (1 + op(1)) + p2(n1−n2)2 n2 1n2 2 (1 + op(1)) Cp(1 + op(1)) + np n1n2 Φ(− √ Cp 2 Φ(cid:0)− x 2 (cid:1) → 0 ) W (ˆδ ˆµ) WOP T → ∞ 90 (ii)While Cp p/n → c with 0 < c < ∞ (cid:113) 2 ±Cp(1 + op(1)) + p n1n2 (n1 − n2)(1 + op(1)) Cp(1 + op(1)) + np n1n2 (1 + op(1)) p n1n2Cp (n1 − n2)(1 + op(1))) Cp(1 + np (1 + op(1)) 2 ±Cp(1 ± (cid:113) ±(cid:112)Cp(1 ± (cid:113) ±√ (cid:113) C0(1 ± 1 1 + 1 c 2 2 = = → n1n2Cp p n1n2Cp (1 + np n1n2Cp 2π−1 π(1−π)) c 1 π(1−π) (n1 − n2)(1 + op(1))) (1 + op(1)) Since Cp → C0, and W1(ˆδ ˆµ) → 1 − Φ( W2(ˆδ ˆµ) → 1 − Φ( √ (cid:113) 2π−1 C0(1 + 1 π(1−π)) c 1 + 1 π(1−π) c 1 2 √ (cid:113) 2π−1 C0(1 − 1 π(1−π)) c 1 + 1 π(1−π) 2 c 1 ) ) If p n → C with 0 < C < ∞, then 0 < C0 < ∞. Since Φ(x) is convex function in the sense that 1 2(Φ(x + ) + Φ(x − )) ≤ Φ(x) for any x > 0 and x >  > 0, limP W (ˆδ ˆµ) = limP 1 2(W1(ˆδ ˆµ) + W2(ˆδ ˆµ)) ≥ 1 − Φ( √ (cid:113) 2 C0 1 + 1 π(1−π) c 1 ) > 1 − Φ( √ C0 2 ) where limP means converge in probability with p → ∞ and n → ∞. If p n → ∞, then C0 = ∞. Hence W (ˆδ ˆµ) = 1 2(W1(ˆδ ˆµ) + W2(ˆδ ˆµ)) → 0. By similar argument in (i), we have W (ˆδ ˆµ) WOP T → ∞. 91 (iii)While Cp p/n → 0, (cid:113) 2 ±Cp(1 + op(1)) + p n1n2 (n1 − n2)(1 + op(1)) Cp(1 + op(1)) + np n1n2 (1 + op(1)) =  Which yields W (ˆδ ˆµ) → 1 2. (cid:113) p (cid:114) n(±Cp/( p (Cp/( p 2 n(n1−n2) n) + n1n2 n) + n2 n1n2 (1 + op(1))) (1 + op(1)) → ∞ if n1 > n2, → −∞ if n1 < n2. Proof of Corollary 1. Noticing that when n1 = n2 = n/2, ˆ1 ∼ N (0, 1 n1 N (0, 1 n1 2 ˆi for i = 1, 2. Then ˜i ∼ N (0, Ip) and Σ). Let ˜i = niΣ √ 1 Σ) and ˆ2 ∼ E(ˆT 1 Σ−1ˆ1 − ˆT 2 Σ−1ˆ2)2 =E(˜T 1 ˜1 − ˜T 2 ˜2)2/N 2 1 = 1j − ˜2 (˜2 2j)2/n2 1 = 6p/n2 1 p(cid:88) Hence we have: j=1 √ p n ) 1 Σ−1ˆ1 − ˆT ˆT 2 Σ−1ˆ2 = Op( Similar to the proof of (3.6.39) we have: W1(ˆδ ˆµ; Θ) ≤ 1 − Φ (cid:16) Cp(1 + op(1)) + (cid:113) 2 Cp(1 + op(1)) + 4p (cid:17) √ p n (1 + op(1)) n (1 + op(1)) and W2( ˆδ ˆµ; Θ) ≤ Φ (cid:16) −Cp(1 + op(1)) + (cid:113) 2 Cp(1 + op(1)) + 4p √ p n (1 + op(1)) n (1 + op(1)) (cid:17) 92 (cid:113) ±Cp(1 + op(1)) + 2 Cp(1 + op(1)) + 4p √ p n (1 + op(1)) n (1 + op(1)) → ±∞ (cid:113) p (cid:113) ±Cp/ 2 Cp/ p n(1 + op(1)) + 1√ n(1 + op(1)) + 4 + op(1)) (1 + op(1)) n =  (cid:113) 2 → ± c 4 → ± → 0 c √ 4+c/ C if if if if Cp√ p/n Cp√ p/n Cp√ p/n Cp√ p/n → ∞ → c and p/n → ∞ → c and p/n → C < ∞ → 0 W (ˆδ ˆµ) WOP T The proof of the proof. → ∞ is the same as that in the proof of Theorem 3(1). This completes 3.6.2 Proofs for consistency of PMLE Proof of Theorem 4. In the algorithm, we estimate β(0) first. Then θ(0) is estimated by fixing β = ˆβ (0) . Then update β = ˆβ (1) by fixing θ = ˆθ (0) . θ = ˆθ (1) is updated in last step by fixing β = β(1). So the idea is to prove the theorem in the following sequence: (a) The consistency and sparsity of β(0); (b) The consistency of θ(0); (c) The consistency and sparsity of β(1); (d) The consistency of θ(1). (cid:13)(cid:13)(cid:13)(cid:13) ˆβ (cid:13)(cid:13)(cid:13)(cid:13) 2 = Op((cid:112) s (a) We first prove (0) − β0 n) and ˆβ (0) 2 = 0 with probability tending to 1, where ˆβ (0) 2 is the p − s dimension sub-vector of ˆβ (0) = ( ˆβ (0)T 1 ˆβ (0)T 2 )T . The proof of (a) is the same as the proof of (c), except the loss function if defined as R(β) which is negative of the penalized MLE function (3.3.1) with covariance matrix ˙Σ replaced by diagn−1(Ip). Then the parameters are estimated by minimize the loss function. We 93 omit the proof here and illustrate the details in (c). (b) Second we prove (0) − θ0 = Op( (cid:13)(cid:13)(cid:13)(cid:13)ˆθ (cid:13)(cid:13)(cid:13)(cid:13) 2 np). Write (cid:113) 1 (cid:12)(cid:12)(cid:12) ˙Σ(θ) (cid:12)(cid:12)(cid:12) − 1 F (θ, ˆβ (0) ; Z) = −np 2 log(2π) − 1 2 log 2(Z − X ˆβ (0) )T ˙Σ−1(θ)(Z − X ˆβ (0) It is sufficient to prove for any given  > 0 the smallest convergence rate of ηn,p is such that we have ) (cid:113) 1 np F (θ0 + uηn,p, ˆβ (0) ; Z) < F (θ0, ˆβ (0) ; Z)) > 1 −  P ( sup (cid:107)u(cid:107)=C This implies there exists a local maximum for the function Q(θ, ˆβ (0) ; Z) of θ in the neighborhood of θ0 with the radius at most proportional to ηn,p. ∂ ˙Σ(θ∗) ∂θj ˙Σ(θ0 + uηn,p) − ˙Σ(θ0) = (cid:80)q between θ0 + uηn,p and θ0. Denote ˙Σj(θ∗) = ∂ ˙Σ(θ∗) By Taylor’s expansion, , then j=1 ∂θj ηn,p, where θ∗ is uθj (0) ; Z) − F (θ0, ˆβ F (θ0 + uηn,p, ˆβ =(cid:2)F (θ0 + uηn,p, β0) − F (θ0, β0)(cid:3) − (0) ; Z) q(cid:88) j=1 (Z − Xβ0)T ˙Σj(θ∗)X( ˆβ (0) − β0)uθj ηn,p − 1 2 (0) − β0)T XT ˙Σj(θ∗)X( ˆβ ( ˆβ (0) − β0)uθj ηn,p q(cid:88) j=1 =(I) + (II) + (III) 94 where uT T uη2 n,p + (θ0) uηn,p + uT ( 1 2 ∂2F ∂θ∂θT (θ∗) + (n − 1)T )uη2 n,p (I) =F (θ0 + uηn,p, β0) − F (θ0, β0) (cid:19)T (cid:18) ∂F ∂θ 2 = − n − 1 q(cid:88) =(1) + (2) + (3) (II) = j=1 (III) = − 1 2 ( ˆβ (0) − β0)T XT ˙Σj(θ∗)(Z − Xβ0)ujηn,p q(cid:88) (0) − β0)T XT ˙Σj(θ∗)X( ˆβ ( ˆβ (0) − β0)ujηn,p j=1 We consider (I) first. T in (1) is q × q matrix with its (i, j)th element as tij(θ0), where tij(θ) = tr(Σ−1(θ)Σi(θ)Σ−1(θ)Σj(θ)). By the similar argument in proving the bound of (I) in theorem 1, there exist a constant K, such that (1) = − n i,j=1 tij(θ0)uiujη2 −Knpη2 with probability tending to 1. In regarding to (2), (cid:80)q 2 n,p ≤ n,p (cid:107)u(cid:107)2 2 ∂F ∂θj (θ0) = n − 1 2 tr(Σj(θ0)Σ(θ0)) − 1 2(Z − Xβ)T ˙Σj(θ0)(Z − Xβ) Notice that Z − Xβ ∼ N (0, ˙Σ(θ)) and tr( ˙Σj ˙Σ) = (n − 1)tr(ΣjΣ). By lemma 3.6.2, (2) =Op( tr( ˙Σ ˙Σj ˙Σ ˙Σj)ηn,p (cid:107)u(cid:107) (n − 1)tr(ΣΣjΣΣj)ηn,p (cid:107)u(cid:107) ) 2 (cid:113) (cid:114) (n − 1)(cid:13)(cid:13)Σj(cid:13)(cid:13)2 =Op( (cid:113) 2 ) = Op( )(cid:107)u(cid:107) 2 ) 2 ηn,p (cid:107)u(cid:107) F √ By A3, A6(i), (2) = Op( npηn,p). 95 Then we consider (3). For any j, k = 1, 2, ..., q ∂2(F ) ∂θj∂θk (θ∗) = n − 1 2 (tr(Σjk(θ∗)Σ(θ∗) − tjk(θ∗)) − 1 2(Z − Xβ)T ˙Σjk(θ∗)(Z − Xβ) Thus (3) could be written as: j,k=1 q(cid:88) q(cid:88) q(cid:88) + + j,k=1 j,k=1 n − 1 2 n − 1 2 n − 1 2 =(i) + (ii) + (iii) (3) = (tr(Σjk(θ∗)Σ(θ0)) − 1 2(Z − Xβ)T Σjk(θ∗)(Z − Xβ))uθj η2 n,p uθk (tr(Σjk(θ∗)Σ(θ∗)) − tr(Σjk(θ∗)Σ(θ0)))uθj η2 n,p uθk (tjk(θ0) − tjk(θ∗))uθj η2 n,p uθk By lemma 3.6.2 and A6(ii ), (i) = Op( (n − 1)tr(ΣΣjkΣΣjk)η2 √ n,p) = Op( npη2 n,p). Similar to the deriving the order of (5) and (6) in the proof of Theorem1, (ii) = n,p). By choosing large C = (cid:107)u(cid:107) , the minimal rate 2 (cid:112) np). p3η3 n,p) and (iii) = Op(npη3 Op(n that (2) and (3) are dominated by (1) is ηn,p = Op( 1√ Now we consider (II). Denote B = (cid:80)n−1 (cid:113) (0) − β0)T XT ˙Σj(θ∗)(Z − Xβ0) = Op( 1, 2, ..., q, ( ˆβ =Op( (cid:118)(cid:117)(cid:117)(cid:116)tr( (cid:13)(cid:13)(cid:13)(cid:13) ˆβ n−1(cid:88) j Σj∗Σ∗Σj∗Xi) + tr(BT Σj∗Σ∗Σj∗B) XT where Σ∗ = Σ(θ∗) and Σj∗ = Σj(θ∗). Notice that(cid:80)n−1 i=1 (cid:13)(cid:13)(cid:13)(cid:13) ˆβ (0) − β0 (cid:13)(cid:13)(cid:13)(cid:13) 2 ) tr(XT ˙Σj(θ∗) ˙Σ(θ∗) ˙Σj(θ∗)X) (cid:13)(cid:13)(cid:13)(cid:13) ) 2 (0) − β0 i=1 XT i Xi = ( n − n2 n1n2 1 n2 )Ip×p i=1 Xi. Then by lemma 3.6.1, for any j = (cid:113) 96 i=1 Xi)T ((cid:80)n−1 and ((cid:80)n−1 n−1(cid:88) tr( i=1 i=1 Xi) = n2 n2 Ip×p. Then by A6(i) 1 j Σj∗Σ∗Σj∗Xi) = ( XT n1n2 n − n2 1 n2 )tr(Σj∗Σ∗Σj∗) ≤ λmax(Σ) n1n2 n Similarly tr(BT Σj∗Σ∗Σj∗B) = Op(np). (cid:13)(cid:13)(cid:13)β(0) − β0 (cid:13)(cid:13)(cid:13) 2 = Op((cid:112) s Since √ n), (II) = Op( psηn,p) For (III), by A6(i), for any j = 1, 2, ..., q we also have: ( ˆβ =( ˆβ (0) − β0) (0) − β0)T XT ˙Σj(θ∗)X( ˆβ (cid:33) (0) − β0)T XT diagn−1Σj(θ∗)( ˜In−1,p + ˜Jn−1,p)X( ˆβ (0) − β0) n−1(cid:88) (cid:32)n−1(cid:88) n−1(cid:88) XT i Xi + ( Xi)T ( Xi) ( ˆβ ≤λmax(Σj∗)(( ˆβ (0) − β0)T =Op(n (0) − β0 i=1 i=1 i=1 ) = Op(s) (cid:13)(cid:13)(cid:13)(cid:13) ˆβ (cid:13)(cid:13)(cid:13)(cid:13)2 2 (cid:13)(cid:13)(cid:13)Σj∗(cid:13)(cid:13)(cid:13)2 F = Op(np). (0) − β0)) (cid:113) 1 Thus (III) = Op(sηn,p). Both (II) and (III) are dominated by (I) while ηn,p = Op( np). This concludes the proof of (0) − θ0 = Op( (cid:13)(cid:13)(cid:13)(cid:13)ˆθ (cid:13)(cid:13)(cid:13)(cid:13)ˆθ (cid:13)(cid:13)(cid:13)(cid:13) ˆβ (cid:13)(cid:13)(cid:13)(cid:13) 2 (cid:13)(cid:13)(cid:13)(cid:13) 2 (cid:13)(cid:13)(cid:13)(cid:13) 2 np) (cid:113) 1 = Op((cid:112) s (cid:113) 1 (c) Write ˆβ (1) = ( ˆβ (1) 1 , ˆβ (2) 2 )T . Then we prove (1) − β0 n) and ˆβ (1) 2 = 0 with probability tending to 1, where ˆβ a p− s subvector of ˆβ . Let ηn,p = Op( (1) (0) − θ0 1 (1) formed by elements in supp(β0) and ˆβ (1) 2 is ) = Op( np). We use two steps to prove the consistency and sparsity. step 1 We first prove consistency on s-dimensional space. Define the loglikelihood 97 function for β1 as (0) ¯Q(ˆθ , β1) = − np 2 − n n(cid:88) log(2π) − 1 Pλ((cid:12)(cid:12)βj (cid:12)(cid:12)) 2 log j=1 (cid:12)(cid:12)(cid:12)(cid:12) ˙Σ(ˆθ (0) ) (cid:12)(cid:12)(cid:12)(cid:12) − 1 2(Z − X1β1)T ˙Σ−1(ˆθ (0) )(Z − X1β1) where β1 is subvector of β0 = (β1T , β2T )T formed by elements in supp β0. We first n). It is sufficient to prove that for any  > 0, the smallest rate (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) ˆβ of ξn,p is(cid:112) s 1 − β1 (1) = Op((cid:112) s 2 n such that we have: (0) ¯Q(ˆθ , β1 + uξn,p) < ¯Q(ˆθ (0) , β1)) > 1 −  P ( sup (cid:107)u(cid:107) 2 =C where u ∈ Rs. This implies that with probability tending to 1, there is a local maxi- mizer ˆβ (1) 1 of the function ¯Q in the neighborhood of β1 0 with the radius of β1 0 at most proportional to ξn,p. (0) , β1 0 + uξn,p) − ¯Q(ˆθ (0) , β1 0) n,p − (Z − X1β1 (0) )X1uξ2 ¯Q(ˆθ = − 1 2 uT X1T ˙Σ−1(ˆθ m(cid:88) − np (cid:0)p (cid:48) λn,p j=1 (|β0j|)sgn(βj)uβj ξn,p + p (0) 0) ˙Σ−1(ˆθ (|β0j|)u2 βj (cid:48)(cid:48) λn,p )X1uξn,p n)(1 + o(1))(cid:1) ξ2 =(I) + (II) + (III) By Taylor’s expansion, ˙Σ−1(ˆθ (0) ) = ˙Σ−1(θ0)+(cid:80)q j=1 ˙Σj(θ∗), where θ∗ is a q dimension 98 vector between θ0 and ˆθ (0) . Therefore, (I) = − 1 2 uT X1T ˙Σ−1(θ0)X1uξ2 n,p − 1 2 =(1) + (2) and (II) = − (Z − X1β1 0) ˙Σ−1(ˆθ (0) )X1uξn,p + =(3) + (4) q(cid:88) j=1 uT X1T ˙Σj(θ∗)X1uξ2 n,pujηn,p q(cid:88) (Z − X1β1 0) ˙Σj(θ∗)X1uξn,pujηn,p j=1 Noticing λmin(Σ−1) > 0,(cid:80)n−1 (cid:32)n−1(cid:88) i=1 X1T i X1 1 n2 )Is and i = ( (cid:33)(cid:32)n−1(cid:88) n − n2 n1n2 (cid:33) X1 i=1 = X1T i=1 n2 1 n2 Is i=1 i=1 we have (1) = 1 2 uT X1T diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)X1uξ2 = − 1 i Σ−1X1 X1T i uξ2 i Σ−1 X1T 2 uT 2 uT n,p n−1(cid:88) n−1(cid:88) i=1 X1 i (cid:33) n,p − 1 n−1(cid:88) X1T i n−1(cid:88) (cid:32)n−1(cid:88) i=1 n−1(cid:88) i=1 X1 i uξ2 n,p n,pλmin(Σ−1) uξ2 ≤ − 1 2 uT X1T i X1 i + i=1 i=1 i=1 2 = − 1 ≤ − 1 2 2 (cid:107)u(cid:107)2 n1n2 n π(1 − π) λmax(Σ) nξ2 n,pλmin(Σ−1) ξ2 n,p (cid:107)u(cid:107)2 2 99 By similar argument and A6(i), (2) = Op(nξ2 n,pηn,p) = op((1)) while ηn,p = Op( np). (cid:113) 1 For (II), by lemma 3.6.1 and A3, (cid:113) (cid:114) (3) =Op( tr(X1 ˙Σ−1(θ0)X1)(cid:107)u(cid:107) 2 ξn,p) n1n2 n tr(Is×s)(cid:107)u(cid:107) ξn,p) 2 =Op( λmax(Σ) √ ns(cid:107)u(cid:107) 2 =Op( ξn,p)) √ Similarly, (4) = Op( ns(cid:107)u(cid:107) 2 √ ξn,pηn,p) = op((3)). So we have (II) = Op( ns(cid:107)u(cid:107) ξn,p)). 2 (III) = (5) + (6), where (5) = −n (6) = −n s(cid:88) s(cid:88) j=1 j=1 (cid:48) λn,p p (cid:48)(cid:48) λn,p p (|β0j|)sgn(βj)uβj ξn,p (|β0j|)u2 βj ξ2 n)(1 + o(1)) (3.6.41) Since an,p = Op( 1√ n ) by A7, √ |(5)| ≤ n san,p (cid:107)u(cid:107) 2 √ = Op( nsξn,p (cid:107)u(cid:107) 2 ) (3.6.42) By A8 |(6)| ≤ 2nξ2 n,p s(cid:88) j=1 (cid:48)(cid:48) p (β0j)u2 βj = op(nξ2 n,p) 100 ≤ 2nξ2 n,pbn,p (cid:107)u(cid:107)2 2 (3.6.43) By choosing large C = (cid:107)u(cid:107) by (I) is ξn,p = Op((cid:112) s 2 n). This completes the proof that , the smallest rate of ξnp that (II) and (III) are dominated step 2. in step 2 we prove that the vector ˆβ = ( ˆβ Rd. It is sufficient to prove for any given β ∈ Rd satisfying (cid:107)β − β0(cid:107) have Q(βs, ˆθ ) ≥ Q(β, ˆθ ), where β = (β1T , β2T )T and βs = (β1T , 0T )T . (0) (0) 2 (1) 1 , 0) is a strict local maximizer on n), we (cid:13)(cid:13)(cid:13)(cid:13) ˆβ (1) 1 − β1 0 (cid:13)(cid:13)(cid:13)(cid:13) 2 n). = Op((cid:112) s = Op((cid:112) s Let  = C(cid:112) s n, it is sufficient to prove for j = s + 1, s + 2, ..., p: (0) (0) ) ) ∂Q(β, ˆθ ∂βj ∂Q(β, ˆθ ∂βj < 0 f or 0 < βj <  (3.6.44) > 0 f or −  < βj < 0 ˙Σ−1(ˆθ (0) )Xj(βl − β0l) XT l (3.6.45) (0) ) ∂Q(β, ˆθ ∂βj =(Z − Xβ0)T ˙Σ−1(ˆθ (0) )Xj + p(cid:88) l=1 − nP(cid:48) λ(|βj|)sgn(βj) =(I) + (II) + (III) where Xj is the jth column of X. (I) =(Z − Xβ0)T ˙Σ−1(θ0)Xj + (Z − Xβ0)T ˙Σk(θ∗)Xjukηn,p We first consider (I). By Taylor’s expansion, q(cid:88) =(5) + (6) k=1 101 Notice(cid:80)n−1 i=1 XT ijXij = n − n2 n1n2 1 n2 and ((cid:80)n−1 i=1 Xij)T ((cid:80)n−1 i=1 Xij) = n2 1 n2 . (cid:113) (5) =(Z − Xβ0)T diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)Xj = Op( ˙Σ−1Xj) Xj (cid:118)(cid:117)(cid:117)(cid:116)n−1(cid:88) n−1(cid:88) n−1(cid:88) =Op( ijΣ−1Xij + XT ijΣ−1 XT Xij) i=1 i=1 i=1 Noticing(cid:80)n−1 where Xij is the jth column of Xi. n − n2 n1n2 i=1 XT ijXij = n2 , ((cid:80)n−1 i=1 Xij)T ((cid:80)n−1 1 √ ∞, we have (5) = Op( √ n). Similarly, (6) = Op( i=1 Xij) = n2 1 n2 and λmax(Σ−1) ≤ nηn,p) = op((3)), which is dominated by (5) if ηn,p = o(1). For (II), by Taylor’s expansion, (II) = ˙Σ−1(θ0)Xj(βl − β0l) + XT l q(cid:88) p(cid:88) k=1 l=1 ˙Σk(θ∗)Xj(βl − β0l)(θ∗ k − θ0k) XT l p(cid:88) l=1 =(7) + (8) p(cid:88) p(cid:88) l=1 (7) = = (cid:32)n−1(cid:88) l diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)Xj(βl − β0l) XT n−1(cid:88) (cid:33) n−1(cid:88) il Σ−1Xij + XT il Σ−1 XT Xij (βl − β0l) Notice l=1 i=1 i=1 i=1 il Σ−1Xij ≤ n−1(cid:88) XT i=1 (XT il Σ−1Xil)) 1 2 (XT ijΣ−1Xij)) 1 2 λmax(Σ−1)(XT il Xil)) 1 2 (XT ijXij)) 1 2 n−1(cid:88) ≤ n−1(cid:88) i=1 i=1 102 =λmax(Σ−1)( − n2 1 n1n2 n n2 ) = Op(n). i=1 Xil. Then BT l Bl = (cid:16) n1 n (cid:17)2 . Xij = BT l Σ−1Bj ≤ (BT l Σ−1Bl)1/2(BT j Σ−1Bj)1/2 ≤ λmax(Σ−1) n2 1 n2 Also, let Bl =(cid:80)n−1 n−1(cid:88) n−1(cid:88) il Σ−1 XT i=1 i=1 Then √ Similarly, (8) = Op( (7) = Op(n(cid:107)β − β0(cid:107) 2 √ ) = OP ( ns) nsηn,p) Thus (cid:16) ∂Q(β) ∂βj = nλn,p √ s√ nλn,p OP ( )) + P (cid:48) λn,p (|βj|) λn,p (3.6.46) (3.6.47) (cid:17) sgn(βj) By assumption 9 and 10, the sign of 3.6.47 is determined by βj, hence 3.6.44 followed. This implies ˆβ (1) should satisfy sparse property and completes the proof of step 2. (d) Lastly we prove (1) − θ0 = Op( np). Since (1) − β0 n), it is the (cid:13)(cid:13)(cid:13)(cid:13)ˆθ (cid:13)(cid:13)(cid:13)(cid:13) 2 (cid:113) 1 (cid:13)(cid:13)(cid:13)(cid:13) ˆβ (cid:13)(cid:13)(cid:13)(cid:13) 2 = Op((cid:112) s same as the proof of (b). We omit the detail here and this completes the proof. 3.6.3 Proofs for consistency for PMLE with tappering Lemma 3.6.3. Assume A1, A2, A11 and A12 hold, we have: (1) (cid:107)Σ(θ) − Σ(θ)T(cid:107)∞ = O( 1 pδ0 ); 103 ); (2) (cid:13)(cid:13)Σk(θ) − Σk,T (θ)(cid:13)(cid:13)∞ = O( 1 (3) (cid:13)(cid:13)Σjk(θ) − Σjk,T (θ)(cid:13)(cid:13)∞ = O( 1 (cid:12)(cid:12)aij (cid:80)p The matrix norm (cid:107)·(cid:107)∞ for the p× p matrix A = [aij]p sumation, i.e. (cid:107)A(cid:107)∞ = maxi (cid:12)(cid:12) pδ0 pδ0 j=1 ). i,j=1 is defined as the maximum of row Proof. We show (1) in detail and omit the details for (2) and (3), as similar arguments can be applied. p(cid:88) (cid:12)(cid:12)γ(hij; θ)KT (hij, wp) − γ(hij; θ)(cid:12)(cid:12) (3.6.48) is the distance between site si and sj. For any i = 1, 2, ..., p, (cid:107)Σ(θ) − ΣT (θ)(cid:107)∞ = max i 2 j=1 (cid:13)(cid:13) where hij =(cid:13)(cid:13)si − sj p(cid:88) (cid:12)(cid:12)γ(hij; θ)KT (hij, wp) − γ(hij; θ)(cid:12)(cid:12) ≤ (cid:88) (cid:12)(cid:12)γ(hij; θ)KT (hij, wp) − γ(hij; θ)(cid:12)(cid:12) + j=1 hij wp} and Bi of n and p. Then Ai ⊂ (cid:83)∞ m = {j : (m − 1)δ ≤ hij < mδ}, where ∆ is independent ∆ (cid:99) Bi m. Let V (R) be the volume of a d-dimensional baa m = V (mδ) − V ((m − 1)δ) = fd−1(m)δd, where of radius R. Then the volume of Bi fd−1(m) is a polynomial function of m with degree of d − 1. By A1, the number of sites in any unit subset of D ⊂ Rd is bounded, say ρ. Let #{A} denote the cardinality of a discrete set A. Then we have #{Bi m} ≤ fd−1(m)δdρ. Then exist a constant K such that m is Bi 104 (3.6.50) (3.6.51) fd−1(m) ≤ Kmd−1 Then (cid:88) (II) = Let Ai 2 = {j : hij ≤ wp}. Then Ai (I) = ∞(cid:88) (cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12) ≤ ∞(cid:88) (cid:90) ∞ m=(cid:98) wp δ (cid:99) (cid:99) m=(cid:98) wp δ md−1δd max j∈Bi m (cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12) j∈Bi m (cid:88) (cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12) (cid:90) ∞ hij≥wp ≤ Kρ ≤ Kρ xd−1 |γ(θ, x)| dx ≤ Kρ wp 0 wp xd |γ(θ, x)| dx (cid:88) hij 0, limp→∞ λmax(ΣT ) < ∞; 105 (b) There exists an open subset ω that contains the true parameter θ0 such that for all θ∗ ∈ ω, we have: (i) −∞ < limp→∞ λmin(Σk (ii) −∞ < limp→∞ λmin(Σkj T (θ∗)) < limp→∞ λmax(Σk T (θ∗)) < limp→∞ λmax(Σkj T (θ∗)) < ∞; T (θ∗)) < ∞; (cid:12)(cid:12)(cid:12)(cid:12) ∂tij,T (θ∗) ∂θk (cid:12)(cid:12)(cid:12)(cid:12) = O(p) for all k = 1, 2, ..., q. (iii) Proof. (a) Let KT = [K(hij, w)]p i,j=1 be the tappering covariance. By eigenvalue inequal- ities of Schur product: min 1≤i≤p aiiλmin(Σ) ≤ λ(Σ ◦ KT ) ≤ max 1≤i≤p aiiλmax(Σ) (3.6.52) where aij are the (i, j)th entry of matrix KT . By A3, λmin(ΣT ) > 0 and limp→∞ λmax(ΣT ) < ∞ (b) Since Σk T = Σk ◦ KT and Σkj T = Σkj ◦ KT [d](i) and [d](ii) hold by A 6(ii). For [2](iii), since tij,T (θ) = tr(Σ−1 T Σi,T Σ−1 T Σj,T ) ∂tij,T (θ) ∂θl =tr(Σ−1 T Σl,T Σ−1 T Σi,T Σ−1 T Σj,T ) + tr(Σ−1 + tr(Σ−1 T Σi,T Σ−1 T Σl,T Σ−1 T Σj,T ) + tr(Σ−1 T Σjl,T ) T Σj,T ) T Σil,T Σ−1 T Σi,T Σ−1 =(1) + (2) + (3) + (4) Then (1) can be written as: (3.6.53) (3.6.54) tr(Σ−1 =tr((Σ−1 T Σl,T Σ−1 T Σi,T Σ−1 T − Σ−1)Σl,T Σ−1 T Σj,T ) T Σi,T Σ−1 T Σj,T ) + tr(Σ−1(Σl,T − Σl)Σ−1 T Σi,T Σ−1 T Σj,T )+ 106 T − Σ−1)Σi,T Σ−1 T Σj,T ) + tr(Σ−1ΣlΣ−1(Σi,T − Σi)Σ−1 T Σj,T )+ T − Σ−1)Σj,T ) + tr(Σ−1ΣlΣ−1ΣiΣ−1(Σj,T − Σj))+ tr(Σ−1Σl(Σ−1 tr(Σ−1ΣlΣ−1Σi(Σ−1 tr(Σ−1ΣlΣ−1ΣiΣ−1Σj) Define (cid:107)·(cid:107)s for matrix A by (cid:107)A(cid:107)s = maxi{|λi(A)| , i = 1, 2, ..., p}, where λi(A) is the (cid:13)(cid:13)(cid:13)Σ−1 T (cid:13)(cid:13)(cid:13)s . ith eigenvalue of matrix A. Notice (cid:13)(cid:13)s (cid:107)Σ − ΣT(cid:107)s (cid:13)(cid:13)(cid:13)Σ−1 T − Σ−1(cid:13)(cid:13)(cid:13)s ≤(cid:13)(cid:13)(cid:13)Σ−1(cid:13)(cid:13)(cid:13)s Since λmin(Σ−1) = 1/λmax(Σ) > 0,(cid:13)(cid:13)Σ−1(cid:13)(cid:13)s T − Σ−1(cid:13)(cid:13)(cid:13)s (cid:13)(cid:13)(cid:13)Σ−1 (cid:13)(cid:13)Σj,T (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)s (cid:13)(cid:13)Σi,T (cid:13)(cid:13)s T − Σ−1)Σl,T Σ−1 T Σi,T Σ−1 T − Σ−1)Σl,T Σ−1 T Σi,T Σ−1 (cid:13)(cid:13)Σl,T T − Σ−1) (cid:12)(cid:12)(cid:12)tr((Σ−1 (cid:13)(cid:13)(cid:13)((Σ−1 (cid:13)(cid:13)(cid:13)(Σ−1 ≤ p < ∞ for all j = 1, 2, ..., q. Hence (cid:13)(cid:13)(cid:13)Σ−1 T T Σj,T ) T Σj,T ) (cid:13)(cid:13)(cid:13)s (cid:13)(cid:13)s (cid:13)(cid:13)(cid:13)2 s ≤ p = O(p/pδ0) = O(p1−δ0) ≤ λmax(Σ−1) < ∞. Also = Op(p−δ0). Then (cid:13)(cid:13)(cid:13)Σ−1 T (cid:13)(cid:13)(cid:13)s < ∞, (3.6.55) (cid:13)(cid:13)Σj,T (cid:13)(cid:13)s By similar argument, the first six terms in (1) all have the order of O(p1−δ0). Apply the same argument on (2) − (4) we have: ∂tij,T (θ) ∂θl = ∂tij(θ) ∂θl + O(p1−δ0) (3.6.56) By A6(iii), ∂tij,T (θ) ∂θl = Op(p). This completes the proof. 107 Proof of Theorem 5. From lemma 4, all regularity conditions for ΣT are satisfied. The proof of 5 is similar to that of Theorem 4. By replacing Σ by ΣT and replacing A3-A6(iii) by the results in lemma 4, the results in Theorem 5 follows. 3.6.4 Proofs for classification using PLDA Lemma 3.6.5. Let ˆθ be the estimate of θ0 and = Op( 1√ np). Define (cid:13)(cid:13)(cid:13)ˆθ − ˆθ0 (cid:13)(cid:13)(cid:13) 2 ˜Σ = ΣT (ˆθ) = Σ(ˆθ) ◦ K(w) where K(w) is defined in section 3.3.1.2. Assume A1, A2 and A11 and A12 hold,then (cid:13)(cid:13)(cid:13) ˜Σ − Σ (cid:13)(cid:13)(cid:13) 2 (cid:13)(cid:13)(cid:13) ˜Σ−1 − Σ−1(cid:13)(cid:13)(cid:13) = Op(cn) and = Op(cn) 2 where cn,p = max( wd√ np , 1 w ) Proof. (cid:13)(cid:13)(cid:13) ˜Σ − Σ (cid:13)(cid:13)(cid:13) 2 = (cid:13)(cid:13)(cid:13)Σ(ˆθ) ◦ K(w) − Σ(θ0) (cid:13)(cid:13)(cid:13) p(cid:88) (cid:12)(cid:12)(cid:12)r(ˆθ; hij)KT (hij, w) − r(θ0; hij) 2 (cid:12)(cid:12)(cid:12) ≤ max i j=1 where KT (h, w) = [(1 − h/w)+]2. For any i = 1, 2, ..., p, p(cid:88) ≤ (cid:88) j=1 (cid:12)(cid:12)(cid:12)r(ˆθ; hij)KT (hij, w) − r(θ0; hij) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)r(ˆθ; hij)KT (hij, w) − r(θ0; hij) (cid:12)(cid:12)(cid:12) + hij 0, (cid:90)  (cid:90)  hd−1γ(h)dh = O( 0 0 hd−1dh) = O(d/d) → 0 as  → 0 (3.6.71) Also, since Kν(h) ∝ e−hh− 1 2 (1 + O( 1 h)) as h → ∞, there exist a constant K, for any C sufficiently large, we have: (cid:90) ∞ hd−1γ(h)dh ≤ K (cid:90) ∞ 3.6.71 and 3.6.72 lead to (cid:82) ∞ C 0 hd−1+v− 1 2 e−hdh = Γ(d + v − 1 2) < ∞ (3.6.72) 0 hd−1γ(h)dh < ∞. Let p → ∞ in 3.6.70, we have lim supp→∞λmax(Σ) < ∞ if Σ is derived from M at´ern covariance function. Now consider the second part of A3. A1 assumes increasing domain framework. Bachoc and Furrer (2016) [4] showed that under A1 and some weak assumptions on the matrix covariance function, if the spectral density of the covariance function is positive, the smallest eigenvalue of the covariance matrix is asymptotically bounded away from zero. 117 Most standard covariance function such as M at´ern covariance function satisfy those assumptions hence satisfy the second part of A3. (cid:80)p Remarks on A4 and A5 : A4 and A5 are the same as the assumptions in [37]. (cid:107)Σk(cid:107) = F i,j=1 γ2 k(hij; θ), where γk(hij; θ) = ∂γ(hij ;θ) ∂θk , k = 1, 2, ..., q and θ is a k dimensional parameter. We now verify that M at´ern covariance function satisfy A4 for fixed ν. For M at´ern covariance function with fixed ν, we have 21−ν Γ(ν) = −σ2 21−ν Γ(ν) = σ2(1 − c) ∂γ(h) ∂σ2 = ∂γ(h) ∂c ∂γ(h) ∂(1/r) (h/r)νKν(h/r)(1 − c) (3.6.73) (h/r)νKν(h/r) 21−ν Γ(ν) h(h/r)ν(2 ν h/r Kν(h/r) − Kν−1(h/r)) It is easy to show that for each k, exist a constant  > 0 independent of n, p, for each i, there’s j such that γk(hij) > c. As a result, (cid:107)Σk(cid:107) Therefore (cid:107)Σk(cid:107)−1 = Op(p−1). i=1  = p. i,j=1 γ2 k(hij; θ) ≥(cid:80)p =(cid:80)p F F Remarks on A6 : We can also verify that the M at´ern covariance function with fixed ν satisfy A6(i) and A6(ii ). Similar to the procedure in the remarks on A3, it is sufficient to verify for any θ ∈ Θ, γk(h; θ) and γkj(h; θ) belong to the function space: (cid:90) ∞ (cid:61) = {f (x) : f (x)xd−1dx < ∞} 0 where d ≥ 1 is the dimension of the domain. We have the first-order partial derivative of M at´ern covariance function in (3.6.73). The second-order partial derivative of M at´ern 118 covariance function is as follows: (3.6.74) ∂2γ(h) (∂2σ2) ∂2γ(h) ∂σ2∂c ∂2γ(h) = 0 = −21−ν Γ(ν) = (1 − c) ∂σ2∂(1/r) ∂2γ(h) ∂c2 = 0 ∂2γ(h) ∂c∂(1/r) ∂2γ(h) ∂2(1/r) (h/r)νKν(h/r) 21−ν Γ(ν) h(h/r)ν(2 ν h/r Kν(h/r) − Kν−1(h/r)) Kν(h/r) − Kν−1(h/r)) ν h/r = −σ2 21−ν Γ(ν) = σ2(1 − c) − (4ν + 1)(h/r)ν−1h2Kν+1(h/r) − (h/r)νh2Kν+2(h/r)] [(h/r)ν−2h2Kν(h/r)(2ν − 1)2ν h(h/r)ν(2 21−ν Γ(ν) Note that the covariance function and its first-order and second-order partial deriva- to proving(cid:82) ∞ tives are linear combinations of a Bessel function of h times a polynomial of h. Similar 0 hd−1γ(h)dh < ∞ in (3.6.70), we have γk(h; θ) ∈ (cid:61) and γkj(h; θ) ∈ (cid:61). Hence A6(i) and A6(ii ) are satisfied. By similar procedure, we can verify that M at´ern covariance function also satisfy A12 and A13. 119 BIBLIOGRAPHY 120 BIBLIOGRAPHY [1] Stanis(cid:32)law Adaszewski, Juergen Dukart, Ferath Kherif, Richard Frackowiak, Bogdan Draganski, et al. How early can we predict Alzheimer’s disease using computational anatomy? Neurobiology of aging, 34(12):2815–2826, 2013. [2] Yaman Aksu, David J Miller, George Kesidis, Don C Bigler, and Qing X Yang. An MRI-derived definition of MCI-to-AD conversion for long-term, automatic prognosis of MCI patients. PLoS One, 6(10):e25074, 2011. [3] Juan Eloy Arco, Javier Ram´ırez, Juan Manuel G´orriz, Carlos G Puntonet, and Mar´ıa Ruz. Short-term prediction of MCI to AD conversion based on longitudinal MRI analysis and neuropsychological tests. In Innovation in Medicine and Healthcare 2015, pages 385–394. Springer, 2016. [4] Fran¸cois Bachoc and Reinhard Furrer. On the smallest eigenvalues of covariance matri- ces of multivariate spatial processes. Stat, 5(1):102–107, 2016. [5] Peter J Bickel and Elizaveta Levina. Some theory for Fisher’s linear discriminant func- tion,’naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, pages 989–1010, 2004. [6] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577–2604, 2008. [7] Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199–227, 2008. [8] T. Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494):672–684, 2011. [9] T. Tony Cai and Weidong Liu. A direct estimation approach to sparse linear discrim- inant analysis. Journal of the American Statistical Association, 106(496):1566–1577, 2011. [10] T. Tony Cai, Zhao Ren, Harrison H Zhou, et al. Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation. Electronic Journal of Statistics, 10(1):1–59, 2016. [11] T. Tony Cai and Anru Zhang. Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data. Journal of multivariate analysis, 150:55–74, 2016. [12] T. Tony Cai, Cun-Hui Zhang, Harrison H Zhou, et al. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4):2118–2144, 2010. 121 [13] T. Tony Cai and Linjun Zhang. High-dimensional linear discriminant analysis: Opti- mality, adaptive algorithm, and missing data1. Technical report, 2018. [14] Tingjin Chu, Jun Zhu, Haonan Wang, et al. Penalized maximum likelihood estimation and variable selection in geostatistics. The Annals of Statistics, 39(5):2607–2625, 2011. [15] Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. Sparse discrim- inant analysis. Technometrics, 53(4):406–413, 2011. [16] Robert W Cox. Afni: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical research, 29(3):162–173, 1996. [17] Noel Cressie. Statistics for spatial data. Terra Nova, 4(5):613–617, 1992. [18] Simon F Eskildsen, Pierrick Coup´e, Daniel Garc´ıa-Lorenzo, Vladimir Fonov, Jens C Pruessner, D Louis Collins, et al. Prediction of Alzheimer’s disease in subjects with mild cognitive impairment from the ADNI cohort using patterns of cortical thinning. Neuroimage, 65:511–521, 2013. [19] Jianqing Fan and Yingying Fan. High dimensional classification using features annealed independence rules. Annals of statistics, 36(6):2605, 2008. [20] Jianqing Fan, Yang Feng, and Xin Tong. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4):745–771, 2012. [21] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001. [22] Christine Fennema-Notestine, Donald J Hagler, Linda K McEvoy, Adam S Fleisher, Elaine H Wu, David S Karow, and Anders M Dale. Structural MRI biomarkers for preclinical and mild Alzheimer’s disease. Human brain mapping, 30(10):3238–3253, 2009. [23] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001. [24] Peter Hall and Tapabrata Maiti. Choosing trajectory and data type when classifying functional data. Biometrika, page ass011, 2012. [25] Miriam Hartig, Diana Truran-Sacrey, Sky Raptentsetsang, Alix Simonson, Adam Mezher, Norbert Schuff, and Michael Weiner. UCSF freesurfer methods. ADNI: Alzheimers Disease Neuroimaging Initiative, San Francisco, 2014. [26] Roger Horn and Charles Johnson. Topics in matrix analysis. ZAMM-Journal of Applied Mathematics and Mechanics, 72(12):692–692, 1992. 122 [27] Clifford R Jack, Matt A Bernstein, Nick C Fox, Paul Thompson, Gene Alexander, Danielle Harvey, Bret Borowski, Paula J Britson, Jennifer L Whitwell, Chadwick Ward, et al. The alzheimer’s disease neuroimaging initiative (adni): Mri methods. Journal of magnetic resonance imaging, 27(4):685–691, 2008. [28] Clifford R Jack, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen, Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model of dynamic biomarkers of the Alzheimer’s pathological cascade. The Lancet Neurology, 9(1):119–128, 2010. [29] David S Karow, Linda K McEvoy, Christine Fennema-Notestine, Donald J Hagler Jr, Robin G Jennings, James B Brewer, Carl K Hoh, and Anders M Dale. Relative capability of MR imaging and FDG PET to depict changes associated with prodromal and early Alzheimer’s disease. Radiology, 256(3):932–942, 2010. [30] Sang Han Lee, Alvin H Bachman, Donghyeon Yu, Johan Lim, Babak A Ardekani, et al. Predicting progression from mild cognitive impairment to Alzheimer’s disease using longitudinal callosal atrophy. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 2:68–74, 2016. [31] Kelvin K Leung, Josephine Barnes, Gerard R Ridgway, Jonathan W Bartlett, Matthew J Clarkson, Kate Macdonald, Norbert Schuff, Nick C Fox, Sebastien Ourselin, et al. Au- tomated cross-sectional and longitudinal hippocampal volume measurement in mild cognitive impairment and Alzheimer’s disease. Neuroimage, 51(4):1345–1359, 2010. [32] Kelvin K Leung, Jonathan W Bartlett, Josephine Barnes, Emily N Manning, Se- bastien Ourselin, Nick C Fox, et al. Cerebral atrophy in mild cognitive impairment and Alzheimer disease rates and acceleration. Neurology, 80(7):648–654, 2013. [33] Quefeng Li and Jun Shao. Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, pages 457–473, 2015. [34] Yang Li, Yaping Wang, Guorong Wu, Feng Shi, Luping Zhou, Weili Lin, Dinggang Shen, et al. Discriminant analysis of longitudinal cortical thickness changes in Alzheimer’s disease using dynamic and network features. Neurobiology of aging, 33(2):427–e15, 2012. [35] Qing Mai, Yi Yang, and Hui Zou. Multiclass sparse discriminant analysis. arXiv preprint arXiv:1504.05845, 2015. [36] Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, page asr066, 2012. [37] Kanti V Mardia and Roger J Marshall. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika, 71(1):135–146, 1984. [38] CR McDonald, LK McEvoy, L Gharapetian, C Fennema-Notestine, DJ Hagler, D Hol- land, A Koyama, JB Brewer, AM Dale, Alzheimers Disease Neuroimaging Initiative, 123 et al. Regional rates of neocortical atrophy from normal aging to early Alzheimer dis- ease. Neurology, 73(6):457–465, 2009. [39] C. D. Meyer. Matrix analysis and applied linear algebra siam, philadelphia, 2000. Numerical Algorithms, 26(2):198, 2001. [40] Chandan Misra, Yong Fan, and Christos Davatzikos. Baseline and longitudinal patterns of brain atrophy in MCI patients, and their use in prediction of short-term conversion to AD: results from ADNI. Neuroimage, 44(4):1415–1422, 2009. [41] Hans-georg M¨uller. Functional modelling and classification of longitudinal data. Scan- dinavian Journal of Statistics, 32(2):223–240, 2005. [42] Schabenberger Oliver and G Carol. Statistical methods for spatial data analysis, 2005. [43] Rui Pan, Hansheng Wang, and Runze Li. Ultrahigh-dimensional multiclass linear dis- criminant analysis by pairwise sure independence screening. Journal of the American Statistical Association, 111(513):169–179, 2016. [44] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High- dimensional covariance estimation by minimizing 1-penalized log-determinant diver- gence. Electronic Journal of Statistics, 5:935–980, 2011. [45] Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008. [46] Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241– 1265, 2011. [47] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879– 898, 2012. [48] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Di- agnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):6567–6572, 2002. [49] Michael W Weiner, Dallas P Veitch, Paul S Aisen, Laurel A Beckett, Nigel J Cairns, Robert C Green, Danielle Harvey, Clifford R Jack, William Jagust, Enchi Liu, et al. The Alzheimer’s Disease Neuroimaging Initiative: a review of papers published since its inception. Alzheimer’s & Dementia, 9(5):e111–e194, 2013. [50] Holger Wendland. Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree. Advances in computational Mathematics, 4(1):389– 396, 1995. [51] Daniela M Witten and Robert Tibshirani. Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodol- ogy), 73(5):753–772, 2011. 124 [52] Michael C Wu, Lingsong Zhang, Zhaoxi Wang, David C Christiani, and Xihong Lin. Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics, 25(9):1145–1151, 2009. [53] Wei Biao Wu and Mohsen Pourahmadi. Banding sample autocovariance matrices of stationary processes. Statistica Sinica, pages 1755–1768, 2009. [54] Bradley T Wyman, Danielle J Harvey, Karen Crawford, Matt A Bernstein, Owen Carmichael, Patricia E Cole, Paul K Crane, Charles DeCarli, Nick C Fox, Jeffrey L Gunter, et al. Standardization of analysis sets for reporting results from ADNI MRI data. Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 9(3):332–337, 2013. [55] Peirong Xu, Ji Zhu, Lixing Zhu, and Yi Li. Covariance-enhanced discriminant analysis. Biometrika, 102(1):33–45, 2014. [56] Fang Yao, Hans-Georg M¨uller, and Jane-Ling Wang. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100(470):577–590, 2005. [57] Daoqiang Zhang, Dinggang Shen, et al. Predicting future clinical changes of MCI patients using longitudinal and multimodal biomarkers. PloS one, 7(3):e33182, 2012. [58] Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4):1509, 2008. 125