HIGH DIMENSIONAL CLASSIFICATION FOR SPATIALLY DEPENDENT DATA

WITH APPLICATION TO NEUROIMAGING

By

Yingjie Li

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Statistics – Doctor of Philosophy

2018

ABSTRACT

HIGH DIMENSIONAL CLASSIFICATION FOR SPATIALLY DEPENDENT

DATA WITH APPLICATION TO NEUROIMAGING

By

Yingjie Li

The methods developed in this thesis are motivated by how to use structure Magnetic

resonance imaging (MRI) data to predict Alzheimer’s disease (AD) or to discriminate be-

tween healthy subjects and AD patients.

Imaging data is a typical example of spatially

dependent data where the correlation between data points collected at various voxels (pix-

els) can be described by proximity. Also, it is high dimensional data since the number of

voxels is extremely high comparing to the number of subjects.

The ﬁrst piece of work considers use longitudinal volumetric MRI data of ﬁve regions of

interest (ROIs), which are known to be vulnerable to Alzheimer’s disease (AD) for predic-

tion. A longitudinal data prediction method based on functional data analysis is applied for

identifying when an early prediction can reasonably be made and determining whether one

ROI is superior with regard to predicting progression to AD over others. By adopting sta-

tistically validated procedures, we compared the prediction performance based on individual

ROIs as well as their combinations. For all the models, the results show that the overall one

year, two years, three years in advance prediction accuracy is above 80%. MCI converter

subjects can be correctly detected as early as two years prior to conversion.

The second piece of work considers use voxel level MRI data for classiﬁcation of AD pa-

tients and healthy subjects. A supervised learning method based on the linear discriminant

analysis (LDA) was developed for high dimensional spatially dependent data. The theory

shows that the method proposed can achieve consistent parameter estimation, consistent

features selection, and asymptotically optimal misclassiﬁcation rate. The extensive simu-

lation study showed a signiﬁcant improvement in classiﬁcation performance under spatial

dependence. We applied the proposed method to voxel level MRI data for classiﬁcation.

The classiﬁcation performance is superior compared to other comparable methods.

Copyright by
YINGJIE LI
2018

I dedicate this dissertation to my family, especially to my dearest grandmother Chen,

Dezhen.

v

ACKNOWLEDGMENTS

I would like to express my sincerest gratitude to my advisor Professor Tapabrata Maiti,

for his invaluable assistance, constructive guidance and immense knowledge. His vision on

statistics and his enthusiasm on research encouraged me all the time. I am fully indebted to

him for his understanding, patient and encouragement. It is impossible to have this thesis

completed and comprehensive without his continuous support to me and to my family.

I would like to thank Professor David Zhu, for sharing his precious experience of working

on brain imaging data, introducing many of the interesting practical problems in Alzheimer’s

disease, and also for serving as one of my committee members. I also wish to thank Professor

Chae Young Lim and Professor Pingshou Zhong for serving on my dissertation committee.

I am extremely grateful for their constructive advice in this thesis work.

I would also thank the entire faculty and staﬀ members in the Department of Statistics

and Probability who have helped me during my study in Michigan State University.

Last I would like to thank my parents and my sister for loving me and supporting me

at every stage of my life. Many thanks go to my husband Yuzhen for standing by my side

always. I give my special thanks to my daughter Aurora, who brought a wonderful light to

my life.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Spatial statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Linear discriminant analysis (LDA) for high dimensional data . . . . . . . .
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2 Early prediction of Alzheimer’s disease using longitudinal vol-
umetric MRI data from ADNI . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Functional Principal Component Analysis with longitudinal data . . .
2.2.2 Early prediction and choosing trajectory . . . . . . . . . . . . . . . .
2.2.3 Prediction with logistic classiﬁer . . . . . . . . . . . . . . . . . . . . .
2.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Prediction using hippocampus . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Prediction results using single ROI
. . . . . . . . . . . . . . . . . . .
2.4.3 Prediction results using combinations of ROIs . . . . . . . . . . . . .
2.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

xi

1
1
2
6
10

13
13
17
17
21
22
24
28
28
32
33
35
37
40

3.3.1 The penalized maximum likelihood estimation (PMLE)

Chapter 3 High dimensional discriminant analysis for spatially dependent
data and its application in neuroimaging data . . . . . . . . . .
44
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
44
3.2 Classiﬁcation using maximum likelihood estimation (MLE-LDA) . . . . . . .
50
3.3 Classiﬁcation using penalized maximum likelihood estimation (PLDA) . . . .
54
. . . . . . .
54
3.3.1.1 Consistency of PMLE . . . . . . . . . . . . . . . . . . . . .
58
3.3.1.2 Covariance tappering and PMLE . . . . . . . . . . . . . . .
60
3.3.2 The penalized maximum likelihood estimation LDA (PLDA) classiﬁer
62
3.4 Simulation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
3.6 Proofs of the main results
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3.6.1 Proofs for classiﬁcation using MLE . . . . . . . . . . . . . . . . . . .
75
3.6.2 Proofs for consistency of PMLE . . . . . . . . . . . . . . . . . . . . .
93
3.6.3 Proofs for consistency for PMLE with tappering . . . . . . . . . . . . 103
3.6.4 Proofs for classiﬁcation using PLDA . . . . . . . . . . . . . . . . . . 108

vii

3.6.5 Remarks on the assumptions . . . . . . . . . . . . . . . . . . . . . . . 116

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

viii

LIST OF TABLES

Table 2.1

Subjects characteristics . . . . . . . . . . . . . . . . . . . . . . . . .

25

Table 2.2

Data characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Table 2.3

Subjects characteristics by year

. . . . . . . . . . . . . . . . . . . .

28

Table 2.4

Parameter estimation (hippocampus for 1y prediction) . . . . . . . .

29

Table 2.5

Prediction accuracy rates (hippocampus) . . . . . . . . . . . . . . .

32

Table 2.6

Prediction performance using single ROI

. . . . . . . . . . . . . . .

34

Table 2.7

Prediction performance of combinations of two ROIs . . . . . . . . .

41

Table 2.8

Prediction performance of combinations of three ROIs

. . . . . . .

42

Table 2.9

Prediction performance of combinations of four or more ROIs

. . .

43

Table 3.1

Classiﬁcation Accuracy Rate (Exponential covariance, p=36) . . . .

67

Table 3.2

Classiﬁcation Accuracy Rate (Exponential covariance, p=400)

. . .

67

Table 3.3

Classiﬁcation Accuracy Rate (Exponential covariance, p=1225)

. .

67

Table 3.4

Parameter estimation (Exponential covariance)

. . . . . . . . . . .

68

Table 3.5

Number of variables selected (Exponential covariance, p=36) . . . .

68

Table 3.6

Number of variables selected (Exponential covariance, p=400)

. . .

68

Table 3.7

Number of variables selected (Exponential covariance, p=1225) . . .

69

Table 3.8

Classiﬁcation Accuracy Rate (Polynomial covariance, p=36)

. . . .

69

Table 3.9

Classiﬁcation Accuracy Rate (Polynomial covariance, p=400)

. . .

69

Table 3.10

Classiﬁcation Accuracy Rate (Polynomial covariance, p=1225)

. . .

69

Table 3.11

Parameter estimation (Polynomial covariance)

. . . . . . . . . . . .

70

ix

Table 3.12 Number of variables selected (Polynomial covariance, p=36) . . . . .

70

Table 3.13 Number of variables selected (Polynomial covariance, p=400) . . . .

70

Table 3.14 Number of variables selected (Polynomial covariance, p=1225)

. . .

71

Table 3.15

Classiﬁcation Accuracy Rate (Mis-speciﬁed covariance, p=36)

. . .

71

Table 3.16

Classiﬁcation Accuracy Rate (Mis-speciﬁed covariance, p=400) . . .

71

Table 3.17

Classiﬁcation Accuracy Rate (Mis-speciﬁed covariance, p=1225)

. .

71

Table 3.18

Subjects characteristics . . . . . . . . . . . . . . . . . . . . . . . . .

73

Table 3.19

Subjects characteristics of training and testing set . . . . . . . . . .

74

Table 3.20

Classiﬁcation performance for voxel level MRI data. Training and
testing samples are of sizes 200 and 174, respectively . . . . . . . . .

74

x

LIST OF FIGURES

Figure 1.1

left: T1-weighted MRI of AD subject; right: TI-weighted MRI of NL
subject.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Figure 2.1

Longitudinal volume of ROIs for MCI subjects. In (a)-(e), the value
on Y axis is the normalized volume. In (f), the value on Y axis is the
ICV (mm3). The value on X axis is “disease year”. Thin lines are
observations for each subject. Blue lines are for MCI-c subjects and
red lines are for MCI-nc subjects. Blue and red thick lines are pooled
mean curves for MCI-c group and MCI-nc group respectivly.
. . . .

Figure 2.2

PACE analysis using hippocampus for 1-year prediction. (a)-(c) show
the estimations of mean function µ(t), covariance surface G(t) and
the ﬁrst two eigen functions φ1(t) and φ2(t). (d) shows the ﬁrst two
eigen functions explained 98.907% of the total variance.
. . . . . .

Figure 2.3

Second versus ﬁrst FPC scores for hippocampus (1 year prediction).
The triagulars indicate MCI-nc and the crosses indicate MCI-c. . . .

Figure 2.4

ROC curve for prediction using hippucampus. Solid line, dotted line
and dotdash line are the ROC curves for 1-year, 2-year and 3-year
prediction respectively.
. . . . . . . . . . . . . . . . . . . . . . . . .

26

30

31

32

Figure 2.5

Figure 3.1

Prediction sensitivity(a), speciﬁcity(b), accuracy(c), and AUC(d) us-
ing single and combinations of ROIs. In each panel, every bar rep-
resent a predict result using diﬀerent combinations of biomarkers
(from left to right): Hippocampus(H), whole brain(W), entorhinal
cortex(EC), fusiform gyrus(F), middle temporal cortex(MTC), H+WB,
H+EC, H+FG, H+MTC, WB+EC, WB+FG, WB+MTC, EC+FG,
EC+MTC, FG+MTC, H+WB+EC, H+WB+FG, H+WB+MTC,
H+EC+FG, H+EC+MTC, H+FG+MTC, WB+EC+FG, WB+EC+MTC,
WB+FG+MTC, EG+FG+MTC, H+WB+EC+FG, H+WB+EC+MTC,
H+WB+FG+MTC, H+EC+FG+MTC, WB+EC+FG+MTC and com-
bination of all the ﬁve ROIs.
. . . . . . . . . . . . . . . . . . . . . .
Two dimensional domain example. Left: 2D domain with p=4 × 4;
middle: µ1; right: µ2. . . . . . . . . . . . . . . . . . . . . . . . . . .

64

38

xi

Chapter 1

Introduction

1.1 Motivation

Brain imaging provides a non-invasive way to observe the human central nervous system.

Many researches are working on improving imaging technologies, data analysis and the appli-

cation of imaging to investigate neurological disorders (including Alzheimer’s disease (AD),

Parkinson’s disease and dementia). AD is the most common form of dementia. Many re-

searches are focusing on ﬁnd better ways to treat the disease, delay its onset, and prevent

it from developing. Also, it is of great interest in developing objective biologically based

methods to diagnose and predict Alzheimer’s disease, track the progression, and monitor the

eﬃcacy of treatment. Brain atrophy is a primary pathologic process in AD due to widespread

neuronal death (see, e.g., [28]). As a non-invasive, widely available and cost-eﬀective way

for obtaining brain imaging, MRI data plays an important role as a potential biomarker to

monitor atrophy progression and thereby the progression of AD. Figure 1.1 gives an example

of MRI of subject with AD and that of normal subject (NL).

This work is motivated by how to use structure Magnetic resonance imaging (sMRI) data

of the brain to discriminate between the healthy subjects and the AD patients and to predict

Alzheimer’s disease (AD) progress. Imaging data is a typical example of spatial dependent

data where the relationship between the data points collected at various voxels (pixels) can

be described by proximity. Also, it is high dimensional data since the number of voxels

1

Figure 1.1 left: T1-weighted MRI of AD subject; right: TI-weighted MRI of NL subject.

is extremely high comparing to the number of subjects.

If consider the signal from each

voxel as a feature, we might have millions of features for classiﬁcation. Motivated by the

characteristics of the brain imaging data, the goal of this thesis is to develop classiﬁcation

methods for high dimensional data with spatial dependency.

1.2 Spatial statistical models

Spatial statistical models are widely applied to many scientiﬁc disciplines, including geology,

agriculture, climatology, ecology, epidemiology, forestry, astronomy, atmospheric science, real

estate, brain imaging, and any discipline that the spatial dependence among observations

cannot be ignored in data analysis.

According to [42], there are three types of spatial data.

1. Geostatistical data. Let D be a continuous domain in the d-dimensional Euclidean

space Rd. Let Y be the variable concerned in the study, such as the air temperature

at diﬀerent locations in US. D is a continuous domain means Y can be observed

everywhere within D, that is between any two locations si and sj, theoretically there
are inﬁnite locations where Y are well deﬁned. We use spatial process {Y (s), s ∈ D}

2

to model the variable Y . If data are collected at some locations s1, ..., sn in D, say

Y (s1), ..., Y (s2), the dataset is called geostatistical data.

2. Lattice data. Lattice data are spatial data where the domain D is ﬁxed and discrete,

in other words, non-random and countable. For example, lattice data could be data

collected by ZIP code, prime rate of a county, or remotely sensed data reported by

pixels. The distinction between geostatistical data and lattice data are not always

clearcut. For example, the imaging data with very low resolution can be considered

as lattice data; the imaging data with very high resolution can be considered as geo-

statistical data; those with moderate resolution could be considered as either one,

determined by the goal of the study. For the convenience in the notation, we use
{Y (si), si ∈ D, i = 1, 2, ...} to represent lattice data.

3. Point patterns. Geostatistical and lattice data have in common the ﬁxed, non-stochastic

domain. An important feature of point pattern data is the random domain. Locations

of avian ﬂu over the world, locations of long leaf pines in a natural forest are examples

of point pattern.

Imaging data can be considered as geostatistical data. Geostatistical data could be

modeled by spatial random ﬁeld {y(s), s ∈ D ⊂ Rd} such that

y(s) = µ(s) + (s)

where µ(s) is the mean eﬀect function and (s) is the random noise. Assume the error
term (s) : s ∈ D is a Gaussian random ﬁeld with mean zero and a covariance function
γ(s, s(cid:48); θ) = Cov((s), (s(cid:48)); θ), where s, s(cid:48) ∈ D and θ is a q × 1 vector of covariance

3

function parameter which could be estimated from the data.

If we further assume the

random ﬁeld (s) is isotropic and stationary, the covariance function could be represented as

γ(h; θ) = cov((s), (s(cid:48)); θ), where h =(cid:13)(cid:13)s − s(cid:48)(cid:13)(cid:13) is the Euclidean distance between s and s(cid:48).

There are many ways to model the covariance function γ(h; θ). A widely used family of

covariance function is the M at´ern covariance function. It is deﬁned as:

γ(h; σ2, c, ν, r) := σ2(1 − c)

21−ν
Γ(ν)

(h/r)νKν(h/r)

(1.2.1)

where Kν(·) is a modiﬁed Bessel function of the second kind and σ2 > 0 is the variance,
0 ≤ c ≤ 1 is a nugget eﬀect, ν > 0 is the scale and smoothness parameter (see, e.g., [17]).

First, The M at´ern covariance function is isotropic. The correlation decreases when the

distance h increases. Second, when ν increases, the smoothness of the random ﬁeld increases,

the M at´ern covariance function converges to Gaussian covariance function γ(h; σ2, c, r) =
σ2(1 − c)exp(−h2/r2) as ν → ∞. Last, if ν = 1
2, (1.2.1) is reduced to the well known
exponential covariance function γ(h; σ2, c, r) = σ2(1 − c)exp(−h/r). r is called the range

parameter since it measures the distance at which the correlation have decreased below

certain threshold.

Consider the spatial statistical model with replications. Assume there are n independent

samples of the spatial random ﬁeld: y1(s), y2(s), ..., yn(s). Suppose any sample of the spatial
random ﬁeld, there are observations at p discrete sites s1, ..., sp ∈ D. Let Yij = yi(sj) be the

observation at the jth site from the ith sample of the spatial random ﬁeld. For i = 1, 2, ..., n

and j = 1, 2, ..., p, the model can be represented by:

Yij = µj + ij

(1.2.2)

4

where µj = µ(sj) is the mean eﬀect at the jth location and ij = i(sj) is the corresponding

Gaunssian random noise at the jth location for the ith sample. We can write the model

in matrix notation. For i = 1, 2, ..., n, let Y i = (Y T

ip )T be a p-dimensional vector.
Then Y i = µ + i, where µ = (µ1, ..., µp)T is the mean vector and  = (1, ..., p)T has

i1 , ..., Y T

multivariate normal distrubition N (0, Σ). Since (s) has a covariance function γ(h; θ), the

i,j=1, where hij =(cid:13)(cid:13)si − sj

(cid:13)(cid:13),

covariance matrix Σ can be represented by Σ(θ) = [γ(hij; θ)]p

i.e. the (i, j)th entry of Σ(θ) is γ(hij; θ).

Maximum likelihood estimation (MLE) are often used to estimate parametric model (see,

e.g., [37]). Considering the model in (1.2.2), Y i ∼ N (µ, Σ(θ)). Let Y = (Y T

1 , ..., Y T

n ), the

MLE of (µ, θ) is obtained by maximizing the log likelihood function,

log l(µ, θ; Y ) = −np
2

log(2π) − n
2

log|Σ(θ)| − 1

2

n(cid:88)
(Y i − µ)T Σ−1(θ)(Y i − µ)
(cid:80)n

i=1

The unique µ to maximize logl(µ, θ; Y ) is: ˆµM LE = 1
n

i=1 Y i = ¯Y . Plug in ˆµ to

log l(µ, θ; Y ), we obtain

L(θ; Y ) = −n
2

log|Σ(θ)| − 1

2

n(cid:88)
(Y i − ˆµ)T Σ−1(θ)(Y i − ˆµ)

i=1

ˆθM LE can be obtained by maximizing L(θ; Y ). Usually there’s no closed-form solution.

Numerical methods could be applied to obtain the solution.

5

1.3 Linear discriminant analysis (LDA) for high di-

mensional data

Consider the p-dimensional discriminant problem between two classes C1 and C2. According
to some classiﬁcation rule T (x) : Rd → {1, 2}, a new observation X = x can be classiﬁed
into class C1 (T (X) = 1) or C2 (T (X) = 2).
If X ∈ C1, the misclassiﬁcation rate is
the probability that it is classiﬁed into class C2, i.e. P (T (X) = 2|X ∈ C1). Similary,
P (T (X) = 1|X ∈ C2) is the misclassiﬁcation rate if X ∈ C2. Here X is a p-dimensional
random variable. By abuse of notation we use the notation: X ∈ Ck to denote that the
observation X is from class Ck.

It can be shown that the optimal classiﬁer in terms of minimizing the misclasiﬁcation

rate is known as the Bayes rule, which classify the new observation into the most probable

class, using the posterior probability of the observation (see Chapter 2 in [23]). Suppose
fk(x) is the conditional density of an observation X in class Ck, (k = 1, 2). Let πk be the

prior probability of class k with π1 + π2 = 1. By Bayes theorem, the posterior probability

of an observation X = x in each class is:

P (X ∈ Ck|X = x) =

fk(x)πk

f1(x)π1 + f2(x)(1 − π1)

There are many ways to model the class densities. Suppose the densities for both classes

are multivariate Gaussian N (µ1, Σ) and N (µ2, Σ) respectively, where µk (k = 1, 2) are

the class mean vectors and Σ is the common positive deﬁnite covariance matrix. Then the

6

density of an observation X = x from Ck is:

fk(x) =

1

(2π)p/2

|Σ|1/2e

2 (x−µk)T Σ−1(x−µk))
− 1

Under this assumption, the Bayes rule assigns X = x in to C1 if π1f1(x) ≥ π2f2(x).

Equivalently, x is assigned to C1 if

log

π1
π2

+ (x − µ)T Σ−1(µ1 − µ2) ≥ 0

where µ = (µ1 + µ2)/2. Note that this classiﬁer is linear in x.

In practice, the parameters of the Gaussian distribution are unknown. We can esti-
are from class Ck, where k ∈ {1, 2}
mate them using training data. Suppose Y k1, ..., Y knk
and Y kj ∈ Rp are independent and identically distributed as Np(µk, Σ(θ0)), where µk =
(µk1, ..., µkp)T , nk is the sample size for class Ck. Σ(θ) is the covariance matrix with pa-
n → π, 0 < π < 1 as
n1
rameter θ = θ0. Let n = n1 + n2 be the total sample size. Assume
n → ∞. p is depending on n. Assume the two classes have equal prior probability for the
two classes, i.e. both the probability that a new observation is from class C1 and C2 are 1
2.

Consider the classiﬁcation rule ˆδ:

ˆδ(X) = (X − ˆµ1 + ˆµ2

2

)T ˆΣ−1 ˆ∆ > 0

(1.3.1)

where ˆµ1, ˆµ2, ˆΣ and ˆ∆ are estimates of µ1, µ2, Σ and ∆ from the data, where ∆ =
(∆1, ..., ∆p)T = µ1 − µ2 is the diﬀerence of the two classes in mean. A new observation
x is classiﬁed into class C1 if ˆδ(x) > 0 and C2 otherwise. Let Θ = (µ1, µ2, θ). If the new

7

observation X comes from C1, then the conditional misclassiﬁcation rate of ˆδ is:

W1(ˆδ; Θ) = P (ˆδ(X) ≤ 0|X ∈ C1, Y ki, i = 1, 2, ..., nk, k = 1, 2) = 1 − Φ(Ψ1)

(1.3.2)

where

(µ1 − ˆµ)T ˆΣ( ˆµ1 − ˆµ2)

(cid:113)
( ˆµ1 − ˆµ2)T ˆΣ−1Σ ˆΣ−1( ˆµ1 − ˆµ2)

Ψ1 =

(1.3.3)

Similarly we can deﬁne the error rate for observations from C2. If a new observation X

comes from class C2, the conditional misclassiﬁcation rate of ˆδ is:

W2(ˆδ, Θ) = P(ˆδ(X) > 0|X ∈ C2, Y ki, k = 1, 2; i = 1, ..., nk) = Φ(Ψ2)

(1.3.4)

where

(cid:113)

Ψ2 =

(µ2 − ˆµ)T ˆΣ( ˆµ1 − ˆµ2)

( ˆµ1 − ˆµ2)T ˆΣ−1Σ ˆΣ−1( ˆµ1 − ˆµ2)

,

(1.3.5)

Since we assumed the equal prior probability for the two classes, the overall misclassiﬁ-

cation rate is :

W (ˆδ; Θ) = 1

2(W1(ˆδ; Θ) + W2(ˆδ; Θ))

(1.3.6)

If µ1, µ2 and Σ are known, the optimal classiﬁcation rule is Bayes rule, which classiﬁes

8

a new observation X = x into class C1 if

δ(x) = (x − µ1 + µ2

2

)T Σ−1∆ > 0,

(1.3.7)

Bayes rule has the smallest misclassiﬁcation rate. If there’s a new observation X from
class C1, since X has normal distribution N (µ1, Σ(θ)), we can calculate that the conditional

misclassiﬁcation rate of Bayes rule δ is

W1(δ; Θ) = W2(δ; Θ) = 1 − Φ(

(cid:112)Cp

2

)

where Cp = ∆T Σ−1(θ)∆ and Φ(·) is the standard Gaussian distribution function.

The overall misclassiﬁcation rate of Bayes rule is:

W (δ; Θ) = 1

2(W1(δ; Θ) + W2(δ; Θ)) = 1 − Φ(

(cid:112)Cp

2

)

Since Bayes rule has the smallest misclassiﬁcation rate, we write WOP T = 1 − Φ(

the optimal misclassiﬁcation rate.

(1.3.8)

√

Cp
2

) as

In practice the parameters Σ(θ), µ1, µ2 are unknown. We need to estimate them us-

ing data. The linear discriminant analysis (LDA) estimates the parameters with sample

estimates (k = 1, 2):

ˆµk =

Y ki/nk = ¯Y k·

nk(cid:88)
(cid:88)
(cid:88)

i=1

ˆΣ =

(Y ki − ˆµk)T (Y ki − ˆµk)/(n1 + n2 − 2)

(1.3.9)

(1.3.10)

k

i

9

Then LDA classiﬁes X into class C1 if

ˆδLDA(X) = (X − 1
2

( ¯Y 1· + ¯Y 2·))T ˆΣ−1( ¯Y 1· − ¯Y 2·) > 0

(1.3.11)

However LDA is not applicable under high dimensional setting (p (cid:29) n) for two reasons.

First the sample covariance matrix ˆΣ is singular and it is diﬃcult to estimate the precision
matrix Ω = Σ−1. Second, even though the true covariance is known, the classiﬁcation per-
formance when p (cid:29) n is poor due to the error accumulated when estimate the p-dimensional

features. Fan and Fan (2008) [19], and Shao et al. (2011) [46] show theoretically that the

variable selection is critical respectively.

Though there are many literature working on high dimensional classiﬁcation to seek a

better classiﬁcation performance for general correlation structure in Σ (see, e.g., [5, 9, 15,

19, 20, 33, 35, 36, 43, 46, 48, 51, 52, 55]), but still not much has been talked about the

high dimensional classiﬁcation for spatially dependent data. Here in this thesis we ﬁll in the
P→ 1;
P→ means convergence in

gap. As deﬁned in [46], a classifer ˆδ is (1) asymptotically optimal if W (ˆδ; θ)/WOP T
(2) asymptotically sub-optimal if W (ˆδ; θ) − WOP T

P→ 0, where

probability. Our goal in this thesis is to develop a asymptotically optimal classiﬁer.

1.4 Overview

The rest of this dissertation is as follows. As a pilot study, in Chapter 2 we consider use

longitudinal volumetric MRI data of ﬁve regions of interest (ROIs), which are known to

be vulnerable to Alzheimer’s disease (AD) for prediction. A longitudinal data classiﬁcation

method based on functional data analysis is developed. It is applied for identifying when

10

an early prediction can reasonably be made and determining whether one ROI is superior

with regard to predicting progression to AD over others. We ﬁrst recover the longitudinal

observations into trajectories using PACE (principal components analysis through condi-

tional expectation; see [56]). PACE is a version of functional principal components (FPC)

analysis, in which the FPC scores are estimated from conditional expectations. To check the

appropriate time period for early prediction, using the idea from [24], we use only part of the

observations from longitudinal data in this step. Then logistic regression with FPC scores as

the explanatory variables was applied for classiﬁcation. We compare the accuracy, sensitiv-

ity, and speciﬁcity of prediction based on individual ROIs as well as their combinations. For

all the models, the results show that the overall one year, two years, three years in advance

prediction accuracy is above 80%. MCI converter subjects can be correctly detected with a

greater-than-chance probability as early as two years prior to conversion.

The data used in Chapter 2 is highly summarized data derived from volumetric MRI,

which measures the size of each ROI. Summarized volumetric MRI data is a good way

to measure the brain atrophy. However potentially there’s a lot of information lost due

to the ignorance of the spatial pattern. How to do classiﬁcation incorporating the spatial

dependency is of great interest for not only the brain imaging data, but also for the imaging

data from a wide range of source.

In Chapter 3 we consider use voxel level MRI data for classiﬁcation of AD patients and

healthy subjects. This data is characterized with high dimension and spatial dependency.

We develop a supervised learning method based on linear discriminant analysis for high

dimensional spatially dependent data. To solve the two issues in standard LDA for high

dimensional data, ﬁrst, we incorporate the information of spatial dependency among all

features using spatial statistical model. Then a non-singular covariance structure can be de-

11

rived by maximum likelihood estimation. Second, penalized maximum likelihood estimation

(PMLE) is applied for selecting important features and estimating parameters simultane-

ously. To reduce computation load, the tapering technique is applied. Additionally, we

develop the theories which show that the method proposed can achieve consistent parameter

estimation, consistent feature selection, and asymptotically optimal misclassiﬁcation rate.

Extensive simulation study shows a signiﬁcant improvement in classiﬁcation performance

under spatial dependence compared to other comparable methods. In the end we apply this

method in using brain imaging data for diagnosis of Alzheimer’s disease.

12

Chapter 2

Early prediction of Alzheimer’s

disease using longitudinal volumetric

MRI data from ADNI

2.1

Introduction

The accurate prediction of Alzheimer’s disease (AD) in individuals with Mild Cognitive Im-

pairment (MCI) is essential for clinical management and selection of potential interventions.

It also plays an important role in improving the eﬃciency of clinical trials of AD-modifying

treatments by enriching the study population with a higher proportion of individuals who

will develop the disease during the trial period. Structural MRI has been a primary re-

search tool for the development of prognostic markers to aid in disease prediction, taking

advantage of brain atrophy, a primary pathologic process in AD due to widespread neuronal

death (see, e.g., [28]). Many studies suggest that structural changes in the brain can be

detected in the early stages of AD (see, e.g., [2, 22, 29]). Several studies analyzed brain

atrophy patterns in diﬀerent regions of interests (ROIs) and diﬀerent disease stages (see,

e.g., [2, 22, 29, 31, 32, 38]. They found that rates of atrophy are not uniform, but vary in

accordance with disease stage and by region. For example, Leung et al. (2010) [31] found a

13

higher rate of hippocampal atrophy in MCI converters (MCI-c) than in MCI nonconverters

(MCI-nc). Also, Fennema-Notestine et al. (2009) [22] found that some regions such as the

mesial temporal, exhibited a linear rate of atrophy throughout both MCI and AD. Other

regions such as lateral temporal, middle gyrus showed accelerated atrophy later in the dis-

ease. These results imply that brain atrophy in certain regions can be used to diﬀerentiate

MCI-c and MCI-nc. Longitudinal data captures atrophy patterns that change with time at

diﬀerent disease stages. Compared to single point brain imaging data, longitudinal data does

a better job of describing brain atrophy and likely performs better in predicting conversion

from MCI to AD.

MRI oﬀers a non-invasive, widely available and cost-eﬀective way for obtaining imaging

biomarkers. Image analysis software such as FreeSurfer has been able to provide accurate

estimation of regional brain volume, cortical thickness, and even curvature. Regional brain

volume has been investigated most as a potential biomarker to monitor atrophy progression

and thereby the progression of AD. In this paper we studied longitudinal volumetric data

of ﬁve regions of interests (ROIs) measured with MRI, including the hippocampus (H),

entorhinal cortex (EC), middle temporal cortex (MTC), fusiform gyrus (FG) and whole brain

(WB), in order to develop a better potential biomarker strategy for use with structural data.

Among the papers which investigate prediction of conversion from MCI to AD using

longitudinal MRI data, several stand out as particularly pertinent to our present work.

Misra et al. (2009) [40] applied high-dimensional pattern classiﬁcation to longitudinal MRI

scans for prediction of conversion within an average period of 15 months. An abnormality

score was calculated for classiﬁcation and achieved an accuracy rate of 81.5%. Zhang et al.

(2012) [57] performed predictions on multimodal longitudinal data (i.e., MRI, PET, etc.)

using a longitudinal feature selection method. Multi-kernel SVM was used for classiﬁcation.

14

They achieved an accuracy of 78.4%, sensitivity of 79.0%, and speciﬁcity of 78.0% for at least

six months ahead of the conversion from MCI to AD. In Li et al. (2012) [34], 262 features

were calculated from longitudinal cortical thickness MRI data. They ﬁrst applied mRMR

(minimum redundancy maximum relevance) for feature selection, then an SVM (support

vector machine) was trained for classiﬁcation. Their method can detect 81.7% (AUC =

0.875) of the MCI converters six months ahead of conversion to AD. Lee et al. (2016) [30]

used baseline plus 1-year follow-up callosal MRI scans for predicting conversion to AD in

the following 2 to 6 years. Logistic regression model with fused lasso regularization was

applied. They found the accuracy of prediction was 84% in females and 61% in males. Arco

et al. (2016) [3] performed prediction based on baseline and 1-year follow-up MRI data and

clinical scores (MMSE and ADAS-Cog). Feature selection based on a two sample t-test and

classiﬁcation based on maximum-uncertainty linear discriminant analysis were applied. They

achieved a classiﬁcation accuracy of 73.95%, with a sensitivity of 72.14% and a speciﬁcity

of 73.77% for six months before conversion. In Adaszewski et al. (2013) [1], voxel-based

longitudinal structure MRI data was used. Weighted based feature selection and SVM were

applied for prediction. Both sensitivity and speciﬁcity were up to 65% at 1 to 4 years before

conversion. Notably they found that conversion could be detected as early as four years

before conversion with a sensitivity of 62%.

As mentioned in Eskildsen et al. (2013) [18], one factor preventing high predictive power

is the heterogeneity of the MRI data due to the variability in underlying disease stage among

subjects. Without standardizing the data based on disease stage, it is impossible to establish

a speciﬁc pattern of atrophy at speciﬁc disease stages. Adaszewski et al. (2013) [1] attempted

to solve this issue by aligning the data by “time of conversion”. We also aligned the data

by “time of conversion” in this paper. This alignment enables us to estimate the pattern of

15

atrophy in accordance with the stages of disease. Moreover, it makes it feasible to compare

ahead of how many years we can reasonably predict the conversion.

Another source of variability in the longitudinal MRI data is that the subjects have dif-

ferent numbers of observations. In the ADNI cohort, some subjects have nine years of data

with up to 10 longitudinal observations, but many subjects only have 1 to 3 data points. Due

to the sparsity and irregularity of the data, it is not feasible for the traditional longitudinal

data analysis method to make predictions and meanwhile, compare the prediction windows.

In this study, we applied a technique known as the principal component analysis through

conditional expectation (PACE, see [56]) to analyze the realigned longitudinal MRI data.

In PACE, the mean curve and covariance structure of the biomarker along with time are

obtained based on the pooled data from all individuals. In this way, longitudinal observa-

tions from each subject could be recovered as a smooth trajectory even if only one or few

observations are available. Finally, once the longitudinal observations are recovered to be

smooth trajectories, they could be treated as functional data. Then a functional prediction

method from Hall et al. (2012) [24] could be employed to determine when an early decision

can reasonably be made, and identify which ROI is best for prediction, using only part of the

trajectory. Moreover, we also examined whether any combination of the ROIs can provide

a better prediction result.

The remainder of this chapter is organized as follows: Section 2.2 presents the methods,

the PACE approach for analyzing longitudinal data and the functional prediction method.

Section 2.3 introduces the data that we use in analysis. Section 2.4 presents the numerical

results and conclusions. The discussion is in section 2.5. More tables about results can be

found in the Appendix.

16

2.2 Methodology

2.2.1 Functional Principal Component Analysis with longitudinal

data

We ﬁrst recover the longitudinal observations into trajectories using PACE (principal com-

ponents analysis through conditional expectation; [56]). PACE is a version of functional

principal components (FPC) analysis, in which the FPC scores are framed as conditional

expectations. This method extends the applicability of FPC analysis to longitudinal data

analysis, where only a few repeated and irregularly spaced measurements are available for

each subject. In PACE, the longitudinal data is modeled as noisy sampled points from a

collection of trajectories that are assumed to be independent realizations of a smooth ran-

dom function X(t), with unknown mean function EX(t) = µ(t) and covariance function
cov(X(s), X(t)) = G(s, t), where t, s ∈ T and T is a bounded and closed time interval. The

eigenfunctions φk and non-increasing eigenvalues λk : G(s, t) =(cid:80)

covariance function G of the process has an orthogonal expansion (in the L2 sense) in terms of
k λkφk(s)φk(t), t, s ∈ T .

From classical functional principal component analysis, the ith underlying random curve can
be expressed through the Karhunen − Lo`eve expansion as

(cid:88)

Xi(t) = µ(t) +

ξikφk(t), t ∈ T

(2.2.1)

k

where ξik are uncorrelated random variables with mean 0 and variance Eξ2

(cid:80)
k λk < ∞, λ1 ≥ λ2 ≥ · · ·.
Let Yij be the jth observation of the random function Xi(·), observed at a random time

ik = λk, with

17

Tij ∈ T . Then Yij can be modeled as:

Yij = Xi(Tij) + ij

= µ(Tij) +

∞(cid:88)

(2.2.2)

ξikφk(Tij) + ij

k=1

where ij are the additional measurement errors that are assumed to be independent, iden-

tically distributed and independent of ξij, i = 1, ..., n, j = 1, ..., Ni, Ni is the number of

observations for the ith subject, and E(ij) = 0, var(ij) = σ2. To reﬂect sparse and irregu-

lar designs, Ni are assumed to be independent and identically distributed random variables

as well as independent of all other random variables.

Now we need to estimate the mean function µ(t), covariance function G(s, t), eigenfunc-

tions φk(t) and eigenvalues λk as well as functional principal component scores (FPC scores)

ξik for k = 1, 2, ... for each subject i = 1, 2, ..., n.

Assume that the mean function, covariance function and eigenfunctions are smooth, they

could be estimated by local linear smoothing. Firstly the mean function µ(t) is estimated

based on the pooled data from all individuals by minimizing:

n(cid:88)

Ni(cid:88)

i=1

j=1

Tij − t
hµ

κ1(

)(Yij − β0 − β1(t − Tij))2

with respect to β0 and β1, where κ1(·) is kernel function and hµ is bandwidth. Then the
estimate of µ(t) is ˆµ(t) = ˆβ0(t).

Let Gi(Tij, Til) = (Yij − ˆµ(Tij))(Yil − ˆµ(Til)) be the “raw” covariances. The local linear

18

surface smoother for G(s, t) is deﬁned by minimizing

n(cid:88)

(cid:88)

Tij − s
hG

,

Til − t
hG

κ2(

) × (Gi(Tij, Til) − f (β, (s, t), (Tij, Til)))2,

i=1

1≤j(cid:54)=l≤Ni

where f (β, (s, t), (Tij, Til)) = β0 + β11(s − Tij) + β12(t − Til), κ2(·,·) is kernel function and

hG is bandwidth. Minimization is with respect to β = (β0, β11, β12). Then the estimate of

G(s, t) is ˆG(s, t) = ˆβ0(s, t).

The estimates of eigenfunctions ˆφk(t) and eigenvalues ˆλk are the solutions of the eigenequa-

tions,

(cid:90)

T

ˆG(s, t) ˆφk(s)ds = ˆλk

ˆφk(t).

The eigenfunctions can be estimated by discretizing the smoothed covariance.

Traditionally, when the measurements for each subject are densely sampled, the FPC

scores ξik = (cid:82) (Xi(t) − µ(t))φk(t)dt are estimated by numerical integration. However, for

longitudinal data, the time points of measurements vary widely across subjects and are

sparse. The FPC scores cannot be well approximated by the usual integration method.

However, as PACE further assumes that in (2.2.2), ξij and ij are jointly Gaussian, FPC

scores for the ith subject could be estimated by the best prediction:

˜ξik = E[ξik|(cid:126)Yi] = λkφT

ikΣ−1
Yi

( (cid:126)Yi − µi),

(2.2.3)

where (cid:126)Yi = (Yi1, Yi2, ..., YiNi
is (ΣYi
(j, l)’s entry of ΣYi

= cov((cid:126)Yi, (cid:126)Yi), i.e. the
)T , φik = (φk(Ti1), ..., φk(TiNi
)j,l = G(Tij, Til) + σ2δjl with δjl = 1 if j = l and 0 if j (cid:54)=

))T , ΣYi

l. Substituting the parameters in (2.2.3) by the estimates of µi, λi, φik, ΣYi

, we have the

19

estimation for FPC scores:

ˆξik = ˆE[ξik|(cid:126)Yi] = ˆλk

T
ˆφ
ik

ˆΣ−1
Yi

((cid:126)Yi − ˆµi),

(2.2.4)

Where the (j, l)th element of ˆΣYi

is ˆΣj,k = ˆG(Tij, Til) + ˆσ2δjl.

Assume the inﬁnite-dimensional processes of (2.2.1) could be approximated by the pro-

jection on the function space spanned by the ﬁrst K eigenfunctions. The prediction for the

trajectory Xi(t) for the ith subject, using the ﬁrst K eigenfunctions is:

K(cid:88)

ˆXK

i (t) = ˆµ(t) +

ˆξik

ˆφk(t)

(2.2.5)

k=1

According to [56], the number of eigenfunctions K can be selected by cross validation

based on prediction error or by AIC-criteria.

In this study we selected K by setting a

threshold. That is, we selected the ﬁrst K eigen functions such that they explained more

than 95% of the total variation.

In summary, to recover a trajectory for each subject from longitudinal observations, we

ﬁrst estimate the mean curve ˆµ(t) and covariance ˆG(s, t) from pooled data of all individuals,

from which, eigenvalues ˆλk and eigenfunctions ˆφk can be estimated. Then FPC scores

are estimated using available observations from each subject by conditional expectation,
even if only one observation is available. From (2.2.5), the FPC scores ˆξik (k = 1,··· , K)

characterize each subject i = 1, ..., n and can be used to describe diﬀerences between subjects.

As a result, we can use FPC scores for classiﬁcation or prediction [41].

For details of PACE methodology, please refer to Yao et al. (2015) [56]. The MATLAB

package for “PACE” is available from http://www.stat.ucdavis.edu/PACE/.

20

2.2.2 Early prediction and choosing trajectory

Hall et al. (2012) [24] introduced a methodology for classifying and predicting the future

state using functional data. It provided an approach to determine when an early decision

can be made reasonably, using only part of the trajectory and showed how to use the method

to choose a better biomarker as a predictor.

Hall et al. (2012) [24] assume there are q types of biomarkers from two classes of the

[k]
population. Let X
ji (t) be the observed data function of the kth biomarker from the ith
subject in population Πj, where j = 1, 2, i = 1, 2, ..., nj, k = 1, 2, ..., q, t ∈ I. Without
loss of generality, assume I = [0, 1]. First, the dimension of the functional data is reduced

by discretizing on a grid, i.e. by conﬁning attention to X

[k]
ji (tl), where l = 1, 2, ..., p and

p denotes the number of grid points. Second, a classiﬁer based on p-variate is constructed

(linear discriminant analysis and logistic classiﬁcation were considered in [24]). Then the

classiﬁer is applied to each type of biomarkers using only a portion of the trajectory, for
example, on the interval I = [0, t] with t ∈ [0, 1], to predict which class the subject belongs

to in the end, i.e. at t = 1. The estimated error rates are denoted by ˆerr[k](t). Comparing
ˆerr[k](t) for a range of values t of interest (t ∈ [0, 1]) and k = 1, ..., q gives an idea of when a

relative early prediction could be made and which is a more reliable biomarker. The results

are examined both numerically and theoretically by checking the consistency of ˆerr[k](t).

In this paper, we use longitudinal data for prediction, as opposed to [24] which used

dense functional data. As a result, instead of reducing dimension by discretizing on a grid,

we reduce dimension by PACE, which was introduced in section 2.2.1. There are two main

steps in this process. First, we apply PACE to the data to calculate FPC scores for each

subject. To check the appropriate time period for early prediction, using the idea from [24],

21

we use only part of the observations from longitudinal data in this step. Second, we apply

the logistic classiﬁcation method to the resulting FPC scores for prediction.

In the numerical study, we compared the one year, two years, and three years early

prediction results. We also compared the error rates for diﬀerent ROIs to choose which one

is the most reliable biomarker.

2.2.3 Prediction with logistic classiﬁer

Logistic regression is a special case of generalized linear models which can be used for clas-

siﬁcation. In the generalized linear model settings, logistic regression has a response Y with

binomial distribution and a logit link function logit(p) = log p

1−p, where p is the probability

of Y = 1. As in the framework of classiﬁcation, the response Y denotes the index of two

groups, say G1 and G0 with Y = 1 if the observation comes from class G1 and Y = 0 if it

comes from class G0. Here we set Y = 1 for MCI-c subjects who convert from MCI to AD

and Y = 0 for MCI-nc subjects who stay as MCI.

Suppose xi1, xi2, ..., xiq are the predictors for the subject i. The logistic regression equa-

tion is

logit(pi) = log

pi
1 − pi

=

q(cid:88)

k=1

xikβk, f or i = 1, 2, ..., n

where n is the number of subjects and pi is the probability that the ith subject belongs

to class G1. The regression coeﬃcients βk are usually estimated using maximum likelihood

estimation(MLE). Then pi is derived by

pi =

1

1 + e−Xiβ

22

(2.2.6)

where Xi = (1, xi1, xi2, ..., xiq), and β = (β0, β1, β2, ..., βq)(cid:48). If pi is greater than a certain

threshold (usually the threshold is set to be 0.5), subject i will be classiﬁed into class G1,

otherwise, classiﬁed into class G0.

In this paper, we would use the trajectory of ROIs’ volume as the predictor (we use
function X(t), t ∈ [0, T ] to denote the trajectory). As introduced in the previous section, we

ﬁrstly use PACE to reduce dimension and denote the predicted trajectory for subject i as

ˆXi(t) = ˆξi1

ˆφ1(t) + ˆξi2

ˆφ1(t) + ... + ˆξiK

ˆφ1(t).

Thus we can do logistic regression with FPCSs ˆξi1, ..., ˆξiK as predictors. For example, if two

FPC scores are chosen, i.e. K=2, the logistic regression is

logit(pi) = β0 + β1

ˆξi1 + β2

ˆξi2, i = 1, 2, ..., n

(2.2.7)

We can easily include more variables in the logistic regression model. In the numerical study

in section 2.4, in addition to FPCSs of certain ROIs, we also included some basic but impor-

tant clinical features in the logistic regression model: baseline age, baseline MMSE (Mini-

Mental State Examination) and APOE (apolipoprotein E) genotype. The corresponding

logistical regression model is:

logit(pi) = β0 + β1

ˆξi1 + β2

ˆξi2 + β3Agei + β4AP OEi + β5M M SEi, i = 1, 2, ..., n (2.2.8)

23

2.3 Data Description

The longitudinal volume of H, EC, MTC, FG and WB are considered to be potential pre-

dictors of conversion from MCI to AD in this paper. Baseline and follow-up volumetric MRI

data, if available, were downloaded from the Alzheimer’s Disease Neuroimaging Initiative

(ADNI, http://www.adni-info.org/) database along with the corresponding clinical informa-

tion. The structural MRI scans were acquired from 1.5T or 3T scanners, manufactured by

GE, Siemens or Philips. Regional brain segmentation and volume estimation were carried

with FreeSurfer by the UCSF/SF VA Medical Center [25], [49].

From a total of 872 individuals with a baseline diagnosis of MCI who were recruited for

ADNI, 66 subjects were excluded due to missing data and 5 subjects were excluded due to

reverting from AD to MCI. In the end 801 MCI subjects were included in this study. They

were split into two categories. The MCI subjects who converted to AD after some years are

labeled as MCI-c (mild cognitive impairment converters; n = 272); the rest of them who

did not convert to AD during the follow-up period were labeled as MCI-nc (mild cognitive

impairment nonconverters; n = 529). At study entry (baseline), all subjects underwent a

comprehensive clinical evaluation, cognitive/functional assessments, and a structural brain

MRI scan. Subjects also provided a blood sample for apolipoprotein E (APOE) genotyping

and proteomic analysis. The subjects were then followed longitudinally at speciﬁc time

points (6, 12, 18, 24, 36... months). However, due to the variability of their visits, the data

collected is irregular in the number of observations and in time intervals for each subject.

Table 2.1 shows the characteristics of the MCI subjects included in this study. It shows,

except for gender, the other three features (baseline age, baseline MMSE, and APOE) are

signiﬁcantly diﬀerent between MCI-c and MCI-nc subjects at a signiﬁcance level of 0.05,

24

Table 2.1 Subjects characteristics

n
Age (Mean±sd)
Gender (F/M)
MMSE (Mean±sd)
APOE (+/-)

MCI-c
272
73.5 ± 7.03
110/162
26.9 ± 1.77
182/90

p-value

MCI-nc
529
72.3 ± 7.59
220/309
28.0 ± 1.70 < 0.001
< 0.001
226/303

0.019
0.813

Key: MCI-c, mild cognitive impairment (converters); MCI-nc, mild cognitive impairment (non
converters); Age, baseline age; MMSE, baseline Mini-Mental State Examination; APOE,
apolipoprotein E genotype .

which suggests the three features might be helpful in prediction. For all the subjects, the

longest observed year is about 9.18 years and the minimum observed year is 0 (only one

data point is collected, see Table 2.2). The numbers of observations for the subjects are

also summarized in Table 2.2. The maximum number of observations is 10, the minimum
number is 1, and the median number is about 3 − 5 in each category.

During the data analysis, all regional volumes were normalized by dividing the intracra-

nial volume (ICV) to correct for individual diﬀerences in head size. For example, to normalize

the hippocampal volume, we calculate the fraction of the hippocampal volume and the ICV

from the same subject at the same time point. The resulting fractions are then used in the

analysis.

In the spaghetti plot (Figure 2.1), (f) is the longitudinal ICV for each subject.

It shows that the ICV for MIC-c and MIC-nc subjects have a similar pattern on average.

Also for each subject, as expected the ICV is relatively constant during the observed period.

For the other ﬁve normalized regional volumes ((a)-(e) in Figure 2.1), on average the MCI-c

group has a higher rate of atrophy in volume compared to MCI-nc group.

As shown in the spaghetti plot (Figure 2.1), ﬁrst, we aligned the longitudinal volumetric

data of ROIs by the timeline. Speciﬁcally, the MCI-c subjects were aligned by “time of
conversion”, i.e. the time point of conversion is deﬁned as year 0 and year −1 is deﬁned for

25

(a) normalized Hippocampus

(b) normalized EC

(c) normalized FG

(d) normalized MTC

(e) normalized WB

(f) ICV (mm3)

Figure 2.1 Longitudinal volume of ROIs for MCI subjects. In (a)-(e), the value on Y axis
is the normalized volume. In (f), the value on Y axis is the ICV (mm3). The value on X
axis is “disease year”. Thin lines are observations for each subject. Blue lines are for MCI-c
subjects and red lines are for MCI-nc subjects. Blue and red thick lines are pooled mean
curves for MCI-c group and MCI-nc group respectivly.

26

0.0030.0040.0050.0060.007−7.5−5.0−2.50.0yearnormalized hippocampuslabelMCI−cMCI−nc0.0010.0020.0030.004−7.5−5.0−2.50.0yearnormalized EClabelMCI−cMCI−nc0.0100.0150.020−7.5−5.0−2.50.0yearnormalized FGlabelMCI−cMCI−nc0.0100.015−7.5−5.0−2.50.0yearnormalized MTClabelMCI−cMCI−nc0.50.60.70.80.9−7.5−5.0−2.50.0yearnormalized whole brainlabelMCI−cMCI−nc12000001400000160000018000002000000−7.5−5.0−2.50.0yearICVlabelMCI−cMCI−ncTable 2.2 Data characteristics

n

year
(max)

year
(min)

year

(median)

obsN obsN
(max)
(min)

obsN

(median)

All subjects
MCI-c
MCI-nc
All 3y subjects
MCI-c
MCI-nc

272
529

30
97

8.15
9.18

8.10
8.03

0
0

2.97
2.52

1.65
2.07

3.03
3.02

8
10

8
10

1
1

4
3

3
3

5
4

Key: MCI-c, mild cognitive impairment (converters); MCI-nc, mild cognitive impairment (non
converters). “All 3y subjects” are the subjects who have observations at all the “-3y”, “-2y”,
“-1y” time points. Column 2 is the number of subjects in each category. Column 3 − 5 are the
maximum, minimum and median length of observed years. Column 6 − 8 are the maximum,
minimum and median number of observations.

the time points observed 1 year prior to conversion. The MCI-nc subjects were aligned by

the last time point of observation, i.e. the time point of last observation is deﬁned as year
0 and the time points observed 1 year prior to the last observation is deﬁned as year “−1”

and so forth. In this way, the data were homogenized by the timeline of disease progression.

The goal is to predict whether a subject will convert to AD in the end, i.e. at year 0.

One of the goals of this paper is also to determine a reasonable prediction window. So

we need to compare the prediction accuracy among diﬀerent prediction windows. Here we

compare one year, two years and three years of early prediction. To make the three years’

prediction results comparable, we pick the subjects who have observations at all the three
most recent years: year “−3”, year “−2”, year “−1” as the testing set (labeled as “all 3y”

subjects in Table 2.2 and 2.3). There are 127 “all 3y” subjects in total, of which 30 are MCI-

cs and 97 are MCI-ncs. In Table 2.3, characteristics of the subjects who have observations
at year “−3”, “−2”, and “−1” are listed in the ﬁrst three columns respectively. The last

column in Table 2.3 lists the characteristics of the testing set. All the subjects are considered

as the training set. Leave-one-out testing is applied to the subjects in the testing set.

27

Table 2.3 Subjects characteristics by year

-3y

-2y

-1y

all 3y

MCI-c
n
Age (mean±SD)
MMSE (mean±SD)
Gender (F/M)
APOE (+/-)
MCI-nc
n
Age (mean±SD)
MMSE (mean±SD)
Gender (F/M)
APOE (+/-)

71

73.94±7.06
27.14±1.82

158

73.61±7.05
27.05±1.69

28/43
40/31

58/100
103/55

245

73.47±7.20
26.94±1.78
100/145
162/83

30

73.94±5.98
27.03±1.63

10/20
19/11

242

72.09±7.61
28.03±1.64

99/143
90/152

371

72.43±7.36
28.08±1.61
154/217
148/223

473

72.39±7.47
27.97±1.7
197/276
203/270

97

70.22±7.20
28.12 ±1.62

40/57
40/57

Key: MCI-c, mild cognitive impairment (converters); MCI-nc, mild cognitive impairment (non
converters); MMSE, baseline Mini-Mental State Examination; APOE, apolipoprotein E
genotype.“-3y” column, “-2y” column and “-1y” column are characteristics of subjects who have
observations at year “-3”, “-2”, and “-1” respectively after alignment. “all 3y” column is the
characteristics of the subjects who have observations at all the 3 years: year “-3”, “-2” and “-1”.

2.4 Numerical results

This section presents the numerical results and conclusions. We ﬁrst took the hippocampus

as an example to illustrate the details of the prediction procedure using a single ROI’s volume

in section 2.4.1. We then list all prediction results using a single ROI in section 2.4.2. In

the end, we investigate whether or not any combinations of the ROIs would improve the

prediction performance in section 2.4.3. There are 26 diﬀerent combinations of the 5 ROIs.

The prediction results from the 26 combination prediction models are listed in this section.

2.4.1 Prediction using hippocampus

In this section, we take the volume of the hippocampus as an example to describe how to

use a single ROI for prediction. We started with the 1-year early prediction.

28

Table 2.4 Parameter estimation (hippocampus for 1y prediction)

Estimate Std. Error

z value

(Intercept)

APOE4

Age

MMSE

FPC Score1
FPC Score2

6.597
0.8120
-0.0225
-0.2243

5265
20364

1.779
0.1802
0.0132
0.0500
1631
5776

3.71
4.51
-1.71
-4.48
3.23
3.53

p-value
2.07e-04
6.62e-06
8.70e-02
7.41e-06
1.25e-03
4.22e-04

The ﬁrst step is PACE analysis with the partially observed data and calculating the

FPC scores for each subject. As outlined in section 2.3, the longitudinal data were realigned

according to disease status (Figure 2.1). We wanted to predict whether or not a subject

would convert to AD at year 0. 1-year prediction means to use all the data observed prior to
year “−1” to predict the state at year 0. Since the data points were irregularly collected, we
use the data with year t ∈ [−9.18,−0.5) for 1-year prediction (9.18 is the longest trajectory

for all the subjects and all ROIs, see Figure 2.1). Figure 2.2 shows the PACE analysis results

for 1-year early prediction using hippocampus. (a) and (b) are the estimation of mean curve
µ(t) and covariance surface G(s, t) for s, t ∈ [−9.18,−0.5), which are introduced in section

2.2.1. We selected the number of the eigenfunctions used in (2.2.5) by setting a threshold to

be 95%. (d) shows that the ﬁrst two principal components explained 98.9% of the variance.

Thus only the ﬁrst two eigenfunctions and corresponding FPC scores ˆξi1 and ˆξi2 of each

subject i are employed for prediction. (c) is the estimation of the ﬁrst two eigenfunctions
φ1(t) and φ2(t) for t ∈ [−9.18,−0.5). The FPC scores are calculated by (2.2.4) and then

applied to logistic regression model (2.2.8). Figure 2.2.3 shows the scatter plot of the ﬁrst

two FPC scores for all the training subjects.

Table 2.4 provides sample statistics for the coeﬃcient estimates which are obtained from

the logistic regression of group indicator (“MCI-c”= 1, “MCI-nc”= 0) with linear model

29

(a) Mean curve ˆµ(t) of hippocampus

(b) Covariance surface ˆG(s, t) of hippocampus

(c) The ﬁrst two eigen function

(d) Fraction of variance explained by eigen function

Figure 2.2 PACE analysis using hippocampus for 1-year prediction. (a)-(c) show the estima-
tions of mean function µ(t), covariance surface G(t) and the ﬁrst two eigen functions φ1(t)
and φ2(t). (d) shows the ﬁrst two eigen functions explained 98.907% of the total variance.

30

t-10-9-8-7-6-5-4-3-2-10µ(t)×10-32345678910Mean functiondatamean-10-8-6Smooth covariance surfacet-4-200-2-4-6t-802-246810×10-7-10t-10-9-8-7-6-5-4-3-2-10φ(t)-0.8-0.6-0.4-0.200.20.4eigen function vs tfirst eigen function φ1(t)second eigen function φ2(t)No. of Principal Components0510152025FVE (%)0102030405060708090100 k = 2, FVE = 98.907% (final choice)Fraction of variance explained by No. of PC for function xFigure 2.3 Second versus ﬁrst FPC scores for hippocampus (1 year prediction). The triagulars
indicate MCI-nc and the crosses indicate MCI-c.

(2.2.8). It shows both the ﬁrst two FPC scores ˆξ1 and ˆξ2 are signiﬁcant in increasing the

odds of being MCI-c. The resulting estimates of coeﬃcients from the logistic models (2.2.8)

are plugged into (2.2.6) to calculate the conversion probability. We let pi be the probability
If pi ≥ p0, subject i will be

of conversion to AD for subject i. Let p0 be a threshold.

classiﬁed as MCI-c, otherwise, it will be classiﬁed as MCI-nc. If set p0 = 0.5, the sensitivity,

speciﬁcity, and accuracy for 1-year early prediction are 0.7, 0.86 and 0.82. By changing the

threshold p0 from 0 to 1, we can derive a ROC (receiver operating characteristic) curve with

AUC (area under the curve) to be 0.838.

We then apply the same procedure mentioned above to a 2-year early prediction and

a 3-year early prediction with hippocampus volume. Setting threshold p0 to be 0.5, the

corresponding sensitivity, speciﬁcity, and accuracy are summarized in Table 2.5. Setting the

threshold p0 to be from 0 to 1, the ROC curve are shown in Figure 2.4. From the ROC curve,

shorter prediction windows perform better, i.e., 1-year early prediction is the best and 3-year

31

−4e−04−2e−040e+002e−04−5e−050e+005e−05Second FPC score vs first PCS score (Hippocampus)first FPC scoresecond FPC scoreMCI−ncMCI−cTable 2.5 Prediction accuracy rates (hippocampus)

sensitivity
speciﬁcity
accuracy

-1 y
0.7
0.86
0.82

-2 y
0.53
0.89
0.80

-3y
0.23
0.98
0.80

Key: The probability threshold p0 is set to be 0.5, i.e. if p ≥ 0.5, classify it as MCI-c, otherwise
classify it as MCI-nc.

early prediction is the worst. It is expected, and the procedure provides risk evaluation in

early prediction.

Figure 2.4 ROC curve for prediction using hippucampus. Solid line, dotted line and dotdash
line are the ROC curves for 1-year, 2-year and 3-year prediction respectively.

2.4.2 Prediction results using single ROI

In section 2.4.1, we took the hippocampus as an example to illustrate the procedure of 1-

year, 2-year and 3-year early prediction using a single ROI. The results shown in section 2.4.1

were based on one training data set. In this section, we apply the same procedure to all the

ﬁve ROIs: H, WB, EC, FG and MTC. To check the robustness of prediction performance,

32

ROC (hippocampus)SpecificitySensitivity0.00.20.40.60.81.01.00.80.60.40.20.01y (AUC=0.838)2y (AUC=0.83)3y (AUC=0.813)we repeated the procedure on 100 samples of training data. That is, we randomly sampled

2/3 of the training data (2/3 MCI-c and 2/3 MCI-nc) for 100 times. We then repeated the

prediction procedures stated in section 2.4.1 for 100 times. The mean and standard devia-

tion of sensitivity, speciﬁcity, and accuracy were calculated when classiﬁcation probability

threshold p0 was set to be 0.5. Also, when set p0 to vary from 0 to 1, the mean and standard

deviation of the AUCs from the ROC curves were also calculated. The results are shown in

Table 2.6.

From Table 2.6, all the ﬁve ROIs have prediction accuracy above 75% and AUC around 0.8

for all 1-year, 2-year, and 3-year prediction. Both H and EC can identify MCI-c correctly with

a greater-than-chance probability as early as two years. EC is the best ROI with prediction

accuracy above 80% and AUC above 0.8 for all the three years’ predictions. Moreover, its

2-year sensitivity is 62% and 1-year sensitivity achieves 70%.

2.4.3 Prediction results using combinations of ROIs

We then check the prediction performance from the combinations of ROIs. Combination

prediction is realized by plug combinations of ROIs’ FPC scores into the logistic regression

model. For example, if we predict using the combination of H and EC, we ﬁrst calculate the

FPC scores of H ( ˆξ1i, ˆξ2i) as well as FPC scores of EC (ˆη1i, ˆη2i) for each subject i (suppose the

number of the eigenfunctions selected for both H and EC is 2). Then the logistic regression

model for predicting using combination of H and EC is:

logit(pi) = α + β1

ˆξi1 + β2

ˆξi2 + β3 ˆηi1 + β4 ˆηi2 + β5Agei + β6AP OEi + β7M M SEi,

for i = 1, 2, ..., n.

33

Table 2.6 Prediction performance using single ROI

H

(Mean ± SD)

WB

(Mean ± SD)

EC

(Mean ± SD)

FG

(Mean ± SD)

MTC

(Mean ± SD)

sensitivity
1 y 0.64±0.044
2 y 0.51±0.029
3 y 0.27±0.054
1 y 0.54±0.039
2 y 0.37±0.051
0.1±0.027
3 y
0.7±0.011
1 y
0.62±0.03
2 y
3 y 0.32±0.037
0.53±0.03
1 y
2 y 0.46±0.026
3 y 0.23±0.039
1 y 0.57±0.043
2 y 0.42±0.027
3 y 0.24±0.038

speciﬁcity
0.85±0.013
0.89±0.006
0.96±0.018
0.86±0.015
0.88±0.013
0.95±0.015
0.85±0.016
0.91±0.012
0.95±0.01
0.88±0.006
0.89±0.008
0.95±0.015
0.88±0.015
0.9±0.011
0.94±0.012

accuracy
0.8±0.013
0.8±0.007
0.8±0.011
0.79±0.012
0.76±0.01
0.75±0.01
0.82±0.012
0.84±0.009
0.8±0.009
0.8±0.006
0.79±0.006
0.78±0.01
0.8±0.011
0.79±0.01
0.78±0.01

AUC

0.84±0.003
0.83±0.004
0.81±0.006
0.8±0.004
0.8±0.004
0.78±0.005
0.84±0.003
0.84±0.004
0.81±0.004
0.82±0.003
0.81±0.003
0.79±0.003
0.84±0.005
0.83±0.006
0.81±0.007

Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG
stands for fusiform gyrus and MTC stands for middle temporal cortex. Sensitivity is the
proportion of true MCI-c in prediction, speciﬁcity is the proportion of true MCI-nc in prediction
and accuracy is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the
curve from the ROC (receiver operatin characteristic) curve. The mean and standard deviation of
sensitivity, speciﬁcity, accuracy, and AUC are based on 100 samples of training data sets.

34

There are 26 diﬀerent combinations in total. Table 2.7 shows the results of combinations

of two ROIs. Table 2.8 shows the results of combinations of three ROIs. Table 2.9 shows the

results of combinations of more than three ROIs. Please see Table 2.7, Table 2.8 and Table

2.9 in Appendix.

2.4.4 Conclusion

The graph comparison of the prediction performances from the 31 diﬀerent combinations

of the ROIs (5 single variable models and 26 combination models) are listed in Figure 2.5.

From Table 2.6, 2.7, 2.8, 2.9 and Figure 2.5, we have the following observations:

First, the longitudinal volumes from the listed brain ROIs measured by MRI have pre-

diction power as early as three years in advance. In 1-year, 2-year, and 3-year prediction

for all the models, the overall speciﬁcity is above 80%. The accuracy is above 70%, and the

AUC is above 0.8. The sensitivity varies for diﬀerent prediction windows and models from

10% to 70%.

Second, short-term prediction performs better, i.e. 1-year early prediction performs the

best compare to 2-year and 3-year prediction. It is consistent with intuition because more

information is included and less noise is introduced in short-term prediction procedure. But

the error curves over time are clearly helpful for early interventions.

To compare the models using diﬀerent combinations of ROIs, since the overall speciﬁcity,

accuracy, and AUC are similar among models, we mainly focus on checking the sensitivity

in all models. Sensitivity is the probability that the MCI-c subjects are correctly identiﬁed.

It is crucial in prognosis and clinical trials. Overall, most of the models derived a sensitivity

of greater than 50% for 1-year and 2-year prediction. However, the maximum sensitivity

for 3-year in advance prediction is just about 35%, which implies it is not easy to correctly

35

identify MCI-c subjects three years in advance using the information in our model.

Among the ﬁve models using single ROI, EC performs the best, followed by H, and WB

is the worst. All the ﬁve ROIs have sensitivity higher than 50% for 1-year prediction. Both

H and EC have a sensitivity greater than 50% in 2-year early prediction (Table 2.6). 3-year

prediction sensitivity are about 10% to 30% for all the ROIs.

For the prediction results from the models using combinations of two or more ROIs (see

Table 2.7), except for the combinations of WB+FG and WB+MTC , all the other ROIs’

combinations have a sensitivity higher than chance for both 1-year and 2-year prediction.

The overall sensitivity for 3-year prediction is about 20%-30% for models of combinations of

two ROIs, while which is around 30% for models of three ROIs and around 35% for models

of more than three ROIs (see Table 2.7 and Table 2.8).

To select the best overall prediction performance in the sense of prediction window,

sensitivity, speciﬁcity, accuracy, and AUC, the results from the models of EC (see Table

2.6), H+EC+FG (see Table 2.8), H+WB+EC+FG and H+WB+EC+FG+MTC (see Table

2.9) are the best. They all have the highest level of prediction sensitivity, speciﬁcity, accuracy,

and AUC. The highest 1-year sensitivity is about 70%, speciﬁcity is about 86%, accuracy

is about 82%, and AUC is about 0.85. For 2-year prediction, they also have the highest
prediction rate with sensitivity as about 62% − 65%, speciﬁcity as about 90%, accuracy as

about 84%, and AUC as about 0.85. For 3-year early prediction, they all have sensitivity

above 32%, speciﬁcity above 90%, accuracy above 80%, and AUC above 0.81.

In the four best models, the combination models work a little bit better than single ROI

model. Because the three combination models have a higher AUC than that of EC model.

However, they are not signiﬁcantly superior than EC. On the other hand, the combinations

without EC and H, which are the ﬁrst and second best single biomarker, perform relatively

36

worse than other combinations (WB+FG, WB+MTC, FG+MTC, WB+FG+MTC, see Ta-

ble 2.7 and Table 2.8). This observation implies that the prediction power of the combination

models comes from the prediction power of EC and H.

2.5 Discussion

In this study we use sparsely observed volumes of ROIs quantiﬁed by longitudinal volumetric

MRI to predict whether or not a MCI subject will convert to AD in a speciﬁed time window.

A longitudinal data prediction method based on functional data analysis is developed for

early prediction in varying time windows, as well as identifying the most important ROI(s)

in the process.

The application of existing methods is not obvious due to the complexity of the data

structure. To apply the prediction method based on functional data analysis, we ﬁrst use

PACE to extract statistically validated information from the longitudinal data. FPC scores

are calculated and used for prediction. This method is new for analyzing ADNI data.

Our best prediction model (1-year prediction using H+WB+EC+FG+MTC with AUC =

0.86, accuracy = 82%, speciﬁcity=86%, and sensitivity=69%) performs favorably compared

to the existing literature for prediction using the same longitudinal data, even though the

comparison is not entirely comparable since those studies used diﬀerent biomarkers and

diﬀerent prediction windows. Actually, our model used fewer predictors (longitudinal ROI

volumes and three clinical features) and has a longer prediction window (one year, two

years, and three years), which suggests that our model eﬃciently used less information and

derived comparable prediction results. Since the primary goal of this paper is to introduce

a statistically validated, advanced methodology for prediction using longitudinal data, we

37

(a) sensitivity

(b) speciﬁcity

(c) accuracy

(d) AUC

Figure 2.5 Prediction sensitivity(a), speciﬁcity(b), accuracy(c), and AUC(d) using single
and combinations of ROIs. In each panel, every bar represent a predict result using diﬀerent
combinations of biomarkers (from left to right): Hippocampus(H), whole brain(W), entorhi-
nal cortex(EC), fusiform gyrus(F), middle temporal cortex(MTC), H+WB, H+EC, H+FG,
H+MTC, WB+EC, WB+FG, WB+MTC, EC+FG, EC+MTC, FG+MTC, H+WB+EC,
H+WB+FG, H+WB+MTC, H+EC+FG, H+EC+MTC, H+FG+MTC, WB+EC+FG,
WB+EC+MTC, WB+FG+MTC, EG+FG+MTC, H+WB+EC+FG, H+WB+EC+MTC,
H+WB+FG+MTC, H+EC+FG+MTC, WB+EC+FG+MTC and combination of all the
ﬁve ROIs.

38

−3y−2y−1yyearsensitivity0.00.10.20.30.40.50.60.7−3y−2y−1yyearspecificity0.00.20.40.60.81.0−3y−2y−1yyearaccuracy0.00.20.40.60.81.0−3y−2y−1yyearAUC0.00.20.40.60.81.0do not intend to elaborate the results by including more predictors. However, one could

continue experimentation with other available biomarkers by the techniques adapted here.

Besides, due to the irregularity and sparsity of longitudinal data, most other literature

make predictions just in one prediction window. Our proposed method makes it feasible for

predicting on more ﬂexible prediction windows, even if only one observation is available. It

enables us to compare the prediction performance in varying prediction windows and helps

to understand the disease risk over time.

Moreover, most of the methods employed in the literature are machine learning tech-

niques, which are not accessible for statistical validation and interpretation. Our model is

based on two well-established statistical methods, which is easy to validate and interpret.

One limitation of the proposed model is it has a necessary assumption for PACE method.

We need to assume the measurement error ij and FPC scores ξij are jointly Gaussian. This

assumption is not always valid. However, the existing simulation studies indicate the method

is robust to some extent in violations of the Gaussian assumption (see [56]). Also, in the

prediction result, the sensitivity is poor for all combinations of the ROIs, especially for 3-year

prediction. This reveals the limitation of using volume of ROI for prediction. More advanced

classiﬁcation technique would be applied to improve the prediction sensitivity in the future

work.

In summary, the proposed procedure in this paper provides a method to predict con-

version from MCI to AD using longitudinal data. The longitudinal prediction curve can be

utilized for understanding disease risk over time. The procedure can be easily applied to an-

alyze other longitudinal data sets for prediction in clinical trials to decide a better prediction

window and choose better biomarker(s).

39

APPENDIX

40

Table 2.7 Prediction performance of combinations of two ROIs

H+WB

(Mean ± SD)

H+EC

(Mean ± SD)

H+FG

(Mean ± SD)

H+MTC

(Mean ± SD)

WB+EC

(Mean ± SD)

WB+FG

(Mean ± SD)

WB+MTC
(Mean ± SD)

EC+FG

(Mean ± SD)

EC+MTC
(Mean ± SD)

FG+MTC
(Mean ± SD)

sensitivity
1 y 0.63±0.044
0.5±0.037
2 y
3 y 0.22±0.065
1 y 0.66±0.031
2 y 0.58±0.044
3 y 0.33±0.029
1 y 0.67±0.032
2 y 0.57±0.035
3 y 0.27±0.049
1 y 0.64±0.039
2 y 0.53±0.033
0.3±0.052
3 y
1 y 0.66±0.025
2 y 0.62±0.022
3 y 0.29±0.026
1 y 0.55±0.052
2 y 0.46±0.026
3 y 0.22±0.048
0.6±0.054
1 y
2 y 0.48±0.047
3 y 0.24±0.043
1 y 0.63±0.032
0.6±0.011
2 y
3 y 0.34±0.038
1 y 0.64±0.016
0.6±0.029
2 y
0.36±0.03
3 y
0.6±0.02
1 y
2 y 0.53±0.032
3 y 0.28±0.055

speciﬁcity
0.85±0.012
0.88±0.008
0.94±0.016
0.84±0.011
0.89±0.013
0.94±0.012
0.88±0.016
0.9±0.01
0.95±0.009
0.85±0.014
0.88±0.016
0.94±0.011
0.83±0.015
0.89±0.014
0.94±0.015
0.88±0.007
0.89±0.006
0.93±0.012
0.87±0.014
0.88±0.016
0.94±0.011
0.85±0.014
0.89±0.01
0.95±0.009
0.86±0.012
0.9±0.012
0.94±0.009
0.89±0.009
0.9±0.01
0.94±0.01

accuracy
0.8±0.01
0.79±0.011
0.77±0.013
0.8±0.01
0.82±0.015
0.8±0.009
0.83±0.01
0.82±0.01
0.79±0.01
0.8±0.011
0.79±0.012
0.79±0.013
0.79±0.012
0.82±0.012
0.78±0.009
0.8±0.012
0.79±0.006
0.76±0.011
0.8±0.012
0.79±0.014
0.78±0.011
0.8±0.01
0.82±0.007
0.81±0.009
0.81±0.008
0.83±0.007
0.81±0.008
0.82±0.006
0.81±0.009
0.78±0.012

AUC

0.84±0.004
0.83±0.006
0.81±0.008
0.85±0.003
0.84±0.003
0.82±0.005
0.85±0.003
0.84±0.004
0.82±0.005
0.85±0.004
0.84±0.005
0.83±0.006
0.85±0.004
0.85±0.004
0.82±0.005
0.83±0.006
0.83±0.007
0.8±0.007
0.84±0.005
0.83±0.007
0.81±0.005
0.85±0.003
0.84±0.003
0.82±0.003
0.86±0.003
0.85±0.004
0.83±0.005
0.85±0.004
0.83±0.005
0.82±0.005

Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG
stands for fusiform gyrus and MTC for middle temporal cortex. Sensitivity is the proportion of
true MCI-c in prediction, speciﬁcity is the proportion of true MCI-nc in prediction, and accuracy
is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the
ROC (receiver operating characteristic) curve. The mean and standard deviation of the
sensitivity, speciﬁcity, accuracy, and AUC are based on 100 samples of training data sets.

41

Table 2.8 Prediction performance of combinations of three ROIs

H+WB+EC
(Mean ± SD)

H+EC+FG
(Mean ± SD)

H+EC+MTC
(Mean ± SD)

sensitivity
1 y 0.66±0.029
2 y 0.61±0.031
0.3±0.03
3 y
0.68±0.03
1 y
H+WB+FG
2 y 0.57±0.055
(Mean ± SD)
3 y 0.27±0.052
H+WB+MTC 1 y 0.64±0.048
2 y 0.55±0.045
(Mean ± SD)
0.29±0.05
3 y
0.7±0.013
1 y
0.63±0.03
2 y
3 y 0.33±0.037
1 y 0.66±0.018
2 y 0.59±0.041
0.35±0.03
3 y
1 y 0.67±0.036
H+FG+MTC
(Mean ± SD)
2 y 0.61±0.041
0.29±0.05
3 y
WB+EC+FG 1 y 0.69±0.023
2 y 0.62±0.032
(Mean ± SD)
3 y 0.34±0.047
WB+EC+MTC 1 y 0.67±0.032
2 y 0.61±0.029
(Mean ± SD)
3 y 0.35±0.034
WB+FG+MTC 1 y 0.66±0.047
2 y 0.54±0.034
(Mean ± SD)
3 y 0.29±0.055
EC+FG+MTC 1 y 0.65±0.029
2 y 0.62±0.017
(Mean ± SD)
3 y 0.34±0.032

speciﬁcity
0.84±0.014
0.88±0.011
0.93±0.015
0.87±0.013
0.9±0.011
0.94±0.01
0.84±0.013
0.87±0.013
0.94±0.013
0.86±0.017
0.9±0.012
0.95±0.009
0.86±0.012
0.89±0.012
0.94±0.01
0.86±0.016
0.9±0.011
0.94±0.008
0.86±0.017
0.89±0.009
0.94±0.009
0.85±0.011
0.89±0.012
0.94±0.011
0.88±0.011
0.9±0.013
0.93±0.008
0.87±0.016
0.9±0.011
0.94±0.008

accuracy
0.79±0.011
0.82±0.012
0.78±0.012
0.82±0.009
0.82±0.016
0.78±0.012
0.79±0.012
0.79±0.011
0.78±0.014
0.82±0.013
0.84±0.012
0.8±0.01
0.81±0.009
0.82±0.009
0.8±0.008
0.82±0.01
0.83±0.011
0.79±0.012
0.82±0.014
0.83±0.009
0.8±0.012
0.81±0.01
0.82±0.01
0.8±0.011
0.83±0.011
0.81±0.013
0.78±0.013
0.82±0.01
0.83±0.009
0.8±0.01

AUC

0.85±0.003
0.84±0.005
0.81±0.006
0.85±0.005
0.84±0.007
0.82±0.008
0.85±0.005
0.84±0.006
0.82±0.006
0.86±0.003
0.85±0.004
0.83±0.004
0.86±0.004
0.85±0.003
0.82±0.006
0.86±0.003
0.84±0.005
0.83±0.005
0.85±0.004
0.85±0.006
0.82±0.006
0.86±0.004
0.85±0.005
0.82±0.006
0.85±0.006
0.84±0.009
0.82±0.007
0.86±0.003
0.85±0.004
0.83±0.004

Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG
stands for fusiform gyrus and MTC for middle temporal cortex. Sensitivity is the proportion of
true MCI-c in prediction, speciﬁcity is the proportion of true MCI-nc in prediction, and accuracy
is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the
ROC (receiver operating characteristic) curve. The mean and standard deviation of the
sensitivity, speciﬁcity, accuracy, and AUC are based on 100 samples of training data sets.

42

Table 2.9 Prediction performance of combinations of four or more ROIs

H+WB+FG+MTC

(Mean ± SD)

H+WB+EC+MTC

(Mean ± SD)

H+WB+EC+FG
(Mean ± SD)

sensitivity
0.7±0.011
1 y
2 y 0.64±0.038
3 y 0.33±0.039
1 y 0.68±0.029
2 y 0.61±0.036
3 y 0.34±0.036
0.68±0.04
1 y
2 y 0.63±0.046
3 y 0.31±0.054
1 y 0.68±0.022
2 y 0.64±0.023
3 y 0.33±0.034
1 y 0.68±0.031
2 y 0.64±0.028
3 y 0.35±0.038
H+WB+EC+FG+MTC 1 y 0.69±0.022
2 y 0.65±0.033
3 y 0.35±0.038

(Mean ± SD)

H+EC+FG+MTC

(Mean ± SD)

WB+EC+FG+MTC

(Mean ± SD)

speciﬁcity
0.86±0.015
0.9±0.012
0.95±0.01
0.85±0.012
0.88±0.013
0.94±0.011
0.86±0.017
0.9±0.012
0.94±0.008
0.87±0.014
0.9±0.012
0.94±0.008
0.86±0.013
0.9±0.01
0.94±0.009
0.86±0.012
0.9±0.012
0.94±0.008

accuracy
0.82±0.012
0.84±0.013
0.8±0.011
0.81±0.009
0.82±0.011
0.8±0.012
0.82±0.012
0.83±0.014
0.79±0.013
0.82±0.011
0.84±0.01
0.8±0.009
0.82±0.011
0.84±0.01
0.8±0.011
0.82±0.01
0.84±0.012
0.8±0.011

AUC

0.85±0.005
0.85±0.007
0.82±0.007
0.86±0.004
0.85±0.005
0.82±0.007
0.86±0.005
0.85±0.008
0.82±0.007
0.86±0.003
0.85±0.004
0.83±0.005
0.86±0.004
0.85±0.006
0.82±0.007
0.86±0.005
0.85±0.007
0.82±0.008

Key: H stands for hippocampus, WB stands for whole brain, EC stands for entorhinal cortex, FG
stands for fusiform gyrus and MTC for middle temporal cortex. Sensitivity is the proportion of
true MCI-c in prediction, speciﬁcity is the proportion of true MCI-nc in prediction, and accuracy
is the proportion of true prediction. p0 is set to be 0.5. AUC is the area under the curve from the
ROC (receiver operating characteristic) curve. The mean and standard deviation of the
sensitivity, speciﬁcity, accuracy, and AUC are based on 100 samples of training data sets.

43

Chapter 3

High dimensional discriminant

analysis for spatially dependent data

and its application in neuroimaging

data

3.1

Introduction

Spatial statistics is rapidly developing for analyzing data which is featured with spatial

structure. Applications of spatial statistics are for a broad range of disciplines such as

agriculture, geology, soil science, oceanography, forestry, meteorology and climatology as

well as imaging data.

With the development of computer technology into the processing of images, spatial

statistics has played an important role in the processing of images and pattern recognition.

Cressie (1992) [17] introduced details of spatial methodologies in analysis of images. Moti-

vated by the problem of how to use brain imaging data for diagnosis and classiﬁcation, we

focus on discriminant analysis of data featured with spatial correlation.

Another important characteristic of the imaging data is the high dimensionality. For

44

example, in MRI scans, signals are collected from a 3D space. The number of voxels could

be as many as millions. Usually the number of images is just hundreds. If we use the signals

from all voxels as features for classiﬁcation, since the number of features is much more than

the sample size, it would be a high dimensional problem.

LDA is an asymptotically optimal classiﬁer under traditional large sample scenario, that

is, the dimension of variables (p) is ﬁxed and the sample size (n) tends to inﬁnity. However

it is not applicable in high dimensional data. Bickel and Levina (2004) [5] demonstrated
that LDA performs asymptotically no better than random guessing if p/n → ∞.

As mentioned in 1.3, there are two main issues in high dimensional classiﬁcation using

LDA. The ﬁrst issue is when p > n, the sample covariance ˆΣ will be singular. To solve

this, independence rule (IR) ignores the correlations among features and use diagonal of

ˆΣ to replace ˆΣ, which is applicable for any high dimensions. Bickel and Levina (2004)

[5] also shows in theory that IR lead to a better classiﬁcation result than the naive LDA,
where the Moore-Penrose inverse is used to replace ˆΣ−1. Another similar way to solve this

issue is nearest shrunken centroid classiﬁer [48]. Fan and Fan (2008) [19] propose feature

annealed independence rule (FAIR) that performs feature selection by t-test in addition

to IR. However, ignoring the covariance structure of the features, these methods can not

asymptotically achieve optimal classiﬁcation rate.

Many other high dimensional LDA methods are proposed in which the covariance struc-

ture in Σ is incorporated. Again, the ﬁrst challenge for high dimensional LDA is the singu-

larity of ˆΣ. Many methods have been proposed for covariance matrix estimation or precision

matrix estimation in high dimension scenario (see, e.g., [6, 7, 8, 10, 11, 12, 44, 45, 47, 53]).

The covariance matrix estimated by these methods can be directly used in LDA. However, an

accurate estimate of Σ does not necessarily lead to better classiﬁcation. Fan and Fan (2008)

45

[19] and Shao et al. (2011) [46] shows that even though the true covariance matrix is known,

the classiﬁcation could be no better than random guess because of the noise accumulated

from estimating the means. This lead to the second challenge for high dimensional LDA.

That is the classiﬁcation performance is poor due to the noise brought in from the estimate

of many non-informative features.

To address this issue, Shao et al.

[46] assumes sparseness in both ∆, which measures

the diﬀerence of the two classes in mean and the covariance matrix Σ. Then Σ and ∆ are

estimated by hard thresholding for classiﬁcation. This method is then extended to quadratic

discriminant analysis in [33]. One issue with this method is the assumption of the sparseness

on the covariance matrix is too restrictive. Witten and Tibshirani (2011) [51] proposed

penalized LAD by applying penalties on the discriminant vectors. Cai and Liu (2011) [9],

Mai and Zhou (2012) [36] and Fan et al. (2012) [20] assume the sparsity of the discriminant

direction βBayes = Σ−1(µ2 − µ1). These methods borrow the idea of penalization in
regression to regularize the estimated discriminant direction βBayes directly, avoiding of
estimating Σ−1. The same idea is still using in [13]. Though informative direction plays a

direct role in the discriminant function, selecting the informative features would be easier

for interpretation. Xu et al. (2014) [55] proposed a covariance-enhanced method to select

informative features for linear discriminant analysis. However the proposed method doesn’t

directly enforce the selecting of informative features.

In this research, we take advantage of the spatial dependency among features and assume

the features are equipped with spatial covariance structure. We then developed a penalized

maximum likelihood estimation procedure to simultaneously estimate the covariance param-

eter, the mean, and select important features for discriminant analysis.

For a spatial domain of interest D in Rd, we consider two classes of spatial processes

46

{yk(s) : s ∈ D, k = 1, 2}, (k = 1, 2), such that

yk(s) = µk(s) + (s),

(3.1.1)

Where µk(s) is the mean eﬀect function and (s) is the corresponding random noise. Assume
that the error process {(s) : s ∈ D} is a Gaussian process with mean zero and a covariance

function

γ(s, s(cid:48); θ) = cov((s), (s(cid:48)))

(3.1.2)

where s, s(cid:48) ∈ D and θ is a q × 1 vector of covariance function parameters. We assume the

spatial domain is expanding as the number of samples on the domain is increasing:

restriction(cid:13)(cid:13)si − sj

A 1. Assume the sample set D ∈ Rd (d ≥ 1) is predetermined and non-random with the
≥  > 0, for si, sj ∈ D for all pairs i, j = 1, 2, ..., p to ensure that the

(cid:13)(cid:13)

2

sampling domain increases in extent as p increases.

Assume for any sample of the spatial processes, there are observations at p discrete sites
s1, ..., sp ∈ D. Suppose yki(s) (i = 1, 2, ..., n1) is from class Ck (k = 1, 2). Let Ykij = yki(sj)

be the observation at jth site for the ith sample of spatial process yk(s), where k = 1, 2,

i = 1, 2, ..., nk, j = 1, 2, ..., p, then the jth observation for sample i can be represented by

Ykij = µkj + kij

(3.1.3)

where µkj = µk(sj) is the mean eﬀect at jth location in class Ck and kij = ki(sj) is the

corresponding Gaussian random noise for ith sample at jth location. In matrix notation,

47

the above model can be written as:

Y ki = µk + ki

(3.1.4)

where Y ki = (Yki1, ..., Ykip)T , µk = (µk1, ..., µkp)T is the mean vector of class Ck and
ki = (ki1, ...kip)T has multivariate nomal distribution N (0, Σ). Since (s) has a covariance
function (3.1.2), the covariance matrix Σ can be represented by Σ(θ) = [γ(si, sj; θ)]p

i,j=1,

i.e. γ(si, sj) is the (i, j)th entry. By (3.1.4), we have

Y ki ∼ N (µk, Σ(θ))

(3.1.5)

Assume θ0 be the true parameter in 3.1.2. If θ = θ0, we write Σ(θ) as Σ for simplicity.

We need to make some assumptions on the covariance function γ(si, sj; θ):

A 2. (i) Let Ξ be the parameter space for θ. Assume the covariance function γ(s, t; θ) is
stationary, isotropy, and twice diﬀerentiable with respect to θ for all θ ∈ Ξ and s, t ∈ D.
(ii ) γ(s, t; θ) is positive-deﬁnite in the sense that for every ﬁnite subset {s1, s2, ..., sp} of D

the covariance matrix Σ = [γ(si, sj; θ)] is positive-deﬁnite.

Under the stationary and isotropic assumption, we can write Σ(θ) = [γ(hij; θ)]p

i,j=1,

where hij =(cid:13)(cid:13)si − sj

(cid:13)(cid:13)

2

is the distance between site si and sj.

Let Y = (Y T

11, ..., Y T

1n1

, Y T

21, ..., Y T

2n2

)T . As deﬁned in (3.1.5), Y has multivariate

normal distribution. Then we have the log-likelihood function for µk and θ

L(θ, µ1, µ2; Y ) = − p(n1 + n2)
nk(cid:88)

2(cid:88)

2

log|Σ(θ)|

log(2π) − n1 + n2
(Y ki − µk)T Σ(θ)−1(Y ki − µk)

2

− 1
2

k=1

i=1

48

(3.1.6)

µ1, µ2 and θ can be estimated by maximum likelihood estimation (MLE) even in high di-

mensional settings. In the setting of spatial statistical model, the resulting Σ(ˆθ) is a positive

deﬁnite matrix. This resolves the ﬁrst challenge lies in high dimensional classiﬁcation.

We denote the resulting estimates as ˆµ1M LE, ˆµ2M LE, ˆθM LE, which can be easily plug

into 1.3.1 and get the MLE-LDA classiﬁer ˆδM LE :

ˆδM LE(X) = (X − ˆµ1M LE + ˆµ2M LE

2

)T Σ−1(ˆθM LE)( ˆµ1M LE − ˆµ2M LE).

In section 3.2, we investigate the properties of MLE-LDA classiﬁer. We ﬁrst derived the
parameter estimation consistency in MLE for p/n → 0 and p/n → C > 0. Then we show
that MLE-LDA is asymptotically optimal if p/n → 0 under some regulation conditions.
However, when p/n → C > 0, the MLE-LDA could be no better than random guess even if

the true covariance is known unless the signals are very strong. This indicates that feature

selection is still necessary for high dimensional case even if we can estimate the covariance

parameter consistently.

In section 3.3, we propose to estimate the parameters by penalized MLE (PMLE) by
applying a penalty on ∆ = µ1 − µ2, which measures the diﬀerence of the two classes in

mean. We assume the sparseness of ∆, which indicates that only a fraction of the p features

are important in classiﬁcation. Then we derive the parameter estimation consistency and

feature selection consistency of PMLE. In the end a classiﬁer (PLDA classiﬁer) is developed
using the PMLE and we show that it is asymptotically optimal even if p/n → C > 0.

Simulation study and real data analysis are conducted in section 3.4 and 3.5. All proofs

and remarks about assumptions are given in 3.6.

49

3.2 Classiﬁcation using maximum likelihood estima-

tion (MLE-LDA)

In this section, we investigate the consistency of parameter estimation in (3.1.6). Also

we investigate the classiﬁcation performance of ˆδM LE. Here are some notations. Let µ1 =

(µ11, µ12, ..., µ1p), µ2 = (µ21, µ22, ..., µ2,p) and θ0 = (θ01, θ02, ..., θ0q) be the true parameters.

Let Σk(θ), k = 1, 2, ..., q be the partial derivative of the matrix Σ(θ) with respect to θk, i.e.

Σ(θ) = Σk(θ). Also let Σk(θ), k = 1, 2, ..., q denote the partial derivative of the matrix

∂
∂θk
Σ(θ)−1 with respect to θk, i.e.
Σkj(θ) = ∂Σ−1(θ)
. We are going to simplify the notation if θ = θ0, i.e. we write Σ(θ0) as
Σ, Σ−1(θ0) as Σ−1, Σk(θ0) as Σk and Σk(θ0) as Σk. Denote the set of all the eigenvalues

Σ−1(θ) = Σk(θ). Also, denote Σkj(θ) = ∂Σ(θ)
∂θk∂θj

∂θk∂θj

∂
∂θk

and

of the square matrix A by λ(A). Moreover, denote the maximum and minimum eigenvalues

of a square matrix A by λmax(A) and λmin(A), respectively.

Here are the regularity conditions assumed in Theorem 1.

A 3. lim supp→∞λmax(Σ) < ∞, lim infp→∞λmin(Σ) > 0
A 4. (cid:107)Σk(cid:107)−2

= Op(p−1)

F

A 5. Assume limp→∞ aij exist, where aij =

t

tij
1/2
ii

t

1/2
jj

and tij = tr(Σ−1ΣiΣ−1Σj).

A 6. There exists an open subset ω that contains the true parameter point θ0 such that for
all θ∗ ∈ ω, we have:
(i) −∞ < limp→∞ λmin(Σk(θ∗)) < limp→∞ λmax(Σk(θ∗)) < ∞.
(ii ) −∞ < limp→∞ λmin(Σkj(θ∗)) < limp→∞ λmax(Σkj(θ∗)) < ∞.
(iii) (cid:107) ∂tij (θ∗)

(cid:107)2= Op(p), where tij(θ∗) = tr(Σ−1(θ∗)Σi(θ∗)Σ−1(θ∗)Σj(θ∗))

∂θ

Since Σk = −ΣΣkΣ and Σkj = Σ−1(ΣkΣ−1Σj + ΣjΣ−1Σk − Σkj)Σ−1, by A3 and A 6,

50

we have

and

−∞ < lim

p→∞ λmin(Σk(θ∗)) < lim

p→∞ λmax(Σk(θ∗)) < ∞

−∞ < lim

p→∞ λmin(Σkj(θ∗)) < lim
Notice that for any p × p matrix A we have (cid:107)A(cid:107)

p→∞ λmax(Σkj(θ∗)) < ∞.

≤ √

p(cid:107)A(cid:107)

2

=

√

F

pλmax(A), then from

A 6 we have:

(cid:13)(cid:13)(cid:13)Σk(θ∗)
(cid:13)(cid:13)(cid:13)
(cid:13)(cid:13)(cid:13)
(cid:13)(cid:13)(cid:13)Σkj(θ∗)

F

F

(1)

(2)

√
= Op(

p);
√
= Op(

p).

First we have the following theorem about MLE consistency of (3.1.6).

Theorem 1. Assume A1-A6 hold . Let (µ1, µ2, θ0) be the true parameter. The maximum

likelihood estimation (MLE) of (3.1.6) is: ˆµ1M LE = ¯Y 1·, ˆµ2M LE = ¯Y 2·, ˆθM LE, where

¯Y k· =(cid:80)nk

i=1 Y ki/nk. Also,

(cid:13)(cid:13)(cid:13)ˆθM LE − θ0

(cid:13)(cid:13)(cid:13)

2

(i) If p/n → 0,

(ii) If p/n → C with 0 < C ≤ ∞ and

p/n → 0,

np);

= Op( 1√
√

(cid:13)(cid:13)(cid:13)ˆθM LE − θ0

(cid:13)(cid:13)(cid:13)

2

= Op( 1

n).

Proof. See the proof in the section 3.6.

Theorem 1 shows that under the spatial statistical model, all the parameters (µ1, µ2, θ)
can be estimated consistently by the MLE for either p/n → 0 or p/n goes to a positive
constant of ∞. Therefore we obtain a positive deﬁnite covariance matrix estimate of Σ(θ).

We can easily plug the MLE ˆµ1 = ˆµM LE1, ˆµ2 = ˆµM LE2 and ˆΣ = Σ(ˆθM LE) into (1.3.7)

51

to build up the classiﬁcation function as following:

ˆδM LE(X) =

X − 1
2

( ˆµ1M LE + ˆµ2M LE)

(cid:16)

(cid:17)T

(cid:16)
Σ−1(ˆθM LE)

(cid:17)

(3.2.1)

ˆµ1M LE − ˆµ2M LE

Then a new observation x would be classiﬁed into class C1 if ˆδM LE(x) > 0 and C2 otherwise.

Using the same notations in section 1.3, the conditional misclassiﬁcation rate is deﬁned by

(1.3.2) and (1.3.4). For simplicity, We are going to use ˆµ1 to denote ˆµM LE1, ˆµ2 to denote

ˆµM LE2, ˆθ to denote ˆθM LE in this section.

We will see in the following theorem that the approximate optimal error rate can be
achieved while p/n → 0. However, if p/n → C with 0 < C ≤ ∞, the error rate would

be no better than random guessing even if we know the true covariance, due to the error

accumulated in the estimation of µ1 and µ2.

Theorem 2. Let Cp = ∆T Σ(θ)∆. Assume p
nCp → ∞ as n, p → ∞.

n → 0, Cp → C0 with 0 ≤ C0 ≤ ∞, and

(1) The overall misclassiﬁcation rate W (ˆδM LE; Θ) is asymptotically sub-optimal. In other

words, W (ˆδM LE; θ) − WOP T

P→ 0.

(2) Moreover, if Cp → C0 with 0 ≤ C0 < ∞ or if Cp → ∞ and Cp

p

n → 0, then W (ˆδM LE; θ)

is asymptotically optimal, i.e.

W (ˆδM LE ;Θ)

WOP T

P→ 1.

Proof. See the proof in the section 3.6.

The following theorem shows that while p

n goes to a positive constant or ∞, even though
the true covariance is known, the error accumulated in the estimation of µ1 and µ2 would

cause biased misclassiﬁcation rate unless the signal levels are extremely high. This discovery

52

suggests that eventhough there’s no problem in parameter estimation in our model even in

high dimensional case, it is still necessary to select important features for classiﬁcation.

Theorem 3. Assume the true covariance Σ is known, denote the classiﬁer function as

δ ˆµ(x) = (x − ˆµ)T Σ−1( ˆµ1 − ˆµ2)

(3.2.2)

where ˆµ1, ˆµ2 are MLE in (3.1.6) and ˆµ =
Cp → C0 with 0 ≤ C0 ≤ ∞. Assume n1 (cid:54)= n2 and nk > n

2

ˆµ1+ ˆµ2

. Assume p/n → C with 0 < C ≤ ∞,

(1) For

(2) For

Cp

p/n → ∞, then W (ˆδ ˆµ)
p/n → c with 0 < c < ∞,

Cp

P→ 0 and WOP T

(i) if p

(ii) if p

n → C < ∞, then limP W (ˆδ ˆµ) > 1 − Φ(
n → ∞, then W (ˆδ ˆµ)
p/n → 0, then W (ˆδ ˆµ)

P→ 0 and WOP T

P→ 1
2.

Cp

(3) For

4 (k = 1, 2), then

P→ 0 but

W (ˆδ ˆµ)
WOP T

P→ ∞.

√

C0
2

);

P→ 0, but

W (ˆδ ˆµ)
WOP T

P→ ∞.

Proof. See the proof in the section 3.6.

Corollary 1. With all conditions the same as in Theorem 3. If n1 = n2, then

(1) If

(2) If

Cp√

p/n

Cp√

p/n

→ ∞, then W (ˆδ ˆµ)

P→ 0 and WOP T

P→ 0, but

W (ˆδ ˆµ)
WOP T

P→ ∞;

→ c with 0 < c < ∞,

(i) If p

(ii) If p

n → C, then W (ˆδ ˆµ)
n → ∞, then W (ˆδ ˆµ)

P→ 1 − Φ(

(cid:113)
P→ 1 − Φ( c

2

P→ 1 − Φ(

√

√
c
2

C

)

) and WOP T

C

c

√
4+c/

4), and WOP T

P→ 0;

(3) If

Cp√

p/n

P→ 0, we have W (ˆδ ˆµ)

P→ 1
2.

53

Proof. See the proof in the section 3.6.

Theorem 3 and Corollary 1 shows that while p/n → C with 0 < C ≤ ∞, ˆδ ˆµ is never
asymptotically optimal. It is asymptotically sub-optimal only if Cp → ∞. It reveals that

though there’s no diﬃculty in applying LDA on spatial dependent data if estimate the
parameters by MLE, however, in high dimensional case (p/n → C with 0 < C ≤ ∞), the

discriminant performance may be poor due to the noise accumulated in the estimation of µ1

and µ2 (see, e.g., [19] and [46]). Therefore, feature selection is still critical for classiﬁcation

with features of high dimension. Fan and Fan (2008) [19] seeks to extract salient features

by two-sample t-test and proved that t-test can pick up all important features by choosing

an appropriate critical value once the features are assumed to be independent. Shao et al.

(2011) [46] proposes to select features by threshold. For the spatially correlated features, we

can use penalized maximum likelihood estimates (PMLE) for feature selection.

3.3 Classiﬁcation using penalized maximum likelihood

estimation (PLDA)

3.3.1 The penalized maximum likelihood estimation (PMLE)

In this section, we consider the high dimensional classiﬁcation problem (i.e. p/n → C with
0 < C ≤ ∞ as p → ∞ and n → ∞). We continue to use the notation in section 3.1.
∆ = µ2 − µ1 is a p dimensional vector. Then ∆ = (∆1, ..., ∆2) = (µ21 − µ11, ..., µ2p − µ1p)
is the diﬀerence of the mean eﬀect between class C1 and C2. Deﬁne the signal set S = {j :
∆j (cid:54)= 0}. Let s be the number of non zero elements in ∆. The important features are

those in the set S. Instead of assuming the sparsity of discriminant direction as in [9], [20]

54

and [36], we assume the sparsity of ∆, that is s (cid:28) n and s/n → 0. Then employing the

idea of penalization in regression, we develop the penalized maximum likelihood estimates to

estimate ∆ and θ. As deﬁned in section 3.1, the observations Y ki are normally distributed
Y ki ∼ N (µk, Σ(θ0)) for k = 1, 2 and i = 1, 2, ..., nk.

Let Ip be the p× p identity matrix and Jp be the p× p matrix with 1 as its entries. Deﬁne

diagonal block matrix for square matrix A as diagn(A) with n × n blocks, i.e.

diagn(A) =

A 0 ···
0 A ···
...
. . .

...

0

0
...

(cid:123)(cid:122)

0

0

0 A

n×n blocks



(cid:124)

···
···
. . .

(cid:125)






Ip

Ip
...

(cid:124) (cid:123)(cid:122) (cid:125)

Ip

n blocks

Then deﬁne ˜In,p = diagn(Ip). Also deﬁne block matrices ˜Jn,p and ˜1n,p as followed:



(cid:124)



(cid:125)

Ip

Ip
...

Ip

Ip

Ip

Ip
...

Ip
...

Ip

(cid:123)(cid:122)

Ip

Ip
n×n blocks

˜Jn,p =

, ˜1n,p =

)T .

Let Y = (Y T

11,··· , Y T
1n1

, Y 21,··· , Y T
2n2

In order to estimate ∆ = µ1 − µ2, we
make transformation with Y . Let Z = VY , where V is a (n − 1)p × np matrix made up by
n−1)T , where Zi = Y 1i− ¯Y for
the ﬁrst (n−1)p rows of ˜In,p− 1
i = 1, 2, ..., n1, Zi = Y 2(i−n1)− ¯Y for i = n1+1, n2+2, ..., n−1 and ¯Y = 1
i=1 Y ki.

˜Jn,p. Then Z = (ZT

(cid:80)nk

(cid:80)2

1 ZT

2 ··· ZT

k=1

n

n

Then

55

• Zi ∼ N (−τ2∆, n−1

n Σ(θ0)) for i = 1, 2, ..., n1;

• Zi ∼ N (τ1∆, n−1

n Σ(θ0)) for i = n1 + 1, ..., n − 1.

n1
n and τ2 =

where τ1 =
such that Xi = X(1) for i = 1, 2, ..., n1 and Xi = X(2) for i = n1 + 1, , ..., n − 1, where

n . Also, cov(Zi, Zj) = − 1
n2

nΣ for i (cid:54)= j. Deﬁne X(1) and X(2)



−τ2
0
0 −τ2
...
...

0

0

0

···
···
0
...
. . .
0 −τ2



X(1) =



τ1

0

0
...

0

τ1
...

0

···
···
. . .

0

0
...

0

τ1



p×p

, X(2) =

p×p

Write X = (XT

1 XT

2 ··· XT

n−1)T and write β = ∆. Then Z is a (n − 1)p × 1 vector
˜Jn−1,p)diagn−1(Σ).

with multivariate normal distribution N (Xβ, ˙Σ), where ˙Σ = ( ˜In−1,p− 1
Denote all the unknown parameters by η = (β, θ) ∈ Rp+q. Since

n

(cid:12)(cid:12)(cid:12) =
(cid:12)(cid:12)(cid:12) ˙Σ

(cid:12)(cid:12)(cid:12)(cid:12) ˜In−1,p − 1

n

(cid:12)(cid:12)(cid:12)(cid:12)|diagn−1Σ(θ))| = (

˜Jn−1,p

)p |Σ(θ)|n−1

1
n

and ( ˜In−1,p − 1

n

˜Jn−1,p)−1 = ˜In−1,p + ˜Jn−1,p, we can write the penalized log-likelihood

function for β and θ:

Q(θ, β; Z) = − np
2

log(2π) − 1

2log

(cid:12)(cid:12)(cid:12) ˙Σ

(cid:12)(cid:12)(cid:12) − 1

2(Z − Xβ)T ˙Σ−1(Z − Xβ) − n

p(cid:88)

j=1

Pλ(|βj|)

(3.3.1)

log |Σ| − 1

2(Z − Xβ)T diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)(Z − Xβ)

=Cn,p − n − 1
− n

p(cid:88)

2
Pλ(|βj|)

j=1

56

where Cn,p = − (n−1)p

2

log π + p

2 log n. Pλ(x) is a generic sparsity-inducing penalty. It could

be the lasso penalization or folded concave penalization (such as the SCAD and the MCP).

We use SCAD penalization in this research.

The true parameter β0 is a column vector parameters of size p, and θ0 = (θ01, θ02, ..., θ0q)
is the q−dimensional parameters in covariance function. Assume β0 is sparse. Without loss
0,2)T , where β0,1 ∈ Rs is non-zero component,
of generality we can write β0 = (βT
n → 0 as n, p, s → ∞. Also, we can write
i is the p× s submatrix of Xi formed by columns

β0,2 = 0(p−s)×1 is the zero component of β0 and s

i ), for i = 1, 2, .., n, where X1

0,1, βT

Xi = (X1

i , X2

in supp(β0) and X2

i is the p × (p − s) complement matrix.

We are going to follow the one step estimation procedure in [14]. As demonstrated in

[58], the one-step method is as eﬃcient as the fully interative method both empirically and

theoretically, provided that the initial estimators are reasonably good. Here’s the algorithm:

Algorithm

1. Initialize β by minimizing R(β) = (Z−Xβ)T (Z−Xβ)+n(cid:80)p

j=1 Pλ(|βj|) with respect

to β. Denote the result by ˆβ

(0)

;

2. Fix β = ˆβ

(0)

, estimate θ by maximize Q(θ, ˆβ

(0)

; Z) in (3.3.1) with respect to θ.

Denote the result by ˆθ

(0)

;

3. Fix θ = ˆθ

(0)

, update β by maximize Q(ˆθ

(0)

, β; Z) in (3.3.1) with respect to β. Denoted

the result by ˆβ

(1)

;

4. Fix β = ˆβ

(1)

, estimate θ by maximize Q(θ, ˆβ

(1)

; Z) in (3.3.1) with respect to θ.

Denote the result by ˆθ

(1)

.

Then ˆθP M LE = ˆθ

(1)

and ˆβP M LE = ˆβ

(1)

are the desired estimates. We call ˆθ

(1)

(1)

and ˆβ

57

one-step PMLE.

(1)

(1)

and θP M LE = ˆθ

Denote ˆ∆P M LE = ˆβ

from the algorithm. Then µ1 and µ2 can
be estimated by ˆµ1P M LE = ¯Y − τ2 ˆ∆P M LE and ˆµ2P M LE = ¯Y + τ1 ˆ∆P M LE. Also we have
estimated covariance ˆΣ = Σ(ˆθP M LE), where the (i, j)th element of ˆΣ is:

where hij =(cid:13)(cid:13)sj − si

(cid:13)(cid:13)

2

ˆσi,j = γ(hij; ˆθP M LE)

(3.3.2)

is the distance between site si and sj.

3.3.1.1 Consistency of PMLE

Penality function largely determines the sampling properties of the penalized likelihood

estimators. Some additional assumptions about the penality function and tuning parameter

λ are needed:

A 7. Assume an = Op( 1√
A 8. bn → 0 as n → ∞, where bn = max1≤j≤m{p

), where an = max1≤j≤p{p

n

(cid:48)(cid:48)
λn

A 9. λn → 0 and λn/(cid:112) s

(|β0j|), β0j (cid:54)= 0}

(cid:48)
λn
(|β0j|), β0j (cid:54)= 0}

n → ∞.

A 10. lim infn→∞

p→∞ lim infθ→0+ P

(cid:48)
λn

(|θ|)/λn > 0

A 7 ensure the unbiasedness property for large parameters and the existence of the

consistent penalized likelihood estimator. A 8 ensure that the penalty function does not

inﬂuence the penalized likelihood estimators more than the likelihood function. A10 makes

the penalized likelihood estimators possess the sparsity property and A9 lead to the variable

selection consistency.

Smoothly Clipped Absolute Deviation (SCAD) penality satisfy all those assumptions and

we are going to apply SCAD in the simulation and real data analysis in this paper. Fan

58

and Li (2001) [21] proposed the SCAD penality function and claims that it satisfy three

good properties: unbiasedness, sparsity and continuity. Unbiasedness means there is no

overpenalization of large features to avoid unnecessary modeling biases. Sparsity means the

insigniﬁcant parameters are set to 0 by a thresholding rule to reduce model complexity.

Continuity makes sure the penalized likelihood produces continuous estimators. It is deﬁned

as



pλ(β) =

λ|β|
− β2−2aλβ+λ2

2(a−1)

(a+1)λ2

2

if |β| ≤ λ

if λ < |β| ≤ aλ

if |β| > aλ

for some a > 0. More details can be found in [21].

Theorem 4. Assume conditions A 1-A 10 hold. Assume β0 = (βT

2,0)T , where β1,0 ∈ Rs
n → C
is non-zero component, β2,0 = 0(p−s)×1 is the zero component of β0 with s
with 0 < C ≤ ∞ as n, p, s → ∞. The PMLE estimate of (3.3.1) from the Algorithm in

n → 0, p

1,0, βT

section 3.3.1 is ˆη = ( ˆβ, ˆθ) with ˆβ = ( ˆβ

T
1 , ˆβ

T
2 )T and ˆβ1 is a subvector of ˆβ formed by

components in supp(β0). Then ˆη satisfy:

a. (consistancy)

= Op( 1√

np) and

(cid:13)(cid:13)(cid:13)ˆθ − θ0

(cid:13)(cid:13)(cid:13)

2

(cid:13)(cid:13)(cid:13) ˆβ − β0

(cid:13)(cid:13)(cid:13)

2

= Op((cid:112) s

n).

b. (sparsity) ˆβ2 = 0 with probability tending to 1 as n → ∞.

Proof. See the proof in the section 3.6.

59

3.3.1.2 Covariance tappering and PMLE

When the number of sites is large (p is large) for each realization of the spatial process,
calculating the likelihood can be computationally infeasible (requiring O(p3) calculation).

Covariance tapering can be used to approximate the likelihood. While the covariance matrix

is replaced by a tapered one, the resulting matrixes can then be manipulated using eﬃcient

sparse matrix algorithms which would reduce computation eﬀectively.

In section 3.1, the covariance matrix is deﬁned as Σ(θ) = [γ(si, sj; θ)]p

i,j=1. Under

A2, we can simply write it as Σ = [γ(hij; θ)]p

i,j=1, where hij =(cid:13)(cid:13)si − sj

(cid:13)(cid:13)

is the distance

2

between sites si and sj. Let KT (h, w) denote a tapering function, which is an isotropic
autocorrelation function when 0 < d < w and 0 when h ≥ w for a given threshold w > 0.

Compactly correlation function can be used as the tapering function. We are going to use

an tapering function from [50],

KT (h, ω) = [(1 − h/w)+]2

(3.3.3)

where x+ = max(x, 0) in which case the correlation is 0 at lag distance greater than the
denote the p × p tapering matrix.
threshold distance w. Let K(w) = [KT (hii(cid:48), w)]p
Then a tapered covariance of Σ is deﬁned as ΣT = Σ ◦ K(w), where ◦ is the Schur product

=1

i,i

(cid:48)

(i.e. elementwise product). By the properties of the Schur product (see, e.g., [26], chap.

5), the tapered covariance matrix would keep the positive deﬁniteness thus it is still a valid

covariance matrix. When p is large, we are going to approximate the penalized log-likelihood

60

QT (θ, β; Z) = − np
2

log(2π) − 1

2 log

2(Z − Xβ)T ˙Σ−1

T (Z − Xβ) − n

Pλ(|βj|)

(cid:12)(cid:12)(cid:12) ˙ΣT

(cid:12)(cid:12)(cid:12) − 1

(3.3.1) by replacing Σ by ΣT and obtain a covariance tapered penalized logliklihood:

p(cid:88)

j=1

(3.3.4)

log |ΣT| − 1

2(Z − Xβ)T diagn−1(Σ−1

T )( ˜In−1,p + ˜Jn−1,p)(Z − Xβ)

=Cn,p − n − 1
− n

p(cid:88)

2
Pλ(|βj|)

j=1

where Cn,p = − (n−1)p

2

log π + p

2 log n.

We keep all the notations the same as in (3.3.1), except Σ is replaced by ΣT . Let

∆P M LE,T = βP M LE,T and θP M LE,T be the penalized maximum likelihood estimates

with tapered covariance (PMLET ). We are going to check the consistency of PMLET . Let
γk(θ, h) = ∂γ(θ,h)
∂θk

(θ) and γjk(θ, h) = ∂2γ(θ,h)
∂θkθj

. Two additional assumptions are made here

for regularity.
A 11. Assume 0 < infp{ wp
pδ0

} < supp{ wp
pδ0

} < ∞, where wp is the threshold distance in the

tapering function for δ0 > 0.
A 12. Let d (d ≥ 1) be the dimension of the domain, i.e. D ⊂ Rd. Assume for all θ ∈ Ξ
and 1 ≤ k, j ≤ q, we have γ(θ, h), γk(θ, h), γjk(θ, h) belong to the function space £, where

£ = {f (h) :(cid:82) ∞

0 hdf (h)dh < ∞}.

Let Σ be the covariance matrix and ΣT be the tapered covariance matrix. Σk,T =

∂ΣT
∂θk

and Σjk,T =

∂2ΣT
∂θj ∂θk

. By using the tapering function (3.3.3), we have the following result.

Theorem 5. Assume conditions 1-12 hold. Assume β0 = (βT

2,0)T , where β1,0 ∈ Rs is
n → C
non-zero component, β2,0 = 0(p−s)×1 is the zero component of β0 with s
with 0 < C ≤ ∞ as n, p, s → ∞. The PMLE estimates of (3.3.4) from Algorithm in section

n → 0, p

1,0, βT

61

3.3.1 is ˆηT = ( ˆβT , ˆθT ) with ˆβ = ( ˆβ

T
1,T , ˆβ

T
2,T )T and ˆβ1,T is a subvector of ˆβT formed by

components in supp(β0). Then ˆηT satisfy:

(cid:13)(cid:13)(cid:13)ˆθT − θ0

(cid:13)(cid:13)(cid:13)

a. (consistancy)

(cid:13)(cid:13)(cid:13) ˆβT − β0

(cid:13)(cid:13)(cid:13)

2

= OP ((cid:112) s

n).

= Op( 1√

np) and

2

b. (sparsity) ˆβ2,T = 0 with probability tending to 1 as n → ∞.

Proof. See the proof in the section 3.6.

3.3.2 The penalized maximum likelihood estimation LDA (PLDA)

classiﬁer

No matter using PMLE or PMLET , we have the estimates for ∆ and θ denoted by ˆ∆ and
ˆθ. the estimation of µ1 and µ2 are ˆµ1 = ¯Y − τ2 ˆ∆ and ˆµ2 = ¯Y + τ1 ˆ∆. Also we have
estimated covariance ˆΣ = Σ(ˆθ), where the (i, j)th element of ˆΣ is:

ˆσij = γ((cid:12)(cid:12)sj − si

(cid:12)(cid:12) ; ˆθ)

(3.3.5)

Since p > n, the error accumulated in estimate of each ˆσij may also cause problems in

classiﬁcation (see, e.g., [6] and [46]). For regularization of the covariance matrix, we use the

tapered coavariance matrix in classiﬁcation function. Speciﬁcally, we deﬁne ˜Σ = ΣT (ˆθ) =
Σ(ˆθ) ◦ K(w), where K(w) is deﬁned in section 3.3.1.2. We then replace µ1, µ2, Σ in LDA
(1.3.7) by ˆµ1, ˆµ2 and ˜Σ for classiﬁcation. Then the PLDA function is:

(cid:80)2

k=1

where ¯Y = 1
n

ˆδP LDA(X) = (X − ¯Y − n1 − n2
(cid:80)nk

2n

i=1 Y ki.

62

ˆ∆)T ˜Σ−1 ˆ∆

(3.3.6)

The conditional misclassiﬁcation rate for class 1 and class 2 are deﬁned by (1.3.2) and

(1.3.4) with ˆΣ replaced by ˜Σ. Similarly we have the overall misclassiﬁcation rate deﬁned in

(1.3.6).

A 13. Assume(cid:82) ∞

1 hdr(h; θ)dh < ∞ and(cid:82) 1

0 hd−1r(h; θ)dh < ∞ for θ ∈ Ξ.

A 14. Assume there exist a constant M such that for any h ≥ 0 and θ ∈ Ξ, (cid:107) ∂γ(h;θ)

∂θ

(cid:107)2≤ M .

Theorem 6. Assume ˆθ, ˆ∆ in (3.3.6) are estimated from Theorem 4 and 5. Suppose assump-
n → C with 0 < C ≤ ∞, Cp → C0 with
n → 0, p
√
d ) with 0 < α < 1, and w−1 = O(p−δ)
→ 0. Also, assume w = O((
np)

tions A1-A3 and A13-14 hold. Assume s
0 ≤ C0 ≤ ∞,

Cp√

α

s/n

with δ > 0, where d is the dimension of the domain. Then the classiﬁcation error rate of

ˆδP LDA is asymptotically sub-optimal, i.e. W (ˆδ; Θ)

P→ 1 − Φ(

). Moreover,

√
C0
2

(1) If Cp → C0 < ∞, W (ˆδ; Θ) is asymptotically optimal, i.e. W (ˆδ;Θ)
WOP T

P→ 1;

(2) If Cp → ∞ and Cpκn,p → 0, W (ˆδ; Θ) is asymptotically optimal, i.e. W (ˆδ;Θ)
WOP T

P→ 1,

where κn,p = max( wd√

np , 1

w ,(cid:112) s

n).

Proof. See the proof in the section 3.6.

3.4 Simulation Analysis

We do the simulation analysis in this section. Assume the spatial domain of interest D in
R2 is a u× u square. We can observe signal at each lattice. Then we have p = u× u features
for classiﬁcation. Assume the mean eﬀects of the signal for class C1 and C2 are µ1 and µ2.
We assume µ1 = (110, 0p−10) and µ2 = 0p, where where 1k is a k dimension vector with

all the elements equal to 1 and 0k is a k dimension vector with all the elements equal to 0.

63

4

3

2

8

7

6

12

11

10

16

15

14

1

1

1

1

1

1

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

9

5

13

1
Figure 3.1 Two dimensional domain example. Left: 2D domain with p=4 × 4; middle: µ1;
right: µ2.

0

0

0

0

1

1

1

0

For example, if u = 4 hence p = 16, then the corresponding spatial domain D, µ1 and µ2

are as in Figure 3.1. In the simulation setting, we let p = 36, 400, 1225 respectively.

For the spatial covariance, we generate the error terms from stationary and isotropic

Gaussian process with zero-mean. M at´ern covariance function which was deﬁned in 1.2.1

is widely used as spatial covariance function. We consider two special cases of M at´ern

covariance function: exponential covariance function and polynomial covariance function.

Let h be the Euclid distance between two sites on the domain D. Speciﬁcally, on the domain
D ∈ R2, the distance between site i with coordinate si = (xi, yi) and site j with coordinate

(cid:113)

sj = (xj, yj) is hij =

(xi − xj)2 + (yi − yj)2.

• Exponential covariance function If let the smoothness parameter ν = 1

2 in 1.2.1,

the the spatial dependence of the error terms reduced to an exponential covariance

function:

γ(h) =

σ2(1 − c)exp(−h/r)

σ2

if h > 0

if h = 0

where σ2 is the variance parameter, c is the nugget eﬀect and r is the range parameter.

In the simulation, we set σ2 = 1, c = 0.2 and r = 1, 2, ..., 8, 9. The classiﬁcation

performance are in Table 3.1, 3.2 and 3.3. The parameter estimation results are in

64

Table 3.4. The model selection results are in Table 3.5, 3.6, 3.7.

• Polynomial covariance function The covariance function of polynomial covariance

function is γ(h) = σ2ρh, where ρ is the correlation parameter. Then the covariance



ρh21
...

1
...

ρhp1 ρhp2



. . . ρh2p
...
. . .

. . .

1

matrix is:

Σp×p = σ2

1

ρh12

. . . ρh1p

In the simulation, we let σ2 = 1 and ρ = 0, 0.1, 0.2..., 0.8, 0.9. Polynomial correlation
function ρh is a special case of exponential correlation function exp(−h/r), since ρh =
exp(−h log 1

ρ). The classiﬁcation performance are in Table 3.8, 3.9 and 3.10. The

parameter estimation results are in Table 3.11. The model selection results are in

Table 3.12, 3.13, 3.14.

100 groups of training sets with n1 = n2 = 30 are generated according to diﬀerent

setting of µ1, µ2 and Σ(θ0). For each training set, we can estimate the parameters µ1,

µ2 and θ0 by MLE, tapered MLE, PMLE and tapered PMLE. 100 groups of testing data

sets with n1 = n2 = 100 are generated for testing the classiﬁcation performance. The

average classiﬁcation error rate was calculated from the 100 groups of testing data sets. All

the reported numbers in the tables are means with their standard errors, calculated by 100

groups of training and testing data sets, in parentheses.

We name the classiﬁcation method proposed in the research as PLDA. For each choice

of p, we compared the classiﬁcation performance of PLDA with oracle classiﬁcation, MLE-

LDA, PREG-LDA, FAIR (Feature Annealed Independence Rule; [19]) and NB (Naive Bayes;

65

[5]).

Speciﬁcally, MLE-LDA uses ˆµ1M LE, ˆµ2M LE and Σ(ˆθM LE) in LDA function for classi-

ﬁcation; PREG-LDA uses ˆ∆ = ˆβ

(0)

and ˆθ = ˆθ

(0)

in LDA in classiﬁcation function, where

(0)

ˆ∆

and ˆθ

(0)

are the same ˆ∆

(0)

(0)

and ˆθ

estimated in the algorithm in section 3.3.1. This

method doesn’t consider spatial correlation in feature selection. NB uses sample mean ˆµ1,

ˆµ2 and diagnol of sample covariance ˆΣ in LDA. This method is also known as independent

rule(IR). FAIR assumes independence between variables and utilizes t-test for variable se-

lection in IR. Oracle uses true mean µ1, µ2 and true covariance Σ(θ0). We also compared

the average number of variables selected from PMLE, PREG and FAIR. The tunning pa-

rameter λ in P REG and P M LE is selected by 10 fold cross validation by minimizing the

classiﬁcation error rate.

For p = 36, the performance of MLE-LDA, PREG-LDA, PMLE, FAIR, NB are shown.

For p = 400 the performance of MLE-LDA, PREG-LDA, PMLE, FAIR, NB are shown. The

classiﬁcation performance of MLE-LDA, PLDA with the parameters estimated by tapered

MLE and tapered PMLE are also shown for p = 400. For p = 1225, the classiﬁcation

performance of the tapered MLE-LDA, tapered PLDA, PREG-LDA, FAIR and NB are

shown.

66

Table 3.1 Classiﬁcation Accuracy Rate (Exponential covariance, p=36)

TURE
p=36

0.88(0.02)
0.88(0.02)
0.89(0.02)
0.90(0.02)
0.91(0.02)
0.92(0.02)
0.93(0.02)
0.93(0.02)
0.94(0.02)

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

MLE

PREG

PMLE

FAIR

NB

0.84(0.03)
0.84(0.03)
0.85(0.03)
0.87(0.03)
0.88(0.02)
0.89(0.02)
0.90(0.02)
0.91(0.02)
0.91(0.02)

0.84(0.04)
0.84(0.04)
0.85(0.04)
0.87(0.03)
0.88(0.02)
0.89(0.02)
0.90(0.03)
0.91(0.02)
0.92(0.02)

0.84(0.05)
0.84(0.05)
0.85(0.05)
0.87(0.04)
0.88(0.04)
0.89(0.04)
0.90(0.03)
0.91(0.03)
0.92(0.02)

0.81(0.04)
0.76(0.05)
0.74(0.04)
0.73(0.05)
0.72(0.05)
0.72(0.06)
0.71(0.06)
0.71(0.06)
0.71(0.05)

0.84(0.04)
0.78(0.05)
0.77(0.04)
0.76(0.05)
0.75(0.05)
0.75(0.06)
0.75(0.06)
0.75(0.06)
0.75(0.05)

Table 3.2 Classiﬁcation Accuracy Rate (Exponential covariance, p=400)

TURE
p=400

0.92(0.02)
0.92(0.02)
0.94(0.02)
0.95(0.01)
0.95(0.01)
0.96(0.01)
0.96(0.01)
0.97(0.01)
0.97(0.01)

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

MLE

MLET

PREG

PREGT

PMLE

PMLET

FAIR

NB

0.74(0.03)
0.75(0.03)
0.78(0.03)
0.80(0.03)
0.81(0.03)
0.83(0.03)
0.84(0.03)
0.85(0.03)
0.86(0.02)

0.75(0.03)
0.76(0.03)
0.78(0.03)
0.80(0.03)
0.82(0.03)
0.84(0.03)
0.85(0.02)
0.86(0.02)
0.87(0.02)

0.84(0.05)
0.85(0.05)
0.86(0.05)
0.88(0.04)
0.90(0.05)
0.91(0.05)
0.91(0.04)
0.92(0.04)
0.93(0.04)

0.84(0.05)
0.85(0.05)
0.87(0.05)
0.89(0.05)
0.90(0.04)
0.91(0.04)
0.92(0.04)
0.93(0.04)
0.93(0.04)

0.82(0.05)
0.85(0.05)
0.88(0.04)
0.91(0.03)
0.92(0.03)
0.93(0.02)
0.95(0.02)
0.95(0.02)
0.96(0.02)

0.84(0.04)
0.86(0.05)
0.89(0.05)
0.91(0.03)
0.93(0.03)
0.94(0.02)
0.94(0.02)
0.95(0.02)
0.95(0.02)

0.83(0.04)
0.79(0.04)
0.77(0.04)
0.75(0.04)
0.74(0.04)
0.73(0.05)
0.72(0.06)
0.72(0.06)
0.72(0.05)

0.74(0.04)
0.67(0.04)
0.63(0.05)
0.61(0.05)
0.60(0.05)
0.59(0.05)
0.58(0.05)
0.58(0.05)
0.58(0.05)

Finally, the Table 3.15 and Table 3.16 and Table 3.17 showed the classiﬁcation perfor-

mance when the covariance are misspeciﬁed. We generate data using Gaussian covariance

function with σ2 = 1, c = 0.2, and r = 1, 2, ..., 9. Then we estimate the parameters using

exponential covariance function. It shows the PLDA classiﬁcation method still yielded the

best performance in the misspeciﬁed case.

Table 3.3 Classiﬁcation Accuracy Rate (Exponential covariance, p=1225)

TURE
p=1225

0.92(0.02)
0.92(0.02)
0.93(0.02)
0.94(0.02)
0.95(0.02)
0.96(0.01)
0.96(0.01)
0.96(0.01)
0.97(0.01)

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

MLET

PREGT

PMLET

FAIR

NB

0.67(0.03)
0.68(0.03)
0.70(0.04)
0.72(0.04)
0.73(0.04)
0.75(0.03)
0.76(0.04)
0.77(0.04)
0.77(0.04)

0.83(0.05)
0.83(0.06)
0.85(0.06)
0.87(0.06)
0.88(0.05)
0.89(0.06)
0.89(0.06)
0.90(0.05)
0.91(0.05)

0.80(0.06)
0.82(0.06)
0.86(0.05)
0.89(0.03)
0.91(0.03)
0.92(0.03)
0.93(0.02)
0.93(0.02)
0.94(0.02)

0.83(0.05)
0.78(0.05)
0.76(0.04)
0.75(0.05)
0.74(0.05)
0.73(0.04)
0.73(0.04)
0.72(0.05)
0.72(0.05)

0.66(0.05)
0.61(0.05)
0.58(0.04)
0.57(0.05)
0.56(0.05)
0.55(0.04)
0.55(0.04)
0.54(0.05)
0.54(0.05)

67

Table 3.4 Parameter estimation (Exponential covariance)

TURE

1
0.2
1
2
0.2
1
3
0.2
1
4
0.2
1
5
0.2
1
6
0.2
1
7
0.2
1
8
0.2
1
9
0.2
1

p=36

p=400

p=1225

MLE

1.0(0.2)
0.2(0.1)
1.0(0.0)
2.0(0.3)
0.2(0.1)
1.0(0.1)
3.0(0.4)
0.2(0.0)
1.0(0.1)
4.1(0.6)
0.2(0.0)
1.0(0.1)
5.1(0.8)
0.2(0.0)
1.0(0.1)
6.1(0.9)
0.2(0.0)
1.0(0.1)
7.1(1.2)
0.2(0.0)
1.0(0.1)
8.1(1.4)
0.2(0.0)
1.0(0.1)
9.2(1.6)
0.2(0.0)
1.0(0.1)

PMLE
1.0(0.2)
0.2(0.1)
1.0(0.0)
2.1(0.3)
0.2(0.1)
1.0(0.1)
3.1(0.4)
0.2(0.0)
1.0(0.1)
4.1(0.6)
0.2(0.0)
1.0(0.1)
5.1(0.8)
0.2(0.0)
1.0(0.1)
6.1(1.0)
0.2(0.0)
1.0(0.1)
7.1(1.2)
0.2(0.0)
1.0(0.1)
8.1(1.4)
0.2(0.0)
1.0(0.1)
9.1(1.6)
0.2(0.0)
1.0(0.1)

MLE

1.0(0.1)
0.2(0.0)
1.0(0.0)
2(0.1)
0.2(0.0)
1.0(0.0)
3.0(0.1)
0.2(0.0)
1.0(0.0)
4.0(0.2)
0.2(0.01)
1.0(0.0)
5.0(0.3)
0.2(0.0)
1.0(0.0)
6.1(0.4)
0.2(0.0)
1.0(0.0)
7.1(0.5)
0.2(0.0)
1.0(0.0)
8.1(0.6)
0.2(0.0)
1.0(0.1)
9.1(0.7)
0.2(0.0)
1.0(0.1)

PMLE
1.0(0.1)
0.2(0.0)
1(0.0)
2.0(0.1)
0.2(0.0)
1.0(0.0)
3.0(0.1)
0.2(0.0)
1.0(0.0)
4.0(0.2)
0.2(0.01)
1.0(0.0)
5.0(0.3)
0.2(0.0)
1.0(0.0)
6.1(0.4)
0.2(0.0)
1.0(0.0)
7.1(0.5)
0.2(0.0)
1.0(0.1)
8.1(0.6)
0.2(0.0)
1.0(0.1)
9.1(0.7)
0.2(0.01)
1.0(0.1)

MLET
1.0(0.1)
0.3(0.1)
1.0(0.0)
2.1(0.1)
0.3(0.0)
1.0(0.0)
3.2(0.2)
0.3(0.0)
1.0(0.0)
4.6(0.4)
0.3(0.0)
1.0(0.0)
6.4(0.8)
0.3(0.0)
1.0(0.0)
8.8(1.4)
0.2(0.0)
1.0(0.0)
12.0(2.5)
0.2(0.0)
1.0(0.1)
16.6(4.6)
0.2(0.0)
1.0(0.1)
24.0(9.6)
0.2(0.0)
1.0(0.1)

PMLET
1.0(0.1)
0.3(0.0)
1.0(0.0)
2.1(0.1)
0.3(0.0)
1.0(0.0)
3.2(0.2)
0.3(0.0)
1.0(0.0)
4.6(0.4)
0.3(0.0)
1.0(0.0)
6.4(0.8)
0.2(0.0)
1.0(0.0)
8.8(1.4)
0.2(0.0)
1.0(0.0)
12.0(2.6)
0.2(0.0)
1.0(0.1)
16.8(4.9)
0.2(0.0)
1.0(0.1)

24.3(10.3)

0.2(0.0)
0.99(0.1)

MLET
1.2(0.1)
0.5(0.0)
1.0(0.0)
2.5(0.1)
0.5(0.0)
1.0(0.0)
3.9(0.2)
0.5(0.0)
1.0(0.0)
5.6(0.4)
0.5(0.0)
1.0(0.0)
7.6(0.7)
0.5(0.0)
0.9(0.0)
10.1(1.2)
0.5(0.0)
0.9(0.0)
13.3(1.9)
0.5(0.0)
0.9(0.0)
17.6(3.3)
0.4(0.0)
0.9(0.0)
23.6(5.9)
0.4(0.0)
0.9(0.0)

PMLET
1.2(0.1)
0.5(0.0)
1.0(0.0)
2.5(0.1)
0.5(0.0)
1.0(0.0)
3.9(0.2)
0.5(0.0)
1.0(0.0)
5.6(0.4)
0.5(0.0)
1.0(0.0)
7.6(0.7)
0.5(0.0)
1.0(0.0)
10.1(1.1)
0.5(0.0)
1.0(0.0)
13.3(1.9)
0.4(0.0)
1.0(0.0)
17.7(3.3)
0.4(0.0)
1.0(0.0)
23.7(5.8)
0.4(0.0)
1.0(0.0)

r=1

r=2

r=3

r=4

r=5

r=6

r=7

r=8

r=9

r
c
σ
r
c
σ
r
c
σ
r
c
σ
r
c
σ
r
c
σ
r
c
σ
r
c
σ
r
c
σ

Table 3.5 Number of variables selected (Exponential covariance, p=36)

PMLE

PREG

FAIR

p=36
r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

N-s

21(6.2)
20(5.9)
19(5.6)
20(5.6)
20(5.5)
20(5.2)
20(5.1)
20(4.8)
20(5.0)

N-c

9(1.7)
9(1.7)
9(1.8)
9(1.7)
10(1.5)
10(1.2)
10(1.2)
10(1.0)
10(1.1)

N-s

19(7.0)
19(6.8)
20(7.7)
20(7.4)
20(7.1)
20(7.4)
19(7.3)
20(7.7)
19(7.6)

N-c

10(0.8)
10(0.9)
10(0.8)
10(0.4)
10(0)
10(0.1)
10(0)
10(0)
10(0)

N-s

6(4.0)
5(3.3)
4(2.4)
3(1.6)
3(1.5)
3(1.4)
3(1.4)
3(1.4)
3(1.4)

N-c

5(2.4)
4(1.9)
3(1.7)
3(1.5)
3(1.5)
3(1.4)
3(1.4)
3(1.4)
3(1.3)

Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model.

Table 3.6 Number of variables selected (Exponential covariance, p=400)

PMLE

PREG

PMLET

PREGT

FAIR

p=400

N-s

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

85(55.3)
92(39.3)
82(31.5)
80(35.7)
74(32.2)
67(27.8)
59(20.5)
57(22.3)
53(21.9)

N-c

9(1.5)
10(1.4)
10(1.2)
10(0.8)
10(0.6)
10(0.6)
10(0.5)
10(0.5)
10(0.5)

N-s

42(55.2)
56(77.8)
50(69.1)
53(68.6)
51(73.3)
64(91.7)
56(91)
61(99)
64(98.5)

N-c

9(1.4)
9(1.1)
9(1.2)
9(1.1)
9(1.2)
10(1.1)
10(1.1)
10(1.1)
10(1.1)

N-s

82(51.3)
90(38.9)
80(37)
72(29.7)
65(32.2)
58(24.6)
56(35.1)
49(28.9)
47(35.7)

N-c
10(1)
10(1.2)
10(1.4)
10(0.7)
10(0.7)
10(0.4)
10(0.5)
10(0.5)
10(0.4)

N-s

46(64.7)
44(62.3)
51(76.9)
55(79.9)
65(97.5)
64(96)
64(91.4)
62(85)

65.(91.8)

N-c

9(1.4)
9(1.4)
9(1.3)
10(1.1)
10(0.9)
10(1.1)
10(0.5)
10(1.2)
10(0.7)

N-s

21(15.1)
20(15.6)
18(14.7)
14(11.7)
11(10.6)

9(9.2)
8(8.5)
8(8.2)
7(7.6)

N-c

7(2.1)
7(2.5)
6(2.6)
6(2.7)
5(2.8)
5(2.8)
4(2.7)
4(2.8)
4(2.7)

Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model.

68

Table 3.7 Number of variables selected (Exponential covariance, p=1225)

PMLTT

PREGT

FAIR

p=1225

N-s

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

116(185.7)
143(192.3)
133(135.1)
104(91.5)
95(77.4)
76(74.0)
66(54.3)
48(46.3)
48(48.1)

N-c

8(2.1)
9(2.2)
9(1.8)
9(1.3)
10(1.0)
10(0.8)
10(0.7)
10(0.6)
10(0.7)

N-s

52(97.2)
53(128.3)
63(151.9)
56(129.7)
56(115.2)
49(107.9)
49(87.2)
53(110.2)
55(104.3)

N-c

9(1.8)
9(1.7)
9(1.7)
9(1.7)
9(1.7)
9(2.0)
9(1.8)
9(1.6)
9(1.7)

N-s

31(22.3)
37(22.7)
34(24.5)
28(19.6)
26(20.6)
23(18.1)
21(16.2)
18(15.6)
15(14.0)

N-c

7.0(1.7)
7.0(2.1)
6.9(2.4)
6.5(2.8)
6.4(2.8)
6.2(3.0)
6.1(3.1)
5.6(3.3)
5.3(3.4)

Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model.

Table 3.8 Classiﬁcation Accuracy Rate (Polynomial covariance, p=36)

TURE

MLE

0.94(0.02)
0.92(0.02)
0.90(0.02)
0.88(0.02)
0.88(0.02)
0.88(0.02)
0.89(0.02)
0.91(0.02)
0.94(0.02)
0.99(0.01)

0.92(0.02)
0.89(0.03)
0.86(0.03)
0.84(0.03)
0.83(0.03)
0.83(0.03)
0.85(0.03)
0.88(0.02)
0.92(0.02)
0.98(0.01)

PREG
p=36

0.92(0.03)
0.89(0.03)
0.86(0.04)
0.84(0.04)
0.83(0.03)
0.83(0.04)
0.85(0.04)
0.88(0.03)
0.92(0.03)
0.97(0.02)

ρ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

PMLE

FAIR

NB

0.92(0.02)
0.89(0.03)
0.86(0.04)
0.84(0.05)
0.83(0.05)
0.84(0.04)
0.85(0.04)
0.88(0.04)
0.92(0.03)
0.97(0.02)

0.89(0.04)
0.87(0.04)
0.84(0.04)
0.81(0.04)
0.78(0.04)
0.76(0.05)
0.74(0.04)
0.72(0.05)
0.71(0.05)
0.70(0.05)

0.92(0.04)
0.89(0.04)
0.87(0.04)
0.84(0.04)
0.81(0.04)
0.79(0.05)
0.77(0.04)
0.75(0.05)
0.74(0.05)
0.73(0.05)

Table 3.9 Classiﬁcation Accuracy Rate (Polynomial covariance, p=400)

TURE

MLE

MLET

0.94(0.02)
0.93(0.02)
0.92(0.02)
0.91(0.02)
0.92(0.02)
0.92(0.02)
0.93(0.02)
0.95(0.01)
0.97(0.01)

1.0(0)

0.79(0.03)
0.76(0.03)
0.75(0.03)
0.74(0.03)
0.74(0.03)
0.75(0.03)
0.77(0.03)
0.81(0.03)
0.87(0.02)
0.97(0.01)

0.79(0.03)
0.77(0.03)
0.76(0.03)
0.74(0.03)
0.75(0.03)
0.76(0.03)
0.78(0.03)
0.81(0.03)
0.87(0.02)
0.97(0.01)

PREG
p=400

0.89(0.04)
0.87(0.03)
0.85(0.04)
0.84(0.05)
0.83(0.05)
0.84(0.05)
0.86(0.05)
0.88(0.05)
0.93(0.04)
0.98(0.02)

ρ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

PREGT

PMLE

PMLET

FAIR

NB

0.89(0.04)
0.87(0.04)
0.85(0.04)
0.85(0.05)
0.84(0.05)
0.84(0.05)
0.86(0.04)
0.88(0.05)
0.93(0.04)
0.98(0.02)

0.90(0.03)
0.86(0.04)
0.83(0.05)
0.82(0.05)
0.83(0.05)
0.85(0.05)
0.88(0.04)
0.92(0.03)
0.96(0.03)
0.99(0.02)

0.90(0.03)
0.87(0.04)
0.84(0.04)
0.84(0.04)
0.84(0.04)
0.86(0.05)
0.89(0.04)
0.92(0.03)
0.96(0.03)
0.99(0.02)

0.88(0.04)
0.87(0.04)
0.86(0.04)
0.83(0.04)
0.82(0.04)
0.80(0.04)
0.78(0.04)
0.75(0.04)
0.73(0.04)
0.70(0.06)

0.79(0.03)
0.78(0.03)
0.77(0.03)
0.74(0.04)
0.71(0.04)
0.68(0.04)
0.65(0.04)
0.61(0.05)
0.59(0.05)
0.56(0.05)

Table 3.10 Classiﬁcation Accuracy Rate (Polynomial covariance, p=1225)

TURE

MLET

0.94(0.02)
0.93(0.02)
0.92(0.02)
0.91(0.02)
0.91(0.02)
0.92(0.02)
0.93(0.02)
0.95(0.02)
0.97(0.01)

1.00(0)

0.70(0.04)
0.68(0.04)
0.67(0.03)
0.67(0.03)
0.67(0.03)
0.68(0.03)
0.69(0.03)
0.71(0.03)
0.75(0.04)
0.85(0.04)

PREGT
p=1225

0.88(0.04)
0.86(0.05)
0.83(0.05)
0.83(0.04)
0.82(0.05)
0.83(0.06)
0.84(0.05)
0.86(0.06)
0.89(0.07)
0.95(0.04)

ρ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

PMLET

FAIR

NB

0.88(0.04)
0.85(0.05)
0.81(0.06)
0.80(0.06)
0.80(0.06)
0.83(0.05)
0.86(0.04)
0.89(0.04)
0.94(0.02)
0.98(0.02)

0.88(0.05)
0.86(0.04)
0.85(0.05)
0.83(0.04)
0.81(0.04)
0.79(0.04)
0.76(0.05)
0.74(0.06)
0.71(0.07)
0.69(0.07)

0.70(0.05)
0.69(0.04)
0.68(0.05)
0.66(0.04)
0.64(0.04)
0.61(0.04)
0.59(0.05)
0.57(0.06)
0.55(0.07)
0.53(0.07)

69

Table 3.11 Parameter estimation (Polynomial covariance)

p=36

p=400

p=1225

TURE
ρ = 0
σ = 1
ρ = 0.1
σ = 1
ρ = 0.2
σ = 1
ρ = 0.3
σ = 1
ρ = 0.4
σ = 1
ρ = 0.5
σ = 1
ρ = 0.6
σ = 1
ρ = 0.7
σ = 1
ρ = 0.8
σ = 1
ρ = 0.9

σ=1

MLE

0.0(0.0)
1.0(0.0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1.0(0.0)
0.3(0.0)
1.0(0.0)
0.4(0.0)
1.0(0.0)
0.5(0.0)
1.0(0.0)
0.6(0.0)
1.0(0.1)
0.7(0.0)
1.0(0.1)
0.8(0.0)
1.0(0.1)
0.9(0.0)
1.0(0.1)

PMLE
0.0(0.0)
1(0.0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1.0(0.0)
0.3(0.0)
1(0.0)
0.4(0.0)
1(0.0)
0.5(0.0)
1(0.0)
0.6(0.0)
1(0.1)
0.7(0.0)
1.0(0.1)
0.8(0.0)
1.0(0.1)
0.9(0.01)
1.1(0.1)

MLE
0(0)

1.0(0.0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1.0(0.0)
0.3(0.0)
1.0(0.01)
0.4(0.0)
1.0(0.0)
0.5(0.0)
1.0(0.0)
0.6(0.0)
1.0(0.0)
0.7(0.0)
1.0(0.0)
0.8(0.0)
1.0(0.0)
0.9(0.0)
1.0(0.1)

MLET
0(0)

1.0(0.0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1(0.0)
0.3(0.0)
1(0.0)
0.4(0.0)
1(0.0)
0.5(0.0)
1.0(0.0)
0.6(0.0)
1.0(0.0)
0.7(0.01)
1.0(0.0)
0.8(0.0)
1.0(0.0)
0.9(0.01)
1.0(0.1)

PMLE

0(0)

1.0(0.0)
0.1(0.1)
1.0(0.0)
0.1(0.1)
1.0(0.0)
0.3(0.0)
1.0(0.0)
0.4(0.0)
1.0(0.0)
0.5(0.0)
1.0(0.0)
0.6(0.0)
1.0(0.0)
0.7(0.0)
1.0(0.0)
0.8(0.0)
1.0(0.1)
0.9(0.0)
1.0(0.1)

0(0)

PMLET MLET
0(0)
1.0(0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1.0(0.0)
0.2(0)
1.0(0.0)
0.3(0.0)
1.0(0.0)
0.3(0)
1.0(0.0)
0.4(0.0)
1.0(0.0)
0.5(0.0)
0.9(0.0)
0.5(0.0)
0.9(0.0)
0.7(0.0)
0.8(0.0)

1.0(0.0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1(0.0)
0.3(0.0)
1.0(0.0)
0.4(0.0)
1(0.0)
0.5(0.0)
1(0.0)
0.6(0.0)
1.0(0.0)
0.7(0.0)
1.0(0.0)
0.8(0.0)
1.0(0.1)
0.9(0.0)
1.1(0.1)

PMLET

0(0)

1.0(0.0)
0.1(0.0)
1.0(0.0)
0.2(0.0)
1.0(0.0)
0.2(0)
1(0.0)
0.3(0.0)
1.0(0.1)
0.3(0)
1.0(0.0)
0.4(0.0)
1.0(0.0)
0.5(0.0)
1.0(0.0)
0.5(0.0)
0.9(0.0)
0.7(0.0)
0.9(0.0)

Table 3.12 Number of variables selected (Polynomial covariance, p=36)

p=36

PMLE

PREG

FAIR

ρ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

N-s

20(5.0)
21(5.4)
20(5.2)
21(6.2)
20(5.9)
21(5.6)
21(5.6)
20(5.9)
19(5.6)
15(5.9)

N-c

10(0.4)
10(0.7)
10(1.4)
10(1.8)
10(1.8)
10(1.7)
10(1.6)
10(1.9)
9(2.2)
7(2.8)

N-s

17(7.2)
18(6.5)
18(6.4)
19(7.0)
20(7.0)
20(6.9)
21(7.3)
20(7.3)
19(7.7)
15(7.9)

N-c

10(0.7)
10(0.6)
10(0.4)
10(0.8)
10(0.8)
10(0.7)
10(0.7)
10(0.2)
10(0)
10(0.3)

N-s

9(3.0)
8(3.2)
7(3.7)
7(4.3)
6(4.1)
5(3.9)
4(2.9)
3(2.4)
3(1.5)
2(1.1)

N-c

7(2.0)
7(2.3)
6(2.5)
6(2.4)
5(2.4)
4(2.2)
4(2)
3(1.8)
3(1.5)
2(1.0)

Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model.

Table 3.13 Number of variables selected (Polynomial covariance, p=400)

p=400

PMLE

PREG

PMLET

PREGT

FAIR

ρ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

N-s

46(47.1)
51(51.9)
77(57.3)
84(50.5)
97(47.3)
93(42.7)
85(32.4)
78(26.4)
60(27.9)
29(30)

N-c
9(1)
9(1.2)
9(1.4)
9(1.5)
10(1.5)
10(1.5)
10(1.3)
10(0.7)
10(1.2)
9(1.5)

N-s

38.5(49.8)
41(50.4)
35(54.5)
40(51.9)
59(75.3)
55(78.7)
52(75.2)
54(69.4)
68(95.6)
55(97.1)

N-c

9(1.3)
9(1.1)
9(1.5)
9(1.4)
9(1.3)
9(1.1)
9(1.2)
10(1.2)
10(0.5)
10(1.2)

N-s

47.4(47.4)
52(57.6)
73(60.6)
82(48.9)
88(43.3)
92(42.6)
83(31.2)
72(24.9)
56(27.5)
29(24.8)

N-c

9.4(0.9)
9(1.1)
9(1.3)
9(1.2)
10(0.9)
10(1.2)
10(1)
10(0.7)
10(1.1)
9(1.6)

N-s

38(49.6)
36(46.7)
39(58.5)
39(56.4)
52(70.3)
46(57.1)
47(63.2)
59(78.6)
69(96)
56(98.1)

N-c

9(1.3)
9.1(1.2)
9(1.5)
9(1.3)
9(1.3)
9(1.2)
9(1.2)
10(1.3)
10(0.5)
10(1.2)

N-s

15(12.1)
17(13.1)
18(12.8)
19(13.9)
22(15.7)
22(15.6)
21(15.8)
16(13)
10(9.6)
5(5.8)

N-c

7(1.9)
7(1.9)
7(1.9)
7(2.1)
7(2.2)
7(2.4)
7(2.4)
6(2.8)
5(3)
3(2.6)

Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model.

70

Table 3.14 Number of variables selected (Polynomial covariance, p=1225)

p=1225

PMLTT

PREGT

FAIR

ρ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

N-s

43(67.9)
53(73.5)

112(136.7)
121(147.5)
166(160.5)
152(99.5)
140(91.2)
113(82.5)
87(77.6)
25(40.2)

N-c

9(1.4)
8(1.7)
8(2.2)
8(2.3)
8(2.5)
9(1.6)
9(1.4)
9(1.5)
10(0.7)
10(1.1)

N-s

35(48.0)
35(49.1)
45(70.1)
52(87.2)
58(123.7)
49(74.9)
58(96.0)
79(184.8)
82(171.0)
76(173.7)

N-c

9(1.6)
8(1.8)
8(1.8)
9(1.6)
9(1.7)
9(2.1)
9(2.1)
9(1.8)
9(2.0)
9(1.8)

N-s

17(13.2)
24(17.7)
27(19.0)
31(19.4)
34(19.8)
36(21.6)
38(23.5)
31(22.2)
23(19.7)
12(11.8)

N-c

7(1.8)
7(1.7)
7(1.7)
7(1.8)
7(1.9)
7(2.0)
7(2.3)
7(2.8)
6(3.3)
5(3.6)

Key: N-s: the number of variables selected by the model; N-c: the number of correct variables selected by the model.

Table 3.15 Classiﬁcation Accuracy Rate (Mis-speciﬁed covariance, p=36)

TURE
p=36

0.887(0.02)
0.879(0.02)
0.904(0.02)
0.925(0.02)
0.939(0.02)
0.951(0.02)
0.958(0.01)
0.963(0.01)
0.969(0.01)

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

MLE

PREG

PMLE

FAIR

NB

0.841(0.03)
0.851(0.03)
0.888(0.03)
0.907(0.02)
0.922(0.02)
0.935(0.02)
0.943(0.02)
0.95(0.02)
0.956(0.01)

0.84(0.04)
0.85(0.04)
0.883(0.03)
0.906(0.03)
0.922(0.02)
0.935(0.02)
0.943(0.02)
0.949(0.02)
0.954(0.02)

0.843(0.05)
0.848(0.04)
0.889(0.03)
0.91(0.03)
0.925(0.03)
0.935(0.02)
0.946(0.02)
0.953(0.02)
0.959(0.02)

0.826(0.05)
0.755(0.05)
0.727(0.06)
0.713(0.05)
0.709(0.05)
0.701(0.06)

0.7(0.05)

0.701(0.05)
0.705(0.05)

0.854(0.05)
0.771(0.05)
0.746(0.06)
0.737(0.05)
0.735(0.05)
0.734(0.06)
0.735(0.05)
0.736(0.05)
0.736(0.05)

Table 3.16 Classiﬁcation Accuracy Rate (Mis-speciﬁed covariance, p=400)

TURE
p=400

0.91(0.02)
0.93(0.02)
0.95(0.01)
0.97(0.01)
0.98(0.01)
0.98(0.01)
0.99(0.01)
0.99(0.01)
1.00(0.01)

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

MLE

MLET

PREG

PREGT

PMLE

PMLET

FAIR

NB

0.73(0.03)
0.76(0.03)
0.82(0.03)
0.85(0.02)
0.88(0.02)
0.90(0.02)
0.92(0.02)
0.93(0.02)
0.94(0.02)

0.74(0.03)
0.78(0.03)
0.82(0.03)
0.86(0.02)
0.88(0.02)
0.90(0.02)
0.87(0.06)
0.69(0.11)
0.61(0.05)

0.84(0.05)
0.86(0.04)
0.89(0.04)
0.92(0.03)
0.94(0.03)
0.95(0.03)
0.96(0.03)
0.96(0.03)
0.97(0.02)

0.84(0.05)
0.87(0.04)
0.90(0.04)
0.93(0.03)
0.94(0.03)
0.95(0.03)
0.94(0.04)
0.86(0.07)
0.84(0.06)

0.83(0.05)
0.89(0.04)
0.93(0.03)
0.96(0.02)
0.97(0.02)
0.97(0.02)
0.98(0.01)
0.98(0.01)
0.98(0.01)

0.84(0.04)
0.90(0.04)
0.93(0.03)
0.96(0.02)
0.96(0.02)
0.97(0.01)
0.96(0.02)
0.88(0.05)
0.87(0.03)

0.85(0.05)
0.79(0.04)
0.76(0.04)
0.74(0.04)
0.73(0.04)
0.71(0.05)
0.71(0.07)
0.70(0.07)
0.70(0.06)

0.75(0.03)
0.67(0.04)
0.62(0.04)
0.60(0.04)
0.59(0.04)
0.58(0.04)
0.57(0.04)
0.56(0.04)
0.56(0.04)

Table 3.17 Classiﬁcation Accuracy Rate (Mis-speciﬁed covariance, p=1225)

TURE
p=1225

0.91(0.02)
0.93(0.02)
0.95(0.02)
0.97(0.01)
0.98(0.01)
0.98(0.01)
0.99(0.01)
0.99(0.01)
0.99(0.01)

r=1
r=2
r=3
r=4
r=5
r=6
r=7
r=8
r=9

MLET

PREGT

PMLET

FAIR

NB

0.66(0.04)
0.69(0.03)
0.72(0.03)
0.74(0.03)
0.75(0.04)
0.76(0.04)
0.75(0.04)
0.72(0.05)
0.65(0.07)

0.83(0.04)
0.84(0.05)
0.87(0.05)
0.89(0.06)
0.90(0.06)
0.91(0.06)
0.91(0.06)
0.90(0.07)
0.86(0.09)

0.81(0.05)
0.88(0.04)
0.92(0.04)
0.94(0.02)
0.95(0.02)
0.95(0.02)
0.95(0.02)
0.94(0.02)
0.90(0.05)

0.83(0.05)
0.78(0.05)
0.75(0.04)
0.73(0.05)
0.73(0.05)
0.72(0.05)
0.712(0.05)
0.71(0.06)
0.71(0.06)

0.67(0.05)
0.60(0.05)
0.57(0.04)
0.55(0.05)
0.54(0.05)
0.54(0.05)
0.54(0.05)
0.54(0.06)
0.53(0.06)

71

3.5 Real Data Analysis

Data used in the real data example were obtained from the Alzheimer’s disease Neuroimag-

ing Initiative (ADNI) database (http:// www.loni.ucla.edu/ADNI), which is lauched in 2004,

aiming to improve clinical trials for prevention and treatment of Alzheimer’s disease (AD).

In the interest of promoting consistency in data analysis, the ADNI Core has created stan-

dardized analysis sets of the structure MRI scans comprising only image data that have

passed quality control (QC) assessments conducted at the Aging and Dementia Imaging

Research laboratory at the Mayo Clinic (see [27]). In this study, we used T1-weighted MRI

images from the collection of standardized datasets. The description of the standardized MRI

imaging from ADNI can be found in http://adni.loni.usc.edu/methods/mri-analysis/

adni-standardized-data/ and [54].

According to [27], the images were obtained using magnetization prepared rapid gra-
dient echo (MPRAGE) or equivalent protocols with varying resolutions (typically 1.0 ×

1.0 mm in plane spatial resolution and 1.2 mm thick sagittal slices with 256*256*166

voxels). The images were then pre-processed according to a number of steps detailed

in [27] and http://adni.loni.usc.edu/methods/mri-analysis/mri-pre-processing/,

which corrected gradient non-linearity, intensity inhomogeneity and phantom-based distor-

tion. In addition, the pre-processed imaging were processed by FreeSurfer for cortical recon-

struction and volumetric segmentation by Center for Imaging of Neurodegnerative Diseases,

UCSF. The skull-stripped volume (brain mask) obtained by FreeSurfer cross-sectional pro-

cessing were used in this study.

Only images from ADNI-1 subjects obtained using 1.5 T scanners at screening visits

were used in this study, and we used the ﬁrst time point if there are multiple images of the

72

Table 3.18 Subjects characteristics

n
Age (Mean±sd)
Gender (F/M)
MMSE (Mean±sd)

AD
187
75.28 ± 7.55
88/99
23.28 ± 2.04

p-value

NL
227
75.80 ± 4.98
110/117
29.11 ± 1.00 < 1e − 15

0.4168
0.813

Key: AD, subjects with Alzheimer’s disease ; NL, healthy subjects; Age, baseline age; MMSE,
baseline Mini-Mental State Examination.

same subject acquired at diﬀerent times. 187 subjects diagnosed as Alzheimer’s disease at

screening visits and 227 healthy subjects at screening visits are contained in this study. The

total number of subjects is 414. Details of the subjects can be found in Table 3.18.

After downloading the pre-processed imaging data from ADNI, an R package ANTsR

were applied for imaging registration. Then we use “3dresample” command by AFNI soft-

ware ([16]) to adjust the resolution and reduce the total number of voxels of the imaging to

18*22*18. Take x axis and y axis for horizontal plane, x axis and z axis for coronal plane

and y axis and z axis for sagittal plane. Only the 1100 voxels located in the center of the

brain were used as features for classiﬁcation (i.e. the voxels with x coordinate from 5 to 14;

y coordinate from 6 to 16; z coordinate from 5 to 14).

After removing the voxels with zero signal for most of the subjects (more than 409

subjects), we have 1077 voxels left in use. The distance between each pair of voxels can be

calculated by their coordinates. For example, there are two voxels s1, s2 with coordinate

s1 = (x1, y1, z1) and s2 = (x2, y2, z2)). Then the distance between s1 and s2 is deﬁned

by: d(s1, s2) =(cid:112)(x1 − x2)2 + (y1 − y2)2 + (z1 − z2)2. From the plot of sample covariance,

spatial correlations can be observed among voxels.

We random sample 100 from the 187 AD subjects and 100 from the 227 health subjects

as the training set. Then there are 87 AD subjects and 127 health subjects left. The testing

73

Table 3.19 Subjects characteristics of training and testing set

AD n

NL n

Age (Mean±sd)
Gender (F/M)
MMSE (Mean±sd)
Age (Mean±sd)
Gender (F/M)
MMSE (Mean±sd)

training set
100
75.64 ± 7.39
47/53
23.22 ± 2.08
100
75.99 ± 5.39
42/58
29.06 ± 1.04

testing set
87
74.85 ± 7.75
41/46
23.36 ± 2.01
87
75.34 ± 4.56
50/37
29.09 ± 1.01

p-value

0.478
0.999
0.649

0.3723
0.05
0.8307

Key: AD, subjects with Alzheimer’s disease ; NL, healthy subjects; Age, baseline age; MMSE,
baseline Mini-Mental State Examination.

Table 3.20 Classiﬁcation performance for voxel level MRI data. Training and testing samples
are of sizes 200 and 174, respectively

Accuracy

No. of training err
No. of testing err

38
51

No. of selected voxels

1077

MLE
0.707

PREG PMLE
0.747
0.776

FAIR
0.621

47
44
3

46
39
5

77
66
7

NB
0.689

55
54

1077

set includes the 87 AD subjects and a random sample of 87 from the 127 health subjects.

Detail of the subjects in the training and testing set can be found in Table 3.19.

We assume the exponential correlation among voxels. Then we apply the PLDA method

proposed in this research for classiﬁcation. First, the parameter are estimated by PMLE:

r = 6.59, c = 0.924, σ2 = 230.38 and 5 voxels are selected for classiﬁcation meanwhile

from training data. Then we plug in the estimates in the classiﬁcation function and do

classiﬁcation on the testing data.

The classiﬁcation accuracy rate of PLDA is listed in Table 3.20. We also listed the

classiﬁcation accuracy from other methods. It shows the classiﬁcation accuracy rate of our

method is about 77.6%, which is superior to other comparable methods (MLE-LDA: 70.7%,

PREG-LDA: 74.7%, FAIR: 62.1% and NB: 68.9%).

74

3.6 Proofs of the main results

3.6.1 Proofs for classiﬁcation using MLE

Lemma 3.6.1. Let  be p-dimensional vectors and  ∼ N (0, Σ), where Σ is a p × p positive
uT u = C and p × m
deﬁnite covariance matrix. For m-dimension vector u with (cid:107)u(cid:107)

√

=

2

matrix Xi, we have:

(cid:12)(cid:12)(cid:12)T Xu

(cid:12)(cid:12)(cid:12) = Op(

(cid:113)

tr(XT ΣX)(cid:107)u(cid:107)

2

)

(3.6.1)

Proof. Since E(T X) = 0,

E(T Xu)2 ≤(cid:104)

E(T XXT )
=tr(XT ΣX)(cid:107)u(cid:107)

2

(cid:105)1/2 (cid:107)u(cid:107)

(cid:104)

=

2

(cid:105)1/2 (cid:107)u(cid:107)

2

E(tr(T XXT ))

(3.6.2)

By Chebyshev’s inequality, for any M

(cid:113)(cid:107)u(cid:107)2

2

T Xu
tr(XT ΣX)

P (

> M ) ≤

E(T Xu)2

M 2 (cid:107)u(cid:107)2

2

tr(XT ΣX)

=

1
M 2

(3.6.3)

Thus for any  > 0, exits M large enough such that

(cid:12)(cid:12)(cid:12)T Xu
(cid:12)(cid:12)(cid:12)
(cid:113)(cid:107)u(cid:107)2

2

P (

tr(XT ΣX)

(cid:12)(cid:12)(cid:12)T Xu

(cid:12)(cid:12)(cid:12) = Op(

This lead to

(cid:113)

tr(XT ΣX)(cid:107)u(cid:107)

).

2

> M ) < 

(3.6.4)

Lemma 3.6.2. Let i(i = 1, 2, ..., n) be p-dimensional vectors and i ∼ N (0, c(n)Σ), where

75

c(n) is a function of n and Σ is a p × p positive deﬁnite covariance matrix with λ(Σ) < ∞.
For a p × p matrix A, we have

(cid:104)

(cid:105)
i Ai − c(n)tr(AΣ)
T

n(cid:88)

i=1

√
n(cid:107)A(cid:107)
= Op(c(n)

F

)

Proof. Since E(T

i Ai) = tr(c(n)AΣ), we have

n(cid:88)

n(cid:88)

E(

i Ai − c(n)tr(AΣ))2 =
T

E(T

i Ai − c(n)tr(AΣ))2 =

i=1

i=1

n(cid:88)

i=1

E(T

i Ai)2 − nc2(n)tr2(AΣ)

(3.6.5)

Let B = c(n)Σ

1
2 AΣ

1
2 , then exit orthogonal matrix Q such that B = QT ΛQ where Λ =
2 i, then ˜i ∼ N (0, Ip×p) where

(cid:113)
c(n)QΣ− 1

diag(λi) and λi are eigenvalues of B. Let ˜i =

Ip×p is identity matrix. Then

E(T

i Ai)2 = E(˜T

i Λ˜i)2 = E(

λj˜2

ij)2

p(cid:88)
p(cid:88)

i=1

p(cid:88)

= E(

j ˜4
λ2

ij +

λjλk˜2

ij˜2

ij) = 2

i=1

j,k=1

= 2tr(BT B) + tr2(B)

= c(n)2[2tr(ΣAΣA) + tr2(AΣ)]

(3.6.6)

p(cid:88)

i=1

p(cid:88)

i=1

λj)2

λ2
j + (

Hence

n(cid:88)

E(

i Ai − c(n)tr(AΣ))2 = 2nc(n)2tr(ΣAT ΣA) ≤ 2nc(n)2λ2
T

max(Σ)tr(AT A)

(3.6.7)

i=1

76

By Chebyshev’s inequality, for any M we have

(cid:80)n

P (

i=1 T

(cid:113)
i Ai − c(n)tr(AΣ)
nc(n)2 (cid:107)A(cid:107)2

> M ) ≤ 2nc(n)2tr(ΣAT ΣA)

M 2nc(n)2 (cid:107)A(cid:107)2

F

≤ 2λ2

max(Σ)
M 2

(3.6.8)

Then for any  > 0 exits M large enough such that

F

(cid:80)n

p(

i=1 T

(cid:113)
i Ai − c(n)tr(AΣ)
nc(n)2 (cid:107)A(cid:107)2

)

F

which means(cid:80)n

√
i Ai − c(n)tr(AΣ) = Op(c(n)

i=1 T

n(cid:107)A(cid:107)

)

F

> M ) < 

(3.6.9)

Proof of Theorem 1. Take derivative with respect to µk (k = 1, 2) with the function
L(θ, µ1, µ2) deﬁned in 3.1.6. Considering Σ−1(θ) is nonsingular, we have µk = ˆµkM LE =
¯Y k·.

For ˆθM LE, we ﬁrst consider the case of p/n → 0. It is suﬃcient to prove that for any

given  > 0, there is a large constant C such that for large p and n, the smallest rate of

(cid:113) 1

convergence ηn,p is

np such that we have

L(θ0 + uηn,p, ˆµ1, ˆµ2) < L(θ0, ˆµ1, ˆµ2)) > 1 − 

(3.6.10)

P (

sup

(cid:107)u(cid:107)
2

=C

where u ∈ Rq. This implies that there exists a local maximum for the function L in the

neighborhood of θ0 with the radius at most proportional to ηn,p.

L(θ0 + uηn,p, ˆµ1, ˆµ2) − L(θ0, ˆµ1, ˆµ2)

(3.6.11)

= (

∂L
∂θ
= −n
2

(θ0))T uηn,p + 1

uT T (θ0)uη2

n,p + (

(θ∗))uη2

∂2L
∂θ∂θT
(θ0))T uηn,p + 1

n,p

2 uT (
∂L
∂θ

2 uT (

∂2L
∂θ∂θT

(θ∗) + mnT (θ0))uη2

n,p

77

= (I) + (II) + (III)

where T (θ0) is a q × q matrix with its (i, j)th element tij(θ0) = tr(Σ−1ΣiΣ−1Σj).

From A4 and A5, tij = aij(tii)

1
2 (tjj)

1

2 ≥ aijλ−2

min(Σ)(cid:107)Σi(cid:107)

F

. There exists a

(cid:13)(cid:13)Σj

(cid:13)(cid:13)

F

constant M such that

(I) = −n
2

uT T uη2

n,p = −n
2

q(cid:88)

i,j=1

tijuiujη2

n,p ≤ −M np
2

n,p (cid:107)u(cid:107)2
η2

2

(3.6.12)

nk(cid:88)

q(cid:88)

(cid:20)

i=1

j=1

(Y ki − ˆµk)T Σj(Y ki − ˆµk) − (

(cid:21)

)tr(ΣΣj)

nk − 1
nk

uθj

ηn,p

(3.6.13)

and

(II) = − 1

2(cid:88)
q(cid:88)

k=1

2

j=1

= (1) + (2)

+ 1
2

tr(ΣΣj)uθj

ηn,p

n

(cid:13)(cid:13)(cid:13)Σj(cid:13)(cid:13)(cid:13)
≤(cid:13)(cid:13)Σi(cid:13)(cid:13)2

2

√
ηn,p) = Op(

np(cid:107)u(cid:107)

(cid:107)u(cid:107)

2

ηn,p)

2

(3.6.14)

F

= pλ2

max(Σi) = O(p).

Also by A6(i), tr(ΣΣj) = tr(Σ1/2ΣjΣ1/2) ≤ λmax(Σj)tr(Σ). Then noticing that tr(Σ) =

nk−1
nk

Because Y ki − ˆµk ∼ N (0,

2(cid:88)

|(1)| = Op(

k=1

nk − 1√

nk

(cid:13)(cid:13)(cid:13)Σj(cid:13)(cid:13)(cid:13)

Σ), by lemma 3.6.2,

√
ηn,p) = Op(

(cid:107)u(cid:107)

2

F

The last equality is because from A6(i),(cid:13)(cid:13)Σi(cid:13)(cid:13)2

F

78

O(p), and p/n → 0

|(2)| =

Thus

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

2

q(cid:88)

j=1

tr(ΣΣj)ujηn,p

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = Op(tr(Σ)ηn,p (cid:107)u(cid:107)

2

) = Op(pηn,p (cid:107)u(cid:107)

)

2

√
(II) = Op((

np + p)ηn,p (cid:107)u(cid:107)

)

2

√
• If p/n → 0, (II) = Op(

np)ηn,p. By choosing suﬃcient large C = (cid:107)u(cid:107)

, the minimal

2

rate of ηn,p to have (II) be dominated by (I) is ηn,p = Op(

(cid:113) 1

np);

• If p/n → C with 0 < C ≤ ∞, (II) = Op(pηn,p). Then the minimal rate of ηn,p to

have (II) dominated by (I) is ηn,p = Op(

(cid:113) 1

n)

Since

∂L

∂θj∂θl

(θ) =

(cid:104)

n
2

tr(Σj(θ)Σ(θ)) + tr(Σj(θ)Σl(θ))

(cid:105) − 1

2

2(cid:88)

nk(cid:88)

k=1

i=1

(Y ki − ˆµk)T Σjl(θ)(Y ki − ˆµk)

)tr(Σ(θ0)Σjl(θ∗))(cid:1)(cid:35)

nk − 1
nk

uθj

uθl

η2
n,p

(III) can be written as

q(cid:88)

− 1

2

2n1n2

j,l=1

n

q(cid:88)
q(cid:88)

j,l=1

+

+

+

n
2

n
2

k=1

i=1

nk(cid:88)

(cid:0)(Y ki − ˆµk)T Σjl(θ∗)(Y ki − ˆµk) − (
tr(Σjl(θ∗)Σ(θ0))uθj

(cid:34) 2(cid:88)
q(cid:88)
(cid:104)
(cid:105)
tr(Σjk(θ∗)Σ(θ∗)) − tr(Σjk(θ∗)Σ(θ0))
(cid:104)
(cid:105)
tr(Σj(θ∗)Σl(θ∗)) − tr(Σj(θ0)Σl(θ0))

η2
n,p

uθl

j,l=1

uθj

uθl

η2
n,p

uθj

η2
n,p

uθl

j,l=1

79

=(3) + (4) + (5) + (6)

By lemma 3.6.2 and A6(ii),

√
|(3)| = Op(

n

For (4), by A6(ii )

(cid:13)(cid:13)(cid:13)Σjl(θ∗)
(cid:13)(cid:13)(cid:13)

F

√
η2
n,p) = Op(

npη2

n,p)

tr(Σjl(θ∗)Σ(θ0)) = Op(p)

thus |(4)| = Op( p

n η2

n,p) It is easy to see (3), (4) are dominated by (I). For (5),

|(5)| =

n

tr(Σkj(θ∗)(Σ(θ0) − Σ(θ∗)))uθk

η2
n,p

uθj

(3.6.15)

Let dil(θ∗) be the i, lth entry of matrix Σkj(θ∗), γil(θ) be the i, lth entry of Σ(θ), then by

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1

4

q(cid:88)

j,k=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .

A2 and A6(ii)

tr(Σkj(θ∗)(Σ(θ0) − Σ(θ∗))) =

≤

≤

dil(θ∗)(γli(θ0) − γli(θ∗))
|dil(θ∗)|(cid:107) ∂γil(θ∗)

∂θ

(cid:107)2(cid:107) θ0 − θ∗ (cid:107)2

(3.6.16)

i,l=1

p(cid:88)
p(cid:88)
p(cid:88)

i,l=1

i,l=1

80

2)M ηn,p

(cid:13)(cid:13)(cid:13)Σkj(θ∗)
(cid:13)(cid:13)(cid:13)

dil(θ∗
(cid:113)

F

≤ M ηn,pp

= Op(

p3ηn,p)

Hence

|(5)| = OP (n

(cid:113)

p3η3) = Op((p3/2nηn,p)η2

n,p))

(3.6.17)

• If p/n → 0 and ηn,p = Op( 1√

np), (5) is dominated by (I).

• If p/n → C with 0 < C ≤ ∞, etan,p = Op( 1

n) and

√

p/n → 0, (5) is dominated by (I).

For (6), let tij(θ∗) = tr(Σ−1(θ∗)Σi(θ∗)Σ−1(θ∗)Σj(θ∗)), by A6(iii),

q(cid:88)

k,j=1

(cid:13)(cid:13)(cid:13)(cid:13) ∂tij(θ∗)

∂θ

(cid:13)(cid:13)(cid:13)(cid:13)

2

|(6)| ≤ n

(cid:107)θ∗ − θ0(cid:107)

uθk

uθj

2

η2
n,p

(3.6.18)

= OP (npη3

n,p)

While ηn,p = Op( 1√

np) or ηn,p = Op( 1

n), (6) is also dominated by (I). Hence (III) is

dominated by (I). This completes the proof.

Proof of Theorem 2. We start with W1(ˆδM LE; Θ) = 1 − Φ(Ψ1), where W1(ˆδM LE; Θ) is

the conditional misclassiﬁcation rate deﬁned in 1.3.2 and Ψ1 is deﬁned in 1.3.3. The idea is
to prove lim infn,p→0 Ψ1 →
(cid:105)p

np). Recall that Σ = Σ(θ) =(cid:2)γ(hij; θ)(cid:3)p

From Therom 1, we have (cid:107) ˆθ − θ (cid:107)2= Op( 1√

√
C0
2

(cid:104)

i,j=1

.

.

and ˆΣ = Σ(ˆθ) =

γ(hij; ˆθ)

i,j=1

By A2, we have:

|γ(hij; θ) − γ(hij; ˆθ)| ≤ M (cid:107) θ − ˆθ (cid:107)2

max
i,j

(3.6.19)

81

Thus there exist  > 0 and matrix E =(cid:2)eij

(cid:3)p

i,j=1 such that

ˆΣ = Σ + E

(3.6.20)

where  = Op( 1√

i.e. (cid:12)(cid:12)eij

np) and E is a p × p matrix with absolute values of all entries less than 1,

(cid:12)(cid:12) ≤ 1 for any i, j = 1, 2, ..., p. As a result, for large p and n, the inverse of Σ can be

written as:

ˆΣ−1 = Σ−1 − Σ−1EΣ−1 + O(2)E2

(3.6.21)

where E2 is a p × p matrix with all entries less than 1, see [39].

Now we consider the denominator of 1.3.3. We ﬁrst claim the denominator can be written

as:

T

ˆ∆

( ˆΣ−1Σ ˆΣ−1) ˆ∆ = ˆ∆

T

Σ−1 ˆ∆(1 + op(1)).

(3.6.22)

Because by 3.6.21, we have

ˆΣ−1Σ ˆΣ−1 = (Σ−1 − Σ−1EΣ−1 + O(2)E2)Σ(Σ−1 − Σ−1EΣ−1 + O(2)E2)
= Σ−1 − 2A + 2AΣA + O(2)E2 + O(3)EE2Σ−1 + O(4)E2E2

(3.6.23)

where A = Σ−1EΣ−1.

Also, noticing that  = O( 1√

np), k1 ≤ λmin(Σ) ≤ λmax(Σ) ≤ k2 and λmax(E) ≤ tr(E) ≤

82

p, we have:

T

T

ˆ∆
ˆ∆

(A) ˆ∆
Σ−1 ˆ∆

=

yT Ey
yT Σy

≤ λmax(E)
λmin(Σ)

≤ p/λmin(Σ) = O(

(cid:114) p

n

)

(3.6.24)

where yT = ˆ∆

T

Σ−1. Similarly, we have:

ˆ∆

T

(2AΣA) ˆ∆
T
ˆ∆

Σ−1 ˆ∆

≤ 2 λ2
max(E)
λ2
min(Σ)

≤ 2p2/λ2

min(Σ) = O(

p
n

)

(3.6.25)

ˆ∆

T

(O(2)E2) ˆ∆
T
ˆ∆

Σ−1 ˆ∆

≤ O(2)λmax(E2)λmin(Σ) ≤ O(2)pλmax(Σ) = O(

1
n

)

(3.6.26)

T

ˆ∆

(O(3)EE2Σ−1) ˆ∆
ˆ∆

Σ−1 ˆ∆

T

≤ O(3)

λmax(E2)λmax(E)λmax(Σ−1)

λmin(Σ)

≤ O(3)p2 λmax(Σ)
λmin(Σ)

(cid:114) p

1
n

)

n

= O(

(3.6.27)

(3.6.28)

≤ O(4)p2λmax(Σ) = Op(

1
n2 )

T

ˆ∆

(O(4)E2E2) ˆ∆
ˆ∆

Σ−1 ˆ∆

T

Since p

n → 0 as n → ∞ and p → ∞, (3.6.22) is derived by combining (3.6.24)- (3.6.28).

Now we investigate ˆ∆

T

Σ−1 ˆ∆ and claim that:

T

ˆ∆

Σ−1 ˆ∆ = ∆T Σ−1∆(1 + op(1)) +
(cid:80)n1
i=1 Y 1i, which is normally distributed as N (µ1, 1
n1

(1 + op(1))

np
n1n2

(3.6.29)

Σ). Also,

Recall ˆµ1 = ¯Y 1· = 1
n1

(cid:80)n2

ˆµ2 = ¯Y 2· = 1
n2

i=1 Y 2i, which is normally distributed as N (µ2, 1
n2

Σ). Let ˆµ1 = µ1 + ˆ1

83

and ˆµ2 = µ2 + ˆ2 where ˆ1 ∼ N (0, 1
n1

Σ) and ˆ2 ∼ N (0, 1
n2

Σ). Then we have:

T

ˆ∆

Σ−1 ˆ∆ = ∆T Σ−1∆ + 2∆T Σ−1(ˆ1 − ˆ2) + (ˆ1 − ˆ2)T Σ−1(ˆ1 − ˆ2)

Noticing ˆ1 − ˆ2 ∼ N (0, n
n1n2

Σ), by Chebyshev’s inequality, for any 0 > 0

∆T Σ−1(ˆ1 − ˆ2)

∆T Σ−1∆

P (

> 0) ≤ E(cid:0)∆T Σ−1(ˆ1 − ˆ2)(cid:1)2

(0∆T Σ−1∆)2

=

≤

n

n1n2∆T Σ−1∆

n

n1n2Cp

=

1

π(1 − π)nCp

→ 0

It goes to 0 because nCp → ∞. Then

(3.6.30)

∆T Σ−1(ˆ1 − ˆ2) = op(∆T Σ−1∆)

(cid:113) n1n2

1

2 (ˆ1 − ˆ2). Then ˜ ∼

n Σ

Then we consider the third term in 3.6.29. Let ˜ =

N (0, Ip×p). Now for any ε0 > 0

P (|(ˆ1 − ˆ2)T Σ−1(ˆ1 − ˆ2) − np/n1n2

np/n1n2

| > 0)

p

| > 0) = P (|˜T ˜ − p
≤ E(cid:0)˜T ˜(cid:1)2
2
0p2
2 → 0

2
p

1

=

as p → ∞. Then

(ˆ1 − ˆ2)T Σ−1(ˆ1 − ˆ2) =

np
n1n2

(1 + op(1))

(3.6.31)

84

Then 3.6.29 followed. Now 3.6.22 and 3.6.29 yield:

T

ˆ∆

( ˆΣ−1Σ ˆΣ−1) ˆ∆ = ∆T Σ∆(1 + op(1)) +

np
n1n2

(1 + op(1))

(3.6.32)

Now we consider the nominator of 1.3.3.

(cid:17)

ˆΣ−1ˆ1 − 2∆T ˆΣ−1ˆ2

(3.6.33)

(cid:16)

=

(µ1 − ˆµ)T ˆΣ ˆ∆ =

1
2

ˆΣ−1ˆ2 − ˆT

∆T ˆΣ−1∆ + ˆT
((1) + (2) − (3) − (4))
1
2

2

1

(1) = ∆T (Σ−1 − E + O(2)E2)∆

By the assumption that k1 ≤ λmin(Σ) ≤ λmax ≤ k2, λmax(E) ≤ p and λmax(E2) ≤ p, we

have

and

thus

∆T (E)∆
∆T Σ−1∆

→ 0

∆T (O(2)E2)∆

∆T Σ−1∆

→ 0

(1) = ∆T Σ−1∆(1 + op(1))

(3.6.34)

By the same argument, we have (2) = ˆT

2 Σ−1ˆ2(1+op(1)) and (3) = ˆT

1 Σ−1ˆ1(1+op(1)).

85

Since ˆ1 ∼ N (0, 1
n1

Σ) and ˆ1 ∼ N (0, 1
n1

Σ),similar to the proof of (3.6.31), we have:

(2) =

p
n2

(1 + op(1)) and (3) =

p
n1

(1 + op(1))

(3.6.35)

Now we consider term (4) in 3.6.33.

(4) =∆T (Σ−1 − Σ−1EΣ−1 + O(2)E2)ˆ2

=∆T Σ−1ˆ2 + ∆T Σ−1EΣ−1ˆ2 + O(2)∆T E2ˆ2

Similar to the proof of 3.6.30, all the three terms in (4) are small order of ∆T Σ−1∆. Thus

we have

(4) = op(∆T Σ−1∆)

(3.6.36)

(3.6.37)

(cid:17)

Now the nominator can be written as:

(cid:18)
∆T Σ−1∆(1 + op(1)) +

(cid:19)
(n1 − n2)(1 + op(1))

p

n1n2

(µ1 − ˆµ)T ˆΣ−1 ˆ∆ = 1

2

3.6.32 and 3.6.37 yield

W1(ˆδM LE; Θ) = 1 − Φ

(cid:16)∆T Σ−1∆(1 + op(1)) + p

(cid:113)
∆T Σ−1∆(1 + op(1)) + np
n1n2

n1n2

2

(n1 − n2)(1 + op(1))

(1 + op(1))

By the same argument, we have:

W2(ˆδM LE; Θ) = Φ

(cid:16)−∆T Σ−1∆(1 + op(1)) + p

(cid:113)
∆T Σ−1∆(1 + op(1)) + np
n1n2

n1n2

2

(n1 − n2)(1 + op(1))

(1 + op(1))

(cid:17)

86

Since p

n → 0 and Cp = ∆T Σ−1∆ → C0 with 0 ≤ C0 ≤ ∞, we have

W (ˆδM LE; Θ) = 1
2

(cid:16)

1 − Φ

Cp(1 + op(1)) + p
(cid:113)
−Cp(1 + op(1)) + p
(cid:113)

n1n2

2

+ Φ

n1n2

Cp(1 + op(1)) + np
n1n2

(1 + op(1))
(n1 − n2)(1 + op(1))

2

Cp(1 + op(1)) + np
n1n2

(1 + op(1))

→1 − Φ(

√

C0
2

)

(n1 − n2)(1 + op(1))


(cid:17)

as p → ∞ and n → ∞. If Cp → C0 < ∞, 1 − Φ(
optimal. Now we check the asymptotically optimal when Cp → ∞. From the inequality

) > 0. Thus ˆδM LE is asymptotically

√
C0
2

x

1 + x2 e

− x2

2 ≤ Φ(−x) ≤ 1
x

− x2

2 , x > 0

e

(3.6.38)

we have

xy
1 + x2 e

− x2−y2

2 ≤ W (ˆδM LE)
)

Φ(−

√

Cp
2

≤ 1 + y2
xy

− x2−y2
e

2

(cid:113)
Cp(1+op(1))± p
n1n2
Cp(1+op(1))+

2

(n1−n2)(1+op(1))
np
n1n2

(1+op(1))

where x =

1 and 1+y2

. It is easy to check that xy
xy → 1 as Cp → ∞. Also x2−y2 → 0 if Cp(p/n) → 0. This completes the proof.

and y =

√

Cp
2

1+x2 →

Proof of Theorem 3. The misclassiﬁcation rate of δ ˆµ is:

W (δ ˆµ; Θ) = 1

2(W1(δ ˆµ; Θ) + W2(δ ˆµ; Θ))

87

where

W1(δ ˆµ; Θ) = 1 − Φ(Ψ1) and W2(δ ˆµ; Θ) = Φ(Ψ2)

where Ψ1 and Ψ2 is deﬁned by 1.3.3 and 1.3.5 with ˆΣ replaced by Σ. We start with

(cid:112)
(µ1 − ˆµ)T Σ−1( ˆµ1 − ˆµ2)
( ˆµ1 − ˆµ2)T Σ−1( ˆµ1 − ˆµ2)

Ψ1 =

The denominator is (3.6.29) and it can be represented as:

ˆ∆Σ−1 ˆ∆ = ∆T Σ∆(1 + op(1)) +

np
n1n2

(1 + op(1))

The nominator is:

(µ1 − ˆµ)T Σ−1 ˆ∆ = 1

2(∆T Σ−1∆ + ˆT

1 Σ−1ˆ1 − ˆT

2 Σ−1ˆ2 − 2∆T Σ−1ˆ2)

By the similar procedure in the proof of (3.6.37), it can be represented by

(µ1 − ˆµ)T Σ−1 ˆ∆ = 1

2

(cid:18)
∆T Σ−1∆(1 + op(1)) +

(cid:19)
(n1 − n2)(1 + op(1))

p

n1n2

Thus we have:

W1(ˆδ ˆµ; Θ) = 1 − Φ

(cid:16)Cp(1 + op(1)) + p

(n1 − n2)(1 + op(1))

(cid:17)

n1n2

(cid:113)

2

Cp(1 + op(1)) + np
n1n2

(1 + op(1))

88

(3.6.39)

Similarly, we have

W2(ˆδ ˆµ; Θ) = Φ

(cid:16)−Cp(1 + op(1)) + p

(n1 − n2)(1 + op(1))

(cid:17)

n1n2

Cp(1 + op(1)) + np
n1n2

(1 + op(1))

(cid:113)

2

(i) If

Cp

p/n → ∞. Then
(cid:113)

±Cp(1 + op(1)) + p
n1n2

(n1 − n2)(1 + op(1))

2

Cp(1 + op(1)) + np
n1n2

(1 + op(1))

p

n1n2Cp

(n1 − n2)(1 + op(1)))

Cp(1 + np

(1 + op(1))

n1n2Cp
p

n1n2Cp
(1 + np

n1n2Cp

(n1 − n2)(1 + op(1)))

(1 + op(1))

±Cp(1 ±

(cid:113)
±(cid:112)Cp(1 ±
(cid:113)

2

2

=

=

→±√

C0
2

which yields W (ˆδ ˆµ) → 0 since p
→ ∞ in probability.

that

W (ˆδ ˆµ)
WOP T

n → C with 0 < C < ∞ and Cp → C0 = ∞. Now we show

Noticing the fact that

we have

x

1 + x2 e

− x2

2 ≤ Φ(−x) ≤ 1
x

− x2

2 , x > 0

e

Φ(cid:0)− x

WOP T

(cid:1) =

2

√

Φ(−

Φ(cid:0)− x

Cp
2

(cid:1) ≤ 4 + x2
x(cid:112)Cp

)

2

− 1
8 (Cp−x2)
e

89

(cid:113)
Cp(1+op(1)± p(n1−n2)
n1n2
np
n1n2

Cp(1+op(1))+

(1+op(1)))

(1+op(1))

4 + x2

x(cid:112)Cp

x(cid:112)Cp

4

x(cid:112)Cp

+

=

→ a constant

where x =

because

and

→ 0

1

x(cid:112)Cp


→ 1

→ a0

if c = ∞,

if c < ∞

x(cid:112)Cp

=

. Also

where a0 =

√

√
c+1/

√
c+1/(π(1−π))

c

→ ∞

(3.6.40)

Cp − x2 =

Thus we have

As a result

C2

p op(1) + Cp

(n±(n1−n2))p

n1n2

(1 + op(1)) +

p2(n1−n2)2

n2
1n2
2

(1 + op(1))

Cp(1 + op(1)) + np
n1n2

Φ(−

√
Cp
2

Φ(cid:0)− x

2

(cid:1) → 0

)

W (ˆδ ˆµ)
WOP T

→ ∞

90

(ii)While

Cp

p/n → c with 0 < c < ∞

(cid:113)

2

±Cp(1 + op(1)) + p
n1n2

(n1 − n2)(1 + op(1))

Cp(1 + op(1)) + np
n1n2

(1 + op(1))

p

n1n2Cp

(n1 − n2)(1 + op(1)))

Cp(1 + np

(1 + op(1))

2

±Cp(1 ±

(cid:113)
±(cid:112)Cp(1 ±
(cid:113)
±√
(cid:113)
C0(1 ± 1
1 + 1
c

2

2

=

=

→

n1n2Cp
p

n1n2Cp
(1 + np
n1n2Cp
2π−1
π(1−π))
c
1
π(1−π)

(n1 − n2)(1 + op(1)))

(1 + op(1))

Since Cp → C0,

and

W1(ˆδ ˆµ) → 1 − Φ(

W2(ˆδ ˆµ) → 1 − Φ(

√

(cid:113)

2π−1
C0(1 + 1
π(1−π))
c
1 + 1
π(1−π)
c

1

2

√

(cid:113)
2π−1
C0(1 − 1
π(1−π))
c
1 + 1
π(1−π)
2
c

1

)

)

If p

n → C with 0 < C < ∞, then 0 < C0 < ∞. Since Φ(x) is convex function in the

sense that 1

2(Φ(x + ) + Φ(x − )) ≤ Φ(x) for any x > 0 and x >  > 0,

limP W (ˆδ ˆµ) = limP

1

2(W1(ˆδ ˆµ) + W2(ˆδ ˆµ)) ≥ 1 − Φ(

√

(cid:113)

2

C0
1 + 1
π(1−π)
c

1

) > 1 − Φ(

√

C0
2

)

where limP means converge in probability with p → ∞ and n → ∞.

If p

n → ∞, then C0 = ∞. Hence W (ˆδ ˆµ) = 1

2(W1(ˆδ ˆµ) + W2(ˆδ ˆµ)) → 0. By similar

argument in (i), we have

W (ˆδ ˆµ)
WOP T

→ ∞.

91

(iii)While

Cp

p/n → 0,

(cid:113)

2

±Cp(1 + op(1)) + p
n1n2

(n1 − n2)(1 + op(1))

Cp(1 + op(1)) + np
n1n2

(1 + op(1))

=



Which yields W (ˆδ ˆµ) → 1
2.

(cid:113) p
(cid:114)
n(±Cp/( p
(Cp/( p

2

n(n1−n2)

n) +
n1n2
n) + n2
n1n2

(1 + op(1)))

(1 + op(1))

→ ∞ if n1 > n2,
→ −∞ if n1 < n2.

Proof of Corollary 1. Noticing that when n1 = n2 = n/2, ˆ1 ∼ N (0, 1
n1
N (0, 1
n1

2 ˆi for i = 1, 2. Then ˜i ∼ N (0, Ip) and

Σ). Let ˜i =

niΣ

√

1

Σ) and ˆ2 ∼

E(ˆT

1 Σ−1ˆ1 − ˆT

2 Σ−1ˆ2)2 =E(˜T

1 ˜1 − ˜T

2 ˜2)2/N 2

1 =

1j − ˜2
(˜2

2j)2/n2

1 = 6p/n2
1

p(cid:88)

Hence we have:

j=1

√

p
n

)

1 Σ−1ˆ1 − ˆT
ˆT

2 Σ−1ˆ2 = Op(

Similar to the proof of (3.6.39) we have:

W1(ˆδ ˆµ; Θ) ≤ 1 − Φ

(cid:16) Cp(1 + op(1)) +
(cid:113)

2

Cp(1 + op(1)) + 4p

(cid:17)

√
p
n (1 + op(1))
n (1 + op(1))

and

W2( ˆδ ˆµ; Θ) ≤ Φ

(cid:16) −Cp(1 + op(1)) +
(cid:113)

2

Cp(1 + op(1)) + 4p

√
p
n (1 + op(1))
n (1 + op(1))

(cid:17)

92

(cid:113)

±Cp(1 + op(1)) +
2

Cp(1 + op(1)) + 4p

√
p
n (1 + op(1))
n (1 + op(1))
→ ±∞

(cid:113) p
(cid:113)

±Cp/

2

Cp/ p

n(1 + op(1)) + 1√
n(1 + op(1)) + 4 + op(1))

(1 + op(1))

n

=



(cid:113)

2

→ ± c

4

→ ±

→ 0

c

√
4+c/

C

if

if

if

if

Cp√

p/n

Cp√

p/n

Cp√

p/n

Cp√

p/n

→ ∞

→ c and p/n → ∞

→ c and p/n → C < ∞

→ 0

W (ˆδ ˆµ)
WOP T

The proof of

the proof.

→ ∞ is the same as that in the proof of Theorem 3(1). This completes

3.6.2 Proofs for consistency of PMLE

Proof of Theorem 4. In the algorithm, we estimate β(0) ﬁrst. Then θ(0) is estimated by

ﬁxing β = ˆβ

(0)

. Then update β = ˆβ

(1)

by ﬁxing θ = ˆθ

(0)

. θ = ˆθ

(1)

is updated in last

step by ﬁxing β = β(1). So the idea is to prove the theorem in the following sequence: (a)

The consistency and sparsity of β(0); (b) The consistency of θ(0); (c) The consistency and

sparsity of β(1); (d) The consistency of θ(1).

(cid:13)(cid:13)(cid:13)(cid:13) ˆβ

(cid:13)(cid:13)(cid:13)(cid:13)

2

= Op((cid:112) s

(a) We ﬁrst prove

(0) − β0

n) and ˆβ

(0)
2 = 0 with probability tending to 1,

where ˆβ

(0)
2

is the p − s dimension sub-vector of ˆβ

(0)

= ( ˆβ

(0)T
1

ˆβ

(0)T
2

)T . The proof of

(a) is the same as the proof of (c), except the loss function if deﬁned as R(β) which is

negative of the penalized MLE function (3.3.1) with covariance matrix ˙Σ replaced by

diagn−1(Ip). Then the parameters are estimated by minimize the loss function. We

93

omit the proof here and illustrate the details in (c).

(b) Second we prove

(0) − θ0

= Op(

(cid:13)(cid:13)(cid:13)(cid:13)ˆθ

(cid:13)(cid:13)(cid:13)(cid:13)

2

np). Write

(cid:113) 1
(cid:12)(cid:12)(cid:12) ˙Σ(θ)
(cid:12)(cid:12)(cid:12) − 1

F (θ, ˆβ

(0)

; Z) = −np
2

log(2π) − 1

2 log

2(Z − X ˆβ

(0)

)T ˙Σ−1(θ)(Z − X ˆβ

(0)

It is suﬃcient to prove for any given  > 0 the smallest convergence rate of ηn,p is

such that we have

)

(cid:113) 1

np

F (θ0 + uηn,p, ˆβ

(0)

; Z) < F (θ0, ˆβ

(0)

; Z)) > 1 − 

P ( sup
(cid:107)u(cid:107)=C

This implies there exists a local maximum for the function Q(θ, ˆβ

(0)

; Z) of θ in the

neighborhood of θ0 with the radius at most proportional to ηn,p.
∂ ˙Σ(θ∗)
∂θj

˙Σ(θ0 + uηn,p) − ˙Σ(θ0) = (cid:80)q
between θ0 + uηn,p and θ0. Denote ˙Σj(θ∗) = ∂ ˙Σ(θ∗)

By Taylor’s expansion,

, then

j=1

∂θj

ηn,p, where θ∗ is

uθj

(0)

; Z) − F (θ0, ˆβ

F (θ0 + uηn,p, ˆβ

=(cid:2)F (θ0 + uηn,p, β0) − F (θ0, β0)(cid:3) −

(0)

; Z)

q(cid:88)

j=1

(Z − Xβ0)T ˙Σj(θ∗)X( ˆβ

(0) − β0)uθj

ηn,p

− 1

2

(0) − β0)T XT ˙Σj(θ∗)X( ˆβ

( ˆβ

(0) − β0)uθj

ηn,p

q(cid:88)

j=1

=(I) + (II) + (III)

94

where

uT T uη2

n,p +

(θ0)

uηn,p +

uT (

1
2

∂2F
∂θ∂θT

(θ∗) + (n − 1)T )uη2

n,p

(I) =F (θ0 + uηn,p, β0) − F (θ0, β0)

(cid:19)T

(cid:18) ∂F

∂θ

2

= − n − 1
q(cid:88)

=(1) + (2) + (3)

(II) =

j=1

(III) = − 1

2

( ˆβ

(0) − β0)T XT ˙Σj(θ∗)(Z − Xβ0)ujηn,p
q(cid:88)

(0) − β0)T XT ˙Σj(θ∗)X( ˆβ

( ˆβ

(0) − β0)ujηn,p

j=1

We consider (I) ﬁrst. T in (1) is q × q matrix with its (i, j)th element as tij(θ0), where
tij(θ) = tr(Σ−1(θ)Σi(θ)Σ−1(θ)Σj(θ)). By the similar argument in proving the bound
of (I) in theorem 1, there exist a constant K, such that (1) = − n
i,j=1 tij(θ0)uiujη2
−Knpη2

with probability tending to 1. In regarding to (2),

(cid:80)q

2

n,p ≤

n,p (cid:107)u(cid:107)2

2

∂F
∂θj

(θ0) =

n − 1

2

tr(Σj(θ0)Σ(θ0)) − 1

2(Z − Xβ)T ˙Σj(θ0)(Z − Xβ)

Notice that Z − Xβ ∼ N (0, ˙Σ(θ)) and tr( ˙Σj ˙Σ) = (n − 1)tr(ΣjΣ). By lemma 3.6.2,

(2) =Op(

tr( ˙Σ ˙Σj ˙Σ ˙Σj)ηn,p (cid:107)u(cid:107)

(n − 1)tr(ΣΣjΣΣj)ηn,p (cid:107)u(cid:107)

)

2

(cid:113)
(cid:114)
(n − 1)(cid:13)(cid:13)Σj(cid:13)(cid:13)2

=Op(

(cid:113)

2

) = Op(
)(cid:107)u(cid:107)

2

)

2

ηn,p (cid:107)u(cid:107)

F

√
By A3, A6(i), (2) = Op(

npηn,p).

95

Then we consider (3). For any j, k = 1, 2, ..., q

∂2(F )
∂θj∂θk

(θ∗) =

n − 1

2

(tr(Σjk(θ∗)Σ(θ∗) − tjk(θ∗)) − 1

2(Z − Xβ)T ˙Σjk(θ∗)(Z − Xβ)

Thus (3) could be written as:

j,k=1

q(cid:88)
q(cid:88)
q(cid:88)

+

+

j,k=1

j,k=1

n − 1

2
n − 1

2

n − 1

2

=(i) + (ii) + (iii)

(3) =

(tr(Σjk(θ∗)Σ(θ0)) − 1

2(Z − Xβ)T Σjk(θ∗)(Z − Xβ))uθj

η2
n,p

uθk

(tr(Σjk(θ∗)Σ(θ∗)) − tr(Σjk(θ∗)Σ(θ0)))uθj

η2
n,p

uθk

(tjk(θ0) − tjk(θ∗))uθj

η2
n,p

uθk

By lemma 3.6.2 and A6(ii ), (i) = Op(

(n − 1)tr(ΣΣjkΣΣjk)η2

√
n,p) = Op(

npη2

n,p).

Similar to the deriving the order of (5) and (6) in the proof of Theorem1, (ii) =

n,p). By choosing large C = (cid:107)u(cid:107)

, the minimal rate

2

(cid:112)

np).

p3η3

n,p) and (iii) = Op(npη3

Op(n
that (2) and (3) are dominated by (1) is ηn,p = Op( 1√

Now we consider (II). Denote B = (cid:80)n−1
(cid:113)
(0) − β0)T XT ˙Σj(θ∗)(Z − Xβ0) = Op(

1, 2, ..., q,

( ˆβ

=Op(

(cid:118)(cid:117)(cid:117)(cid:116)tr(
(cid:13)(cid:13)(cid:13)(cid:13) ˆβ
n−1(cid:88)
j Σj∗Σ∗Σj∗Xi) + tr(BT Σj∗Σ∗Σj∗B)
XT
where Σ∗ = Σ(θ∗) and Σj∗ = Σj(θ∗). Notice that(cid:80)n−1

i=1

(cid:13)(cid:13)(cid:13)(cid:13) ˆβ

(0) − β0

(cid:13)(cid:13)(cid:13)(cid:13)

2

)

tr(XT ˙Σj(θ∗) ˙Σ(θ∗) ˙Σj(θ∗)X)

(cid:13)(cid:13)(cid:13)(cid:13)

)

2

(0) − β0

i=1 XT

i Xi = (

n − n2
n1n2

1

n2 )Ip×p

i=1 Xi. Then by lemma 3.6.1, for any j =

(cid:113)

96

i=1 Xi)T ((cid:80)n−1

and ((cid:80)n−1
n−1(cid:88)

tr(

i=1

i=1 Xi) =

n2
n2 Ip×p. Then by A6(i)
1

j Σj∗Σ∗Σj∗Xi) = (
XT

n1n2

n

− n2
1
n2

)tr(Σj∗Σ∗Σj∗) ≤ λmax(Σ)

n1n2

n

Similarly tr(BT Σj∗Σ∗Σj∗B) = Op(np).

(cid:13)(cid:13)(cid:13)β(0) − β0

(cid:13)(cid:13)(cid:13)

2

= Op((cid:112) s

Since

√
n), (II) = Op(

psηn,p)

For (III), by A6(i), for any j = 1, 2, ..., q we also have:

( ˆβ

=( ˆβ

(0) − β0)

(0) − β0)T XT ˙Σj(θ∗)X( ˆβ
(cid:33)
(0) − β0)T XT diagn−1Σj(θ∗)( ˜In−1,p + ˜Jn−1,p)X( ˆβ
(0) − β0)
n−1(cid:88)

(cid:32)n−1(cid:88)

n−1(cid:88)

XT

i Xi + (

Xi)T (

Xi)

( ˆβ

≤λmax(Σj∗)(( ˆβ

(0) − β0)T

=Op(n

(0) − β0

i=1

i=1

i=1

) = Op(s)

(cid:13)(cid:13)(cid:13)(cid:13) ˆβ

(cid:13)(cid:13)(cid:13)(cid:13)2

2

(cid:13)(cid:13)(cid:13)Σj∗(cid:13)(cid:13)(cid:13)2

F

= Op(np).

(0) − β0))

(cid:113) 1

Thus (III) = Op(sηn,p). Both (II) and (III) are dominated by (I) while ηn,p =

Op(

np). This concludes the proof of

(0) − θ0

= Op(

(cid:13)(cid:13)(cid:13)(cid:13)ˆθ

(cid:13)(cid:13)(cid:13)(cid:13)ˆθ

(cid:13)(cid:13)(cid:13)(cid:13) ˆβ

(cid:13)(cid:13)(cid:13)(cid:13)

2

(cid:13)(cid:13)(cid:13)(cid:13)

2

(cid:13)(cid:13)(cid:13)(cid:13)

2

np)

(cid:113) 1
= Op((cid:112) s
(cid:113) 1

(c) Write ˆβ

(1)

= ( ˆβ

(1)
1 , ˆβ

(2)
2 )T . Then we prove

(1) − β0

n) and ˆβ

(1)
2 = 0

with probability tending to 1, where ˆβ
a p− s subvector of ˆβ

. Let ηn,p = Op(

(1)

(0) − θ0

1
(1) formed by elements in supp(β0) and ˆβ

(1)
2

is

) = Op(

np). We use two steps

to prove the consistency and sparsity.

step 1 We ﬁrst prove consistency on s-dimensional space. Deﬁne the loglikelihood

97

function for β1 as

(0)

¯Q(ˆθ

, β1) = − np
2
− n

n(cid:88)

log(2π) − 1

Pλ((cid:12)(cid:12)βj

(cid:12)(cid:12))

2 log

j=1

(cid:12)(cid:12)(cid:12)(cid:12) ˙Σ(ˆθ

(0)

)

(cid:12)(cid:12)(cid:12)(cid:12) − 1

2(Z − X1β1)T ˙Σ−1(ˆθ

(0)

)(Z − X1β1)

where β1 is subvector of β0 = (β1T , β2T )T formed by elements in supp β0. We ﬁrst

n). It is suﬃcient to prove that for any  > 0, the smallest rate

(cid:13)(cid:13)(cid:13)(cid:13)
(cid:13)(cid:13)(cid:13)(cid:13) ˆβ
of ξn,p is(cid:112) s

1 − β1

(1)

= Op((cid:112) s

2
n such that we have:

(0)

¯Q(ˆθ

, β1 + uξn,p) < ¯Q(ˆθ

(0)

, β1)) > 1 − 

P (

sup

(cid:107)u(cid:107)
2

=C

where u ∈ Rs. This implies that with probability tending to 1, there is a local maxi-

mizer ˆβ

(1)
1

of the function ¯Q in the neighborhood of β1

0 with the radius of β1

0 at most

proportional to ξn,p.

(0)

, β1

0 + uξn,p) − ¯Q(ˆθ

(0)

, β1
0)
n,p − (Z − X1β1

(0)

)X1uξ2

¯Q(ˆθ
= − 1

2 uT X1T ˙Σ−1(ˆθ
m(cid:88)
− np

(cid:0)p

(cid:48)
λn,p

j=1

(|β0j|)sgn(βj)uβj

ξn,p + p

(0)

0) ˙Σ−1(ˆθ
(|β0j|)u2
βj

(cid:48)(cid:48)
λn,p

)X1uξn,p

n)(1 + o(1))(cid:1)

ξ2

=(I) + (II) + (III)

By Taylor’s expansion, ˙Σ−1(ˆθ

(0)

) = ˙Σ−1(θ0)+(cid:80)q

j=1

˙Σj(θ∗), where θ∗ is a q dimension

98

vector between θ0 and ˆθ

(0)

. Therefore,

(I) = − 1

2 uT X1T ˙Σ−1(θ0)X1uξ2

n,p − 1

2

=(1) + (2)

and

(II) = − (Z − X1β1

0) ˙Σ−1(ˆθ

(0)

)X1uξn,p +

=(3) + (4)

q(cid:88)

j=1

uT X1T ˙Σj(θ∗)X1uξ2

n,pujηn,p

q(cid:88)

(Z − X1β1

0) ˙Σj(θ∗)X1uξn,pujηn,p

j=1

Noticing λmin(Σ−1) > 0,(cid:80)n−1
(cid:32)n−1(cid:88)

i=1 X1T

i X1

1
n2 )Is and

i = (

(cid:33)(cid:32)n−1(cid:88)

n − n2
n1n2
(cid:33)

X1

i=1

=

X1T
i=1

n2
1
n2 Is

i=1

i=1

we have

(1) = 1

2 uT X1T diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)X1uξ2
= − 1

i Σ−1X1
X1T

i uξ2

i Σ−1
X1T

2 uT

2 uT

n,p

n−1(cid:88)
n−1(cid:88)

i=1

X1
i

(cid:33)

n,p − 1
n−1(cid:88)

X1T
i

n−1(cid:88)
(cid:32)n−1(cid:88)

i=1

n−1(cid:88)

i=1

X1

i uξ2
n,p

n,pλmin(Σ−1)

uξ2

≤ − 1

2 uT

X1T

i X1

i +

i=1

i=1

i=1

2

= − 1
≤ − 1

2

2

(cid:107)u(cid:107)2
n1n2
n
π(1 − π)
λmax(Σ)

nξ2

n,pλmin(Σ−1)
ξ2
n,p (cid:107)u(cid:107)2

2

99

By similar argument and A6(i), (2) = Op(nξ2

n,pηn,p) = op((1)) while ηn,p = Op(

np).

(cid:113) 1

For (II), by lemma 3.6.1 and A3,

(cid:113)
(cid:114)

(3) =Op(

tr(X1 ˙Σ−1(θ0)X1)(cid:107)u(cid:107)

2

ξn,p)

n1n2

n

tr(Is×s)(cid:107)u(cid:107)

ξn,p)

2

=Op(

λmax(Σ)

√
ns(cid:107)u(cid:107)

2

=Op(

ξn,p))

√
Similarly, (4) = Op(

ns(cid:107)u(cid:107)

2

√
ξn,pηn,p) = op((3)). So we have (II) = Op(

ns(cid:107)u(cid:107)

ξn,p)).

2

(III) = (5) + (6), where

(5) = −n

(6) = −n

s(cid:88)
s(cid:88)

j=1

j=1

(cid:48)
λn,p

p

(cid:48)(cid:48)
λn,p

p

(|β0j|)sgn(βj)uβj

ξn,p

(|β0j|)u2
βj

ξ2
n)(1 + o(1))

(3.6.41)

Since an,p = Op( 1√

n

) by A7,

√

|(5)| ≤ n

san,p (cid:107)u(cid:107)

2

√
= Op(

nsξn,p (cid:107)u(cid:107)

2

)

(3.6.42)

By A8

|(6)| ≤ 2nξ2

n,p

s(cid:88)

j=1

(cid:48)(cid:48)

p

(β0j)u2
βj

= op(nξ2

n,p)

100

≤ 2nξ2

n,pbn,p (cid:107)u(cid:107)2

2

(3.6.43)

By choosing large C = (cid:107)u(cid:107)

by (I) is ξn,p = Op((cid:112) s

2

n). This completes the proof that

, the smallest rate of ξnp that (II) and (III) are dominated

step 2. in step 2 we prove that the vector ˆβ = ( ˆβ
Rd. It is suﬃcient to prove for any given β ∈ Rd satisfying (cid:107)β − β0(cid:107)
have Q(βs, ˆθ

) ≥ Q(β, ˆθ

), where β = (β1T , β2T )T and βs = (β1T , 0T )T .

(0)

(0)

2

(1)
1 , 0) is a strict local maximizer on

n), we

(cid:13)(cid:13)(cid:13)(cid:13) ˆβ

(1)

1 − β1

0

(cid:13)(cid:13)(cid:13)(cid:13)

2

n).

= Op((cid:112) s
= Op((cid:112) s

Let  = C(cid:112) s

n, it is suﬃcient to prove for j = s + 1, s + 2, ..., p:

(0)

(0)

)

)

∂Q(β, ˆθ
∂βj
∂Q(β, ˆθ
∂βj

< 0 f or 0 < βj < 

(3.6.44)

> 0 f or −  < βj < 0

˙Σ−1(ˆθ

(0)

)Xj(βl − β0l)

XT
l

(3.6.45)

(0)

)

∂Q(β, ˆθ
∂βj

=(Z − Xβ0)T ˙Σ−1(ˆθ

(0)

)Xj +

p(cid:88)

l=1

− nP(cid:48)

λ(|βj|)sgn(βj)

=(I) + (II) + (III)

where Xj is the jth column of X.

(I) =(Z − Xβ0)T ˙Σ−1(θ0)Xj +

(Z − Xβ0)T ˙Σk(θ∗)Xjukηn,p

We ﬁrst consider (I). By Taylor’s expansion,

q(cid:88)

=(5) + (6)

k=1

101

Notice(cid:80)n−1

i=1 XT

ijXij =

n − n2
n1n2

1

n2 and ((cid:80)n−1

i=1 Xij)T ((cid:80)n−1

i=1 Xij) =

n2
1
n2 .

(cid:113)
(5) =(Z − Xβ0)T diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)Xj = Op(

˙Σ−1Xj)

Xj

(cid:118)(cid:117)(cid:117)(cid:116)n−1(cid:88)

n−1(cid:88)

n−1(cid:88)

=Op(

ijΣ−1Xij +
XT

ijΣ−1
XT

Xij)

i=1

i=1

i=1

Noticing(cid:80)n−1

where Xij is the jth column of Xi.
n − n2
n1n2

i=1 XT

ijXij =

n2 , ((cid:80)n−1

i=1 Xij)T ((cid:80)n−1

1

√
∞, we have (5) = Op(

√
n). Similarly, (6) = Op(

i=1 Xij) =

n2
1

n2 and λmax(Σ−1) ≤

nηn,p) = op((3)), which is dominated

by (5) if ηn,p = o(1).

For (II), by Taylor’s expansion,

(II) =

˙Σ−1(θ0)Xj(βl − β0l) +

XT
l

q(cid:88)

p(cid:88)

k=1

l=1

˙Σk(θ∗)Xj(βl − β0l)(θ∗

k − θ0k)

XT
l

p(cid:88)

l=1

=(7) + (8)

p(cid:88)
p(cid:88)

l=1

(7) =

=

(cid:32)n−1(cid:88)

l diagn−1(Σ−1)( ˜In−1,p + ˜Jn−1,p)Xj(βl − β0l)
XT

n−1(cid:88)

(cid:33)

n−1(cid:88)

il Σ−1Xij +
XT

il Σ−1
XT

Xij

(βl − β0l)

Notice

l=1

i=1

i=1

i=1

il Σ−1Xij ≤ n−1(cid:88)

XT

i=1

(XT

il Σ−1Xil))

1
2 (XT

ijΣ−1Xij))

1
2

λmax(Σ−1)(XT

il Xil))

1
2 (XT

ijXij))

1
2

n−1(cid:88)
≤ n−1(cid:88)

i=1

i=1

102

=λmax(Σ−1)(

− n2

1

n1n2

n

n2 ) = Op(n).

i=1 Xil. Then BT

l Bl =

(cid:16) n1

n

(cid:17)2

.

Xij = BT

l Σ−1Bj ≤ (BT

l Σ−1Bl)1/2(BT

j Σ−1Bj)1/2 ≤ λmax(Σ−1)

n2
1
n2

Also, let Bl =(cid:80)n−1
n−1(cid:88)
n−1(cid:88)

il Σ−1
XT

i=1

i=1

Then

√
Similarly, (8) = Op(

(7) = Op(n(cid:107)β − β0(cid:107)

2

√
) = OP (

ns)

nsηn,p) Thus

(cid:16)

∂Q(β)

∂βj

= nλn,p

√
s√
nλn,p

OP (

)) +

P

(cid:48)
λn,p

(|βj|)
λn,p

(3.6.46)

(3.6.47)

(cid:17)

sgn(βj)

By assumption 9 and 10, the sign of 3.6.47 is determined by βj, hence 3.6.44 followed.

This implies ˆβ

(1)

should satisfy sparse property and completes the proof of step 2.

(d) Lastly we prove

(1) − θ0

= Op(

np). Since

(1) − β0

n), it is the

(cid:13)(cid:13)(cid:13)(cid:13)ˆθ

(cid:13)(cid:13)(cid:13)(cid:13)

2

(cid:113) 1

(cid:13)(cid:13)(cid:13)(cid:13) ˆβ

(cid:13)(cid:13)(cid:13)(cid:13)

2

= Op((cid:112) s

same as the proof of (b). We omit the detail here and this completes the proof.

3.6.3 Proofs for consistency for PMLE with tappering

Lemma 3.6.3. Assume A1, A2, A11 and A12 hold, we have:

(1) (cid:107)Σ(θ) − Σ(θ)T(cid:107)∞ = O( 1
pδ0

);

103

);

(2) (cid:13)(cid:13)Σk(θ) − Σk,T (θ)(cid:13)(cid:13)∞ = O( 1
(3) (cid:13)(cid:13)Σjk(θ) − Σjk,T (θ)(cid:13)(cid:13)∞ = O( 1
(cid:12)(cid:12)aij
(cid:80)p

The matrix norm (cid:107)·(cid:107)∞ for the p× p matrix A = [aij]p
sumation, i.e. (cid:107)A(cid:107)∞ = maxi

(cid:12)(cid:12)

pδ0

pδ0

j=1

).

i,j=1 is deﬁned as the maximum of row

Proof. We show (1) in detail and omit the details for (2) and (3), as similar arguments can

be applied.

p(cid:88)

(cid:12)(cid:12)γ(hij; θ)KT (hij, wp) − γ(hij; θ)(cid:12)(cid:12)

(3.6.48)

is the distance between site si and sj. For any i = 1, 2, ..., p,

(cid:107)Σ(θ) − ΣT (θ)(cid:107)∞ = max

i

2

j=1

(cid:13)(cid:13)
where hij =(cid:13)(cid:13)si − sj
p(cid:88)
(cid:12)(cid:12)γ(hij; θ)KT (hij, wp) − γ(hij; θ)(cid:12)(cid:12)
≤ (cid:88)

(cid:12)(cid:12)γ(hij; θ)KT (hij, wp) − γ(hij; θ)(cid:12)(cid:12) +

j=1

hij <wp

= (I) + (II)

(3.6.49)

(cid:88)

hij≥wp

γ0(θ, hij)

m=(cid:98) wp

Let Ai = {j : hij > wp} and Bi

of n and p. Then Ai ⊂ (cid:83)∞

m = {j : (m − 1)δ ≤ hij < mδ}, where ∆ is independent
∆ (cid:99) Bi
m. Let V (R) be the volume of a d-dimensional baa
m = V (mδ) − V ((m − 1)δ) = fd−1(m)δd, where
of radius R. Then the volume of Bi
fd−1(m) is a polynomial function of m with degree of d − 1. By A1, the number of sites
in any unit subset of D ⊂ Rd is bounded, say ρ. Let #{A} denote the cardinality of a
discrete set A. Then we have #{Bi
m} ≤ fd−1(m)δdρ. Then exist a constant K such that

m is Bi

104

(3.6.50)

(3.6.51)

fd−1(m) ≤ Kmd−1 Then
(cid:88)

(II) =

Let Ai

2 = {j : hij ≤ wp}. Then Ai

(I) =

∞(cid:88)

(cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12) ≤
∞(cid:88)
(cid:90) ∞

m=(cid:98) wp
δ

(cid:99)

(cid:99)
m=(cid:98) wp
δ
md−1δd max
j∈Bi
m

(cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12)

j∈Bi
m

(cid:88)
(cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12)
(cid:90) ∞

hij≥wp

≤ Kρ

≤ Kρ

xd−1 |γ(θ, x)| dx ≤ Kρ
wp

0

wp

xd |γ(θ, x)| dx

(cid:88)

hij <wp

= 2

hij <wp
(cid:98) wp
δ

≤ 2
wp

(cid:99)+1

δ

m=1 Bi
m.

2 ⊂(cid:83)(cid:98) wp
(cid:12)(cid:12)γ(hij; θ) − γ(hij; θ)KT (hij, wp)(cid:12)(cid:12)
(cid:88)
(cid:12)(cid:12)γ(hij; θ)(cid:12)(cid:12) hij
(cid:99)+1(cid:88)
(cid:88)
(cid:99)+1(cid:88)
(cid:90) ∞
(cid:90) ∞

(mδ)d−1δ max
j∈Bi
m

(cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12)

xd |γ(θ, x)| dx

m=1
(cid:98) wp
δ

j∈Bi
m

m=1

hij

hij

wp

xdγ0(θ, x)dx

0

0

≤ 2Kρ
wp
≤ 2Kρ
wp
≤ 2Kρ
wp

(cid:12)(cid:12)γ(θ, hij)(cid:12)(cid:12)

wp has the same order as p1/2 by A11. By A12, both (I) and (II) = O(1/p1/2). This

completes the proof.

Lemma 3.6.4. Assume A3-A6, A 11, A12 hold, we have

(a) limp→∞ λmin(ΣT ) > 0, limp→∞ λmax(ΣT ) < ∞;

105

(b) There exists an open subset ω that contains the true parameter θ0 such that for all

θ∗ ∈ ω, we have:

(i) −∞ < limp→∞ λmin(Σk
(ii) −∞ < limp→∞ λmin(Σkj

T (θ∗)) < limp→∞ λmax(Σk
T (θ∗)) < limp→∞ λmax(Σkj

T (θ∗)) < ∞;
T (θ∗)) < ∞;

(cid:12)(cid:12)(cid:12)(cid:12) ∂tij,T (θ∗)

∂θk

(cid:12)(cid:12)(cid:12)(cid:12) = O(p) for all k = 1, 2, ..., q.

(iii)

Proof.

(a) Let KT = [K(hij, w)]p

i,j=1 be the tappering covariance. By eigenvalue inequal-

ities of Schur product:

min
1≤i≤p

aiiλmin(Σ) ≤ λ(Σ ◦ KT ) ≤ max
1≤i≤p

aiiλmax(Σ)

(3.6.52)

where aij are the (i, j)th entry of matrix KT . By A3, λmin(ΣT ) > 0 and limp→∞ λmax(ΣT ) <
∞

(b) Since Σk

T = Σk ◦ KT and Σkj

T = Σkj ◦ KT [d](i) and [d](ii) hold by A 6(ii). For [2](iii),

since tij,T (θ) = tr(Σ−1

T Σi,T Σ−1

T Σj,T )

∂tij,T (θ)

∂θl

=tr(Σ−1

T Σl,T Σ−1

T Σi,T Σ−1

T Σj,T ) + tr(Σ−1

+ tr(Σ−1

T Σi,T Σ−1

T Σl,T Σ−1

T Σj,T ) + tr(Σ−1

T Σjl,T )

T Σj,T )

T Σil,T Σ−1
T Σi,T Σ−1

=(1) + (2) + (3) + (4)

Then (1) can be written as:

(3.6.53)

(3.6.54)

tr(Σ−1
=tr((Σ−1

T Σl,T Σ−1
T Σi,T Σ−1
T − Σ−1)Σl,T Σ−1

T Σj,T )
T Σi,T Σ−1

T Σj,T ) + tr(Σ−1(Σl,T − Σl)Σ−1

T Σi,T Σ−1

T Σj,T )+

106

T − Σ−1)Σi,T Σ−1

T Σj,T ) + tr(Σ−1ΣlΣ−1(Σi,T − Σi)Σ−1

T Σj,T )+

T − Σ−1)Σj,T ) + tr(Σ−1ΣlΣ−1ΣiΣ−1(Σj,T − Σj))+

tr(Σ−1Σl(Σ−1
tr(Σ−1ΣlΣ−1Σi(Σ−1
tr(Σ−1ΣlΣ−1ΣiΣ−1Σj)

Deﬁne (cid:107)·(cid:107)s for matrix A by (cid:107)A(cid:107)s = maxi{|λi(A)| , i = 1, 2, ..., p}, where λi(A) is the

(cid:13)(cid:13)(cid:13)Σ−1

T

(cid:13)(cid:13)(cid:13)s

.

ith eigenvalue of matrix A. Notice

(cid:13)(cid:13)s

(cid:107)Σ − ΣT(cid:107)s

(cid:13)(cid:13)(cid:13)Σ−1
T − Σ−1(cid:13)(cid:13)(cid:13)s

≤(cid:13)(cid:13)(cid:13)Σ−1(cid:13)(cid:13)(cid:13)s
Since λmin(Σ−1) = 1/λmax(Σ) > 0,(cid:13)(cid:13)Σ−1(cid:13)(cid:13)s
T − Σ−1(cid:13)(cid:13)(cid:13)s
(cid:13)(cid:13)(cid:13)Σ−1
(cid:13)(cid:13)Σj,T
(cid:12)(cid:12)(cid:12)
(cid:13)(cid:13)(cid:13)s
(cid:13)(cid:13)Σi,T
(cid:13)(cid:13)s

T − Σ−1)Σl,T Σ−1
T Σi,T Σ−1
T − Σ−1)Σl,T Σ−1
T Σi,T Σ−1
(cid:13)(cid:13)Σl,T
T − Σ−1)

(cid:12)(cid:12)(cid:12)tr((Σ−1
(cid:13)(cid:13)(cid:13)((Σ−1
(cid:13)(cid:13)(cid:13)(Σ−1

≤ p

< ∞ for all j = 1, 2, ..., q. Hence

(cid:13)(cid:13)(cid:13)Σ−1

T

T Σj,T )

T Σj,T )

(cid:13)(cid:13)(cid:13)s

(cid:13)(cid:13)s

(cid:13)(cid:13)(cid:13)2

s

≤ p
= O(p/pδ0) = O(p1−δ0)

≤ λmax(Σ−1) < ∞. Also

= Op(p−δ0). Then

(cid:13)(cid:13)(cid:13)Σ−1

T

(cid:13)(cid:13)(cid:13)s

< ∞,

(3.6.55)

(cid:13)(cid:13)Σj,T

(cid:13)(cid:13)s

By similar argument, the ﬁrst six terms in (1) all have the order of O(p1−δ0). Apply
the same argument on (2) − (4) we have:

∂tij,T (θ)

∂θl

=

∂tij(θ)

∂θl

+ O(p1−δ0)

(3.6.56)

By A6(iii),

∂tij,T (θ)

∂θl

= Op(p). This completes the proof.

107

Proof of Theorem 5. From lemma 4, all regularity conditions for ΣT are satisﬁed. The

proof of 5 is similar to that of Theorem 4. By replacing Σ by ΣT and replacing A3-A6(iii)

by the results in lemma 4, the results in Theorem 5 follows.

3.6.4 Proofs for classiﬁcation using PLDA

Lemma 3.6.5. Let ˆθ be the estimate of θ0 and

= Op( 1√

np). Deﬁne

(cid:13)(cid:13)(cid:13)ˆθ − ˆθ0

(cid:13)(cid:13)(cid:13)

2

˜Σ = ΣT (ˆθ) = Σ(ˆθ) ◦ K(w)

where K(w) is deﬁned in section 3.3.1.2. Assume A1, A2 and A11 and A12 hold,then

(cid:13)(cid:13)(cid:13) ˜Σ − Σ
(cid:13)(cid:13)(cid:13)

2

(cid:13)(cid:13)(cid:13) ˜Σ−1 − Σ−1(cid:13)(cid:13)(cid:13)

= Op(cn) and

= Op(cn)

2

where cn,p = max( wd√

np , 1
w )

Proof.

(cid:13)(cid:13)(cid:13) ˜Σ − Σ

(cid:13)(cid:13)(cid:13)

2

=

(cid:13)(cid:13)(cid:13)Σ(ˆθ) ◦ K(w) − Σ(θ0)
(cid:13)(cid:13)(cid:13)
p(cid:88)

(cid:12)(cid:12)(cid:12)r(ˆθ; hij)KT (hij, w) − r(θ0; hij)

2

(cid:12)(cid:12)(cid:12)

≤ max

i

j=1

where KT (h, w) = [(1 − h/w)+]2. For any i = 1, 2, ..., p,

p(cid:88)
≤ (cid:88)

j=1

(cid:12)(cid:12)(cid:12)r(ˆθ; hij)KT (hij, w) − r(θ0; hij)
(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)r(ˆθ; hij)KT (hij, w) − r(θ0; hij)
(cid:12)(cid:12)(cid:12) +

hij <w

(cid:12)(cid:12)r(θ0, hij)(cid:12)(cid:12)

(cid:88)

hij≥w

108

(cid:12)(cid:12)(cid:12)(r(ˆθ; hij) − r(θ0; hij))KT (hij, w)
≤ (cid:88)
(cid:88)
(cid:12)(cid:12)r(θ0, hij)(cid:12)(cid:12)

hij <w

+

hij≥w

(cid:12)(cid:12)(cid:12) +

(cid:88)

(cid:12)(cid:12)r(θ0; hij)KT (hij, w) − r(θ0; hij)(cid:12)(cid:12)

hij <w

=(I) + (II) + (III)

From the same proof in lemma 3.6.3, we have (II) = Op(1/w) and (III) = Op(1/w). From

A1 and A2(iii), we have

(I) ≤ (cid:88)

hij <w
≤M wdρ

(cid:12)(cid:12)(cid:12)r(ˆθ; hij) − r(θ0; hij)
(cid:13)(cid:13)(cid:13)ˆθ − θ0

(cid:13)(cid:13)(cid:13)

2

(cid:12)(cid:12)(cid:12) ≤

q(cid:88)

(cid:88)

k=1

hij <w

(cid:12)(cid:12)rk(θ∗; hij)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ˆθ∗

k − θ0k

(cid:12)(cid:12)(cid:12)

(cid:13)(cid:13)(cid:13) ˜Σ − Σ

(cid:13)(cid:13)(cid:13)

= Op(cn).

2
1

λmin(Σ) < ∞ and

=

2

(cid:13)(cid:13)(cid:13) ˜Σ−1

T

(cid:13)(cid:13)(cid:13)

2

=

1

Therefore (I) = Op( wd√

np). Combine (I), (II) and (III),

Since Σ(θ0) and ˜Σ = ΣT (ˆθ) are positive deﬁnite,(cid:13)(cid:13)Σ−1(cid:13)(cid:13)
(cid:13)(cid:13)(cid:13) ˜Σ−1
T − Σ−1(cid:13)(cid:13)(cid:13)
≤(cid:13)(cid:13)(cid:13)Σ−1(cid:13)(cid:13)(cid:13)

(cid:13)(cid:13)(cid:13)Σ−1(Σ − ˜ΣT ) ˜Σ−1

< ∞.

(cid:13)(cid:13)(cid:13)

=

T

2

2

λmin( ˜ΣT )

(cid:13)(cid:13)(cid:13)Σ − ˜ΣT

(cid:13)(cid:13)(cid:13)

2

(cid:13)(cid:13)(cid:13) ˜Σ−1

T

(cid:13)(cid:13)(cid:13)

2

= Op(cn).

2

(cid:80)p

k=s+1 σ2

ik is bounded above by a

and this completes the proof.

Lemma 3.6.6. Assume A12 holds. Then max1≤i≤s

constant.

109

p(cid:88)

k=s+1

max
1≤i≤s

Proof. Since A12 holds, by similar argument as proving lemma 3.6.3, we have:

p(cid:88)

 (cid:88)
γ(hik, θ0) ≤ max
(cid:90) ∞
1≤i≤s
0<hij <1
hd−1γ(h, θ0)
(cid:90) ∞

1

hd−1γ(h, θ0)dh +
hd−1γ0(h, θ0)dh +

k=s+1

hdγ0(h, θ0) ≤ ∞

1

σ2
ik = max
1≤i≤s

(cid:90) 1
(cid:90) 1

0

≤

≤

h=0



γ(hik, θ0)

(cid:88)

hij≥1

γ(hik, θ0) +

Proof of Theorem 6. Suppose a new observation is from class 1, the conditional misclas-

siﬁcation rate of ˆδP M LE for class 1 is:

W1(ˆδP M LE; Θ) = 1 − Φ(
(cid:80)nk
(cid:13)(cid:13)(cid:13)Σ−1 − ˜Σ−1(cid:13)(cid:13)(cid:13)

k=1

2

Where ¯Y =(cid:80)2

Op(cn) and

tance w. Then

T ˜Σ−1Σ ˜Σ−1 ˆ∆ = ˆ∆

ˆ∆

(cid:113)

(µ1 − ¯Y − n1−n2

2n

ˆ∆)T ˜Σ−1 ˆ∆

T ˜Σ−1Σ ˜Σ−1 ˆ∆

ˆ∆

)

(3.6.57)

(cid:13)(cid:13)(cid:13)Σ − ˜Σ
(cid:13)(cid:13)(cid:13)

i=1 Y ki. We ﬁrst consider denominator. From lemma 3.6.5,

=

= Op(cn), where cn = max( wd√

2
w ) and w is the threshold dis-

np , 1

T ˜Σ−1 ˆ∆ + ˆ∆

T ˜Σ−1(Σ − ˜Σ) ˜Σ−1 ˆ∆

(3.6.58)

(cid:13)(cid:13)(cid:13)Σ − ˜Σ
(cid:13)(cid:13)(cid:13)

≤ ˆ∆

= ˆ∆

T ˜Σ−1 ˆ∆ +

2
λmin( ˜Σ)
T ˜Σ−1 ˆ∆(1 + Op(cn))

T ˜Σ−1 ˆ∆

ˆ∆

=( ˆ∆

T

= ˆ∆

T

T

Σ−1 ˆ∆ + ˆ∆
Σ−1 ˆ∆(1 + Op(cn))

( ˜Σ−1 − Σ−1) ˆ∆)(1 + Op(cn))

110

Write

T

ˆ∆

Σ−1 ˆ∆ = ∆Σ−1∆ + 2( ˆ∆ − ∆)T Σ−1∆ + ( ˆ∆ − ∆)T Σ−1( ˆ∆ − ∆)

(3.6.59)

(cid:13)(cid:13)(cid:13) ˆ∆ − ∆

(cid:13)(cid:13)(cid:13)

2

From Theorem 4,

the second term

= OP ((cid:112) s

n). Hence ( ˆ∆ − ∆)T Σ−1( ˆ∆ − ∆) = OP ( s

n). Also

( ˆ∆ − ∆)T Σ−1∆ ≤ (∆T Σ−1∆)

1
2

(cid:16)

( ˆ∆ − ∆)T Σ−1( ˆ∆ − ∆)

(cid:17) 1

2

(3.6.60)

Since

s

n∆T Σ−1∆

→ 0, we have

T

ˆ∆

Σ−1 ˆ∆ =∆T Σ−1∆(1 + OP (

s

n∆T Σ−1∆

) + OP (

s

n∆T Σ−1∆

))

(3.6.61)

(cid:114)

(cid:114)
= ∆T Σ−1∆(1 + OP (
(cid:113)
(cid:113)

n∆T Σ−1∆

s

T ˜Σ−1Σ ˜Σ−1 ˆ∆ =

ˆ∆

s

n∆T Σ−1∆

))

(cid:113)
∆Σ−1∆(1 + Op(Dn,p)).

Let Dn,p = max(

, cn), the denominator can be represented by:

Now consider the nominator.

2n

(µ1 − ¯Y − n1 − n2
= (µ1 − ¯Y )T ˜Σ−1 ˆ∆ +
= (µ1 − ¯Y − n2
n

ˆ∆)T ˜Σ−1 ˆ∆
n2 − n1

2n
∆)T ˜Σ−1 ˆ∆ +

(3.6.62)

ˆ∆

T ˜Σ−1 ˆ∆
∆T ˜Σ−1 ˆ∆ +
n2
n

n2 − n1

2n

T ˜Σ−1 ˆ∆

ˆ∆

111

= (1) + (2) + (3)

We start from (3). By lemma 3.6.5, ˆ∆

T ˜Σ−1 ˆ∆ = ˆ∆

T

Σ−1 ˆ∆(1 + Op(cn)). From 3.6.61, (3)

can be represented by (3) =

n2−n1
2n ∆T Σ−1∆(1 + OP (Dn,p)).

For (2), ﬁrst we have ∆T ˜Σ−1 ˆ∆ = ∆T Σ−1 ˆ∆(1 + Op(cn)). Then combine 3.6.61

∆T Σ−1 ˆ∆ ≤(cid:16)

2(cid:16) ˆ∆
(cid:17) 1

∆T Σ−1∆

T

Σ−1 ˆ∆

(cid:17) 1
2 = ∆T Σ−1∆(1 + OP (

(cid:114)

s

n∆T Σ−1∆

)) (3.6.63)

Then (2) can be represented by: (2) =

n ∆T Σ−1 ˆ∆ =
n2

n ∆T Σ−1∆(1 + OP (Dn,p)).
n2

Thus

(2) + (3) = 1

2∆T Σ−1∆(1 + OP (Dn,p))

(3.6.64)

Now we consider (1). Let ˆ∆ = ( ˆ∆

T
1 , ˆ∆

T

2 )T where ˆ∆1 is s dimension and ˆ∆2 is p − s

dimension. From Theorem 4, with probability tending to 1, ˆ∆2 = 0 and

(cid:13)(cid:13)(cid:13) ˆ∆ − ∆

(cid:13)(cid:13)(cid:13)

=

2

OP ( s
dimension. Then ξ ∼ N (0, 1

n). Let ξ = µ1 − ¯Y − n2
nΣ)

n ∆ = (ξT

1 , ξT

0 )T , where ξ1 is s dimension and ξ0 is p − s

Write

and

 Σ1 Σ12

ΣT

12 Σ2

 , Σ−1 =

 C1 C12

CT

12 C2

Σ =

 ˜Σ1

˜ΣT
12

 , ˜Σ−1 =

 ˜C1

˜CT
12

˜Σ12

˜Σ2

˜C12

˜C2

˜Σ =

112





where Σ1, ˜Σ1, C1 and ˜C1 are s × s matrix. Then

C12 = −Σ−1

1 Σ12C2 and ˜C12 = − ˜Σ−1

1

˜Σ12 ˜C2.

Write

(1) =ξT ˜Σ−1 ˆ∆ = ξT

1

˜C1 ˆ∆1 + ξT
0

˜C12 ˆ∆1

=ξT
1

˜C1 ˆ∆1 − ξ0 ˜C2 ˜Σ2 ˜Σ−1

1

ˆ∆1

=(i) + (ii)

First we have: ξT
1

˜C1 ˆ∆1 ≤ (ξT

1

˜C1ξ1)1/2( ˆ∆

T
1

˜C1 ˆ∆1)1/2. By lemma3.6.5,

ξT
1

˜C1ξ1 = ξT

1 C1ξ1(1 + Op(cn))

Since ξ1 ∼ N (0, 1
1 Σ−1ξT
ξT
1 C1ξT

1 ≤ ξT

nΣ1), E(ξT
λmax(C1)
λmin(Σ−1
1 )

1

1 ξ1) = tr( 1

1 Σ−1
= Op((cid:112) s

nIs) = s
n). Hence ξT
1

n. Then ξT

1 Σ−1
˜C1ξ1 = OP ( s

n).

1 ξ1 = OP ( s

n) and therefore

Also,

ˆ∆

T
1

˜C1 ˆ∆1 = ˆ∆

T ˜Σ−1 ˆ∆ = ∆T Σ−1∆(1 + Op(Dn,p))

Hence (i) in (1) is: (i) = ξT
1

˜C1 ˆ∆1 = (∆T Σ−1∆)

The second term in (1) is:

2 Op((cid:112) s

1

n).

(ii) = ξ0 ˜C2 ˜Σ12 ˜Σ−1

1

ˆ∆1 ≤ ( ˆ∆1 ˜Σ−1

1

ˆ∆1)1/2(ξ0 ˜CT
2

˜ΣT
12

˜Σ−1

1

˜Σ12 ˜C2)1/2

113

By lemma 3.6.5

ξT
0

˜CT
2

˜ΣT
12

˜Σ−1

1

Since ξ0 ∼ N (0, 1

nΣ2),

˜Σ12 ˜C2ξ0 = ξ0CT

2 ΣT

12Σ−1

1 Σ12C2ξ0(1 + Op(cn))

E[ξT

0 CT

2 ΣT

12Σ−1

1 Σ12C2ξ0] ≤λmax(Σ−1
=λmax(Σ−1
=λmax(Σ−1
1 )
≤λmax(Σ−1

1 )E[ξT

0 CT

2 ΣT

12Σ12C2ξ0]

1 )tr(E[ξT

0 CT

2 ΣT

12Σ12C2ξ0])

1
n

tr(CT

2 ΣT

12Σ12C2Σ2)
1
n

max(C2)

1 )λmax(Σ2)λ2

tr(Σ12ΣT

12)

tr(Σ12ΣT

12) =

s(cid:88)

p(cid:88)

i=1

k=s+1

p(cid:88)

k=s+1

σ2
ik

ik ≤ s max
σ2
1≤i≤s

thus ξT

0 CT

2 ΣT

1 Σ12C2ξ0 = Op( s

n dn,p), where dn,p = max1≤i≤s

12Σ−1

lemma 3.6.6, dn,p is bounded above by a constant. As a result:

˜Σ12 ˜C2ξ0 = xiT

0 CT

2 ΣT

1 Σ12C2ξ0 = Op(

12Σ−1

s
n

dn,p)(1 + Op(max(cn,

ξT
0

˜CT
2

˜ΣT
12

˜Σ−1

1

Since

(3.6.65)

(cid:80)p

k=s+1 σ2

ik. By

(cid:114) s

n

)))

ˆ∆

T
1

˜Σ−1

1

ˆ∆1 ≤ ˆ∆

T
1

˜C1 ˆ∆1 ≤ ˆ∆

T ˜Σ−1 ˆ∆ = ∆T Σ−1∆(1 + OP (Dn,p))

(3.6.66)

114

(cid:113)

Let An,p = max(

s

n∆T Σ−1∆

, s
n , cn), we have

(ii) = (∆T Σ−1∆)

1
2 OP (An,p))

(3.6.67)

Combining the approximation of (i) and (ii) in (1), we have

(1) = ξT ˜Σ−1 ˆ∆ = (∆T Σ−1∆)

1
2 OP (An,p).

As a result, since

√
s/n
Cp

→ 0 ,

W1(ˆδP LDA; Θ) =1 − Φ(

1

(cid:112)
(cid:113)
2∆T Σ−1∆(1 + Op(Dn,p)) +
∆T Σ−1∆Op(An,p)
(cid:32) 1
2Cp(1 + Op(Dn,p) +(cid:112)CpOp(An,p)
∆T Σ−1∆(1 + Op(Dn,p))
(cid:112)Cp(1 + Op(Dn,p))

(cid:33)

= 1 − Φ

)

(3.6.68)

Similarly, we can derive:

W2(ˆδP LDA; Θ) =Φ(

(cid:112)
(cid:113)
2∆T Σ−1∆(1 + Op(Dn,p)) +
∆T Σ−1∆Op(An,p)
− 1
(cid:32)− 1
2 Cp(1 + Op(Dn,p) +(cid:112)CpOp(An,p)
∆T Σ−1∆(1 + Op(Dn,p))
(cid:112)Cp(1 + Op(Dn,p))

(cid:33)

= Φ

)

(3.6.69)

Since Dn,p → 0 and An,p → 0, both W1(ˆδP LDA; Θ) and W2(ˆδP LDA; Θ) go to 1 − Φ(
)
as n, p → ∞ with probability tending to 1 as n, p, s → ∞. Then the approximate overall
2(W1(ˆδP LDA; Θ) + W2(ˆδP LDA; Θ)) → 1 −

misclassiﬁcation error rate is W (ˆδP LDA; Θ) = 1

√
C0
2

115

Φ(

√

). This completes the proof of sub-optimal.

C0
2
Now we show the asymptotically optimal of W (ˆδP LDA; Θ). If Cp → C0 < ∞,

W (ˆδP LDA;Θ)

WOP T

=

→ 1.

√

W (ˆδP LDA;Θ)
Φ(−
If Cp → ∞,

Cp
2

)

x(cid:112)Cp

4 + x2 e

− x2−Cp

8 ≤ W (ˆδP LDA; Θ)
)

√
Cp
2

Φ(−

x(cid:112)Cp

≤ 4 + Cp

− x2−Cp
e

8

=(cid:112)Cp(1 + Op(Dn,p ± O(

An,p√
Cp

))).

√
Cp(1+Op(Dn,p))±2
√
Cp
x
4+x2 e

√
− x2−Cp

Cp(1+Op(Dn,p))
8 → 1 and

CpOp(An,p)

where x =

First

√
4+Cp
x
Cp

→ 1 as Cp → ∞. Also,

An,p(cid:112)Cp

)))

x2 − Cp = Cp(O(Dn,p + O(

(cid:112) s
n → 0, we have x2 − Cp → 0. Hence

W (ˆδP LDA;Θ)

WOP T

→ 1. This

if Cpcn → 0 and Cp

completes the proof.

3.6.5 Remarks on the assumptions

Remarks on A3 : The ﬁrst part of A3 is the same as that in [37]. We now verify that

the cavariance matrix derived from M at´ern covariance function satisfy the ﬁrst part

of A3. First, for symmetric matrix, we have

λmax(Σ) ≤ ((cid:107)Σ(cid:107)

1

)1/2((cid:107)Σ(cid:107)∞)1/2 = (cid:107)Σ(cid:107)∞ = max

i

p(cid:88)

j

γ(hij)

116

Using the same notation in the proof of Lemma3.6.3, for each i

p(cid:88)

j=1

∞(cid:88)

(cid:88)

m=0

j∈Bi
m

∞(cid:88)

m=0

γ(hij) ≤

r(hij) ≤ Kρ

md−1δd max
j∈Bi
m

r(hij) ≤ Kρ

(cid:90) ∞

0

hd−1r(h)dh

(3.6.70)

Recall that M at´ern covariance function has the following expansion at h = 0:

γ(h; σ2, c, ν, r) = σ2(1 − c)(1 − b1h2ν + b2h2 + O(h2+2ν)) as h → 0

where b1 and b2 are explicit constants depending only on ν and r. Thus for  > 0,

(cid:90) 

(cid:90) 
hd−1γ(h)dh = O(

0

0

hd−1dh) = O(d/d) → 0 as  → 0

(3.6.71)

Also, since Kν(h) ∝ e−hh− 1

2 (1 + O( 1

h)) as h → ∞, there exist a constant K, for any

C suﬃciently large, we have:

(cid:90) ∞

hd−1γ(h)dh ≤ K

(cid:90) ∞
3.6.71 and 3.6.72 lead to (cid:82) ∞

C

0

hd−1+v− 1

2 e−hdh = Γ(d + v − 1

2) < ∞

(3.6.72)

0 hd−1γ(h)dh < ∞. Let p → ∞ in 3.6.70, we have

lim supp→∞λmax(Σ) < ∞ if Σ is derived from M at´ern covariance function.

Now consider the second part of A3. A1 assumes increasing domain framework. Bachoc

and Furrer (2016) [4] showed that under A1 and some weak assumptions on the matrix

covariance function, if the spectral density of the covariance function is positive, the

smallest eigenvalue of the covariance matrix is asymptotically bounded away from zero.

117

Most standard covariance function such as M at´ern covariance function satisfy those

assumptions hence satisfy the second part of A3.

(cid:80)p

Remarks on A4 and A5 : A4 and A5 are the same as the assumptions in [37]. (cid:107)Σk(cid:107)

=

F

i,j=1 γ2

k(hij; θ), where γk(hij; θ) =

∂γ(hij ;θ)

∂θk

, k = 1, 2, ..., q and θ is a k dimensional

parameter. We now verify that M at´ern covariance function satisfy A4 for ﬁxed ν. For

M at´ern covariance function with ﬁxed ν, we have

21−ν
Γ(ν)
= −σ2 21−ν
Γ(ν)
= σ2(1 − c)

∂γ(h)
∂σ2 =
∂γ(h)

∂c

∂γ(h)
∂(1/r)

(h/r)νKν(h/r)(1 − c)

(3.6.73)

(h/r)νKν(h/r)
21−ν
Γ(ν)

h(h/r)ν(2

ν
h/r

Kν(h/r) − Kν−1(h/r))

It is easy to show that for each k, exist a constant  > 0 independent of n, p, for each i,
there’s j such that γk(hij) > c. As a result, (cid:107)Σk(cid:107)
Therefore (cid:107)Σk(cid:107)−1

= Op(p−1).

i=1  = p.

i,j=1 γ2

k(hij; θ) ≥(cid:80)p

=(cid:80)p

F

F

Remarks on A6 : We can also verify that the M at´ern covariance function with ﬁxed ν

satisfy A6(i) and A6(ii ). Similar to the procedure in the remarks on A3, it is suﬃcient
to verify for any θ ∈ Θ, γk(h; θ) and γkj(h; θ) belong to the function space:

(cid:90) ∞

(cid:61) = {f (x) :

f (x)xd−1dx < ∞}

0

where d ≥ 1 is the dimension of the domain. We have the ﬁrst-order partial derivative of

M at´ern covariance function in (3.6.73). The second-order partial derivative of M at´ern

118

covariance function is as follows:

(3.6.74)

∂2γ(h)
(∂2σ2)
∂2γ(h)
∂σ2∂c
∂2γ(h)

= 0

= −21−ν

Γ(ν)
= (1 − c)

∂σ2∂(1/r)
∂2γ(h)
∂c2 = 0
∂2γ(h)
∂c∂(1/r)
∂2γ(h)
∂2(1/r)

(h/r)νKν(h/r)

21−ν
Γ(ν)

h(h/r)ν(2

ν
h/r

Kν(h/r) − Kν−1(h/r))

Kν(h/r) − Kν−1(h/r))

ν
h/r

= −σ2 21−ν
Γ(ν)
= σ2(1 − c)
− (4ν + 1)(h/r)ν−1h2Kν+1(h/r) − (h/r)νh2Kν+2(h/r)]

[(h/r)ν−2h2Kν(h/r)(2ν − 1)2ν

h(h/r)ν(2
21−ν
Γ(ν)

Note that the covariance function and its ﬁrst-order and second-order partial deriva-

to proving(cid:82) ∞

tives are linear combinations of a Bessel function of h times a polynomial of h. Similar

0 hd−1γ(h)dh < ∞ in (3.6.70), we have γk(h; θ) ∈ (cid:61) and γkj(h; θ) ∈ (cid:61).

Hence A6(i) and A6(ii ) are satisﬁed. By similar procedure, we can verify that M at´ern

covariance function also satisfy A12 and A13.

119

BIBLIOGRAPHY

120

BIBLIOGRAPHY

[1] Stanis(cid:32)law Adaszewski, Juergen Dukart, Ferath Kherif, Richard Frackowiak, Bogdan
Draganski, et al. How early can we predict Alzheimer’s disease using computational
anatomy? Neurobiology of aging, 34(12):2815–2826, 2013.

[2] Yaman Aksu, David J Miller, George Kesidis, Don C Bigler, and Qing X Yang. An
MRI-derived deﬁnition of MCI-to-AD conversion for long-term, automatic prognosis of
MCI patients. PLoS One, 6(10):e25074, 2011.

[3] Juan Eloy Arco, Javier Ram´ırez, Juan Manuel G´orriz, Carlos G Puntonet, and Mar´ıa
Ruz. Short-term prediction of MCI to AD conversion based on longitudinal MRI analysis
and neuropsychological tests. In Innovation in Medicine and Healthcare 2015, pages
385–394. Springer, 2016.

[4] Fran¸cois Bachoc and Reinhard Furrer. On the smallest eigenvalues of covariance matri-

ces of multivariate spatial processes. Stat, 5(1):102–107, 2016.

[5] Peter J Bickel and Elizaveta Levina. Some theory for Fisher’s linear discriminant func-
tion,’naive bayes’, and some alternatives when there are many more variables than
observations. Bernoulli, pages 989–1010, 2004.

[6] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The

Annals of Statistics, 36(6):2577–2604, 2008.

[7] Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices.

The Annals of Statistics, 36(1):199–227, 2008.

[8] T. Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix

estimation. Journal of the American Statistical Association, 106(494):672–684, 2011.

[9] T. Tony Cai and Weidong Liu. A direct estimation approach to sparse linear discrim-
inant analysis. Journal of the American Statistical Association, 106(496):1566–1577,
2011.

[10] T. Tony Cai, Zhao Ren, Harrison H Zhou, et al. Estimating structured high-dimensional
covariance and precision matrices: Optimal rates and adaptive estimation. Electronic
Journal of Statistics, 10(1):1–59, 2016.

[11] T. Tony Cai and Anru Zhang. Minimax rate-optimal estimation of high-dimensional
covariance matrices with incomplete data. Journal of multivariate analysis, 150:55–74,
2016.

[12] T. Tony Cai, Cun-Hui Zhang, Harrison H Zhou, et al. Optimal rates of convergence for

covariance matrix estimation. The Annals of Statistics, 38(4):2118–2144, 2010.

121

[13] T. Tony Cai and Linjun Zhang. High-dimensional linear discriminant analysis: Opti-

mality, adaptive algorithm, and missing data1. Technical report, 2018.

[14] Tingjin Chu, Jun Zhu, Haonan Wang, et al. Penalized maximum likelihood estimation
and variable selection in geostatistics. The Annals of Statistics, 39(5):2607–2625, 2011.

[15] Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. Sparse discrim-

inant analysis. Technometrics, 53(4):406–413, 2011.

[16] Robert W Cox. Afni: software for analysis and visualization of functional magnetic

resonance neuroimages. Computers and Biomedical research, 29(3):162–173, 1996.

[17] Noel Cressie. Statistics for spatial data. Terra Nova, 4(5):613–617, 1992.

[18] Simon F Eskildsen, Pierrick Coup´e, Daniel Garc´ıa-Lorenzo, Vladimir Fonov, Jens C
Pruessner, D Louis Collins, et al. Prediction of Alzheimer’s disease in subjects with
mild cognitive impairment from the ADNI cohort using patterns of cortical thinning.
Neuroimage, 65:511–521, 2013.

[19] Jianqing Fan and Yingying Fan. High dimensional classiﬁcation using features annealed

independence rules. Annals of statistics, 36(6):2605, 2008.

[20] Jianqing Fan, Yang Feng, and Xin Tong. A road to classiﬁcation in high dimensional
space: the regularized optimal aﬃne discriminant. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 74(4):745–771, 2012.

[21] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and
its oracle properties. Journal of the American statistical Association, 96(456):1348–1360,
2001.

[22] Christine Fennema-Notestine, Donald J Hagler, Linda K McEvoy, Adam S Fleisher,
Elaine H Wu, David S Karow, and Anders M Dale. Structural MRI biomarkers for
preclinical and mild Alzheimer’s disease. Human brain mapping, 30(10):3238–3253,
2009.

[23] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical

learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.

[24] Peter Hall and Tapabrata Maiti. Choosing trajectory and data type when classifying

functional data. Biometrika, page ass011, 2012.

[25] Miriam Hartig, Diana Truran-Sacrey, Sky Raptentsetsang, Alix Simonson, Adam
Mezher, Norbert Schuﬀ, and Michael Weiner. UCSF freesurfer methods. ADNI:
Alzheimers Disease Neuroimaging Initiative, San Francisco, 2014.

[26] Roger Horn and Charles Johnson. Topics in matrix analysis. ZAMM-Journal of Applied

Mathematics and Mechanics, 72(12):692–692, 1992.

122

[27] Cliﬀord R Jack, Matt A Bernstein, Nick C Fox, Paul Thompson, Gene Alexander,
Danielle Harvey, Bret Borowski, Paula J Britson, Jennifer L Whitwell, Chadwick Ward,
et al. The alzheimer’s disease neuroimaging initiative (adni): Mri methods. Journal of
magnetic resonance imaging, 27(4):685–691, 2008.

[28] Cliﬀord R Jack, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen,
Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model
of dynamic biomarkers of the Alzheimer’s pathological cascade. The Lancet Neurology,
9(1):119–128, 2010.

[29] David S Karow, Linda K McEvoy, Christine Fennema-Notestine, Donald J Hagler Jr,
Robin G Jennings, James B Brewer, Carl K Hoh, and Anders M Dale. Relative capability
of MR imaging and FDG PET to depict changes associated with prodromal and early
Alzheimer’s disease. Radiology, 256(3):932–942, 2010.

[30] Sang Han Lee, Alvin H Bachman, Donghyeon Yu, Johan Lim, Babak A Ardekani,
et al. Predicting progression from mild cognitive impairment to Alzheimer’s disease
using longitudinal callosal atrophy. Alzheimer’s & Dementia: Diagnosis, Assessment &
Disease Monitoring, 2:68–74, 2016.

[31] Kelvin K Leung, Josephine Barnes, Gerard R Ridgway, Jonathan W Bartlett, Matthew J
Clarkson, Kate Macdonald, Norbert Schuﬀ, Nick C Fox, Sebastien Ourselin, et al. Au-
tomated cross-sectional and longitudinal hippocampal volume measurement in mild
cognitive impairment and Alzheimer’s disease. Neuroimage, 51(4):1345–1359, 2010.

[32] Kelvin K Leung, Jonathan W Bartlett, Josephine Barnes, Emily N Manning, Se-
bastien Ourselin, Nick C Fox, et al. Cerebral atrophy in mild cognitive impairment
and Alzheimer disease rates and acceleration. Neurology, 80(7):648–654, 2013.

[33] Quefeng Li and Jun Shao. Sparse quadratic discriminant analysis for high dimensional

data. Statistica Sinica, pages 457–473, 2015.

[34] Yang Li, Yaping Wang, Guorong Wu, Feng Shi, Luping Zhou, Weili Lin, Dinggang Shen,
et al. Discriminant analysis of longitudinal cortical thickness changes in Alzheimer’s
disease using dynamic and network features. Neurobiology of aging, 33(2):427–e15,
2012.

[35] Qing Mai, Yi Yang, and Hui Zou. Multiclass sparse discriminant analysis. arXiv preprint

arXiv:1504.05845, 2015.

[36] Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant analysis

in ultra-high dimensions. Biometrika, page asr066, 2012.

[37] Kanti V Mardia and Roger J Marshall. Maximum likelihood estimation of models for

residual covariance in spatial regression. Biometrika, 71(1):135–146, 1984.

[38] CR McDonald, LK McEvoy, L Gharapetian, C Fennema-Notestine, DJ Hagler, D Hol-
land, A Koyama, JB Brewer, AM Dale, Alzheimers Disease Neuroimaging Initiative,

123

et al. Regional rates of neocortical atrophy from normal aging to early Alzheimer dis-
ease. Neurology, 73(6):457–465, 2009.

[39] C. D. Meyer. Matrix analysis and applied linear algebra siam, philadelphia, 2000.

Numerical Algorithms, 26(2):198, 2001.

[40] Chandan Misra, Yong Fan, and Christos Davatzikos. Baseline and longitudinal patterns
of brain atrophy in MCI patients, and their use in prediction of short-term conversion
to AD: results from ADNI. Neuroimage, 44(4):1415–1422, 2009.

[41] Hans-georg M¨uller. Functional modelling and classiﬁcation of longitudinal data. Scan-

dinavian Journal of Statistics, 32(2):223–240, 2005.

[42] Schabenberger Oliver and G Carol. Statistical methods for spatial data analysis, 2005.

[43] Rui Pan, Hansheng Wang, and Runze Li. Ultrahigh-dimensional multiclass linear dis-
criminant analysis by pairwise sure independence screening. Journal of the American
Statistical Association, 111(513):169–179, 2016.

[44] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High-
dimensional covariance estimation by minimizing 1-penalized log-determinant diver-
gence. Electronic Journal of Statistics, 5:935–980, 2011.

[45] Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation

invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.

[46] Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, et al. Sparse linear discriminant
analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241–
1265, 2011.

[47] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–

898, 2012.

[48] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Di-
agnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings
of the National Academy of Sciences, 99(10):6567–6572, 2002.

[49] Michael W Weiner, Dallas P Veitch, Paul S Aisen, Laurel A Beckett, Nigel J Cairns,
Robert C Green, Danielle Harvey, Cliﬀord R Jack, William Jagust, Enchi Liu, et al.
The Alzheimer’s Disease Neuroimaging Initiative: a review of papers published since its
inception. Alzheimer’s & Dementia, 9(5):e111–e194, 2013.

[50] Holger Wendland. Piecewise polynomial, positive deﬁnite and compactly supported
radial functions of minimal degree. Advances in computational Mathematics, 4(1):389–
396, 1995.

[51] Daniela M Witten and Robert Tibshirani. Penalized classiﬁcation using ﬁsher’s linear
discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 73(5):753–772, 2011.

124

[52] Michael C Wu, Lingsong Zhang, Zhaoxi Wang, David C Christiani, and Xihong Lin.
Sparse linear discriminant analysis for simultaneous testing for the signiﬁcance of a gene
set/pathway and gene selection. Bioinformatics, 25(9):1145–1151, 2009.

[53] Wei Biao Wu and Mohsen Pourahmadi. Banding sample autocovariance matrices of

stationary processes. Statistica Sinica, pages 1755–1768, 2009.

[54] Bradley T Wyman, Danielle J Harvey, Karen Crawford, Matt A Bernstein, Owen
Carmichael, Patricia E Cole, Paul K Crane, Charles DeCarli, Nick C Fox, Jeﬀrey L
Gunter, et al. Standardization of analysis sets for reporting results from ADNI MRI
data. Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 9(3):332–337,
2013.

[55] Peirong Xu, Ji Zhu, Lixing Zhu, and Yi Li. Covariance-enhanced discriminant analysis.

Biometrika, 102(1):33–45, 2014.

[56] Fang Yao, Hans-Georg M¨uller, and Jane-Ling Wang. Functional data analysis for sparse
longitudinal data. Journal of the American Statistical Association, 100(470):577–590,
2005.

[57] Daoqiang Zhang, Dinggang Shen, et al. Predicting future clinical changes of MCI

patients using longitudinal and multimodal biomarkers. PloS one, 7(3):e33182, 2012.

[58] Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood

models. Annals of statistics, 36(4):1509, 2008.

125