ADVANCED CLASSIFICATION METHODS FOR LARGE SPATIAL-TEMPORAL DATA:

APPLICATIONS TO NEUROIMAGING

By

Rejaul Karim

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Statistics - Doctor of Philosophy

2019

ABSTRACT

ADVANCED CLASSIFICATION METHODS FOR LARGE SPATIAL-TEMPORAL DATA:

APPLICATIONS TO NEUROIMAGING

By

Rejaul Karim

Spatial data are characterized by dependency between the data indexed by a ﬁxed point

in space and its ”neighbors”. Exploiting such dependencies leads to improvement in estima-

tion and inference. Due to large abundance of such data in nature,previous methodologies

are being extended to incorporate such proximal information.For example in a latent model

for generating data is spatially dependent,one would like to investigate how such depen-

dencies aﬀect the variable selection performances. This work is centerd around a penalized

estimating equation approach to model of an expanding dimension(pn) of predictor variables

where responses are generated from Poisson model driven by latent Gaussian model (Log

Gaussian Cox process). In the past this approach has been extensively studied in longitudi-

nal data analysis. Gaussian random ﬁelds that exhibit Conditional autoregressive structure

(CAR) we provide some theoretical results of the estimator obtained from the penalized

estimating equation. The oracle properties of the estimator as described by Fan & Li (2001)

are provided.

Pattern detection in imaging data has lead to a rise of classiﬁcation methods that are

eﬀective in separating objects and structures in an image. This provides a major impetus

in the context of medical imaging. Magnetic resonance images (MRI) collected in four

dimensions (3D space and time), maybe used to predict diﬀerent disease phases of a particular

patient. Linear discriminant analysis(LDA) is a classical tool used for dimension reduction

as well as classiﬁcation. However, in the context of high-dimensional data where feature

volume is signiﬁcantly larger than sample size, the within-class covariance of the LDA tool

is singular, yielding the classiﬁcation rule unsuitable. Sparse discriminant methods have

therefore been proposed to implement LDA in a high dimensional setup. These methods

do not incorporate dependencies in the feature covariance structure when data acquired is

spatially and temporally correlated. This article proposes a regularized high dimensional

LDA resolution for spatio-temporal imaging data. Theoretically we ensure that the method

proposed can achieve consistent parameter estimation, feature selection, at an asymptotically

optimal misclassiﬁcation rate. Extensive simulation study shows a signiﬁcant improvement

in classiﬁcation performance under spatial-temporal dependence. This method is applied to

longitudinal structural MRI data obtained from the ADNI initiative.

LDA classiﬁcation rule are restrictive since this paradigm is based on strong assump-

tion that binary class data generating process is Normally distributed with same covariance

function. In contrast, support vector machine is considered much more robust classiﬁer due

to its distribution free approach. The tensor counterpart of SVM also known as support

tensor machine is widely popular in analysis of MRI image which is a tensor in its original

format. This tensor structure preserves the neighboring spatial information which is lost

after vectorization. In this work, we apply memory eﬃcient random projection to tensor as

a dimension reduction method which preserves distance with high probability. Near optimal

classiﬁcation consistency is shown along with few simulation study.

To the memory of my family members who have motivated and supported me.

iv

ACKNOWLEDGMENTS

I would like to sincerely thank my chair advisor Professor Tapabrata(Taps) Maiti for every-

thing. Few words does not do justice to his contribution in my career. His patience and

support resuscitated my academic career from time to time.

I have utmost gratitude to

Professor Chae Young Lim who spent lot of eﬀort in proof checking and guided me through

numerous expert suggestions. Her smart ideas helped me steer through research bottlenecks.

I will be indebted to my committee member Professor Yimin Xiao who introduced to

spatial statistics. His mathematical intuition still amazes me. I am grateful to generosity of

my Professor Arun Ross for his unconditional acceptance to be part of my committee. He

inspires me through his interesting application approaches which revolutionized biometrics.

A very special mention to my senior and collaborator Abdhi Sarkar who helped me with

superb coding skills especially regarding MRI data preprocessing and analysis. I would be

unable to complete this dissertation without her. I am also thankful to my colleague Peide

Li(Peter) who responded immediately to distress calls. The third work of my thesis is largely

based on extension of his work. I am grateful to Yingjie Li who has permitted to use her

material in my second work. The third work of my thesis is an extension for her thesis.

Lastly I would like to thank my parents and my sister for the moral support.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

KEY TO SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5

1.4.1 Transformation approach by Yasui & Lele (1997)

Chapter 1 Analysis of Spatial Count Data: Penalized Estimating Equation
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Model
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
1.4.1.1 High dimensional curse . . . . . . . . . . . . . . . . . . . . .
In search for transformation function . . . . . . . . . . . . . . . . . . . . . .
1.5.1
Integral equation approach . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
1.5.1.1 Charlier Polynomial
. . . . . . . . . . . . . . . . . . . . . .
1.5.2 Properties of marginal model
1.6 Penalized quasi likelihood on Induced model
. . . . . . . . . . . . . . . . . .
1.7 GEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Short range dependence . . . . . . . . . . . . . . . . . . . .
1.8 Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Smooth Clipped Absolute Deviation Penalty . . . . . . . . . . . . . .
1.9 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.11 CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.12 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.7.0.1

1.8.1

Chapter 2 High Dimensional Sparse-LDA for spatio-temporal Data . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Review of classical Linear discriminant Analysis (LDA) . . . . . . . . . . . .
2.3 Spatio-temporal LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Spatio-temporal LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Spatio-temporal covariance . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1
2.4.2
Irregular lattice points in space and time . . . . . . . . . . . . . . . .
2.4.3 Well separateness in space and time domain . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Asymptotic optimal misclassiﬁcation rate . . . . . . . . . . . . . . . .
2.6 Penalized Linear Discriminant Analysis (pLDA) . . . . . . . . . . . . . . . .
2.6.1 Across sample independence . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 REML estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5 Regularity Conditions

vi

ix

xi

1
1
1
3
4
5
5
6
6
7
7
8
8
10
11
11
12
13
13
14
15

42
42
44
47
48
49
51
51
52
54
58
60
61

2.7.1 Asymptotic properties of one-step estimates

2.6.3 Validation of REML assumptions . . . . . . . . . . . . . . . . . . . .
2.6.3.1 Tapered REML . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.4 Regularity conditions for penalty . . . . . . . . . . . . . . . . . . . .
2.7 Algorithm and methodology to obtain optimal solutions . . . . . . . . . . . .
. . . . . . . . . . . . . .
2.8 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.1 Covariance Tapering . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.2 One way Tapering vs Two way Tapering . . . . . . . . . . . . . . . .
2.8.3 Tapering range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Misclassiﬁcation Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 MRI Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11 Simulation Studies
2.11.1 Exponential Space Time Covariance (weak correlations) . . . . . . . .
2.11.1.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.2 Exponential Space Time Covariance (strong correlations) . . . . . . .
2.11.2.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.3 Matern Space Covariance and exponential time (separable) . . . . . .
2.11.3.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.4 Non separable space and time Gneiting covariance (separable) . . . .
2.11.4.1 With ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62
64
65
72
73
74
74
75
75
76
76
77
79
79
80
80
81
81
82
82
83

Chapter 3 Random Projection for Tensor data

. . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Kronecker factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Literature reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Weak dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Choice of d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Memory eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2.1 Tensor type data . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Variance reduction through averaging . . . . . . . . . . . . . . . . . .
3.5 Simulation result
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Future scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction to Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7
3.7.1 Limitation for Gaussian assumptions . . . . . . . . . . . . . . . . . .

92
92
93
94
95
96
97
97
97
97
98
98
98
99
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8.1 Mathematical Background for Tensor . . . . . . . . . . . . . . . . . . 100
3.9 Kernelized Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . 101
3.9.1 Framework of the Classiﬁcation Problem . . . . . . . . . . . . . . . . 101
Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.9.2
3.9.3
STM with random projection . . . . . . . . . . . . . . . . . . . . . . 103
3.9.4
Solving the STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.9.5 Estimation with Complete Tensor Data . . . . . . . . . . . . . . . . . 104
3.10 Statistical Property of STM . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.8 Preliminaries

vii

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

viii

LIST OF TABLES

Table 1.1: Bias of absolute value,Sample Standard Deviation,Empirical Coverage prob-
ability of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 2.1: Exponential separable space and time covariance estimation when ∆ is the
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

mean vector

14

79

Table 2.2: Estimation and selection of ∆ using the estimated covariance . . . . . . .

79

Table 2.3: Estimation and selection of ∆ under independence . . . . . . . . . . . . .

79

Table 2.4: Exponential separable space and time covariance estimation when ∆small
is the mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Table 2.5: Estimation and selection of ∆small using the estimated covariance

. . . .

80

Table 2.6: Estimation and selection of ∆small under independence

. . . . . . . . . .

80

Table 2.7: Exponential separable space and time covariance estimation when ∆ is the
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

mean vector

80

Table 2.8: Estimation and selection of ∆ using the estimated covariance . . . . . . .

80

Table 2.9: Estimation and selection of ∆ under independence . . . . . . . . . . . . .

80

Table 2.10: Exponential separable space and time covariance estimation when ∆small
is the mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Table 2.11: Estimation and selection of ∆small using the estimated covariance

. . . .

81

Table 2.12: Estimation and selection of ∆small under independence

. . . . . . . . . .

81

Table 2.13: Matern covariance with separable exponential time when ∆ is the mean
vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Table 2.14: Estimation and selection of ∆ using the estimated covariance . . . . . . .

81

Table 2.15: Estimation and selection of ∆ under independence . . . . . . . . . . . . .

82

Table 2.16: Gneiting covariance with non-separable when ∆ is the mean vector . . . .

82

ix

Table 2.17: Estimation and Selection of ∆ using the estimated covariance . . . . . . .

82

Table 3.1: Average of total deviation of ratios of pairwise distance between projected
. . . . . . . . . . . . . . . . . . . .

and actual data from 1 . (Variability)

98

x

KEY TO SYMBOLS

1. ≤ denotes the positive semideﬁnite ordering. i.e. we write A ≤ B if the matrix B − A is

positive deﬁnite.

2. | · | denotes the cardinality of a set.
3. (cid:53)βK is the gradient of a vector K w.r.t β.
4. ◦ signiﬁes the Schur or Hadamard product of two matrices.
5. ⊗ signiﬁes the Kronecker product of two matrices.

6. ¨µ denotes the double derivative of the function µ, similarly

...
µ denotes the third derivative

and so on.

7. := signiﬁes assignment or is referred to as “denoted by”.

8. Tr A denotes the trace of a matrix A.

9. LHS stands for Left Hand Side of an equation.
10. ∼ denotes neighbors. i.e. u ∼ v implies u and v are adjacent voxels.
11. (cid:104)·,·(cid:105) signiﬁes inner product of two vectors.
12. sign(a) signiﬁes −1 or +1 depending on whether a < 0 or a > 0 respectively.
13. I(·) denotes the indicator operator.

14. TP (True Positives) represents the average number of correctly detected nonzero coeﬃ-

cients.

15. FP (False Positives) represents the average number of incorrectly detected nonzero coef-

ﬁcients.

16. CP gives the average empirical Coverage Probability of the 95% conﬁdence intervals.

17. OSE stands for One step Estimation

xi

Chapter 1

Analysis of Spatial Count Data:

Penalized Estimating Equation

Approach

1.1

Introduction

Geographical factors play a signiﬁcant role in epidemiology. Poisson regression is popularly

used for the analysis of disease rates, plant growth etc which assumes that the rates in

nearby regions are independent and the variance of response is equal to the mean.Yasui &

Lele (1997) Hierarchical models have been proposed to utilize spatial locations and neighbors

as analysis of disease rates. Using hierarchal model, marginal likelihood becomes intractable,

so this issue can be solved using estimating equation.

1.2 Literature review

Mardia and Marshall(stationary setup CAR low dimensional estimation) have produced con-

sistent estimators in Gaussian CAR model using likelihood method.However since likelihood

1

is not tractable for Poisson log Normal Distribution. Mean ﬁeld method cannot be used

here it assumes the distribution of latent variable to independent. Following Yasui & Lele
(1997) method, we need to ﬁnd exact form of function Y ∗ which is unbiased estimator for

log Y to prove theoretical result.

In order to achieve one way is to solve the ﬁrst order

Fredholm equation. However this equation is not solvable since log Y do not have ﬁnite
expectation when Y ∼ Poisson(λ) Liang and Zeger (1986) used the idea of estimating equa-

tion for longitudinal data.Wang L(2011) extended this for diverging number of covariates.

Wang L, Zhou J, Qu A.(2011) used penalized estimating equation for high dimensional in-

ference. ClS type estimates and MM estimates and estimating equation approach is used.

The central limit theorem for this statistics can be achieved by result using Peligard of rho

mixing . Correlation upper bound is obtained for banded precision matrix. For general

CAR model, problem remains open. Some progress can be made using partitioned matrix
inverse Σ−1

ii = (Σii)−1 − Σ−i,i(Σ−(i,i))−1Σi,−i LIN and CLAYTON (2005) extended the use

estimating equation for spatial binary data referencing to the work of Zeger(1988) for time

series of count data. These works shows that estimating equation approach can be used for

spatially dependent data under appropriate mixing conditions. These paper estimated the

regression parameter from quasi likelihood score function. Asymptotic covariance of such

estimator depends on unknown nuisance parameter like scale and correlation parameters

(σ2, γ) .Lele(1991),used Jackniﬁng tools for reduced bias estimators of nuisance parame-

ters. HEAGERTY and LUMLEY (2000) proposed non parametric estimation of Covariance

matrix using sub sampling windows for time series in general lattice data. Prentice(1988)

provided a consistent estimate of scale and correlation parameter using second quasi likeli-

hood score function. Here the precision matrix in quasi score requires knowledge of third and

fourth moments which replaced by ”working” precision matrix. This paper generalizes the

2

idea of joint optimal estimation Godambe and Thompson (1989) (βn, σ2, γ) in independent

when skewness and kurtosis of distribution is known from before.

1.3 Model

For county i ∈ {1, 2, .., n} Yi= Observed disease cases of county i. Ei= Expected disease

cases of county i.Ei s are known. Ψi= Logarithm of ratio of disease rates to some reference

rates county i .

Yi|Ψi

ind∼ P oisson(eΨiEi)

ψn×1 ∼N(cid:16)Xn×pnβpn×1, σ2V n(γ)

(cid:17)

V n(γ) = (I − γMnWn)−1Mn

γ ∈ (−1, 1) ; σ2 ∈ (0,∞)

Deﬁne W n×n as adjacency matrix

 1 if i and j are neighbours

otherwise

j∈N (i) Wij = Wi+. Let M n×n = Diag ( 1
Wi+

Wij =

0

(cid:80)

)ii

3

Here matrix W n can be interpreted as the adjacency matrix of graph with vertex set as indices

of random variable namely={1,2,...n}. W r i,j

n

represents the number of paths of length r from vertex

i to j. W n is irreducible if it is adjacency matrix of connected graph.Mathematically, W r i,j
n > 0
for some r and all i, j ∈ {1, 2, .., n}. Suppose W n is reducible then W n can be represented into

block diagonal matrix of irreducible matrix. The block diagonal structure of W n represents the

isolated connected components of corresponding graph. Suppose



W =

1W

0

2W

. . .

0

c−1W

cW



here matrix jW is irreducible adjacency matrix of graph with vertex set indexed by j

So the corresponding V 1,n = (I − γMnWn)−1Mn is also block diagonal matrix. Without loss

of generality we prove all results assuming W n is irreducible since these results can be extended to

block diagonal W n similarly.

1.4 Estimation

The objective is to estimate β,σ2 and γ using observations {Y }i. Most straightforward method

would me Maximum Likelihood Method (MLE) method but MLE is untractable so Yasui & Lele

(1997) have used transformation and then MLE.

4

1.4.1 Transformation approach by Yasui & Lele (1997)

Deﬁne

Y ∗ = f1(Y ) such that EY ∗|Ψ = Ψ + b∗(ψ)

Y ∗∗ = f2(Y )

such that EY ∗∗|Ψ = Ψ2 + b∗∗(ψ)

With b∗(y) and b∗∗(y) satisfying some regularizing condition. q(λn) is SCAD penalty function with

parameter λn , the following sets of MLE equation are solved for estimation

(cid:0)βn

(cid:1) =

Un

1
n

(cid:124)

X(cid:124)(cid:2)σ2V3,n(γ)(cid:3)−1(Y ∗ − Xβ)
(cid:125)

(cid:123)(cid:122)

Sn

−q(λn)(βn)

F1,n(γ) = (Y ∗ − Xβ)

(cid:124)

M nW n(I − γM nW n)(Y ∗ − Xβ) = 0

F2,n(σ2) = (Y ∗ − Xβ)
(cid:124)

(M−1

n − γW n)(Y ∗ − Xβ) − nσ2 = 0

1.4.1.1 High dimensional curse

These equations are derived assuming asymptotic unbiasedness of Y ∗ for large value of Ψ and
plugin method of moment estimates i.e Y ∗ in place of Ψ in the estimating equations.BuT when pn

is more than n, the above method breaks down due to accumulation of bias of order pn/n. Shown

in appendix.

5

1.5

In search for transformation function

Here we considered two approaches to obtain a function which can serve as unbiased estimate of

Xβ. Using these two method we could only approximate the target function. But high dimension

such approximation fail miserably.

1.5.1 Integral equation approach

We want to solve for function Y ∗ = f (Y ) such that EY ∗|Ψ = Ψ or equivalently

∞(cid:88)

i=0

e−λ λi
i!

f (i) = log λ

for all λ ∈ (0,∞) This kind of equation is classiﬁed as Fredholm Integral equation of ﬁrst kind with

form

Kf = g

where K is integral or expectation operator of random variable Poisson λ with kernel K(λ, i) = λi
i!

. g is known (data function) here g(λ) = log λ and f (solution function) is to be solved for.

A natural heuristic is to think K as matrix with rows, dependent on λ and columns dependent
upon i ∈ {0, 1, ..,∞} where f is an unknown vector and g is constant depending on λ

Assuming all regularity conditions for Poisson kernel. Suppose it has SVD(singular value
decomposition) the equation can be solved. But the data function g(λ) = log(λ) /∈ L1(L =
Lebesgue, R,B). Hence log(λ) is not complete with respect to orthonormal basis. Therefore it does

not have a series expansion with respect to any orthonormal basis.

6

1.5.1.1 Charlier Polynomial

Charlier polynomials {Cm}∞

m=0 are a family of orthogonal polynomials with respect to Poisson

weights.

∞(cid:88)

i=0

λi
i!

Cm (λ, i) Cn (λ, i) = λ−neλn!δmn λ > 0

Suppose

log(λ) =

∞(cid:88)

n=0

anλ−nn!

for some constants an

Then function f (y) could be analytically approximated. However log(λ) does not have Taylor series
expansion on (0, +∞). The solution f cannot be found using Charlier Polynomial.

However many numerical approximations are possible using techniques like quadratures etc.

1.5.2 Properties of marginal model

Finally we came across the work of which gives basic ideas about marginal distribution of this

hierarchal model also known as log poisson Model. We derive ﬁrst few moments needed in order

to solve for this estimating equation.

The induced model is PΛd=1(Xβ, σ2V 1,n(γ)) multivariate Poisson-log normal distribution with

moments provided in the appendix

7

1.6 Penalized quasi likelihood on Induced model

Basawa have discussed the likelihood estimation of int erst parameter under mixture model where

the nuisance parameter has prior. He showed that under exponential family set up the conditional

likelihood estimation is asymptotically eﬃcient as the mixture distribution set up. However our

problem does not ﬁt this paradigm since the prior is on the random variable has prior on it.

Extending the idea of conditional least squares,it can be shown that there are two cases for estima-

tion.In case 1 we can estimate all parameters jointly β, γ, σ2 by either by diﬀerentiating same pseudo

likelihood with respect to diﬀerent parameters. hi =(cid:0)Y i − E(Y i|F−i)(cid:1)2 then multiply appropriate
1 to n but Y i In case 2 we can use two diﬀerent pseudo likelihoods i.e hi =(cid:0)Y i − E(Y i|F−i)(cid:1)2 and
(cid:18)(cid:0)Y i − E(Y i|F−i)(cid:1)2 − V ar(Yi|F−i)
(cid:19)2

weights for optimality.Here F−i = (S)(Y 1, ..., Y i−1, Y i+1, ...Y n) is Sigma ﬁeld of all samples from

h(cid:48)
i =
our setup the expression of conditional expectation E(Y i|F−i) is intractable
We work for simpliﬁed case: estimate all parameters through hi = (Y i−µi(βn, σ2, γ)2 the multiply

then multiply appropriate weights for optimality. In

appropriate weights for optimality.

1.7 GEE

Since we have few moments of marginal distribution we can formulate the estimating equation

(cid:124)
1
(cid:124)
2

 be the coeﬃcient matrix for β Hence XT

i



X

X
···
(cid:124)
n

X

Let X =

is the ith row of X

µi(βn, σ2, γ) = E(Y i) = exp (−X

(cid:124)
i βn + σ2V 1,n(γ)ii/2) = θn,iRii

8

Vn(σ2, γ, βn) = Diag(µi) + Diag(µi)[R(cid:12)2(σ2, γ) − J n]Diag(µi)

exp (−X

(cid:124)
i βn) = θn,i

exp(σ2V 1,n(γ)ii/2) = Rii

(cid:19)

(cid:18)

R(cid:12)2(σ2, γ) − J n

= exp(σ2V 1,n i,j) − 1

i,j

(cid:124)
R(cid:12)2 denotes Hadamard product of matrix R (cid:12) R and J n = 1n1
n

The equation are derived from stationarity of KKT conditions. The following score functions arise

when we treat β, γ, σ2 as mean parameters under the constraints

• sparse solution inducing penalty on β

• σ2 ≥ 0

• |γ| ≤ 1

1

= − 1
n

(cid:124)

(cid:0)βn σ2, γ(cid:1) =

1
n

(cid:124)

Un

(cid:18)

∂µ
∂β

. . . , Xnµn

XT Diag(cid:0)µi

X1µ1,

= − 1
n

(cid:124)

(cid:124)(cid:2)Vn(σ2, γ, βn)(cid:3)−1
(cid:123)(cid:122)
(cid:19)(cid:2)Vn(σ2, γ, βn)(cid:3)−1
(cid:1)(cid:2)Vn(σ2, γ, βn)(cid:3)−1

(cid:123)(cid:122)

Sn

Sn

(cid:123)(cid:122)

Sn

(cid:18)
(cid:19)
(cid:125)
Y − µ(βn, σ2, γ)
(cid:18)
(cid:19)
(cid:125)
Y − µ(βn, σ2, γ)
(cid:18)
(cid:19)
(cid:125)
Y − µ(βn, σ2, γ)

+q(λn)(|βn|) (cid:12) sgn(βn)

+q(λn)(|βn|) (cid:12) sgn(βn)

+q(λn)(|βn|) (cid:12) sgn(βn)

= 0

under λn ≥ 0

q(λn) is derivative of SCAD penalty function wrt β with penalty parameter λn

9

F(cid:48)
3,n(γ, βn, σ2) =

(cid:19)
(cid:18)
(cid:124)(cid:2)Vn(σ2, γ, βn)(cid:3)−1
(cid:125)
(cid:123)(cid:122)
(cid:124)
Y − µ(βn, σ2, γ)
(cid:19)
(cid:18)
1,n(γ))(cid:2)Vn(σ2, γ, βn)(cid:3)−1
1,nW nV −1

Y − µ(βn, σ2, γ)

∂µ
∂γ

F3,n

1
n

Diag(V −1

+ v sgn(γ)

+ v sgn(γ)

= 0

2

3

=

(cid:124)
µ

σ2
2n

under v ≥ 0

=

(cid:124)
µ

1
2n

under u ≥ 0

F(cid:48)
2,n(σ2, γ, βn) =

(cid:18)
(cid:19)
(cid:124)(cid:2)Vn(σ2, γ, βn)(cid:3)−1
(cid:125)
(cid:123)(cid:122)
Y − µ(βn, σ2, γ)
(cid:18)
(cid:19)
Diag(V 1,n(γ))(cid:2)Vn(σ2, γ, βn)(cid:3)−1
Y − µ(βn, σ2, γ)

∂µ
∂σ2

(cid:124)

F2,n

1
n

− u

− u

= 0

1.7.0.1 Short range dependence

We need short range dependence property which gurantees variable selection consistency σ2, γ, βn)

i.e variance covariance matrix of Y deﬁned above has ﬁnite row (or colomn) sum bounded from
above and below. Here V 1,n i,j = f (V n i,j) where f : R → R is hadamard matrix function

Lemma 1.7.1. (cid:107)Vn(cid:107)∞ ≥ sup

i

(µi + µ2

Lemma 1.7.2. (cid:107)Vn(cid:107)∞ ≤ sup

Since (cid:107)Vn(cid:107)∞ < ∞ for all n it implies (cid:80)∞

σ2V 1,n i,i

µi+µ2

i

1

i σ2

wi+(1+|γ|) )

(cid:0) exp(σ2V 1,n i,i)−1(cid:1)

i

1

wi+(1−|γ|)

j=1 COV (Yi, Yj) < ∞ which represents short range

dependence.

10

1.8 Penalty

To proof consistency penalized score function needs Taylor expansion. So unbiased penalty function

is desired by Heyde constrained equation

1.8.1 Smooth Clipped Absolute Deviation Penalty

Non concave penalty with oracle properties by Fan and Li (2001)

(cid:48)

λ(|θ|) = λ[I(|θ| < λ) +

q

(aλ − |θ|)+
(a − 1)λ

I(|θ| ≥ λ)]

a > 2 for computation a = 3.7

Alternative penalty can be MCP.

(cid:48)

λ(|θ|) = λ(1 − |θ|

aλ

q

)+

a > 0 value for a are found using cross validation

Here Sn( ˆβn) is non penalized score version . Then we add penalty term and prove asymptotic up
crossing ie Un = op(an) where an → 0 is suﬃcient since Un may be discontinuous due to penalty.

11

1.9 Assumptions

1. Ei = 1 for all i

2. True parameter β

(cid:124)
n(1×sn)

= (β

(cid:124)
n(1×pn)

, 0(1×sn−pn))

3. True parameter σ2 is bounded and |γ|leq1

4. True parameter βn lies in interior of comapact subset B ∈ Rpn

5. X is bounded element wise.

(cid:1) ≤ λmax

(cid:0) 1

n

(cid:80)n

i=1

(cid:1) < ∞

X(cid:124)

i

Xi

X(cid:124)

i

Xi

i=1

(cid:80)n

(cid:0) 1

n

6. 0 < λmin

7. min1≤j≤sn βn0j/λn → ∞ If pn = o(nα) then λn = O(n−η) where 0 < α < 4

3 and 0 < η < (2− α)

8. maxi wi+,n = k < ∞

9. (cid:80)sn

i=1 |βi,n| < ∞

10. min
1≤i≤pn

βn,0(i)/λ → ∞

11. p3

n

n = o(1)

12. λn → 0

13. p2

nlogn4 = o(nλ2
n)

12

1.10 Consistency

Theorem 1.10.1. There exist approximate GEE solution of nth step be ˆβ

(cid:124)
n,n = (ˆβ

(cid:124)
n1,n, ˆβ

(cid:124)
n2,n)

P(|Unj ˆβ)| = 0,

j = 1, 2, .....pn) → 1
j = pn + 1, pn + 2, .....sn) → 1

P(|Unj ˆβ)| ≤ λn
logn

,

P(ˆβn2 = 0) → 1

Remark 1. The approximation of the solution is conveyed through second item which says that

the value of elements from pn + 1 to sn of score equation at ˆβn,n are not exactly 0 but less than

λn/logn so by choosing λn carefully we can attain good approximation

1.11 CLT

Deﬁne Ξn = ∂µ(cid:124)
∂βn

V −1

n

∂µ
∂βn

Deﬁne sandwich variance estimator by

(cid:19)

−1

ˆΞ

−1(cid:18) ∂µ(cid:124)

ˆHn = ˆΞ

n (Y − ˆµ(βn)(Y − ˆµ(βn)
ˆV −1

(cid:124) ˆV −1

n

∂µ
∂βn

∂βn

∀αn ∈ Rpn(cid:107)αn(cid:107) = 1

αn ˆHn(βn,0)− 1

2

∂µ(cid:124)
∂βn

n (Y − ˆµ(βn,0)(ˆβn − βn,0) −→ N(0, 1)
ˆV −1

13

1.12 Simulation results

We simulate using X = I and maximal number of neighbors k = 10 . We record their std error and

set p = n1.5 and s = n0.8 and σ2=1

σ2

0.1
10
100

2

σ

0.1
10
100

n = 400, sn = 30pn=500

γ = 0.05

γ = 0.75

γ = 0.9

(0.28,0.81,0.85)
(0.24, 0.74,0.84)
(0.21,0.70,0.87)

(0.27 ,0.73,0.76)
(0.21,0.73,0.85)
(0.22,0.68, 0.86)

(0.38,0.85,.60)
(0.35,0.82, 0.72)
(0.32,0.83,0.85)

n = 1000, sn = 30pn = 2000

γ = 0.05

γ = 0.75

γ = 0.9

(0.22,0.49,0.91)
(0.21,0.27,0.91)
(0.21,0.28,0.93)

(0.25,0.57, 0.89)
(0.23,0.59, 0.91)

(34.90 ,0.91)

(0.31,0.61,0.83)
(0.23,0.61,0.84)
(0.30,0.58,0.84)

Table 1.1: Bias of absolute value,Sample Standard Deviation,Empirical Coverage

probability of β

14

APPENDIX

15

APPENDIX

Variance Components

V 3,n(βn, σ2

n, γn) = V(Y ∗)
(cid:125)
(cid:123)(cid:122)
Y |Ψ(Y ∗)
E

= V Ψ

(cid:124)

V 1,n

+ E

(cid:124)

(cid:125)
ΨV Y |Ψ(Y ∗)

(cid:123)(cid:122)

V 2,n

V 1,n = σ2(I − γMnWn)−1Mn

(cid:20)(cid:18) 1

Ei

V 2,n = Diag

exp (−X iβ + V 1,n(γ)ii/2)

(cid:19)

(cid:18)

(cid:48)2∗ (Ψi)exp(Ψi) + b2∗(Ψi)
b

(cid:19)(cid:21)

+ E

Ψ

(CRLB is attained for exponential family)

(cid:18) ∂(Ψ + bias(Ψ))

(cid:19)2

∂(exp(EΨ))

/I(cid:0)exp(ψ)(cid:1)

V Y |Ψ(Y ∗) =

= exp(−Ψ/E)[1 + b∗(Ψ)]2

Moment Generating function of Normal Distribution

Ψ(exp(−Ψi/Ei)) = exp (−X
E

(cid:124)
i β + σ2V 1,n(γ)ii/2)/Ei

Assuming negligible bias and derivative of bias of order ( 1
n )

(cid:0)Ψ − EY |Ψ(Y ∗

i )(cid:1)(cid:124)(cid:0)Ψ − EY |Ψ(Y ∗
i )(cid:1)

(V 2,n)ii = E

ΨV Y |Ψ(Y ∗
ΨV Y |Ψ(Y ∗

= E

i ) + E

Ψ

i ) + b2∗(Ψi)

16

We will approximate to get

(cid:18) 1

Ei

V 2,n = Diag

exp (−X iβ + σ2V 1,n(γ)ii/2)

(cid:19)

Inequalities using function Y ∗

Proof. We assume eΨ = λ onwards Since log(x + 0.5) is a concave function. By Jensen’s inequality
E log(x + 0.5) ≤ log(E x + 0.5) = log(λ + 0.5) and using Y |λ ind∼ P oisson(λ) so it follows Stein
Chen identity Stein Chen identity, E(Y g(Y )) = λE(g(Y + 1) when E|Y g(Y )| and E|g(Y + 1)| exists.

Hence Taylor approximation is valid and central moments can be derived using Stirling’s number.
From Log Sovolev inequality in Poisson Measure for any function f : R → (0,∞)

(cid:20)

f (Y )log(cid:0)f (Y )(cid:1) − E[f (Y )] log(cid:0) E[f (Y )](cid:1)(cid:21)

Eλ

≤ λ E

(cid:18)(cid:0) f (Y + 1) − f (Y )(cid:1)2

(cid:19)

f (Y )

Here E denotes Eλ Using f (Y ) = Y + 1

2 we obtain using Log sovolev inequality

(cid:20)

E

(Y +

≤ λ E

≤ λ E

E

+

1
2

1
2

1
2

1
2

1
2

λ E

) log(cid:0)Y +
(cid:18) 1
(cid:19)
(cid:20)
(cid:18) 1
(cid:19)
(cid:20)
log(cid:0)Y +

Y + 1
2

Y + 1
2

(cid:1) − (λ +
log(cid:0)Y +
(cid:20)

(cid:1)(cid:21)
) log(cid:0) λ +
+ 1(cid:1) − log(cid:0) λ +
(cid:18) 1
(cid:19)
(cid:1) +
log(cid:0)Y +
(cid:18) 1
(cid:1)(cid:21)
(cid:1) − log(cid:0) λ +

≤ λ E

Y + 1
2

λ E

1
2

1
2

1
2

1
2

1
2

Y + 1
2

(cid:20)

E

1
2

(cid:1) − log(cid:0) λ +
log(cid:0)Y +
(cid:1)(cid:21)
(cid:19)2 − log(cid:0) λ +

1
2

1
2

1
2

1
Y + 1
2 + b

(cid:1)(cid:21)

1
2

−,

(cid:19)

+

(cid:18)

(cid:1)(cid:21)

(cid:4)

17

Moments of Log Normal Poisson

E(Y i) = exp (X

(cid:124)
i β + σ2V 1,n(γ)ii/2) = µi

(cid:0)exp(σ2V 1,n(γ)ii) − 1(cid:1)
(cid:0)exp(σ2V 1,n(γ)ij) − 1(cid:1)

V ar(Y i) = µi + µ2
i

Cov(Y i, Y j) = µiµj

Corr(Yi, Yj) =

exp(σ2V 1,n(γ)ij − 1

[exp(σ2V 1,n(γ)ii) − 1 + µi]

1

2 [exp(σ2V 1,n(γ)jj) − 1 + µj]

1
2

(cid:18)
Y i − µi(β)

E

(cid:19)3

= exp(3X

+2 exp(3X

(cid:124)
i β + 9σ2V 1,n(γ)ii/2) − 3 exp(3X
(cid:124)
i β + 3σ2V 1,n(γ)ii/2) − 3 exp(2X
(cid:124)
i β + 2σ2V 1,n(γ)ii/2) + exp(1X

(cid:124)
i β + 5σ2V 1,n(γ)ii/2)

(cid:124)
i β + 4σ2V 1,n(γ)ii/2)
(cid:124)
i β + 1σ2V 1,n(γ)ii/2)

−3 exp(2X

E(Y i − E(Y i))4 = exp(4X

(cid:124)
i β + 16σ2V 1,n(γ)ii/2) − 4 exp(4X
(cid:124)
i β + 6σ2V 1,n(γ)ii/2) − 3 exp(4X
(cid:124)
i β + 9σ2V 1,n(γ)ii/2) − 12 exp(3X
(cid:124)
i β + 3σ2V 1,n(γ)ii/2) + 7 exp(2X
(cid:124)
i β + 2σ2V 1,n(γ)ii/2) + exp(1X

(cid:124)
i β + 10σ2V 1,n(γ)ii/2)
(cid:124)
i β + 4σ2V 1,n(γ)ii/2)
(cid:124)
i β + 5σ2V 1,n(γ)ii/2)
(cid:124)
i β + 4σ2V 1,n(γ)ii/2)
(cid:124)
i β + 1σ2V 1,n(γ)ii/2)

+6 exp(4X

+6 exp(3X

+6 exp(3X

−4 exp(2X

Lemma 1.12.1.

Corr(Yi, Yj) ≤ Corr(Ψi, Ψj)

Proof. Standardize (

Ψi

σ2V 1,n(γ)ii

,

Ψj

σ2V 1,n(γ)jj

) Let resulting marginal model be (Zi , Zj)

18

Standardize (

Yi

V ar(Yi) ,

Yj

V ar(Yj ) )

Use the inequality Corr(Yi, Yj) =

b > 0 , c > 0

(cid:0)ea−1+b(cid:1)0.5(cid:0)ea−1+c(cid:1)0.5 < a = Corr(Zi, Zj) for |a| < 1 and

ea−1

(cid:4)

Lemma 1.12.2. Log normal distribution is dependent on ﬁrst two moments of the latent model

Proof.

E(eit

(cid:124)Y ) = EΨEY |Ψ(eit

(cid:80)n
j=1 Ψj (eitj−1) = exp

EΨ e

(cid:0) n(cid:89)

j=1

j Y )(cid:1) = EΨ

(cid:124)

n(cid:89)

j=1

EY j|Ψj

(eit

eΨj (eitj−1)

(cid:124)Xβ +

(cid:124)
σ2 (cid:94)
(eit − 1)

V 1,n(γ)

(cid:94)
(eit − 1)

2

(cid:124)Y ) = EΨ
(cid:16)

(eit − 1)

(cid:17)

(cid:4)

(cid:94)
(eit − 1) denotes vector

here

(cid:94)
(eit − 1)j = (eitj − 1)

Relationship between induced and latent Model co variance

(cid:18) 1
(cid:18) 1
(cid:18) 1

Ei

Ei

(V 2,n)ii =

≤

≥

exp (−X iβ + σ2V 1,n(γ)ii/2)

exp (−X iβ + σ2

2(1 − |γ|)wi+

)

1

(cid:19)

exp (−X iβ + σ2 1
2wi+

)

Ei

(cid:19)

(cid:19)

Range of γ

For the matrix V 1,n(γ) to be positive deﬁnite matrix , it is suﬃcient for γ ∈ (−1, 1) Since

iﬀ |γ| ≤ 1

V −1
1,n(γ) = M−1

n − γWn

19

Deﬁne strictly diagonal dominant matrix

∆(A)i = |aii| −(cid:88)

i(cid:54)=j

|aij| ≥ 0for all i

Proof. Symmetric strictly diagonal dominant matrix with positive diagonal elements is positive

deﬁnite from Varah (1975) .

V −1
1,n(γ) = M−1

n − γWn

satisﬁes this condition for all n iﬀ |γ| ≤ 1

(cid:4)

Useful bounds about V 1,n

Deﬁne spectral norm (cid:107)A(cid:107)∗ = max

i

|λi(A)| Form the theory of diagonal dominant matrices we

conclude following bounds. When γ > 0 V −1
1,n is M Matrix. We can exploit various results on M
Matrix and extend them to the case γ < 0 as ∆ and (cid:107)(cid:107)∞ operator operates on absolute value of

matrix entries.

(1 − |γ|) min

wi+

1

i

1

2

3

(cid:18)

≤ (cid:107)

I − γ M nW n

(cid:19)−1

Mn(cid:107)∗ ≤

wi+

1

(1 − |γ|) max

i

≤ V 1,n(γ)ii ≤

1
wi+

1

(1 − |γ|)wi+

|V 1,n(γ)ij| ≤ |γ||V 1,n(γ)ii|

20

V 1,n(γ)ii ≥

1
V −1

1,n(γ)ii

(cid:18)

Structure of V 1,n V 1,n =

I − γ M nW n

(cid:19)−1

Mn can be expanded by(cid:80)∞

m=0 γm (M nW n)m Mn

since
(cid:107)(MnWn)(cid:107)∗ < 1 where (cid:107)(cid:107)∗ represents spectral norm or maximum of absolute eigenvalue.

Lemma 1.12.3. It can be proved by induction that (MnWn)r

n i,j ≤ maxi

1

wi+

for all ≥ 1 and all

n

Proof. Note that (MnWn)i,m = wi,m
wi+

Diag ( 1
Wi+

)ii

where wi,m = wm,i is 0 or 1(cid:80)

j∈N (i) wij = wi+. Let M n×n =

(MnWn)r

n i,m =

n(cid:88)

j=1

(MnWn)1

n i,m =

wi,m
wi+

≤ maxi

1
wi+

(MnWn)1

i,j (MnWn)r−1

(MnWn)i,j maxi

≤ n(cid:88)

j=0

1
wi+

wi,j
wi+

1 ≤ maxi

1
wi+

(cid:4)

n − γWn)−1 = V 1,n are positive. The row sum can be written as

Lemma 1.12.4. (cid:107)V −1

n (cid:107)∞ ≥ inf

i

Proof. Case γ > 0:
Here all entries of matrix (M−1

 = (1 − γ)



1

1
···

1

V −1

1,n

n j,m ≤ n(cid:88)

j=0

µi + inf
i

µ2
i

σ2

wi+(1−γ)





w1+

w2+
···

wi+
···

wn+





1

1
···

1

≤ max

i

wi+

21

(cid:26)

exp(σ2V 1,n i,j) − 1

(cid:27)



1

1
···

1



≥ σ2V 1,n



1

1
···

1



≥

max

i

σ2
wi+(1 − γ)



1

1
···

1

(cid:107)Vn(cid:107)∞ ≥ inf

i

µi + inf
i

µ2
i

σ2

wi+(1 − γ)



(cid:4)

Lemma 1.12.5. If 1 − γ − γ 2

1−γ 2 (1 − 2 1

k ) > 0 is strictly diagonally dominant matrix

Proof. Since (M nW n) is non negative irreducible matrix by Perron Frobenius therom: for all k

 = 1





1

1
···

1



1

1
···

1

(M nW n)k

. using this result geometric expansion of V 1,n = (cid:80)∞
(cid:18)

necessary and suﬃcient condition 1 − γ + min

(cid:80)∞

m=2 γm

i

m=0 γm (M nW n)m Mn we can arrive at

(cid:19)

2(M nW n)m

i,i − 1

> 0 for strict diag-

onal dominance of V 1,n Consider wort scenario where (M nW n)2k+1
equivalent to 1 − γ − γ 2

k ) > 0 which is cubic in gamma.

1−γ 2 (1 − 2 1

i,i = 0 then above condition is
(cid:4)

22

Lemma 1.12.6. If V 1,n is strictly diagonally dominant matrix then matrix (Vn)i,j = f (V 1,n)i,j is

strictly diagonally dominant matrix or equivalently if

|V1,n i,i| ≥(cid:88)

j(cid:54)=i

|V1,n i,j|

then

|f (V1,n i,i)| ≥(cid:88)

j(cid:54)=i

|f (V1,n i,j)|

where f (x) = exp(σ2 x) − 1

Proof. We need to show that if from the results 1.12 it is known that CW,γ > Vn i,i > 0 and
Vn i,i > |γ||Vn i,j| for all i (cid:54)= j. Using expansion

exp((x) − 1 =

∞(cid:88)

i=0

xk
K!

We prove that for all n and each k by induction

n,i,i = V k−1
V k

n,i,i Vn,i,i ≥ V k−1

n,i,i

|V1,n i,j|

(cid:88)
n,i,j |(cid:88)

j(cid:54)=i
|V k−1

j(cid:54)=i

|V1,n i,j| ≥(cid:88)

j(cid:54)=i

≥ max
j(cid:54)=i

|V k
1,n i,j|

(cid:4)

The above result can hold large class smooth function f : R → R with domain being variance

matrix elements.

Lemma 1.12.7. (cid:107)V −1

n (cid:107)∗ ≥ inf

i

wi+(1+|γ|)

wi+(1+|γ|)µi+µ2

i σ2

23

Proof.

min

i

(cid:107)Vn(cid:107)∗ ≤ (cid:107)Vn(cid:107)∞
n ) ≥

(cid:107)Vn(cid:107)∞

1

λi(V −1

Lemma 1.12.8. If A, B, C, A − B be symmetric positive semideﬁnite matrix then

λ(i)(A + C) ≥ min

i

min

i

λ(i)(B + C)

Proof. Using results in Hiai & Lin (2017) take k = n to obtain

λ(i)(AC−1) ≥ min

i

min

i

λ(i)(BC−1)

Since all matrices commute with I

] − 1

w+i

min

λ(i)(AC−1 + I) ≥ min

λ(i)(cid:0)(AC−1 + I)C(cid:1) ≥ min

λ(i)(BC−1 + I)

λ(i)(cid:0)(BC−1 + I)C(cid:1)

i

i

i

min

i

λ(i)(A + C) ≥ min

i

min

i

λ(i)(B + C)

Lemma 1.12.9.

(cid:107)V −1

n (cid:107)∗ ≤

1

λ(i)(V1,n)] − 1

≤

exp[

exp[min

i

Proof. Since

Vn = exp((V1,n) − 1 =

∞(cid:88)

i=0

V ◦k
1,n
K!

1
1

i

(1−γ) min

24

(cid:4)

(cid:4)

where ◦ represents hadamard power We can obtain trivial inequality

λi(Vn) ≥ λi(V1,n)

But we can obtain bounds dependent upon function form f

We use results

λ(i)(A ◦ B) ≥ min

i

min

i

λ(i)(A) min

i

λ(i)(B)

and using induction and 1.12.8 we conclude that

λ(i)[exp(V1,n) − 1] ≥ exp[min

i

min

i

λ(i)(V1,n)] − 1

Lemma 1.12.10. (cid:107)Vn(cid:107)∞ ≥ sup

(µi + µ2

i σ2

i

1

wi+(1+|γ|) )

(cid:4)

Proof.

Vn(σ2, γ, βn) = Diag(µi) + Diag(µi)[R(cid:12)2(σ2, γ) − J n]Diag(µi)
(cid:18)

(cid:19)

R(cid:12)2(σ2, γ) − J n

= exp(σ2V 1,n i,j) − 1

σ2
V 1,n i,j

= ˜σ2

i,j

Using the expansion to obtain

exp(˜σ2x) − 1 − σ2x = σ2

exp(˜σ2x) − 1 ≥ ˜σ2x

max
i,j

(cid:18)

x2/2! + x3/3! + ... + x2n/2n! + x2n+1/(2n + 1)!

(cid:19)

≥ 0

25

for all x ∈ [−1, 1] Therefore

(cid:107)Vn(cid:107)∞ ≥ sup

(µi + µ2

i(cid:107)V1,n(cid:107)∞) ≥ sup

(µi +

i

µ2

i σ2

wi+(1 + |γ|)

i

)

(cid:4)

(cid:0) exp(σ2V 1,n i,i)−1(cid:1)
(cid:18)
(cid:19)

σ2V 1,n i,i

(cid:107)∞ ≤ (cid:107)Diag

1

wi+(1−|γ|)

(cid:19)

var−1(Ψi)

Lemma 1.12.11. (cid:107)Vn(cid:107)∞ ≤ sup

i

µi+µ2

i

(cid:107)Diag

var−1(Yi)
µi+µ2

i

(cid:107)Vn(cid:107)∞ ≤ sup

(cid:18)

(cid:19)
(cid:0) exp(σ2V 1,n i,i)−1(cid:1)

Vn Diag

var−1(Yi)

V 1,n i,i

1

wi+(1−|γ|)

(cid:18)

i

Proof. We use the previous result Corr(Yi, Yj) ≤ Corr(Ψi, Ψj) for all i, j hence

(cid:18)

var−1(Ψi)

(cid:19)

(cid:107)∞
(cid:4)

Derivation of estimating equations

1 derivation of Un is straightforward.

2

V1,n Diag

=

=

∂µ
∂γ
βn + σ2 Diag(V 1,n(γ))n×1/2)

∂ exp (−X

(cid:124)



µ1

σ2/2

 Diag(

∂γ

µn

. . .

. . .

µ1(V −1

1,nW nV −1

1,n)1,1

∂V 1,n(γ)

∂γ

) =

 =



σ2/2

µn(V −1

1,nW nV −1

1,n)n,n

26

σ2/2 Diag(V −1

1,nW nV −1

1,n) µ

3 Derivation of F(cid:48)

2,n is also similar

Stationary Case

Joint estimate of parameters may be diﬃcult to compute. We solve for simplest

case ﬁrst where

(Wn)i,j = f (|i − j|)I(|i − j| ≤ k)

for some ﬁxed k. Therefore yielding matrix V 1,n(γ) = (M−1

n − γWn)−1 stationary co variance

matrix. It can be seen by Cramer’s rule (V 1,n)i,j =

−1
det[(V
1,n)−(i,j)]
det[(V 1,n)−1]

Re parametrization

µi(βn, σ2, γ) = exp (−X

(cid:124)
i βn + σ2V 1,n(γ)ii/2)

 βn

σ2 V0
2

) = exp(−[X

(cid:124)
i , 1]

βn

˜σ2

)

= exp(−[X

(cid:124)
i , 1]

= exp(− ˜X

(cid:124)
i

˜β) = µi( ˜βn)

since

V 1,n(γ)ii = V0 1×1 is same for all i due to stationarity

Vn(σ2, γ, βn) = Diag(µi) + Diag(µi)

(cid:18)

(cid:19)

(cid:18)

(cid:124)

[R(cid:12)2( ˜σ2, γ) − J n]Diag(µi)
(cid:19)

R(cid:12)2( ˜σ2, γ) − J n

= exp

i,j

˜σ2 V 1,n i,j

− 1

V0

27

Here V 1,n(γ)ii = V0 1×1 is same for all i due to stationarity

GEE in stationary setup

1 Score equation for mean

(cid:1)

(cid:0)˜βn

Un

(pn+1)×1 =

(cid:124)(cid:2)Vn(˜σ2, γ, ˜βn)(cid:3)−1
(cid:123)(cid:122)

˜Sn

∂µ
∂ ˜β

1
n

(cid:124)

(cid:18)
(cid:19)
(cid:125)
Y − µ(˜βn)

+

q(λn)(|˜βn ,−n|) (cid:12) sgn(˜βn ,−n)

0



under λn ≥ 0 and ˜βn ,−n pn×1 denote sub vector such that

˜βn ,−n

˜βn ,n

 = ˜βn

q(λn) is derivative of SCAD penalty function wrt β with penalty parameter λn

2 Method of Moment

(cid:19)

(cid:18)

˜σ2 Vt(γ)
V0(γ)

exp

(cid:20)(cid:80)n

− 1 = E

i=t+1

(cid:18)
(cid:19)(cid:18)
(cid:80)n
Y i − µi( ˜βn)

(cid:19)
Y i+t − µi+t( ˜βn)

(cid:21)

i=t+1 µi( ˜βn)µi+t( ˜βn)

2.1 Score for scale parameter

(cid:18)

(cid:80)n

i=1

ˆexp( ˜σ2) =

(cid:18)

(cid:20)(cid:80)n

i=1

ˆ˜σ2 = log

(cid:80)n
Y i − ˆµi( ˜βn)

i=1 ˆµi( ˜βn)2

(cid:80)n
Y i − ˆµi( ˜βn)

i=1 ˆµi( ˜βn)2

(cid:19)2 − ˆµi( ˜βn)
(cid:19)2 − ˆµi( ˜βn)

+ 1

(cid:21)

+ 1

28

ˆ˜σ2 can be negative by using this method.

2.2 Score for correlation parameter

(cid:20)(cid:80)n

i=t+1

(cid:18)
(cid:19)(cid:18)
(cid:80)n
Y i − ˆµi( ˜βn)

(cid:19)
Y i+t − ˆµi+t( ˜βn)

i=t+1 ˆµi( ˜βn) ˆµi+t( ˜βn)

(cid:21)

/ˆ˜σ2

+ 1

ˆVt(γ)
ˆV0(γ)

= log

ˆγ can be outside (-1,1)

Vt(γ) = V|i−j|=t(γ) due to stationarity time series of counts.

3 Cressie variogram estimate

(cid:20)(cid:80)n
i=t+1(Yi − EYi)(Yi − EYi+t)

(cid:21)

E

n − t

(cid:18)

(cid:20)(cid:80)n

i=t+1

Thus we obtain E
exp( ˜σ2) − exp( ˜σ2 Vt(γ))

= V AR(Yi) + V AR(Yi+t) − 2COV (Yi, Yi+t)

Y i
˜βn)

µi(

− Y i+t
µi+t(

˜βn)

− ˆµi(

˜βn)

−1
2 − − ˆµi+t(

˜βn)

−1
2

(cid:21)

=

(cid:19)2

2(n−t)

We take t = 1 ﬁnd ˆγ form ˆV1(γ) when V1(γ) is known function of γ.
Cressie showed that when dimension of β is ﬁxed given σ2 |ˆγ − γ| = Op( 1√
n )

Matrix Inverse approximation V 1,n is diagonal matrix

(V 3,n)−1
= (V 1,n + σ2 V 2,n)−1
= (V 2,n)−1 − ( σ2 V 2,n)−1

(cid:18)

(V 1,n)−1 + ( σ2 V 2,n)−1

( σ2 V 2,n)−1

(cid:18)

(cid:19)−1

(cid:19)−1

= ( σ2 V 2,n)−1 − ( σ2 V 2,n)−1

M−1

n + ( σ2 V 2,n)−1 − γW n

( σ2 V 2,n)−1

(cid:18)

σ2 V 2,n)−1 − ( σ2 V 2,n)−1( σ2 V 2,n + M−1
n )
I − γ( σ2 V 2,n + M−1
n )

−1
2 W n σ2 V 2,n + M−1
n )

−1
2

(cid:19)−1

−1
2

29

( σ2 V 2,n + M−1
n )

−1
2 ( σ2 V 2,n)−1

−1
2

Lemma 1.12.12. The spectral norm of
−1
2 W n( σ2 V 2,n + M−1
( σ2 V 2,n + M−1
n )
n )
is same as spectral norm ( σ2 V 2,n + M−1
n )−1Wn which is less than max row sum of
( σ2 V 2,n + M−1
Hence the spectral norm of ( σ2 V 2,n + M−1
n )
I − γ( σ2 V 2,n + M−1
n )

−1
2 W n( σ2 V 2,n + M−1
n )

can be approximated by

−1
2 W n( σ2 V 2,n + M−1
n )

−1
2

n )−1Wn

(cid:18)

(cid:19)

−1
2 ≤ 1

γm(cid:0)( σ2 V 2,n + M−1

n )

n(cid:88)

m=0

−1
2 W n( σ2 V 2,n + M−1
n )

2 (cid:1)m + O(γn)

−1

Remark 2. Hence forward Y ∗ will be denoted as Y

Bound on Frobenius norm of score

Further

E(cid:13)(cid:13)(cid:13)X(cid:124)(cid:2)V3,n(γ)(cid:3)−1(Y − Xβ)
(cid:13)(cid:13)(cid:13)2

2

≤ E(cid:13)(cid:13)X(cid:124)

Dn(γ)−1(Z − Xβ)(cid:13)(cid:13)2

2

where is Dn is diagonal positive deﬁnite matrix.So Z have mutually independent random elements.

Using

Since (V 2,n)−1

(cid:18)

(cid:19)−1

(V 1,n)−1 + (V 2,n)−1

(V 2,n)−1is Postive deﬁnite Matrix

E(cid:13)(cid:13)(cid:13)X(cid:124)(cid:2)V3,n(γ)(cid:3)−1(Y − Xβ)

(cid:13)(cid:13)(cid:13)2

2

= TR(X

(cid:124)

V −1

3,nX)

30

3,n)tr(X

(cid:124)

X)

≤ λmax(V −1
≤ λmax(V −1

(cid:124)

2,n)tr(X

≤ E(cid:13)(cid:13)(cid:13)X(cid:124)(cid:2)Dn(γ)(cid:3)−1(Z − Xβ)

X)

(cid:13)(cid:13)(cid:13)2

2

where Dn = 1
min
Since λmax(V −1

i

2,n) ≤

Ei

exp (−X

(cid:18)

1
min

i

Ei

(cid:124)
i β) − σ2
exp (−X

2

(cid:19)

i

1
max
wi+
(cid:124)
i β) − σ2

1
2 max

i

wi+

Convergence of scale and correlation parameters

n − σ2| = Op((cid:107)ˆβn,n − βn(cid:107)2)
| ˆσ2
|(ˆγn − γ)| = Op((cid:107)ˆβn,n − βn(cid:107)2)

Convergence of variance parameter

In Yasui and Lele [2012] estimation of σ2 are done using

both hierarchical and marginal methods of conditional least squares

Hierarchical method

ˆσ2

n =

1
n

(Y ∗ − Xˆβn,n)
(cid:124)

(M−1

n − ˆγnW n)(Y ∗ − Xˆβn,n) − 1
n

(Y ∗∗ − Y ∗)

(cid:124)

M−1

n (Y ∗∗ − Y ∗)

Marginal method

ˆσ2

n =

1
n

(Y ∗−Xˆβn,n)

(cid:124)

(M−1

n − ˆγnW n)(Y ∗−Xˆβn,n)− 1
n

(cid:18)

TR

ˆV 2,n−1(ˆβn,n, ˆσ2

n−1, ˆγn)M−1

n

(cid:19)

(1.1)

For both of these methods, it can be shown that rates are same

31

Proof.

ˆσ2

n − σ2
1
n

(Y ∗ − Xˆβn,n)
(cid:124)
(cid:18)

=
− Xˆβn,n) − 1
n
− 1
n

TR

(M−1
(Y ∗ − Xβn)
(cid:124)

n − γnW n)(Y ∗ − Xβn)

(M−1
n−1, ˆγn)M−1
n )

n − ˆγnW n)(Y ∗
(cid:18)
(cid:19)
V 2,n(βn, σ2, γ)M−1
n )
(cid:1)(Y ∗ − Xˆβn,n)

(cid:124)(cid:0) (ˆγn − γ)W n

TR

1
n

+

ˆV 2,n−1(ˆβn,n, ˆσ2

=

1
n

(Y ∗ − Xˆβn,n)

(cid:19)

1
+
n
− βn)
1
n

+

(cid:18)(cid:0)

TR

2 (Y ∗(cid:124) − Xβn)(M−1
(cid:124)X(cid:124)

(M−1

n − γW n)X(ˆβn,n − βn)

n − γW n)X(ˆβn,n − βn) +

1
n

(ˆβn,n

n−1, ˆγn) − V 2,n(βn, σ2, γ) (cid:1)M−1

n

(cid:19)

ˆV 2,n−1(ˆβn,n, ˆσ2

(cid:19)
(cid:1)(Y ∗ − Xˆβn,n)(cid:107)2
(cid:1)(Y ∗ − Xˆβn,n)(cid:107)2
(cid:124)(cid:0) W n
(cid:1)(Y ∗ − Xˆβn,n))
(cid:124)(cid:0) W n

(cid:18)
(cid:107)(Y ∗ − Xˆβn,n)

E

(cid:124)(cid:0) (ˆγn − γ)W n

≤ E(cid:107)(ˆγn − γ)(cid:107)2 E(cid:107)(Y ∗ − Xˆβn,n)
= (cid:107)(ˆγn − γ)(cid:107)2TR( (Y ∗ − Xˆβn,n)
≤ n (cid:107)(ˆγn − γ)(cid:107)2 λmax(V 3,nW n)
≤ n (cid:107)(ˆγn − γ)(cid:107)2 max

(V 3,n)iiλmax(W n)

i

(cid:107) (Y ∗(cid:124) − Xβn)(M−1
≤ (cid:107) (Y ∗(cid:124) − Xβn)(M−1
≤ maxi λi

(M−1

(cid:18)

n − γW n)X(ˆβn,n − βn)(cid:107)2

n − γW n)X(cid:107)2 (cid:107)(ˆβn,n − βn)(cid:107)2

n − γW n)V 3(M−1

n − γW n)

(cid:19) (cid:112)TR(X(cid:124)X) (cid:107)(ˆβn,n − βn)(cid:107)2

32

(cid:18)

(M−1

n − γW n)V 3(M−1

n − γW n)

(cid:19) (cid:112)TR(XX(cid:124))

= (cid:107)(ˆβn,n − βn)(cid:107)2

(cid:114)

2 maxi λi

) O(cid:112)(n pn)

(

= Op

pn
n
= Op(pn)

(M−1

n − γW n)X(ˆβn,n − βn)
2 maxiλi (X(cid:124)

(M−1

n − γW n) X)

(cid:124)X(cid:124)

(ˆβn,n − βn)
≤ (cid:107)(ˆβn,n − βn)(cid:107)2
≤ Op n (

)

pn
n

= Op (pn)

By recursion when n is large ˆσ2

n−1 − σ2 is same order of ˆσ2

n − σ2

(cid:19)

V 1,n(γ)ii)

V 2,n(βn, σ2, γ)

(cid:18) 1

n

n

Ei

σ2
2

exp (−X iβ +

= Diag

ˆV 2,n−1(ˆβn,n, ˆσ2

− V 2,n(βn, σ2, γ) M−1

n−1, ˆγn)M−1
(cid:1)

TR(cid:0)
≤ TR(cid:0)
− V 2,n(βn, σ2, γ) (cid:1) (cid:107)λmax(cid:107)(cid:0)M−1
(cid:18) 1
n(cid:88)
(cid:18)

exp (−X iβ +

ˆV 2,n−1(ˆβn,n, ˆσ2

n−1, ˆγn)

σ2
2

Ei

i=1

=

n

≤ n Op

|X i| (cid:107)ˆβn,n − βn(cid:107)2

(cid:1)

(cid:19)

V 1,n(γ)ii/2)

(cid:19)

|V 1,n(γ)ii − ˆV 1,n(ˆγn)ii|

+ | ˆσ2

n − σ2| sup

V 1,n(γ)ii + σ2 sup
i

i

(cid:19)

sup

|X i| (cid:107)ˆβn,n − βn(cid:107)2

sup

i

(cid:18)
(cid:114)

i

(

= n Op

= Op n

pn
n

)

33

Observing that order of last term dominates hence (cid:107) ˆσ2

n− σ2 (cid:107)2 = Op

due to contribution from last term.

(cid:113)

n ) contrary to Op ( pn
( pn
n )
(cid:4)

the marginal method do not have the last term contributed through bias correction. Therfore

the convergence rate for σ2

n follows classical rate of Op ( pn
n )

Convergence of γn

γ =

(Y ∗ − Xβn)(cid:124)M nW nM nW n(Y ∗ − Xβn) + (Y ∗ − Y ∗∗)(cid:124)M nW nM nW n(Y ∗ − Y ∗∗)

(Y ∗ − Xβn)

(cid:124)

M nW n(Y ∗ − Xβn)

=

f1(βn)

f2(βn) + Cn

ˆγn =

=

=

=

(Y ∗ − Xˆβn,n)(cid:124)M nW nM nW n(Y ∗ − Xˆβn,n) + (Y ∗ − Y ∗∗)(cid:124)M nW nM nW n(Y ∗ − Y ∗∗)

(Y ∗ − Xˆβn,n)

(cid:124)

M nW n(Y ∗ − Xˆβn,n)

f1(ˆβn,n)

f2(ˆβn,n) + Cn

f1(βn) + 2(ˆβn,n − βn)

An(Y ∗ − Xβn) + (ˆβn,n − βn)
(cid:124)X(cid:124)
f2(βn) + 2(ˆβn,n − βn)(cid:124)X(cid:124)AnA
n(Y ∗ − Xβn) + (ˆβn,n − βn)(cid:124)X(cid:124)AnA
(cid:124)

(cid:124)X(cid:124)

AnX(ˆβn,n − βn)

(cid:124)
n

X(ˆβn,n − βn) + Cn

f1(βn) + g1(βn, ˆβn,n)

f2(βn) + g2(βn, ˆβn,n) + Cn

Therefore (ˆγn − γ) can be written as

f1(βn) + g1(βn, ˆβn,n)

f2(βn) + g2(βn, ˆβn,n) + Cn

−

f1(βn)

f2(βn) + Cn

It is proven that

f2g1 + Cng1 − f1g2

(f2 + Cn)(f2 + g2 + Cn)

g1

f2 + g2 + Cn

−

g2

f1

f2 + g2 + Cn

f2 + Cn

=

=

g1(βn, ˆβn,n) ≤ Op(pn)

34

g2(βn, ˆβn,n) ≥ Op(pn)

f1(βn) = Op(n)

f2(βn) = Op(n)

Cn = Op(n)

(cid:107)(ˆγn − γ)(cid:107)2 =

Op(pn)
Op(n) = Op( pn

n ) By Slutsky’s theoreom

Convergence of precision matrix

Lemma 1.12.13. Let A and An be sequences of invertible matrices of same order belong to Ba-

nach space over(cid:0)Rn×n ,(cid:107) (cid:107)2

(cid:1) such that

(cid:107)An − A(cid:107)2 = Op(rn) then (cid:107)A−1

n − A−1(cid:107)2 = Op(rn) for large n

Proof. (cid:107)A−1

n − A−1(cid:107)2 = (cid:107)A−1

n

(An − A) A−1(cid:107)2

(cid:107)A(cid:107)2 =

(cid:113)(cid:80)

i λ2

i where λi is ith singular value

of A
(cid:107)A−1(cid:107)2 =

(cid:113)(cid:80)

i

1
λ2

i

λi (cid:54)= 0 for any i due to non singularity of A

(cid:4)

Using above lemma on A = σ2 An = ˆσ2
n
The Frobenius norm inequality (cid:107)A(cid:107)2 ≤ √

spectral norm of A

n (cid:107)A(cid:107)∗ here A is n × n matrix and (cid:107)A(cid:107)∗ denotes

(cid:107) ˆσ−2

n

ˆV

−1
1,n − σ−2V −1

n − σ−2(cid:107)2 (cid:107)(M n − γW n)(cid:107)2

n) (1 − γ) max

i

wi+

n

1,n(cid:107)2 ≤ ˆσ−2
||(ˆγn − γ)||2 (cid:107)W n(cid:107)2 + (cid:107) ˆσ−2
√
√
≤ Op(
pn
) O(
) O(
n
√

n) + Op

pn
n

(

(cid:114)

≤ Op (

pn)

35

√
Hence (cid:107) ˆV 1,n−V 1,n(cid:107)2 = Op (

pn) implying that (cid:107) ( ˆV 1,n)i,i −(V 1,n)i,i (cid:107)2 = Op (

all diagonal elements ˆV 1,n

and V 1,n are of same order.

(cid:113) pn

n ) provided

Now all diagonal elements of ˆV

−1
1,n

and V −1

1,n are of same order as well as oﬀ- diagonal elements

due to assumption 8 on wi,js . This property of same order for inverse matrices can be veriﬁed

using law
(V −1)i,i =

det(V {i,i}c )

det(V )

V {i,i}c represents sub matrix of A obtained by deleting ith row and ith colomn from A

(cid:20) 1

Ei

V 2,n = Diag

exp (−X iβ + σ2V 1,n(γ)ii/2)

(cid:21)

By Mean Value Theorem (multivariate)

(cid:107) ˆV 2,n − V 2,n(cid:107)2
≤ n max

i

+ n max

i

i

+ n max
√

≤ Op (

pn)

exp (−X iβ∗ + σ2V 1,n(γ)ii/2) max
exp (−X iβ∗ + σ2∗V 1,n(γ)ii/2) max
exp (−X iβ∗ + σ2∗V 1,n(γ)ii/2) (cid:107)σ2∗ F (M n, W n, γ∗) ii/2(cid:107)2 (cid:107)(ˆγn − γ)(cid:107)2

(cid:107)X i(cid:107)2(cid:107)(ˆβn,n − βn)(cid:107)2
(V 1,n(γ)ii/2) | ˆσ2

n − σ2|2

i

i

1
Ei
1
Ei
1
Ei

(cid:107) ˆV 3,n − V 3,n(cid:107)2 ≤ (cid:107) ˆσ2

n − σ2 (cid:107)2 (cid:107)V 2,n(cid:107)2 + (cid:107) ˆσ2

n(cid:107)2 (cid:107) ˆV 2,n − V 2,n(cid:107)2

+ (cid:107) ˆV 1,n − V 1,n(cid:107)2

Using lemma 1.12.13 ﬁnally it is established that (cid:107) ˆV 3,n

−1 − V −1

3,n(cid:107)2 = Op (

√
pn)

36

derivative wrt γ

∂V 1,n(γ)

∂γ

V −1

1,n

= −V −1

1,n

∂V −1
1,n(γ)
∂γ
1,n W n V −1
n W nM−1

1,n

= V −1

= M−1

n − γ M−1

n W 2

n − γ W 2

nM−1

n + γ2 W 3
n

Since gradient is a linear operator

∂(cid:0)V 1,n

(cid:1)

∂γ

i,i(γ)

=

(cid:18) ∂V 1,n(γ)

(cid:19)

∂γ

i,i

(V 3,n) = ( σ2 V 1,n + V 2,n)

Joint Convexity of Quasilikelihood To estabilish convexity of quasilikelihood, we need to show

the directional derivative for every direction

t ∈ R2 < t,∇(γ ,σ2)Sn ≥ 0

It can be established that the derivative of score function is monotonic

∂Sn
∂γ

= X(cid:124) ∂(cid:2)V −1
3,n(γ)(cid:3)
= X(cid:124) ∂(cid:2) σ2 V 1,n + V 2,n

∂γ

(Y − Xβ)

(cid:3)−1

∂γ

(Y − Xβ)

37

Derivative computation

3,n

∂V −1
∂γ
= −V −1

∂V 3,n

V −1

3,n

3,n

∂γ
= −σ2 V −1

3,n

∂V 1,n

∂γ

= −σ2 V −1

3,n

− σ2 V −1

3,n Diag

2

3,n

∂γ

∂V 1,n

(cid:18)(cid:0)V 2,n
(cid:18) σ2 V 1,n i,i
(cid:18)

2

(cid:20)

(cid:20)

= −σ2 V −1

3,n

Diag

exp

≤ −σ2 V −1

3,n

(cid:18)

(cid:18)

Diag

exp

≤ −σ2 V −1

σ2
2(V −1
1,n)i,i
3,nL(W, γ)V −1

3,n

Remark 3. The Quasi Likelihood(cid:82) Y

3,n

3,n − V −1
V −1
(cid:1)
V −1

∂(cid:0)V 1,n

i,i

∂V 2,n

∂γ

V −1

3,n

(cid:1)

i,i

(cid:19)

V −1

3,n

∂γ

(cid:19)

(cid:1)

(cid:0)V −1
(cid:0)V −1

1,n

1,nW V −1
(cid:1)

1,nW V −1

1,n

i,i

(cid:19)

i,i

(cid:21)

V −1

3,n

+ V −1

1,n

1,nW V −1
(cid:21)

+ V −1

1,nW V −1

1,n

V −1

3,n

β Sn(β)dβ depends on convexity of matrix L(W, γ). However

function of γ since adjacency matrix W is not positive semi deﬁnite because by Perron Frobeius

theorem it has at least one positive eigenvalue and some eigenvalues are negative since trace(Wn) =

0

Joint convexity

Lemma 1.12.14. The unpenalized quasi likelihood function to be jointly convex function of

Xβ σ2 γ. The suﬃcient condition that score function is monotonic function of Xβ, σ2, γ

Proof. Results that will be used in this proof are:

• If V is non negative deﬁnite matrix then the function f : V → V is monotone

38

• The composition of two convex function is convex i.e f, g are both monotone implies f (g) is

convex

• If f, g are both monotone and f(x) is semi deﬁnite for all x f (cid:23) 0 and g (cid:23) 0 the productf g

is monotone.

V −1
1,n(γ) = (M−1

(cid:19)
n −γW n) is monotonic function of γ. So V 1,n is monotonic function of γ by result 1
exp (−X iβ+σ2V 1,n(γ)ii/2)

σ2V 1,n is jointly convex function of and σ result 3. V 2,n = Diag

(cid:18)

1
Ei

is jointly convex function of Xβ ,γ and σ by result 2 and 3. V 1,n + V 2,n is jointly convex function
of Xβ ,γ andσ. V −1
3,n = (V 1,n + σ2 V 2,n)−1 is jointly convex function of Xβ ,γ and σ .
1
(cid:4)
n

X(cid:124)(cid:2)σ2V3,n(γ)(cid:3)−1(Y − Xβ)
(cid:125)

is jointly convex function of Xβ ,γ and σ

(cid:123)(cid:122)

(cid:124)

Sn

Central Limit Theorem

1 We have link function such that

E(Y i) = exp (X

(cid:124)
i β + σ2V 1,n(γ)ii/2) = µi = h−1(X

(cid:124)
i β)

< ∞ and ∂2h−1

∂β2 < ∞ for all β in parameter space

The ∂h−1
∂β
These conditions are satisﬁed since we assume (cid:107)X
h−1(X

(cid:124)
i β) = exp(σ2V 1,n(γ)ii/2)exp(X

(cid:124)
i β)

(cid:124)
i β(cid:107),∞ , max

i,j

(Xi,j) is bounded ?? and

2 there exists smooth variance function V V AR(Yi|X) = V (h−1(X

(cid:124)
i β)) This condition is sat-

isﬁed since here Vi(r) = r + r2(eσ2V 1,n(γ)ii − 1)

(cid:124)(cid:2)Vn(˜σ2, γ, ˜βn)(cid:3)−1 is ﬁnite

3 Each element of vector of ∂µ
˜β

∂

i.e (cid:107)X(cid:124)

Diag(µi)V AR−1(Y )(cid:107)∞ = O(1) this is proved by following lemma.

Lemma 1.12.15. (cid:107)X(cid:124)

Diag(µi)V AR−1(Y )(cid:107)∞ < ∞

39

Proof. Using assumptions which gives bounds on elements of X and parameters. Write D as

diagonal matrix with Di = V ar(Yi)

Vn(σ2, γ, βn) = D(cid:48) CORR(Y ) D

using Woodbury matrix identity (A+A(cid:48)BA)−1 = A−1−(B−1 +A)−1 and taking A = Diag(µi) with
diagonal entries 0 > µi > ∞ it is suﬃcient to prove that max of row sum of inverse of correlation
matrix (cid:107)CORR(Y )−1(cid:107)1,∞ < C bounded by some constant using lemma below (cid:107)CORR(Y )(cid:107)1,∞ <
(cid:4)

1

Peligrad’s result on CLT Deﬁne (Y (cid:48)

i ) =

Yi − µi(β)

. F m

1 = Sigma ﬁeld(Y (cid:48)

i

: 1 ≤ i ≤ m)

Let Tn =(cid:80)n

i=1 Y (cid:48)

i and ν2

n = E(Tn)2

(cid:18)

(cid:19)

ρ(n) =

sup
m≥0
X∈L2(F m
1 )
Y ∈L2(F∞
m+n)

|CORR(X, Y )| α(n) = sup
m≥0
A∈F m
B∈F∞

1

m+n

|P (A ∩ B) − P (A)P (B)|

Peligard showed that for non stationary sequences {Y (cid:48)

i }∞
i=1 CLT is obtainable provided some suf-

ﬁcient conditions and and lim

n→∞ ρ(n) < 1 and lim

n→∞ α(n) = 0 Using the methods we can show CLT

holds under these weak suﬃcient condition.

1 E(Yi) = 0 E(Y 2

2 max1≤i≤n E|Y (cid:48)

i ) < ∞
i |2+δ < ∞ Lyapunov condition which suﬃces for Linderberg Conditions
(cid:19)

(cid:18)

3 ν2

n = O

nh(n)

h(n) is slow varying function i.e lim
n→∞

h(an)

n = 1 for all a > 0

1.1 In our setup sup

i

V −1
1,n i,i+n = 0 implies under Gaussian setup α(n) → 0 thus making Ψi and

Ψi+n ae independent hence Yi and Yi+n are independent as as n tends to inﬁnity. hence
α(n) → 0 when under probability space corresponds to process Yi

40

1.2 Corr(Yi, Yj) ≤ Corr(Ψi, Ψj) =

√

V1,n i,j

V1,n i,i V1,n j,j

≤ |γ| for all i, j so lim

n→∞ ρ(n) < 1

2 δ = 1 is satisﬁed here

3 sup

(µi + µ2

i σ2

i

wi+(1+|γ|) ) ≤ h(n) ≤ sup

1

i

(cid:0) exp(σ2V 1,n i,i)−1(cid:1)

σ2V 1,n i,i

µi+µ2

i

1

wi+(1−|γ|)

h(n) is independent of n thus regular varying function of n.

Implication of CLT The unpenalized score equation Sn(βn, σ2, γ) converges to Normal distribution

when parameters assume values of true model. Thus if can show that solution of score equation i.e
Sn( ˆβn, σ2, γ) = 0 is consistent to βn,0 then ( ˆβn − βn,0) converges weakly to Normal distribution.
Further when conditions are satisﬁed Liang and Zeger the solutions of Sn( ˆβ∗
Zeger (1986) follows CLT i.e ( ˆβn − βn,0) −→ N

n, ˆσ2, ˆγ) = 0Liang &

Exact structure Of V If the process is stationary the exact form of V −1 is known through paper

Toeplitz and Circulant Matrices: A review . When not stationary can we know analytical structure?
n − γ Wn)−1 where number of ’1’s in row of matrix Wn is less than k
For example let Vn(γ) = (M−1
We can perform update Vn(γt+1) = Vn(γt) + (γt+1 − γt)

∂V 1,n(γ)

|γ t

∂γ

∂V 1,n(γ)

∂γ

= −V −1

V −1

1,n

1,n

∂V −1
1,n(γ)
∂γ
1,n W n V −1
n W nM−1

1,n

= V −1

= M−1

n − γ M−1

n W 2

n − γ W 2

nM−1

n + γ2 W 3
n

Therefore diagonal elements are diag(M−1
+i + γ2k(k − 1)/2
0 − 2γw2

i,i ≤ −2γk2

+i + γ2W 3

n W nM−1

n − γ M−1

n W 2

n − γ W 2

nM−1

n + γ2 W 3

n)i =

41

Chapter 2

High Dimensional Sparse-LDA for

spatio-temporal Data

Introduction

2.1

Introduction

The mathematical clarity and simplicity of Fisher’s linear discriminant analysis (LDA) has lead to

extensive applications in multimedia information retrieval such as speech and pattern recognition

as explained by Yu & Yang (2001). Used both for classiﬁcation Cox & Savoy (2003) Gutman et al.

(2013) and dimension reduction Mourao-Miranda et al. (2005) Rathi & Palani (2012) , in the context

of image analysis and feature extraction, LDA has often encountered diﬃculty pertaining to such

high-dimensional problems. Technolgical advancements in medical images speciﬁcally magnetic

resonance images (MRIs) have now lead to a rise in high-resolution images with dimensions as high
as 256x256x198 ≈ 12M volumetric pixels or voxels. As the number of subjects obtained to study

such images cannot feasibly exceed these dimensions, it leads to singularity in the construction of the

covariate matrix. An initial proposition of a simple two-step algorithm such as PCA-LDA used yet

another dimension reductionality step by reducing the initial dimension using principal component

analysis (PCA) Wang et al. (2010) Ghosh (2001). Although this technique leads to construction

42

of orthogonal features of lower dimensions, the components from LDA and PCA algorithms maybe

incompatible with each other, thus resulting in the loss of important information.

The review provided by (Fan & Lv, 2010, Section 4.2) provides insight into issues that classical

LDA encounters when a high-dimensional problem arises. They show that dimension reduction

signiﬁcantly is important for reducing the misclassiﬁcation rate. Certain propositions considered

remedies such as imposing independence assumptions on the covariance structure thus signiﬁcantly

lowering the number of estimating parameters to circumvent the singularity Bickel et al. (2004),Tib-

shirani et al. (2002) and Fan & Fan (2008). As an application to genetics these techniques did not

necessitate the selection of features to facilitate the classiﬁcation. Fan & Fan (2008) produced the

method named Features Annealed Independence Rule (FAIR). This was an improvement over near-

est shrunken centroid rule Tibshirani et al. (2003) essentially equivalent to two-sample t-tests, as

it sets a relative importance order for features that would result in a more optimal selection. This

procedure keeps in check the noise accumulation previously unaccounted for, so as to not subvert

faint features. However in an attempt to incorporate and account for signiﬁcant correlation among

the genes, Fan et al. (2012) proposed the regularized optimal aﬃne discriminant (ROAD) and a

few variations in nthe assumptions made. For the same microarray dataset J. Shao et al. (2011)

propose a sparse LDA (SLDA) using thresholding obtain a sparse estimate of the covariance matrix.

Another important contribution with regard to LDA classiﬁcation in genetics was made by Witten

& Tibshirani (2011). The authors here penalized the discriminant vectors in Fisher’s discriminant

problem. This results in constraining the within subject covariance while penalizing the between

subject covariances.

Instead of separately estimating or penalizing the covariance estimates or

mean estimates, Cai & Liu (2011) propose sparse estimates of the product of the diﬀerence in class

means and the covariance, an important portion of the classiﬁcation rule using l1 minimization.

In other work, Mai et al. (2012) propose penalized least squares estimate using a lasso penalty to

solve the high dimensional LDA problem.

43

For longitudinal neuroimaging studies data has naturally structured dependencies in higher

dimensions that may signiﬁcantly inform the classiﬁcation rule. This article explores statistical

methodology that may assist in the simultaneous selection of regions and classiﬁcation of diseases

status using structural brain magnetic resonance images (MRIs). The objective would be able to

assist researchers in locating brain regions or ROI (Region of Interest) based analysis that identiﬁes

speciﬁc voxels playing a key role in investigating regions of susceptibility to Alzheimer’s disease. The

authors of Yingjie & Maiti (2019) explored penalized LDA imposing a parametric spatial covariance

on the within subject images. We extend this work to investigate methods using a spatio-temporal

covariance for longitudinal studies and explore estimation under a variety of situations such as

space-time separability and non-separability in section 2.4 and covariance tapering in section 2.8.1.

All of the methodology in this article has been explored under the smoothly clipped absolute

deviation (SCAD) penalty Fan & Li (2001). The algorithm use to obtain the penalized parameters

was introduced by Zou & Li (2008) as the one-step sparse estimates. The consistency and selection

properties have been studied under this one-step setup. We produce a series of simulation results

under a variety of setups in section A dataset from the Alzheimer’s disease neuroimaging initiative

(ADNI Petersen et al. (2010)) is used to demonstrate the method and results in areas of the

hippocampus that may play a vital role in the classiﬁcation of healthy brains and patients diagnosed

with AD.

2.2 Review of classical Linear discriminant Analysis

(LDA)

Consider a pT -dimensional discriminant problem between two classes C1 and C2. Let Yk1, .., Yknk
be from classes Ck, where k ∈ {1, 2} and Ykj ∈ RpT , We further assume that Ykj ∼ NpT (µk, Σ(θ))
are independent and identically distributed for j = {1, 2, .., nk}. The mean vectors µk vary between

44

the classes but they have a common variance Σ(θ) with parameter θ. The estimation is based

on samples of replicates of size n1 and n2 for each class respectively and the total sample size is

n = n1 + n2. The membership of a new test sample X into class C1 is then determined by the LDA

classiﬁer given by ˆδ(X) such that,

(cid:18)

(cid:19)

ˆδ(X) =

X − 1
2

(ˆµ1 + ˆµ2)

ˆΣ−1(ˆµ1 − ˆµ2) > 0

(2.1)

Alternatively this classiﬁer can be expressed in terms of the diﬀerence in mean ∆ = µ1 − µ2

that provides insight into the discriminant vector.

(cid:18)

ˆδ(X) =

(cid:19)

> 0

X − 1
2

(ˆµ1 + ˆµ2) ˆΣ−1 ˆ∆

(2.2)

where ˆµk and ˆΣ denote the estimated mean and covariance. The maximum likelihood estimates

(MLE) that results in sample means and covariances can be obtained by maximizing the log-

likelihood function for µk and θ is given by,

(ˆµk, ˆθ) = arg max

µk,θ

L(θ, µ1, µ2; Y ) = −

(cid:80)T
t=1 pT × n
2(cid:88)
nk(cid:88)

k=1

i=1

− 1
2

log(2π) − n
2

log|Σ(θ)|

2
(Yk,i − µk)T Σ−1(θ)(Yk,i − µk)

(2.3)

All the estimates obtained in the setup have consistent properties under the setup where pT < n.

This Bayes’ classiﬁer rule is established within the parameter space where l1 and l2 are positive

constants such that,

Γ = {(∆, Σ(θ)) : ∆T Σ−1(θ)∆ > CpT , l1 ≤ λmin(Σ(θ)) ≤ λmax(Σ(θ)) ≤ l2}

(2.4)

45

If the new observation X belongs to C1, then the conditional misclassiﬁcation rate for unknown

parameters Θ = (µ1, µ2, θ) of ˆδ(X) is given by,

W1(ˆδ, Θ) = P (ˆδ(X)) ≤ 0|X ∈ C1) = 1 − Φ(ψ1)

where,

ψ1 =

1

2 ( ˆµ1 − ˆµ2)T ˆΣ ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

(2.5)

(2.6)

Similarly the expression for X belonging to C2 results in W2( ˆ∆, Θ) = 1 − Φ(ψ2). Therefore the

overall misclassiﬁcation rate is given by,

W(ˆδ) = max
Θ∈Γ

1
2

(W1( ˆ∆, Θ) + W2( ˆ∆, Θ))

(2.7)

Under the assumption that X is normally distributed, the worst case conditional classiﬁcation error
rate where Φ(·) denotes the standard Gaussian distribution function is given by,

W(δ) = max
Θ∈Γ

[1 − Φ(

√

∆T Σ−1∆

2

)] = 1 − Φ(

)

(2.8)

(cid:112)CpT

2

In the neuroimaging setup with longitudinal structural MRIs, as discussed in section 2.1, we

encounter a high dimensional sceniario where pT >> n rendering singularity of covariance matrix

estimates ˆΣ(θ). In what follows, we impose a spatio-temporal dependence between registered images

of each subject.

46

2.3 Spatio-temporal LDA

For the purposes of classiﬁcation, the voxels present in MRI images of a particular subject is mod-

eled as Gaussian Random Field (GRFs). These random ﬁelds exhibit decaying spatial covariance

dependences in MRIs for each particular time point for a subject. Hence, longitudinal MRIs of a

speciﬁc subject can be considered to posses spatio-temporal dependence. To begin with, we con-

sider a separable covariance structure where both spatial covariance on each image and the time

covariance across images within the same subject are assumed to be second order stationary and

isotropic and identical across subjects.

Let us consider a p · T dimensional discriminant problem of a spatio-temporal process between
two classes C1 and C2. Let {Yki(s, t) : (s, t) ∈ D × T}, for i = 1, .., nk in each class k = {1, 2} of

size n1 and n2 respectively such that n1 + n2 = n, denote a spatio-temporal process in the domain
n → π for 0 < π < 1
D× T ∈ Rd× [0,∞] where d denotes the spatial dimensionality. Also assume, n1
an n → ∞. Let us express the process Ykj as,

Yki(s, t) = µk(s, t) + ki(s, t)

(2.9)

where µk is the mean eﬀect corresponding to each class k = {1, 2} and the error term {ki(s, t) :
(s, t) ∈ D × T} is a Gaussian process with mean 0 and covariance Σ(θ)(s,t),(s(cid:48),t(cid:48)) = γ[(s, t), (s(cid:48), t(cid:48)); θ]

such that,

γ[(s, t), (s(cid:48), t(cid:48)); θ] = cov[(s, t), (s(cid:48), t(cid:48)) = γ[(cid:107)(s − s(cid:48))(cid:107)2,|t − t(cid:48)|; θ]

(2.10)

We additionally constrain the the spatio-temporal voxels to be non-random, and for any pair of
spatio-temporal sites the distance is bounded below by ﬁxed number such that (cid:107)(s, t) − (s(cid:48), t(cid:48))(cid:107) ≥

η > 0. Hence all of the statistical methodology described will be investigated under the increasing

47

domain framework.

Alternatively, the vectorized form of the model can be expressed as,

Yk,i,j,l = µk,j,l + k,i,j,l

(2.11)

where, j = 1, 2, ...p represents the sth
time point. Speciﬁ-
cally, (Yk,i)pT×1 = (Yk,i,1,1, ...., Yk,i,p,T )(cid:48) , (µk)pT×1 = (µk,1,1, ...., µk,p,T )(cid:48) and k,i = (k,i,1,1, ...., k,i,p,T )(cid:48)
has a multivariate Gaussian distribution. Thus, Yk,i ∼ NpT (µk, Σ(θ)) where Σ(θ) is pT × pT co-

j spatial site, l = 1, 2, ...T represents the tth
l

variance matrix.

2.4 Spatio-temporal LDA

For the purposes of classiﬁcation, the voxels present in MRI images of a particular subject is mod-

eled as Gaussian Random Field (GRFs). These random ﬁelds exhibit decaying spatial covariance

dependences in MRIs for each particular time point for a subject. Hence, longitudinal MRIs of a

speciﬁc subject can be considered to posses spatio-temporal dependence. To begin with, we con-

sider a separable covariance structure where both spatial covariance on each image and the time

covariance across images within the same subject are assumed to be second order stationary and

isotropic and identical across subjects.

Let us consider a p · T dimensional discriminant problem of a spatio-temporal process between
two classes C1 and C2. Let {Yki(s, t) : (s, t) ∈ D × T}, for i = 1, .., nk in each class k = {1, 2} of

size n1 and n2 respectively such that n1 + n2 = n, denote a spatio-temporal process in the domain
n → π for 0 < π < 1
D× T ∈ Rd× [0,∞] where d denotes the spatial dimensionality. Also assume, n1
an n → ∞. Let us express the process Ykj as,

Yki(s, t) = µk(s, t) + ki(s, t)

(2.12)

48

where µk is the mean eﬀect corresponding to each class k = {1, 2} and the error term {ki(s, t) :
(s, t) ∈ D × T} is a Gaussian process with mean 0 and covariance Σ(θ)(s,t),(s(cid:48),t(cid:48)) = γ[(s, t), (s(cid:48), t(cid:48)); θ]

such that,

γ[(s, t), (s(cid:48), t(cid:48)); θ] = cov[(s, t), (s(cid:48), t(cid:48)) = γ[(cid:107)(s − s(cid:48))(cid:107)2,|t − t(cid:48)|; θ]

(2.13)

We additionally constrain the the spatio-temporal voxels to be non-random, and for any pair of
spatio-temporal sites the distance is bounded below by ﬁxed number such that (cid:107)(s, t) − (s(cid:48), t(cid:48))(cid:107) ≥

η > 0. Hence all of the statistical methodology described will be investigated under the increasing

domain framework.

Alternatively, the vectorized form of the model can be expressed as,

Yk,i,j,l = µk,j,l + k,i,j,l

(2.14)

where, j = 1, 2, ...p represents the sth
time point. Speciﬁ-
cally, (Yk,i)pT×1 = (Yk,i,1,1, ...., Yk,i,p,T )(cid:48) , (µk)pT×1 = (µk,1,1, ...., µk,p,T )(cid:48) and k,i = (k,i,1,1, ...., k,i,p,T )(cid:48)
has a multivariate Gaussian distribution. Thus, Yk,i ∼ NpT (µk, Σ(θ)) where Σ(θ) is pT × pT co-

j spatial site, l = 1, 2, ...T represents the tth
l

variance matrix.

2.4.1 Spatio-temporal covariance

In this section, we begin by reviewing the property that any covariance matrix may retain a

irreducible block structure. Let ΣP T×P T be a covariance matrix which is formed by unfolding the

covariance tensor Σ in the following ordered way,

ΣP×T×P×T (a, a(cid:48), c, c(cid:48)) = ΣP T×P T (P a + c, P a(cid:48) + c(cid:48))

49

Let q = (P − 1)a + c and q(cid:48) = (P − 1)a(cid:48) + c(cid:48) then, Σ(q, q(cid:48)) = Cov(Y (sc, ta), Y (sc(cid:48), ta(cid:48))). Therefore a
fully separable model would produce, Σ(q, q(cid:48))P T×P T = γ1(sc−s(cid:48)

a; θ) and a non-separable

c; θ)γ2(ta−t(cid:48)

model would result in Σ(q, q(cid:48)) = γ((cid:112)(sc − s(cid:48)

c)2 + (ta − t(cid:48)

a)2; θ).

All covariance models used in this article is assumed to be up to second order stationary.

Cressie & Huang (1999) describe in detail classes of spatio-temporal stationary covariance functions.
Assume γ(·,·) is continuous and its spectral distribution posses a spectral density f (ω, τ ) ≥ 0, that

is by Bochner’s theorem,

(cid:90)

(cid:90)

Rd

R

γ(s, t) =

eiω(cid:48)s+iτ(cid:48)tf (ω, τ )dωdτ

(2.15)

Additionally if γ(·,·) is also integrable, we get

(cid:90)

1
2π

f (ω, τ ) =

1

2πd+1

e−isτ h(ω; t)dt

where h(ω; t) := 1
2πd

e−is(cid:48)ω−itτ γ(s; t)dsdt =

(cid:90) (cid:90)
(cid:82) e−is(cid:48)ωγ(s; t)dsdt =(cid:82) eitτ f (ω, τ )dτ . A valid positive deﬁnite covariance can
(C1) For each ω ∈ Rd, ρ(ω;·) is a continuous autocorrelation function, (cid:82) ρ(ω; u)du < ∞ and
(C2) (cid:82) k(ω)dω < ∞.

be achieved by modeling h(ω; u) = ρ(ω; u)k(ω) where the following is satisﬁed,

k(ω) > 0.

(2.16)

Mat´ern (1960) deﬁnes a class of space time interacting model through the spectral density

f (ω, τ ) = η(α2β2 + β2(cid:107)ω(cid:107)2 + α2τ 2 + (cid:107)ω(cid:107)2τ 2)−(ν+ d+1
2 )

(2.17)

for all η, β, α, ν ≥ 0 and belonging to compact subspace and  ∈ [0, 1] which represents the extent
of separability between the time and space domain. α−1, β−1 represents the spatial and temporal

decay of the correlation respectively. η is the scale parameter and ν is the smoothness parameter.

50

Additionally it can be shown that this Matern class of covariances satisfy the regularity conditions.

2.4.2 Irregular lattice points in space and time

Cressie & Lahiri (1996) gives general conditions for uniform convergence REML estimates for lattice

as well irregular spaced points. Theses conditions are satisﬁed for large class of covariance structure

including Matern class.However we assume that our brain is embedded in irregular spatial lattice

and time points are at irregular lag.

Let us deﬁne the indexing set of time point {0, 1, 2, ..n} and n → ∞ . For each time point
t ∈ Tn = {0, t1, ..tn} we have a spatial domain St. Similarly we can deﬁne subset consisting of odd
indices T1n = {0, t1, ..t2k+1 : 0 ≤ k ≤ (cid:98)n/2(cid:99)} Here d=3 each pi,t for i ∈ 1, 2, ..d representing the
number of points in each direction.Here the cardinality |St| = p1,tp2,tp3,t = Pt. We assume that the

the index set containing origin and neighbors at unit distance to be L = {(l1, l2, ..ld) :(cid:80)d

diﬀerence between spatial lattice is ﬁxed which denoted by H = (h1, h2, h3) independent of n. Deﬁne
i |li| ≤ 1}.
Deﬁne Z by the set of all positive and negative integers including 0. Z2 = {2i; i ∈ Z} the set of all
even integers. Let the generator of indexing set Zn,t = {(i1, ..ij, ..id) : 0 ≤ ij ≤ pi and 1 ≤ j ≤ d}
Z1n,t = {(k1, ..jj, ..kd) : kj = 2xj +1 0 ≤ xj ≤ 2(cid:98)(pi − 1)/2(cid:99)−1 and 1 ≤ j ≤ d} be odd integer subset
of Zn . Therefore regular spatial lattice is generated by St = L ◦ Zn,t ◦ H and S1t = L ◦ Z1n,t ◦ H

2.4.3 Well separateness in space and time domain

However stationarity,isotropy and non increasing as function of distance is satisﬁed by Matern

covariance functions,this properties allows to validate our results when space does not have lattice
structure. Henceforth, we assume our spatial points in irregular domain Stfor all t ∈ {0, t1, ..tTn}
only with the restriction a ≤ inf
(cid:107)s(cid:48) − s(cid:107)2 ≤ A for some constants a, A > 0 independent of time.
s(cid:48)∈St

51

Similarly,for irregular time domain for all Tn have b ≤ inf
t∈Tn

(cid:107)t(cid:48) − t(cid:107) ≤ B for some constants

b, B > 0. Therefore we have an increasing domain setup.

2.5 Regularity Conditions

Denote(cid:80)T

t=1 Pt = ˜PT and(cid:80)T

t=1 st = ˜sT Σ(θ) is assumed to be second order stationary, isotropic
and twice diﬀerential over all dimensions of space and time for θ ∈ Ξ over all (s, t) ∈ D × T where

Ξ is the parametric space of θ.
Deﬁne projection covariance matrix ˜Π(θ) = ˜Σ−1(θ) − ˜Σ−1(θ) ˜X1( ˜X T

1

˜Σ−1(θ) ˜X1)− ˜X T

1

˜Σ−1(θ)

In general, let us consider any covariance matrix Σ(θ) for θ ∈ Ξ constructed by a covariance

function γ(θ) where the true parameter θ is given by θ0. Also note that the derivative of Σ(θ) w.r.t

θm is denoted by Σm(θ) = ∂
∂θm
θ = (θ1, ..θm, ..θr)(cid:48). Then we require the following assumptions,

Σ(θ) and second derivative is denoted by Σmm(cid:48)

(θ) = ∂2Σ(θ)

∂θm∂θm(cid:48) where

n → π

(A1) n1

(A2) (cid:80)T

t=1 st = o(n)

(A3) Let V = denote a r × r matrix so that trace( ˜Π(θ) ∂ ˜Σ(θ)

∂θm

˜Π(θ) ∂ ˜Σ(θ)

∂θm(cid:48) ) = vm,m(cid:48) are the elements

of V, then limp∧T→∞

√

vm,m(cid:48)
√
vm(cid:48)m(cid:48) exists and V is non-singular.

vmm

(A4) For any compact subset K ⊂ Θ have,any θ ∈ K and ﬁnte non zero constant ζ1,K and ζ2,k

and for all m, m(cid:48) ∈ {1, 2, ..r} we have,

(A4).1 limp∧T→∞ λmax(Σ(θ)) = O(1)
(A4).2 limp∧T→∞λmaxΣm(θ0) < ∞ = O(1) where Σm(θ0) = ∂Σ(θ)

∂(θm)|θ=θ0.

(A4).3 limp∧T→∞λmaxΣmm(cid:48)

(θ∗) < ∞ = O(1) where Σmm(cid:48)

(θ∗) =

Σ(θ)
∂(θm)∂(θ(cid:48)

m)|θ=θ∗.

(A4).4 (cid:107)Σm(θ∗)(cid:107)2

F = O(((cid:80)T

t=1 Pt)

1

2 +δ) where δ > 0.

52

un

n ≥ 1 − δ lim inf

n→∞ λrnΣm(θ) >

(A4).5 limp∧T→∞ λmin(Σ(θ)) > ζ2,K > 0

(A4).6 For some subsequence un and δ > 0 such that lim sup

n

ζ2,K > 0 for all m where λ1 ≥ λ2 ≥ ..λ ˜PT

.

53

2.5.1 Asymptotic optimal misclassiﬁcation rate

Below under the usual setup for classical LDA, we show the behavior of the misclassiﬁcation error

rate. First lemma 2.5.4 provides the asymptotic matrix property below.

Theorem 2.5.1. Let ˆθn be Restricted Expected Maximum Likelihood(REML) estimate of θ0 then
(cid:107)ˆθn − θ0(cid:107)2 = Op(

)

(cid:113) 1(cid:80)T

t=1 Pt n

Proof. Assuming conditions A(1)-A(4) hold, Cressie & Lahiri (1996) theorem 3.2 shows that

−1
2 (ˆθREM L − θ0) ⇒ N (0, I)

V

∂θm(cid:48) ) ≥ O((cid:80)T

˜Π(θ) ∂ ˜Σ(θ)

t=1 Pt n) inf
i

λi( ˜Π(θ) ∂ ˜Σ(θ)
∂θm

˜Π(θ) ∂ ˜Σ(θ)

∂θm(cid:48) ) We infer that(cid:107)ˆθn−

Since Vi,j = trace( ˜Π(θ) ∂ ˜Σ(θ)
∂θm
θ0(cid:107)2 = Op(

(cid:113) 1(cid:80)T

t=1 Pt n

even in irregular space time domain

) Later we show that above assumptions holds for Materrn class covariance
(cid:4)

Lemma 2.5.2.

(cid:0)

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

2

∆T Σ−1∆/2(cid:1) =

√

−

(cid:112)∆T Σ−1∆)((cid:112)∆T Σ−1∆) + OP ( P T

OP ( P T

n )

n )((cid:112)∆T Σ−1∆)
(cid:113) 1

P T n )

Proof. From assumptions A(4), we get ˆΣ = Σ + E where  = Op(
and (cid:107)E(cid:107)2 < ∞

So by geometric series expansion is valid for large n .
Thus we have ˆΣ−1 = Σ−1 + Op()Σ−1EΣ−1 + op()for large n
Using the fact that (cid:107)Σ−1(cid:107)2 < ∞ and ∆ ∼ NP T (∆, Σ( 1

Theorem 2.5.3. Using assumptions A(1) to A(4) ,(cid:80)T

+ 1
n2

n1

))

that W (ˆδM LE) → 1 − Φ(

√

C∞
2

)

54

t=1 Pt = ˜PT = o(n) and C ˜PT

(cid:4)

→ C∞ we show

Proof. By Taylor Series expansion

Φ(

2

√
= Φ(

)

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆
∆T Σ−1∆/2) +(cid:0)
(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

ˆ∆T Σ−1 ˆ∆

−

1
2

(

2

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

∆T Σ−1∆/2)2 (

2
√

∆T Σ−1∆/2(cid:1)φ(

√

√

−

√

∆T Σ−1∆/2) φ(

√

∆T Σ−1∆/2)+

∆T∗Σ−1∗∆∗/2)(cid:1)

for some quantity ∆T∗Σ−1∗∆∗ ∈ [

√
ˆ∆T Σ−1 ˆ∆
ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

2

√

,

∆T Σ−1∆/2]

Now using lemma 2.5.2 and boundedness of 0 < φ(.) ≤1

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

2

Φ(

√

) = Φ(lim
n

∆T Σ−1∆/2) + op(1)

(cid:4)

Lemma 2.5.4.

2

−

√

(cid:0)

ˆ∆T Σ−1 ˆ∆

∆T Σ−1∆/2(cid:1) =

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

n )((cid:112)∆T Σ−1∆)
(cid:113) 1
pT n ) and (cid:107)E(cid:107)2 < ∞ So by geometric
Proof. From lemma 2.5.4, we get ˆΣ = Σ + E where  = Op(
series expansion is valid for large n .Thus we have ˆΣ−1 = Σ−1 + Op()Σ−1EΣ−1 + op() for large
n. Using the fact that (cid:107)Σ−1(cid:107)2 < ∞ and ∆ ∼ NpT (∆, ( 1
(cid:4)

(cid:112)∆T Σ−1∆)((cid:112)∆T Σ−1∆) + OP ( pT

OP ( pT

n )

)Σ)

+ 1
n2

n1

55

Theorem 2.5.5. Under assumption A(3) if pT = o(n) and CpT → C∞ as n → ∞, then W(ˆδM LE) →
1 − Φ(

√

)

C∞
2

Proof.

Φ(

2

√
= Φ(
√

)

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆
∆T Σ−1∆/2) +(cid:0)
∆T Σ−1∆/2(cid:1)φ(
(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

ˆ∆T Σ−1 ˆ∆

√

−

(

1
2
−

2
√

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

2

∆T Σ−1∆/2)+

∆T Σ−1∆/2)2 (

√
∆T Σ−1∆/2) φ(

∆T∗Σ−1∗∆∗/2)(cid:1)

√

For some quantity ∆T∗Σ−1∗∆∗ ∈ [
Now using lemma 2.5.4 and boundedness of 0 < φ(.) ≤1

ˆ∆T Σ−1 ˆ∆
ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

2

,

√

√

∆T Σ−1∆/2]

ˆ∆T Σ−1 ˆ∆

(cid:112) ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

2

Φ(

√

) = Φ( lim
n→∞

∆T Σ−1∆/2) + op(1)

Results follow from Lemma 2.5.4.

(cid:4)

With the application to neuroimaging, below we investigate the properties of the misclassi-

ﬁcation error rate whenever pT >> n. If we further assume a known covariance matrix Σ, with

unknown mean estimates of the LDA classiﬁer obtained using MLE, the following theorem provides

the necessary motivation to modify the current method.

Theorem 2.5.6. Under the known true covariance Σ, let us deﬁne the LDA classiﬁer,

ˆδµ(X) = (X − ˆµ)(cid:48)Σ−1(ˆµ1 − ˆµ2)

(2.18)

56

where ˆµ1,ˆµ2 are the MLE estimates and ˆµ = ˆµ1+ˆµ2
pT /n → ∞ and let CpT → C∞ where 0 < C∞ < ∞ then,

2

. Let us assume the high dimensional setup

(1) For n1 (cid:54)= n2 but nk > n/4 for k ∈ {1, 2}

√CpT
pT /n → c for c > 0, we get W(ˆδµ) → 0 but
√CpT
pT /n → 0, we get W(ˆδµ) → 1

2

(i) If

(ii) If

(2) For n1 = n2,

W(ˆδµ)
√C

1−Φ(

pT
2

→ ∞

)

√CpT
pT /n → ∞, we get W(ˆδµ) → 0 but
√CpT
pT /n → c for c > 0, we get W(ˆδµ) → 1 − Φ( c
√CpT
pT /n → 0, we get W(ˆδµ) → 1

W(ˆδµ)
√C

1−Φ(

pT
2

2

)

(i) If

(ii) If

(iii) If

→ ∞

4 ), which is a constant in (0, 1
2 )

Proof. We will prove that

ˆ∆T Σ−1 ˆ∆

2

Φ(

) = Φ(

(cid:113)

∆T Σ−1∆)(1 + Op(
1√
∆T Σ−1∆)(1 + Op(

npT
1√

2

) + pT

n (n1 − n2)(1 + Op(
n (n1 + n2)(1 + Op(

1√

npT
1√

) + pT

npT

))

))

npT

The key step here is, (µ1 − µ)T ˆΣ−1 ˆ∆ = ∆T Σ−1∆(1 + Op(

1√

npT

)) + p
n1n2

(n1 − n2)(1 + Op(

1√

npT

)).

(µ1−µ)T ˆΣ−1 ˆ∆ =

1
2

[(∆ ˆΣ−1∆)+( ˆµ1−µ1)T ˆΣ−1( ˆµ1−µ1)−( ˆµ2−µ2)T ˆΣ−1( ˆµ2−µ2)−2∆T ˆΣ−1( ˆµ2−µ2)]

from the previous results we have, ˆΣ−1 = Σ−1(1 + Op(

1√

npT

))

(cid:4)

In section 2.1, we have established that a variety of approaches have been investigated to

overcome this singularity. Fan & Lv (2010) furthered emphasized on how the sparsity assumption

and estimation may lead to optimal misclassiﬁcation rates. Similar to the spatial approach adopted

57

by Yingjie & Maiti (2019), we expand methods to the spatio-temporal setup and incorporate non-

separability into the following theoretical results.

2.6 Penalized Linear Discriminant Analysis (pLDA)

In most imaging studies with relatively higher resolution information, the number of spatial points

(pixels and/or voxels) is almost exponentially higher than number of individual subjects for clas-

siﬁcation. Thus pT > n is a rather unrealistic scenario. The LDA classiﬁer under investigation

too renders itself unsuitable in situations where acquisition of data results in pT >> n, as ˆΣ

is singular. As for the misclassiﬁcation rate, we can also establish that under the setup where
pT /n → ∞ that the misclassiﬁcation rate would be equivalent to random chance even when the

true covariance is known due to the inconsistent accumulation of variance of the estimates of ˆµk
for k ∈ {1, 2}. Therefore there is a need to develop methods that can better accommodate a much

more likely scenario of pT > n. Although, based on setup we are able to reduce the number of

estimates to s + 2 << n dimensions where a solution maybe obtained using MLE, the nature of

these estimates are unstable. Therefore we proceed by proposing a penalized LDA. To describe the
method, we add to the notation used in section 2.4 as follows. Let ∆ = µ1 − µ2 = (∆1, ..., ∆pT )(cid:48)
be a pT × 1 dimensional vector which is the diﬀerence of the mean eﬀect between classes C1 and
C2. We deﬁne the signal set S = {ν : ∆ν (cid:54)= 0}, denoting the true non-zero diﬀerences between
classes. Further, let s denote the cardinality of S i.e. s = |S| and we assume that s < n << pT . To

formalize vector and matrix representation of the model in order to easily incorporate the penalty
term, for Yk,i ∼ NpT (µk, Σ(θ)), we deﬁne Zi = Y1,i − ¯Y for i = 1, 2, ..., n1 and Zi = Y2,i − ¯Y for

i = n1 + 1, n1 + 2, ..., n − 1 where ¯Y = (cid:80)

i=1 Yk,i/2n and n = n1 + n2. Let τ1 = n1

n and

(cid:80)nk

k=1,2

58

Zi ∼

n Σ(θ)),

n Σ(θ)),

for i = 1, 2, .., n1
for i = n1 + 1, 2, .., n − 1

and Cov(Zi, Zj) = 1

n Σ(θ)) for i (cid:54)= j. Alternatively,

τ2 = n2

n . Thus,

NpT (τ1 ∆, n−1

NpT (−τ2 ∆, n−1


∼ N



Z1

Z2
...

Zn−1

(cid:18)−τ21n1
 ⊗ IpT
(cid:125)
(cid:123)(cid:122)
(cid:124)

τ11n2−1

˜X

∆(cid:124)(cid:123)(cid:122)(cid:125)

β

(cid:124)

, (In−1 − 1
n

(cid:123)(cid:122)

˜Σ(θ)

Jn−1) ⊗ Σ(θ)

(cid:19)
(cid:125)

(2.19)

where Ib signiﬁes a b × b identity matrix and Jb signiﬁes a b × b matrix of 1s while 1b denotes a

b × 1 dimensional vector. We also have, ˜Σ−1(θ) = (In−1 + Jn−1) ⊗ Σ−1(θ) and X =

such that ˜X = X ⊗ IpT .

We now express the joint log-likelihood of Z as,

−τ21n1

τ11n2−1



L(β, θ; Z) = − npT
2

log(2π) − 1
2

(Z − ˜Xβ)T ˜Σ−1(Z − ˜Xβ) − 1
2

log|det( ˜Σ)|

(2.20)

It is clear that we seek to obtain a solution for a sparse true β0 = (β(cid:48)

1,0, 0)(cid:48) where
β1,0 is an s dimensional vector, β2,0 is a p − s dimensional vector and β0 is pT × 1. In this high-

1,0, β0,2)(cid:48) = (β(cid:48)

dimensional setup, we get solutions for estimating θ and β by solving the penalized log-likelihood

given by,

Q(β, θ; Z) = L(β, θ; Z) − n

T(cid:88)

p(cid:88)

t=1

u=1

pλt,n(|βu|)

(2.21)

Here pλt,n denotes the penalty function p with its corresponding tuning parameter λt,n depend-

59

ing on n. To obtain, well deﬁned properties of the sparse estimator we consider the smoothly

clipped absolute deviation (SCAD, Fan & Li (2001)) penalty function given by,



pλn(β) ∼

λt,n|β|,
− β2−2αλt,nβ+λt,n

2(α−1)

(α+1)λt,n

2

2

,

2

,

if |β| < λt,n
if λt,n < |β| < αλt,n
if |β| > αλt,n

2.6.1 Across sample independence

Notice that samples are not uncorrelated due to covariance matrix (In−1 − 1
n

⊗Σ(θ) but we

can make a linear transformation to data Z(cid:48)
sample in Z thus independence in Gaussian case. Take Hn = (In−1 − 1√

n = (H ⊗ IP T )Zn which ensures no correlation across

n+1 J(n−1)) and resulting

(cid:124)
−τ21n1

τ11n2−1

HH T

Jn−1)

(cid:123)(cid:122)
(cid:125)
 =
 (−τ2 − τ1√

(τ1 − τ1√



n+1 )1n1

n+1 )1n2−1

design matrix would be X(cid:48) = HnX = (In−1 − 1√

n+1 J(n−1)

Note that ˜X T ˜Σ−1 ˜X = ˜X(cid:48)T ˜Σ(cid:48)−1 ˜X(cid:48)

Henceforth we refer to transformed likelihood equation as our original equation and with the

abuse of notation, we assign the notation of original variable to transformed variable. We now

express the joint log-likelihood of Z as,

L(β, θ; Z) = − npT
2

log(2π) − 1
2

(Z − ˜Xβ)T ˜Σ−1(Z − ˜Xβ) − 1
2

log|det( ˜Σ)|

(2.22)

It is clear that since we seek to obtain a solution for a sparse true β0 = (β(cid:48)

1,0, 0)(cid:48) where β1,0
is an s dimensional vector and β0 is pT × 1 in this high-dimensional setup, we get solutions for

estimating θ and β by solving the penalized log-likelihood given by,

Q(β, θ; Z) = L(β, θ; Z) − n

T(cid:88)

p(cid:88)

t=1

u=1

pλt(|βu|)

60

(2.23)

Here pλ denotes the penalty function p with its corresponding tuning parameter λn depending

on n. To obtain, well deﬁned properties of the sparse estimator we consider the smoothly clipped

absolute deviation (SCAD, Fan & Li (2001)) penalty function given by,



pλn(β) ∼

λn|β|,
−|β|2−2αλn|β|+λn

2(α−1)

(α+1)λn

2

2

,

2

,

if |β| < λn
if λn < |β| < αλn
if |β| > αλn

2.6.2 REML estimation

Without loss of generality we can expand

β0 P T×1 = (β0 1,P1×1, β0 2,P2×1, .β0 t,Pt×1, ..β0 T,PT ×1)T

For each t we can further expand βT

Notice that X⊗IP T β = X⊗⊕T

Ist 0

0 t,Pt×1 = (β0 1,t,st×1, 0PT −st×1)T

 (β0 1,1,st×1, 0Pt−st×1 ..β0 1,t,st×1, 0Pt−st×1..β0 1,T,sT ×1, 0PT −sT ×1)T

0

0

= X ⊗ ⊕T

t=1

(cid:124)

Ist 0
(cid:123)(cid:122)

0

β

0

t=1


(cid:125)
In(Pt−st) 0

0

0

In the context of this result let us redeﬁne ˜X = X ⊗ Q Clearly rank of ˜X = nP T −(cid:80)T

QP T×P T

t=1 st

Suppose, we obtain a orthogonal matrix Bn such that

n Bn = ⊕T
BT

t=1diag

n Bn = I − ˜X( ˜X T ˜X)− ˜X T

 and BT

REML estimates are deﬁned by

ˆθn,REM L = arg min
θ

− log det( ˜Σ(θ)) − log det( ˜X T ˜Σ(θ)−1 ˜X) − 1
2

ZT Π(θ)Z

61

Let us deﬁne oracle estimator ˆβ
˜Σ−1(ˆθ) ˜X1,t)− ˜X T

orc
t

1,t

setup, ˆβ

1,t

|ˆθ = ( ˜X T
(cid:18)

(cid:19)

orc
t when it is set of zeros in βt,0 is already known. Under that

˜Σ−1(ˆθ)Zn where ˜X1,t = Ist This leads to equation

trace

Π(ˆθn,REM L)

∂ ˜Σ(ˆθn,REM L)

∂θ

+ ZT ∂Π(ˆθn,REM L)

∂θ

Z = 0

(2.24)

where ˜Π(θ) = ˜Σ−1(θ) − ˜Σ−1(θ) ˜X1( ˜X T

1

˜Σ−1(θ) ˜X1)− ˜X T

1

˜Σ−1(θ)

The attractive features of REML are that equation 5.4 is unbiased and the covariance estimate

ˆθREM L and mean estimate ˆβ are asymptotically independent.

Theorem 2.6.1. Oracle REML Cressie & Lahiri (1996)

(cid:107)ˆθn,REM L − θ0(cid:107)2 = Op(

(cid:113) 1(cid:80)T

)

t=1 Pt (n)

Proof. From the result, P r( ˆβt|ˆθ = ˆβorc
t,j = 0} =
{j : βt,j = 0}, we know the indices or position where β0 is zero. therefore we are able to identify

|ˆθ) → 1 and from the assumption that {j : ˆβorc

t

with probability one about linear subspace of design matrix generated by ˜X1,t needed to construct

projected covariance matrix ˜Π .So, with probability converging to 1, we get result as stated in
(cid:4)

Cressie & Lahiri (1996)

2.6.3 Validation of REML assumptions

From Cressie & Lahiri (1996) ,we observe that assumption A1-A4 are suﬃcient to guarantee con-

vergence of covariance parameter ˆθ. In this section we prove those assumption hold for Matern

covariance class under suitable chosen parameter space.

Lemma 2.6.2. For every isotropic , stationary function γ(h, t) the matrix Σ is positive deﬁnite
with entries Σi,j = γ(si − sj, ti − tj) for some ﬁxed P spatial points si, sj ∈ S and T time points
ti, tj ∈ T

62

Proof. For any vector a∼; a∼

T Σa∼ = V ar(aT X∼ ) here covar(Xi, Xj) = γ(si − sj, ti − tj)

(cid:4)

conditions C4-C6 is suﬃcient for assumptions A4 which guarantees convergence of REML esti-

mators even in irregular time domain. Since we assume our covariance follows Matern class which

is isotropic and second order stationary we can establish following lemma:

Lemma 2.6.3. Conditions A4.1-3 , A4.5 and A4.6 are satisﬁed if the item 1 , item 2 and item 3-4

holds for any isotropic and second order stationary process.
For any θ ∈ K compact subset

s(cid:48)∈St,t(cid:48)≤Tn

|γ(.)(s− s(cid:48), t− t(cid:48); θ)| < ζ1,K here γ(.) = γ0, γiγi,j denoting covariance,

(cid:80)
(cid:80)

• lim

n

lim sup
s∈St,t≤Tn

• lim

n

lim sup
s∈St,t≤Tn
• for all i lim

(cid:80)

ﬁrst and second derivative of covariance respectively

s(cid:48)(cid:54)=s∈St,t(cid:54)=t≤Tn

|γ0(s − s(cid:48), t − t(cid:48); θ)| < ζ2,Kγ0(0, 0; θ)

lim sup
s∈St,t≤Tn

n

s(cid:48)∈St,t≤Tn

i (s − s(cid:48), t − t(cid:48); θ) > ζ2,K
γ2

• for all i and N(s) = {s(cid:48) : s + s(cid:48) ∈ St} and N(t) = {t(cid:48) : t + t(cid:48) ∈ St}

(cid:88)

lim
n

s∈St,t≤Tn

< ζ2,K lim
n

|γi(s, t; θ)| (cid:88)
(cid:88)

s(cid:48)∈N(s),t(cid:48)∈N(t)

|γi(s − s(cid:48), t − t(cid:48); θ)|

s(cid:48)(cid:54)=s∈S1t,t(cid:48)(cid:54)=t∈T1n
i (s(cid:48), t(cid:48); θ) for each ﬁxed s ∈ S1,t, t ≤ T1,n
γ2

Proof. Since closed form of γ(.) is not available for all values for separability parameter  , we

use the argument that eigenvalues of Σ and ∂Σ
∂θi

are continuous in  due to continuity of f ()

and ∂f
∂θi

as function of . Our strategy is to prove all the results for  = 0, 1 for which closed

form are available. Also notice that spectral density f () is continuously diﬀerentiable in .



ηπ

ηπ

γ0((cid:107)s(cid:107), t) =

d
2 Γ(ν+ d

π

2 )(α(cid:107)s(cid:107))ν Kν (α(cid:107)s(cid:107))
2 )(β|t|)ν Kν (β|t|)
1
2 Γ(ν+ 1
2ν−1α2ν
2ν−1β2ν
√
√
α2(cid:107)s(cid:107)2+β2t2)ν Kν (
2ν−1(α2+β2)ν

α2(cid:107)s(cid:107)2+β(cid:107)2t2)

2 )(

d+1
2 Γ(ν+ d+1

 = 1

 = 0

63

Let k ≥ 0 be an integer deﬁne the set Ek = s(cid:48) : (cid:107)s(cid:48) − s(cid:107) ≤ k + 1/s(cid:48) : (cid:107)s(cid:48) − s(cid:107) ≤ k and Nk =
|Ek| number of points in Ek. To each point s(cid:48) ∈ Ek we associate a disjoint (cid:107)(cid:107)2 ball of volume
≤

2(cid:0)(k+1)d−kd(cid:1)

, so total space occupied is Nk2

, also volume of Ek = 2

2

π

d

d
2 ( a
2 )d
π
Γ(1+ d
2 )

Γ(1+ d
2 )

therefore we obtain Nk ≤ d(k+1)d−12d

ad

and since non decreasing behavior as a

d
2 ( a
2 )d
π
Γ(1+ d
2 )

2(cid:0)d(k+1)d−1(cid:1)

d

Γ(1+ d
2 )

π

2

function of distance between arguments. sup
s(cid:48)∈Ek

γ0(s − s(cid:48), t − t(cid:48)) ≤ γ0(k, t − t(cid:48))

(cid:88)

lim sup
s∈St,t≤Tn

lim
n

≤ (cid:88)
(cid:90)

≤ CK

1
b

k∈Z+,k2∈Z+

(cid:90)

|γ0(s − s(cid:48), t − t(cid:48); θ)| = lim

s(cid:48)∈St,t(cid:48)≤Tn
d(k + 1)d−12d

γ0(k, bk2; θ) =

1
b

ad

lim sup
s∈St,t≤Tn

n

(cid:90)

(cid:90)

s(cid:48)∈St,t(cid:48)≤Tn
d(s + 1)d−12d

R+

R+

ad

d(s + 1)d−12d

R+

R+

ad

(s, t)dsdt = O(1)

due to ﬁnite moment of modiﬁed Bessel function of second kind

•• From the property of uniform integrability ,

(cid:88)

γ0(s − s(cid:48), t − t(cid:48); θ)

γ0(s, t)dsdt

(cid:4)

2.6.3.1 Tapered REML

Suppose we taper each spatial matrix with ﬁxed tapered range rt = Kt√
Pt

such that constant Kt is

to be determined by cross validation.

We solve modiﬁed REML equation

(cid:18)

trace

Π(ˆθn,REM L,tapered))tap

(cid:19)

∂ ˜Σ(ˆθn,REM L)

∂θ

+ ZT ∂Π(ˆθn,REM L,tapered)tap

∂θ

Z = 0

(2.25)

Here Πtap =

˜Σ ◦ KT ap

−1 −

˜Σ ◦ KT ap

−1 ˜X1( ˜X T

1

˜Σ ◦ KT ap

−1 ˜X1)− ˜X T

1

˜Σ ◦ KT ap

−1

In order to show the convergence of tapered REML estimater i.e. (cid:107)ˆθn,REM L,tapered − θ0(cid:107)2 =

(cid:113) 1(cid:80)T

Op(

t=1 Pt (n)

) we ensure that even after tapering all conditions as mentioned in Cressie & Lahiri

64

(1996) remains intact. Hence we need to prove the following lemma which assures that tapered in

not ”far” from original covarinace matrix

Lemma 2.6.4. Assuming general condition 13 holds, we have the results that for each t,

• (cid:107)Σt − Σt,T aper(cid:107)1 = Op( 1√

Pt

)

• (cid:107)Σk,t − Σk,t,T aper(cid:107)1 = Op( 1√

Pt

)

• (cid:107)Σk,j,t − Σk,j,t,T aper(cid:107)1 = Op( 1√

Pt

)

Proof.

= max

i

t=1

1

T(cid:88)
pt(cid:88)
(cid:88)
(cid:90) ∞

hij≥wpt

(cid:88)

hij≤wpt
≤ Ktρ

|γtaper(s, t : θ)| + max

i

(cid:88)

(cid:88)

m=(cid:99) w(pt)

δ

j∈Bi

m

(cid:98)

(cid:107)Σt − Σt,T aper(cid:107)1 =

|γtaper(s, t : θ)|

|γtaper(s, t : θ)|

≤ 3

Ktρ
w(pt)

xd|γ(x : θ)|dx

wp

(cid:4)

Therefore based on equation 2.23 in order to estimate ˆβ and ˆθ, we obtain the solutions of deriva-

tive of the penalized likelihoods with respect to each of the unknown parameters and iteratively

solve until convergence is obtained.

2.6.4 Regularity conditions for penalty

In order to demonstrate properties of the subject level images over time, we introduce βt,0,j that
is the unknown true parameters of dimension p × 1 for each ﬁxed time point t = {1, 2, .., T}.

Analogously under the sparsity assumption, we denote st to determine the number of non-zero

components within a speciﬁc time point. Therefore,(cid:80)T

t=1 st = s. The methods below without loss

65

of generality, as voxelwise analysis that involves a subject wise registration on a single template

space are performed, that each image contains the same number spatial sites p. We further assume

that for every subject the multiple images were acquired over time with a total number of T

time points. Clearly, for separable cases this provides the ease of the Kronecker product. Σ(θ) is

assumed to be second order stationary, isotropic and twice diﬀerential over all dimensions of space
and time for θ ∈ Ξ over all (s, t) ∈ D × T where Ξ is the parametric space of θ. In general, let
us consider any covariance matrix Σ(θ) for θ ∈ Ξ constructed by a covariance function γ(θ) where

the true parameter θ is given by θ0. Also note that the derivative of Σ(θ) w.r.t θm is denoted by
∂θm∂θm(cid:48) where θ = (θ1, ..θm, ..θr)(cid:48).

Σ(θ) and second derivative is denoted by Σmm(cid:48)

Σm(θ) = ∂
∂θm

(θ) = ∂2Σ(θ)

Listed below are conditions under which consistent estimates of the pLDA are obtained.

(A3) at,n = max
1≤j≤p

{p(cid:48)

λt,n

(|βt,0,j|), β0,j (cid:54)= 0} = O(

√
n)

(|βt,0,j|), β0,j (cid:54)= 0} = o(1)

{p(cid:48)(cid:48)

λt,n

(A4) bt,n = max
1≤j≤p
s → ∞
nλt,n√

(A5)

√

(A6) λt,n = o(1) as n → ∞.

(A7) s4

n → 0 as n → ∞.

(A8) min
1≤i≤s

|βt,0,i|
λt,n

→ 0 as n → ∞.

(A9)

(A10)

n→∞ lim inf
lim
θ→0+
(cid:80)T
(cid:80)T

1 dpt

1 st

nCpT

p(cid:48)

λt,n

(|θ|) > 0 as n → ∞.

→ 0 where dp = maxi≤st

(cid:80)pt

k=st+1 σ2

k,t

We can rearrange β0 and re-express it as,

β0 pT×1 = (β0 1,p×1, β0 2,p×1, ..., βt,0,p×1, ..βT,0,p×1)T

66

For all of the methods shown below the only constraint for identiﬁability is that(cid:80)T

t=1 st = o(n).

Considering λt,n = λn is equal for each time group, below are results under these special cases.

(cid:113) 1

P T n ) , we want to prove that for each t = 1, 2..., T

n ) and ˆβt 2 = 0P×1 with probability close to 1 where ˆβt|ˆθ = arg max

Q(β, ˆθ; Z)

βt

Lemma 2.6.5. Assuming (cid:107)ˆθ − θ(cid:107)2 = Op(

(cid:107) ˆβt− βt,0(cid:107)2 = Op((cid:112) st
In ﬁrst part we show that (cid:107) ˆβt − βt,0(cid:107)2 = Op((cid:112) st

Proof. For each ﬁxed t the proof consists of two steps

n ) . In the next part we show that ˆβt 2 = 0P×1 with

probability close to 1 . Thus the proof is in two parts :

Without loss of generality, β0,t = (β0,t,1, 0)T and lets assume that an oracle has already informed

us about the positions of zero, thus we can construct a oracle estimator which already have zeros

in the positions of

ˆβorc
t

|ˆθ = arg max

βt,1|βc

t,1=0

L(β, ˆθ; Z)

By this construction we are only searching for ˆβorc

t

in the space of real vectors whose non zero

element corresponds to non zero part of true βt,0.
In the second part we show that for any maximizer of penalized likelihood ˆβt|ˆθ = arg max

β

Q(β, ˆθ; Z)

P r( ˆβt|ˆθ = ˆβorc

t

|ˆθ) → 1

Furthermore if Likelihood is concave like in our Gaussian case , the maximizer ˆβt is unique. To

prove second part we use the stationary condition of KKT conditions refereed in Kwon & Kim

(2012)

The proof of ﬁrst part

Theorem 2.6.6. Assuming (cid:107)ˆθ − θ(cid:107)2 = Op(
(cid:107) ˆβorc

|ˆθ − βt,0(cid:107)2 = Op((cid:112) st

n )

t

(cid:113) 1

67

P T n ) , we want to prove that for each t = 1, 2..., T

Denote vector ut ∈ Rpn with entries 1 corresponding to index j such that βt,j (cid:54)= 0 and 0

elsewhere, but without loss of generality , we denote ut as u.Similarly, ˜Xt denotes the colomns of

design matrix ˜X corresponding to βt,j (cid:54)= 0. Hence we get identity that ˜Xtut = ˜X

n ) , ξn = Op((cid:112) s
It is suﬃcient to show that for any  > 0 ,(cid:80)T
some C > 0 under the condition that for r ∈ {1, 2, ....T} − {t} we have (cid:107) ˆβr − βr(cid:107)2 = Op((cid:112) sr
P r(cid:0) sup

L(βt,0 + uξn,t, ˆθ) − L(βt,0, ˆθ)(cid:1) > 1 − 

t=1 st = s, ξn,t = Op((cid:112) st

0

0

n )

0

n ) and for

(cid:107)u(cid:107)2=C



 u

0

0
0
0 Ist×st 0

or equivalently we can prove that

P r(cid:0) sup

(cid:107)u(cid:107)2=C

L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ)(cid:1) > 1 − 

For term (1) we get using eigenvalue inequality result that ψi is i th eigenvalue

Proof.

uT
t

˜X T
t

1
2

(cid:124)

˜Σ(ˆθ)−1 ˜Xtutξn

(cid:125)

(cid:123)(cid:122)

1

=

1
2

(cid:124)

(cid:123)(cid:122)

1a

q(cid:88)

r=1

uT
t

˜X T
t

+

1
2

(cid:124)

(cid:125)

(cid:123)(cid:122)

1b

uT ˜X T ˜Σ(θ0)−1 ˜Xtutξn

uT ˜X T ˙˜Σ(θ∗)r(ˆθr − θ0,r)uT ˜X T ξn

(cid:125)

1(a) ≤ O((cid:107)u(cid:107)2 ˜Xt

T ˜Xt min

i

ψi( ˜Σ(ˆθ)−1))ξ2

n

= O(min

i

ψi( ˜˜X T

t (In−1 + Jn−1)) ˜Xt) min
i

ψi( ˜Σ(ˆθ)−1))ξ2

n

= O(nξ2

n,t)by eigenvalue inequality

Similarly using CS inequality1(b) = O(nξ2

n,tηn)

(cid:4)

68

For term (2) we use Taylor expansion of ˜Σ(ˆθ)−1 around θ0 such that

ˆθ − θ0 = vηn

2b
term (2a) using Markov inequality that for any r.v

(Zt − ˜Xβt)T ˙˜Σ(θ∗)r(ˆθr − θ0,r) ˜Xuξn

(cid:123)(cid:122)

(cid:125)

(cid:114)

2a

(cid:124)

(cid:125)

− q(cid:88)
(cid:124)

r=1

Z = Op
−(Z − ˜Xβ)T ˜Σ(θ0)−1 ˜Xtutξn = Op(

Proof. −(Z − ˜Xβ)T ˜Σ(ˆθ)−1 ˜Xtut
= −(Z − ˜Xβ)T ˜Σ(θ0)−1 ˜Xuξn

(cid:123)(cid:122)
(cid:112)E(Z2)
(cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)min
= Op((cid:112) n1n2
(cid:80)q
Similarly term (2b)
(cid:113)
r=1(Z − ˜Xβ)T ˜Σ(θ∗)r ˜Xuξn

n st(cid:107)u(cid:107)ξn) = Op(

ψi(Σ(θ0)−1) trace( ˜X T

= Op(

i

= Op(

= Op(

√

i

min
nst(cid:107)u(cid:107)ξnηn) = op(

√

nst(cid:107)u(cid:107)ξn)

˜Xt)

t

√
nst(cid:107)u(cid:107)ξn)

ψi( ˜Σ(θ0)−1) trace( ˜˜X T ˜˜X)(cid:107)u(cid:107)2ξn)

min

i

−τ21n1

τ11n2−1

T

−τ21n1

τ11n2−1

(In−1 + Jn−1)

(cid:107)u(cid:107)2ξn)

(cid:4)

(cid:4)

ψi( ˜Σ(θ∗)r ˜Σ(θ0) ˜Σ(θ∗)r)trace ˜X T ˜X)(cid:107)u(cid:107)2(cid:107)v(cid:107)2ξnηn)

Here we observe that term 1 dominates all other term hence for appropriate constant C the value

above is negative, hence proved.

Lemma 2.6.7. P r(min
j≤qn

| ˆβt,j|ˆθ| > aλn) → 1

Proof.

| ˆβt,j|ˆθ | ≤ min
j≤qn

|β0,t,j| − max
j≤qn

min
j≤qn

Using result (cid:107) ˆβt,j|ˆθ − β0,t,j(cid:107)2 = Op((cid:112) st

| ˆβt,j|ˆθ − β0,t,j|

n ) and assumption A6 we establish the above result.

(cid:4)

69

(cid:113)(cid:80)T st

Lemma 2.6.9. (cid:107) ˆβ − β0(cid:107) = Op(

n ) then (cid:107)ˆθ − θ0(cid:107) = Op(

P r(cid:0) sup

Proof. We will show that for arbitrary  > 0

(cid:107)v(cid:107)2=C

Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0)(cid:1) > 1 −  here η = Op(
(cid:125)

Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0) = Q(β0, θ0 + vηn) − Q(β0, θ0)

(cid:123)(cid:122)

+ Q( ˆβ, θ0 + vηn) − Q(β0, θ0 + vηn) −

Q( ˆβ, θ0) − Q(β0, θ0)

(cid:18)

1

(cid:124)
(cid:123)(cid:122)

2

q,q(cid:88)

)

1 Ptn

(cid:113) 1(cid:80)T
(cid:113) 1(cid:80)T

1 Ptn

)

(cid:19)
(cid:125)
(cid:20) ∂2Q(β0, θ∗)
(cid:123)(cid:122)

∂θ(cid:48)

rθr

13

Lemma 2.6.8. P r( max

qn+1≤j≤pn

t,j | ˆθ,ˆθ)

| ∂L( ˆβorc

∂β

| < nλn|θ=ˆθ for allt) → 1

∂L(β0,θ=ˆθ)

∂β

= 0 due to conditional expectation.Using lemma 9.1, we can prove that

Proof. Eβ,θ
t,j | ˆθ,ˆθ)
∂L( ˆβorc
∂β

= −(Z − ˜X ˆβorc

t,j |ˆθ)T ˜Σ−1(ˆθ) ˜X1,t

= −(Z − ˜Xβ0)T ˜Σ−1(ˆθ) ˜X1,t + −( ˆβorc
quantity is P T O(e−λ2

n,t) Hence proved.

t,j |ˆθ − β0)T ˜X T By Gaussian concentraion inequality each of this
(cid:4)

(cid:124)
q(cid:88)

r=1

q,q(cid:88)

Expanding term (1) by Taylor series we obtain
vr(cid:48)[−˜tr,r(cid:48)(θ0)]vr

∂Q(β0, θ0)

ηn

vr

(cid:124)

11

vr(cid:48)

∂θr

(cid:124)

(cid:125)

(cid:124)

r,r(cid:48)=1

+ η2
n

+ η2
n

(cid:123)(cid:122)

(cid:123)(cid:122)
(cid:18)(cid:113)
= −(Z − ˜Xβ0)T ˜Σr(θ0)−1(Z − ˜Xβ0) + trace( ˜Σ(θ0) ˜Σr(θ0)−1) Using Markov inequality
trace ˜Σ(θ0) ˜Σr(θ0)−1 ˜Σ(θ0) ˜Σr(θ0)−1
√
nP T )λmax(Σ(θ0))λmax(Σr(θ0)−1) = Op(

= Op((cid:107) ˜Σ(θ0) ˜Σr(θ0)−1(cid:107)F )

+ ˜tr,r(cid:48)(θ0)

(cid:125)
(cid:19)

r,r(cid:48)=1

nP T )

(cid:125)

vr

12

∂Q(β0,θ0)

∂θr

= Op
≤ Op(

√

(cid:21)

(11) = Op(1)(cid:107)ν(cid:107) Let ˜ti,j(θ) = trace(cid:0)In ⊗ Σ−1(θ0)Σi(θ0)Σ−1(θ0)Σj(θ0)
(cid:125)

(cid:123)(cid:122)

(cid:124)

(cid:1)

ti,j (θ0)

(cid:18)

(cid:19)

ti,j√

ti,itj,j

1
t
2
j,j

= n ti,j(θ0) = n t

1
2
i,i

1
2

= n t

i,iai,jt

1
2
j,j

By assumption, lim
n
≥ P T λ2

min(Σ−1(θ0))λ2

min(Σi(θ0))

A exists with lim
n

λmax(A) = Op(1) and ti,i(θ0)

70

n(cid:107)ν(cid:107)2
(cid:125)

Hence for some constant Kn

(12) ≤ −nP T η2

n(cid:107)v(cid:107)2

2 lim
n

λmax(A)λ2

min(Σi(θ0)) = −nP T η2
λ2

2Kn

min(Σ−1(θ0)) max
i≤q
(cid:123)(cid:122)

131

∂2Q(β0, θ∗)

∂θ(cid:48)

rθr

(cid:124)

= trace( ˜Σ(θ∗) ˜Σr,r(cid:48)

(θ∗)−1) − (Z − ˜Xβ)T ˜Σr,r(cid:48)

(θ∗)−1(Z − ˜Xβ)

+ trace( ˜Σr(θ∗) ˜Σr(cid:48)

(θ∗)−1)

(cid:18)

˜Σ(θ∗) − ˜Σ(θ0)

(cid:19)

(θ∗)−1

Similar to expression (11)
(131) = Op((cid:107) ˜Σ(θ∗) ˜Σr,r(cid:48)
√
≤ Op(
nP T )λmax(Σr,r(cid:48)
+ nλmax[(Σr,r(cid:48)

(θ∗)−1(cid:107)F ) + trace ˜Σr,r(cid:48)
(θ∗)−1)λmax(Σ(θ0))
(θ∗)−1) trace(Σ(θ0) − Σ(θ∗) )]

√
√

= Op(
≤ Op(

nP T ) + nOp(1)P T max
i≤P T

nP T ) + nOp(1)P T max
i≤P T

∂θ

(cid:104) ∂γi
(cid:107) ∂γi

∂θ

(cid:12)(cid:12)θ∗∗, θ0 − θ∗(cid:105)
(cid:12)(cid:12)θ∗∗(cid:107)∞(cid:107)θ0 − θ∗(cid:107)2 = Op(

√

nP T )

Using identity ˜Σr(θ) ˜Σr(cid:48)

(θ)−1 = − ˜Σr(θ) ˜Σ−1(θ) ˜Σr(cid:48)(θ) ˜Σ−1(θ) we obtain that

(θ∗)−1)
(θ∗)−1 ) + trace( ˜Σr(cid:48)

(θ0)−1) − trace( ˜Σr(θ∗) ˜Σr(cid:48)
(θ0)−1 − ˜Σr(cid:48)

trace( ˜Σr(θ0) ˜Σr(cid:48)
= trace( ˜Σr(θ0)( ˜Σr(cid:48)
≤ nλmax(Σr(θ0))trace(Σr(cid:48)(θ0)−1 − Σr(cid:48)(θ∗)−1 )
+ nλmax(Σr(cid:48)(θ∗)−1) trace(Σr(θ0) − Σr(θ∗) )

(cid:12)(cid:12)(cid:12)(cid:12)θ∗∗

(θ∗)−1( ˜Σr(θ0) − ˜Σr(θ∗) )

(cid:12)(cid:12)(cid:12)(cid:12)θ∗∗

, θ0 − θ∗(cid:105)

= nOp(1)P T max
i≤P T
nP T )η2

√
(13) = Op(

i
∂θ

(cid:104) ∂γ(r)
n(cid:107)ν(cid:107)2

2 = Op(

1√

nP T

)(cid:107)2(cid:107)2

, θ0 − θ∗(cid:105) + nOp(1)P T max
i≤P T

(cid:104) ∂γ(r(cid:48),−1) i

∂θ

2 Similarly as expression second part of (131). Thus

term (1) is dominated by (12)

As for expression (2) it is easy to check that penalty term vanishes and we are left with
( ˆβ−β0)T ˜X T [ ˜Σ−1(θ0 +νηn)− ˜Σ−1(θ0)](Z− ˜Xβ0) + ( ˆβ−β0)T ˜X T [ ˜Σ−1(θ0 +νηn)− ˜Σ−1(θ0)] ˜X( ˆβ−β0)

= ηn( ˆβ − β0)T ˜X T(cid:80)q

r=1

˜Σr(θ∗)νr(Z − ˜Xβ) + ( ˆβ − β0)T ˜X T(cid:80)q

˜Σr(θ∗)νr ˜X( ˆβ − β0)

r=1

71

= Op((cid:107)( ˆβ − β0)(cid:107)2(cid:107)(ˆθ − θ0)(cid:107)2)(cid:112)n trace(Σ−1(θ0)Σr(θ0)−1) +

Op(n(cid:107)( ˆβ − β0)(cid:107)2
appropriate large value on (cid:107)ν(cid:107) we prove the statement.

2(cid:107)(ˆθ − θ0)(cid:107)2λmaxΣr(θ∗)−1 Hence (11) dominates all of the rest of the terms, thus
(cid:4)

Theorem 2.6.10. min
1≤j≤qn

ˆβorc
j

|ˆθ = O(n2c)

Proof. Follows min
1≤j≤qn
Op((cid:107) ˆβorc

|ˆθ − β|θ(cid:107)2) = oP (n2c)

j

ˆβorc
j

|ˆθ ≤ min
1≤j≤qn

β|θ + Op((cid:107) ˆβorc

j

|ˆθ − β|θ(cid:107)2). By assumption and the result that
(cid:4)

2.7 Algorithm and methodology to obtain optimal

solutions

Based on the objective function (2.23), in order to estimate ˆβ and ˆθ, we obtain the solutions

of derivative of the penalized likelihoods with respect to each of the unknown parameters and

iteratively solve until convergence is obtained. Similar to estimating solutions in Zou & Li (2008),

as a ﬁrst step initialization we replace Σ(ˆθ) with an identity matrix IpT and obtain the solution

for β denoted by ˆβ(ini). Given ˆβ(ini), we obtain a solution for ˆθ(0), which is then used to obtain

the solution ˆβ(0), which then used to obtain a solution for ˆθ(1). Finally we obtain ˆβ(1) given the

estimates ˆθ(1). Similar to already established MLE based methods under the SCAD penalty it

has been show with these one-step estimates consistency can be achieved. There is no apparent

need to continue to with further iterations and attain convergence in the estimates. Along similar

lines theoretical properties of consistency are established for estimates under the assumption that

remainder of the parameters are considered to be ﬁxed at any given iteration, based on the solution

obtained from the previous step.

Step (1): Let Σ = IpT and solve ˆβt

(ini)

= arg maxβt L(βt, Σ = IpT ; Z)

Step (2): Let βt = ˆβt

(ini)

, Solve ˆθ(0) = arg maxθ L( ˆβt

(ini)

, θ; Z)

72

Step (3): Solve ˆβt

(0)

= arg max

βt

L(β, θ = ˆθ(0); Z)

Step (4): Solve ˆθ(1) = arg maxθ L(βt = ˆβt

(0)

, θ; Z)

Step (5): Solve ˆβt

(1)

= arg max

βt

L(β, θ = ˆθ(1); Z)

The ﬁnal estimates are therefore given by ˆθ(1) and ˆβt

(1)

for the spatio-temporal covariance abd

diﬀerence in mean parameters respectively.

2.7.1 Asymptotic properties of one-step estimates

In order to show properties of consistency for all estimators, we show that consistency is obtained

under the assumption that previous estimates of the ﬁxed parameter used to solve the next iteration

are also consistent. Based on these notions, consider the following theorem.

Theorem 2.7.1. Assuming conditions A1-A10 hold, then for each ﬁxed t = {1, .., T},

(i) (Consistency) there exists an estimate ˆβt = ( ˆβ1,t, ˆβ2,t)T such that, (cid:107) ˆβt − β0,t(cid:107) = Op((cid:112) st
(ii) (Sparsity)further, if (i) holds then, Q(( ˆβ1, 0)(cid:48), θ; Z) > argmax(cid:107) ˆβ2(cid:107)2=c

Q(( ˆβ1, ˆβ2)(cid:48), θ; Z)

n )

√ s

n

Proof.Check appendix.

In a similar way, we are able to establish existence and consistency results for ˆθ as well. That

is,

Theorem 2.7.2. Assuming 2.5to (A10) hold, (cid:107)ˆθ − θ0(cid:107) = Op(

(cid:113) 1
pT n ) as p, T, n → ∞.

Since for every time component we are able to show asymptotic properties, these results can be

neatly combined to obtain the following result.

Theorem 2.7.3. Using the above theorem one can easily check that for

β0,pT×1 = (β0,1,p×1, β0,2,p×1, ...β0,t,p×1, ..β0,T,p×1)T and

ˆβ0,pT×1 = ( ˆβ0,1,p×1, ˆβ0,2,p×1, ... ˆβ0,t,p×1, .. ˆβ0,T,p×1)T

73

(i) (Consistency) there exists an estimate ˆβ such that (cid:107) ˆβ − β0(cid:107) = Op((cid:112) s

n ) as n → ∞.

(ii) (Sparsity) P ( ˆβt,2 = 0 ∀t) → 1 as n → ∞.

All the proofs of theorems 2.7.1, 2.7.2 and 2.7.3 are provided in the Appendix ??. The covariance

function is assumed to be stationary and the results included are based on the SCAD penalty

resulting in unbiased estimators.

2.8 Computational Complexity

Thus far, we have described a penalized LDA technique, that can simultaneously estimate pa-

rameters of an underlying spatio-temporal covariance model and selection and estimation of the

diﬀerences between the means. However, whenever the vector Z is constructed the corresponding
matrix ˜Σ(θ) which is of dimension (n − 1)pT × (n − 1)pT . Assuming the subjects are independent
we will require to calculate the inverse of a pT × pT matrix, which could have a computational cost

of O(pT 3). In order to ease this burden, we begin by studying tapering techniques.

2.8.1 Covariance Tapering

In spatial statistics Kaufman et al. (2008) introduced covariance tapering to approximate the like-

lihood by replacing the spatial covariance with a positive-deﬁnite tapered version of the covariance

matrix and established the computational gain while maintaining the the underlying theoretical

properties. For very large datasets, Furrer et al. (2006) delineate the use of tapered covariance ma-

trices. As the density of sites in a image is exponentially larger than the number of time-points in

our longitudinal imaging study we taper only the spatial covariance of a separable spatio-temporal

covariance matrix.

Suppose we use tapering function KT ap(h, w) only for spatial domain with the spectral density of

74

tapered function

fK(ω) = (2π)−d

(cid:90)

Rd

exp(−iw T x )KTap(x , ω)dx

(1+

Mψ

(cid:107)ω(cid:107)2
α2 )ν+d/2+ψ

with the restriction, fK(ω) ≤
where ψ > (1 − ν, d/4). This tapering function also
included Wedland taper function of degree k satisﬁes the above condition 2.8.1 for k > max{1/2, ν +
(d − 2)/4 + δ} for some δ > 0. However, we consider the Wedland taper function for each time
ﬁxed covariance matrix Σt for KT ap(h, rt) = [(1 − h

)+]2 have k = 2, d = 3 here

rt

2.8.2 One way Tapering vs Two way Tapering

One way tapering yields biased score function:
l1(θ) = − nP T

log(2π) − (Z − ˜Xβ)T ( ˜Σ ◦ K−1

2

(cid:80)P
j=1 pλt,n(|β0 t,j|)

−(cid:80)T

t=1

T ap)(Z − ˜Xβ) − log(det( ˜Σ ◦ KT ap))

Two way tapering yields unbiased score function:
l2(θ) = − nP T

log(2π) − (Z − ˜Xβ)T [( ˜Σ ◦ K−1

T ap) ◦ KT ap](Z − ˜Xβ) − log(det( ˜Σ ◦ KT ap))

2

t=1

(cid:80)P
−(cid:80)T
j=1 pλt,n(|β0 t,j|)
(cid:113) 1(cid:80)T

op(

t=1 Ptn

) under certain assumptions.

It is established that both methods give asymptotically close estimates (cid:107)ˆθ1,taper − ˆθ2,taper(cid:107)2 =

2.8.3 Tapering range

We have established that, choosing tapering range guarantees estimation consistency of θ0 as n → ∞

under increasing domain setup. In the work of Furrer et al. (2016) and Chu et al. (2011), authors

have established estimation consistency under some suﬃcient conditions;which are satisﬁed for

our Matern class of covariance. Mainly due to the fact that for appropriate true parameters

γ(s, t; θ) <

1+(cid:107)s(cid:107)3+α1 +|t|1+α2 and its derivative ∂γ(s,t;θ)

∂θ

A

1+(cid:107)s(cid:107)3+α1 +|t|1+α2 for some α1 > 0 and α2 > 0.

A

75

Similar results hold for non separable covariances as well. Additionally,(cid:82) ∞

0 sd+1t2γ(s, t)dsdt < ∞

√
as spectral density is diﬀerentiable indeﬁnitely.Thus we show that tapering range rt = O(

Pt)

|Lθ0−Lθtap| =

1
n

sup
θ0∈Θ

log(| ˜Σ(θ0)( ˜Σ(θ0)◦Ktaper)−1)|)+

1
n

(Z−Xβ0)T ( ˜Σ−1(θ0)− ˜Σ(θ0)◦Ktaper)−1)(Z−Xβ0) = op(1)

2.9 Misclassiﬁcation Optimality

This theorem shows that after penalized maximum likelihood estimation under tapering, the plug

in classiﬁer is optimal or universal.

Theorem 2.9.1. Suppose st
nCp
W (ˆδ) → 1 − Φ(

√
C0
2 )

→ 0 the worst classiﬁcation error rate of

Proof. Refer to appendix 2.11.4.1

(cid:4)

2.10 MRI Data Preprocessing

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimag-

ing Initiative (ADNI) database (adni.loni.usc.edu). For up-to-date information, see www.adni-

info.org. To demonstrate the methodology we consider MRI images obtained from a two year

study using a 3T scanner. A total of 51 subjects (31 Controls, 18 AD ) are collected. On average

each of these subjects have 4 visits. The preprocessing protocol from ADNI allows us to down-

load N3 (nonparametric nonuniformity normalization) bias-corrected NIFTi images. For visualizing

these images, we use the freely available toolkit ITKsnap developed by Yushkevich et al. (2016).

An aﬃne registration with the skull on was performed on every subject to its separate visits using

ANTs Avants et al. (2011) registration implemented in R using the AntsR and extrantsr packages

Muschelli et al. (2018). i.e. visit two, three and four were aﬃne registered to visit one.

76

After the ﬁrst set of registrations on this longitudinal dataset, for each of the subjects using just the

ﬁrst visit, skull-stripping was performed by extracting only the brain tissue using the tool MASS

developed by Doshi et al. (2013). This was run in parallel as it is known to be computationally

heavy.

If this is hard to implement one may consider the simpler FSL Bet tool developed by

Popescu et al. (2012).

The brain mask obtained after skull stripping the ﬁrst visit corresponding to each subject, is then

applied to all other visits of the respective subject. To perform voxel-wise analyses, a deformable

registration is performed using a template. For the purposes of the 3D 3T MRI scans we chose the

SRI24 atlas Rohlﬁng et al. (2008). This deformable registration was once again performed using

ANTs registration. These transformations were calculated using the ﬁrst visit of each subject to

the template. A forward transformation consists of .mat ﬁle (for initial aﬃne registration) and a

forward warp image .nii.gz ﬁle. The transformations then may be applied to the corresponding

time points. A quality control check may then be done using the ITKsnap tool to ensure that all

images have been registered to the same space.

In order to obtain ROI labels one can visit (www.nitrc.org/projects/sri24) to obtain the segmenta-

tion of the ROIs and the corresponding label text ﬁle. The hippocampus for AD studies is labeled

as 37 (left) and 38(right).

2.11 Simulation Studies

In the following simulation study, we investigate a variety of conditions under which the pro-
posed methods are employed. On a lattice grid of p × p ∈ R2 multinomial gaussian observa-
tions where p = {10, 15, 20} are produced over t = {1, .., 4} time points. The mean of this

multinomial gaussian distribution is characterized in two ways. The ﬁrst condition assumes that

there is a strong signal indicating a signiﬁcant diﬀerence between the two classes C1 and C2 i.e.

77

∆ = {1, ..., 1, 2....2, 0..., 0}T where s = 20 is the number of non-zero entities and the ﬁrst 10 are

1 while the following 10 are 2. Similarly a smaller diﬀerence between the two classes was as-
sumed with ∆small = {0.75, ..., 0.75, 0.5....0.5, 0..., 0}T . Under space and time separable assumption

multiple spatial covariance functions are assumed such as:

• Exponential covariance: The spatial dependence of the error terms are γ(d) = σ2(1 −
c) exp(−d/r) where γ(d) = σ2 when d = 0 and d denotes the euclidean distance between two
spatial points. r = {3, 10}, c = {0.2, 0.5} and σ = 1.

• Polynomial correlation: The covariance function is γ(d) = ρd where ρ = 0.9.

• Matern covariance: The covariance function is given by,

(cid:0)(cid:113)

(cid:0)(cid:113)

(cid:1)νKν

2ν d
ρ

γ(d) = σ2 21−ν

Γ(ν)

2ν d
ρ

(cid:1) where Kν(·) is the modiﬁed Bessel function of the second

kind, ρ, ν > 0 and Γ is the gamma function. In this study, ν = 2 and ρ = 5.

The covariance assumed for the four time points was exponential where γ(dt) = exp(−dt/rt) where
dt denotes the euclidean distance between pairs of time points and rt = {0.5, 0.7, 1}.

For the non-separable case we use the following covariance function,

(cid:0)

(cid:1)

γ(d, dt) = e

−dν
t )0.5γµ

(1+dλ

1+dλ
t

• Gneiting covariance: This non-separable space time covariance is,

, where ν, λ ∈ [0, 2] and γ ∈ (0, 1).

If γ = 0 then the model is

separable. In the simulation study the values considered were ν = 0.8, λ = 1 and γ = 0.6.

For all of the simulation studies we have 100, 225 and 400 locations on 4 time points. Therefore

the total number of observations per subject is 400, 900 and 1600 observations respectively. The

total number of non-zero components in the mean vector s = 20. So best case scenario for selection

is when true positives (TP) is 20 and false positives (FP) is 0. A total of 100 subjects per study is

generated with a 50% training and validation split.

78

2.11.1 Exponential Space Time Covariance (weak correlations)

2.11.1.1 With ∆

p × p
100
225
400

r=3
3.009 (0.0979339536)
2.996 (0.0736011373)
3.002 (0.0516560637)

c=0.2
0.231 (0.0228625958)
0.234 (0.0175467671)
0.233 (0.0125129651)

σ = 1
1 (4.234e-07)
1 (5.015e-07)
1 (2.954e-07)

rt = 1
0.992 (0.0103475934)
0.995 (0.0069044814)
0.996 (0.0048776816)

Table 2.1: Exponential separable space and time covariance estimation when ∆ is the mean

vector

p × p
100
225
400

TP
20.00
20.00
19.99

FP
47.14
111.02
189.68

MSE Train1 Train2 Test1 Test2
1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

0.00202
0.00124
0.00097

Table 2.2: Estimation and selection of ∆ using the estimated covariance

p × p
100
225
400

TP
19.87
19.83
19.70

FP
45.38
64.56
78.59

MSE Train1 Train2 Test1 Test2
1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

0.00851
0.00467
0.00310

Table 2.3: Estimation and selection of ∆ under independence

p × p
100
225
400

r=3
2.992 (0.0940290558)
3.005 (0.0735975377)
3 (0.0567520396)

c=0.2
0.234 (0.0215993179)
0.232 (0.0166009694)
0.233 (0.0131230581)

σ = 1
1 (5e-07)
1 (4.172e-07)
1 (4.156e-07)

rt = 1
0.993 (0.0109421777)
0.993 (0.0080306034)
0.995 (0.0060978283)

Table 2.4: Exponential separable space and time covariance estimation when ∆small is the

mean vector

79

p × p
100
225
400

TP
17.66
16.13
12.99

FP
119.99
231.17
311.30

MSE Train1 Train2 Test1 Test2
0.79
0.73
0.66

0.79
0.73
0.67

0.79
0.73
0.67

0.79
0.73
0.67

0.00839
0.00506
0.00385

Table 2.5: Estimation and selection of ∆small using the estimated covariance

p × p
100
225
400

TP
15.70
14.41
12.68

FP
45.50
55.40
54.21

MSE Train1 Train2 Test1 Test2
0.75
0.72
1.00

0.75
1.00
0.69

0.75
1.00
0.69

0.75
0.71
0.69

0.01081
0.00561
0.00345

Table 2.6: Estimation and selection of ∆small under independence

2.11.2 Exponential Space Time Covariance (strong correlations)

2.11.2.1 With ∆

p × p
100
225
400

r=10
10.024 (0.6592026826)
9.918 (0.4420207073)
9.952 (0.3472843583)

c=0.5
0.519 (0.0310983865)
0.523 (0.021304777)
0.522 (0.0165295756)

σ = 1
1 (3.3601e-06)
1 (2.4351e-06)
1 (3.5861e-06)

rt=0.5
0.47 (0.1196661211)
0.466 (0.1284748148)
0.475 (0.1094737503)

Table 2.7: Exponential separable space and time covariance estimation when ∆ is the mean

vector

p × p
100
225
400

TP
20.00
16.13
12.99

FP
0.79
231.17
311.30

MSE Train1 Train2 Test1 Test2
1.00
0.73
0.66

0.79
0.73
0.67

1.00
0.73
0.67

1.00
0.73
0.67

0.00029
0.00506
0.00385

Table 2.8: Estimation and selection of ∆ using the estimated covariance

p × p
10x10
15x15
20x20

TP
19.98
19.98
19.99

FP
47.08
61.14
78.91

MSE Train1 Train2 Test1 Test2
1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

0.00487
0.00229
0.00154

Table 2.9: Estimation and selection of ∆ under independence

80

p × p
100
225
400

r= 10
10.056 (0.5445091578)
10.021 (0.415936012)
9.952 (0.3829308584)

c=0.5
0.517 (0.0251724131)
0.519 (0.0190434918)
0.522 (0.0185292498)

σ = 1
1 (2.5151e-06)
1 (5.481e-07)
1 (2.2068e-06)

rt=0.5
0.466 (0.1292439526)
0.46 (0.136410963)
0.475 (0.1505265743)

Table 2.10: Exponential separable space and time covariance estimation when ∆small is

the mean vector

p × p
100
225
400

TP
20.00
19.99
19.99

FP
29.41
231.17
102.89

MSE Train1 Train2 Test1 Test2
1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

1.00
1.00
1.00

0.00036
0.00021
0.00015

Table 2.11: Estimation and selection of ∆small using the estimated covariance

p × p
100
225
400

TP
19.98
19.98
19.99

FP
62.76
61.14
78.91

MSE Train1 Train2 Test1 Test2
0.99
0.98
0.98

1.00
0.97
0.98

0.00630
0.00375
0.00225

1.00
0.97
0.98

1.00
0.97
0.98

Table 2.12: Estimation and selection of ∆small under independence

2.11.3 Matern Space Covariance and exponential time

(separable)

2.11.3.1 With ∆

p × p
100
225
400

ν = 2
1.998 (0.0143793046)
2.001 (0.0091730859)
2 (0.0070390667)

rt=0.7
0.698 (0.0175455886)
0.701 (0.0108351911)
0.7 (0.0074068311)

φ = 5
4.795 (0.1068378951)
4.808 (0.06846612)
4.799 (0.0494941786)

Table 2.13: Matern covariance with separable exponential time when ∆ is the mean vector

p × p
100
225
400

TP
15.18
14.11
12.73

FP MSE Train1 Train2 Test1 Test2
0.98
0.97
0.97

0.97
0.97
0.97

0.97
0.97
0.97

0.97
0.97
1.00

0.03
0.01
0.01

86.51
155.46
214.40

Table 2.14: Estimation and selection of ∆ using the estimated covariance

81

p × p
100
225
400

TP
14.49
17.32
10.19

FP MSE Train1 Train2 Test1 Test2
0.93
0.98
0.84

0.93
0.97
0.83

0.93
0.97
0.83

0.93
0.97
0.98

0.06
0.00
0.02

62.76
61.14
51.06

Table 2.15: Estimation and selection of ∆ under independence

2.11.4 Non separable space and time Gneiting covariance

(separable)

2.11.4.1 With ∆

p × p
100
225
400

ν = 0.8
0.831 (0.0147602785)
0.832 (0.0108536628)
0.831 (0.0078036784)

λ = 1
1.059 (0.0263549226)
1.057 (0.0162489845)
1.057 (0.0130021963)

γ = 0.6
0.982 (0.0063041298)
0.983 (0.0039539343)
0.983 (0.0031399539)

Table 2.16: Gneiting covariance with non-separable when ∆ is the mean vector

p × p
100
225
400

TP
19.08
17.22
14.40

FP MSE Train1 Train2 Test1 Test2
0.96
0.96
0.95

0.97
0.96
0.95

0.97
0.96
0.95

0.03
0.01
0.01

0.97
0.96
0.95

111.03
178.08
178.88

Table 2.17: Estimation and Selection of ∆ using the estimated covariance

82

APPENDIX

83

APPENDIX

Useful lemmas

Lemma 2.11.1. If X ∼ N (0, 1) then, X = Op(1)

Proof. Using Gaussian concentration inequality P (|X| >

exists a δ =

√−log(/2)

.

2

√−log(/2)

2

) ≤  for every  > 0 there
(cid:4)

Lemma 2.11.2. If X ∼ χ2(K) then, X = Op(K)

Proof. Using the sub-exponential concentration inequality X−K is subexponential with parameters

(4K, 2). Similarly for every  > 0 there exist a δ which implies P (

K > 0.

|X−K|

4K > δ) <  for some constant
(cid:4)

(cid:113) 1

Theorem 2.11.3 (2.7.2). Assuming (cid:107)ˆθ − θ(cid:107)2 = Op(

t = 1, 2..., T (cid:107) ˆβt − βt,0(cid:107)2 = Op((cid:112) st

P T n ) , we want to prove that for each

n ) and ˆβt 2 = 0P×1 with probability close to 1 where ˆβ =

arg maxβ L(β, ˆθ; Z)

Proof. For each ﬁxed t the proof consists of two steps

In ﬁrst part we show that (cid:107) ˆβt − βt,0(cid:107)2 = Op((cid:112) st

n ) . In the next part we show that ˆβt 2 = 0P×1 with

probability close to 1 . To prove this second part we need an additional lemma 2.11.4 from the paper

The proof of ﬁrst part

Denote vector ut ∈ Rpn with entries 1 corresponding to index j such that βt,j (cid:54)= 0 and 0 elsewhere,

but without loss of generality , we denote ut as u.Similarly, Xt denotes the colomns of design matrix

X corresponding to βt,j (cid:54)= 0. Hence we get identity that Xtut = X

Ist×st 0
n ) , ξn = Op((cid:112) s
show that for any  > 0 ,(cid:80)T
the condition that for r ∈ {1, 2, ....T} − {t} we have (cid:107) ˆβr − βr(cid:107)2 = Op((cid:112) sr

t=1 st = s, ξn,t = Op((cid:112) st

0

0

n )

 u It is suﬃcient to

n ) and for some C > 0 under

84

P r(cid:0) sup

(cid:107)u(cid:107)2=C

Q(βt,0 + uξn,t, ˆθ) − Q(βt,0, ˆθ)(cid:1) > 1 − 

or equivalently we can prove that

P r(cid:0) sup

(cid:107)u(cid:107)2=C

Q(β0,t + uξn, ˆθ) − Q(β0,t, ˆθ)(cid:1) > 1 − 

Q(β0,t + uξn, ˆθ) − Q(β0,t, ˆθ)

= L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ) − n

= L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ) − n

Pt(cid:88)
st(cid:88)

j=1

j=1

(pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|))

(pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|))

− n

(pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|))

(pλn,t(|βt,0,j + uξn|) − pλn,t(|βt,0,j|))

pt(cid:88)

st(cid:88)

j=1

j=st+1

≤ L(β0,t + uξn, ˆθ) − L(β0,t, ˆθ) − n

forj = st + 1, ., pt
pλn,t(|βt,0,j| = 0) = 0
pλn,t(|βt,0,j − uξn| > 0) = 0

=

1
2
− n

st(cid:88)

˜Σ(ˆθ)−1Xtutξn −T ˜Σ(ˆθ)−1Xtut
t X T
uT
t
(pλn,t(|βt,0,j + uξn|) − pλn(|βt,0,j|))

j=1

=

1
2
− n

=

1
2

(cid:124)

˜Σ(ˆθ)−1Xtutξn − (Z − Xβ)T ˜Σ(ˆθ)−1Xtut

st(cid:88)

j=1

t X T
uT
t
(pλn(|βt,0,j + uξn|) − pλn(|βt,0,j|))
(cid:123)(cid:122)

˜Σ(ˆθ)−1Xtutξ2

uT
t X T
t

(cid:124)

− (Z − Xβ)T ˜Σ(ˆθ)−1Xtutξn,t

(cid:125)

(cid:123)(cid:122)

1

(cid:125)

n,t

2

85

(cid:0)p(cid:48)

st(cid:88)

j=1

−n

(cid:124)

λn,t(|β0, j|)sgn(β0, j)utξn,t
(cid:125)

(cid:123)(cid:122)

3a

n,t(1 + op(1)(cid:1)
(cid:123)(cid:122)
(cid:125)
λn,t(|β0,j|)u2
t ξ2

(cid:124)

+ p(cid:48)(cid:48)

3b

For term (1) we get using eigenvalue inequality result that ψi is i th eigenvalue

uT
t X T
t

˜Σ(ˆθ)−1Xtutξn

uT X T ˜Σ(θ0)−1Xtutξn

(cid:125)

=

1
2

(cid:124)

(cid:125)

+

uT
t X T
t

uT X T ˙˜Σ(θ∗)r(ˆθr − θ0,r)uT X T ξn

(cid:123)(cid:122)

1a

(cid:125)

Proof.

1
2

(cid:124)

1
2

(cid:124)

(cid:123)(cid:122)

1

q(cid:88)

r=1

(cid:123)(cid:122)

1b

i

1(a) ≤ O((cid:107)u(cid:107)2X T X min

ψi( ˜Σ(ˆθ)−1))ξ2

n = O(min

i

ψi( ˜X T (In−1 + Jn−1)) ˜X) min

i

ψi( ˜Σ(ˆθ)−1))ξ2

n

= O(nξ2

n,t)by eigenvalue inequality

Similarly using CS inequality1(b) = O(nξ2

n,tηn)

(cid:4)

86

ψi( ˜Σ(θ0)−1) trace( ˜X T ˜X)(cid:107)u(cid:107)2ξn)

−τ21n1

τ11n2−1

(cid:107)u(cid:107)2ξn)

(In−1 + Jn−1)

(cid:4)

(cid:4)

For term (2) we use Taylor expansion of ˜Σ(ˆθ)−1 around θ0 such that ˆθ − θ0 = vηn

Proof. −(Z − Xβ)T ˜Σ(ˆθ)−1Xtut = −(Z − Xβ)T ˜Σ(θ0)−1Xuξn

(cid:124)

(cid:123)(cid:122)

2a

(cid:125)

− q(cid:88)
(cid:124)

r=1

min

i

i

= Op(

(cid:123)(cid:122)

2b

Z = Op

ψi(Σ(θ0)−1) trace(X T

term (2a) using Markov inequality that for any r.v

(Zt − Xβt)T ˙˜Σ(θ∗)r(ˆθr − θ0,r)Xuξn

(cid:125)
(cid:113)
(cid:112)E(Z2) −(Z − Xβ)T ˜Σ(θ0)−1Xtutξn = Op(
(cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)min
T
−τ21n1
= Op((cid:112) n1n2
Similarly term (2b)(cid:80)q
(cid:113)
−n(cid:80)s

r=1(Z − Xβ)T ˜Σ(θ∗)rXuξn
( ˜X T ˜X)(cid:107)u(cid:107)2(cid:107)v(cid:107)2ξnηn) = Op(q

(|β0, j|)sgn(β0, j)uξn ≤ n

n st(cid:107)u(cid:107)ξn) = Op(

√
nst(cid:107)u(cid:107)ξn)

ψi( ˜Σ(θ∗)r) min

τ11n2−1

= Op(

t Xt)

s max

√

min

i

j=1(p(cid:48)

λn

i

For term (3a) using CS inequality and using assumption

√

n1n2n)

Term (3b)using CS inequality and using assumption p(cid:48)(cid:48)

λn

(|β0,j|)u2ξ2

nbn = Op(nξ2
n)

p(cid:48)

λn

(|β0, j|)(cid:107)u(cid:107)2

j

2ξn = O(n

sξn an) = Op(

√

snξn)

√

n ≤ n(cid:112)(s)ξ2

Here we observe that term 1 dominates all other term hence for appropriate constant C the

value above is negative, hence proved.

Lemma 2.11.4. This lemma proves sparsity of the estimator.SCAD penalized estimator demon-

strates this oracle property which means as if number of zeros in parameter in known initially.Now,

let ˆβt = ( ˆβ1

t , ˆβ2

t )T .For any given ˆβt satisfying (cid:107) ˆβ1

t − β1

n ) and assumptions A(1) to

t,0(cid:107)2 = Op((cid:112) st

A(12) then with high probability

 ˆβ1

t

0

 = max

t (cid:107)≤C

(cid:107) ˆβ2

Q

√ st

n

Q

 ˆβ1

t

ˆβ2
t



87

Proof. It is suﬃcient to prove that for j = st, st + 1, ....pt

∂Q(βt)
∂βt,j

< 0 for 0 < βt,j < C

and

∂Q(βt)
∂βt,j

> 0 for 0 > βt,j > −C

(cid:114) st

n

(cid:114) st

n

(cid:113)

∂Q(βt; ˆθ)

= −(Z − Xβ)T ˜Σ(ˆθ)−1Xj − nP (cid:48)

λ(|βt,j|)sgn(βt,j)

∂βt,j

−(Z − Xβ)T ˜Σ(ˆθ)−1Xj = −(Z − Xβ)T ˜Σ(θ)−1Xj − Q(cid:88)

(Z − Xβ)T ˜Σk(θ∗)−1ukηn

−(Z − Xβ)T ˜Σ(θ)−1Xj = Op(

k=1

Xj ˜Σ−1Xj) = Op(

√

n) by concentration inequality

similarly,

Q(cid:88)

k=1

√
(Z − Xβ)T ˜Σk(θ∗)−1ukηn = Op(

nηn)

Collecting all terms we achieve that

∂Q(βt; ˆθ)

∂βt,j

= nλn,t

(cid:0)Op(

√
st√
n

λn,t) +

sgn(βj)(cid:1)

p(cid:48)

λn,t

(|βj|)
λn,t

√
n λn,t → 0 sgn(βj) dominates , hence proved
st√

Since

(cid:113)(cid:80)T st

Lemma 2.11.5. (cid:107) ˆβ − β0(cid:107) = Op(

n ) then (cid:107)ˆθ − θ0(cid:107) = Op(

P r(cid:0) sup

Proof. We will show that for arbitrary  > 0

Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0)(cid:1) > 1 −  here η = Op(

(cid:107)v(cid:107)2=C

By Taylor Series expansion around ˆθ

88

1 Ptn

(cid:113) 1(cid:80)T
(cid:113) 1(cid:80)T

1 Ptn

)

(cid:4)

)

Q( ˆβ, θ0 + vηn) − Q( ˆβ, θ0) =

(Z − X ˆβ)T ˙˜Σ(θ0)r(Z − X ˆβ)vη2

+

(cid:125)

vT (Z − X ˆβ)T [ ¨˜Σ(θ∗)r,r(cid:48) − D](Z − X ˆβ)vη2

+

1

(cid:123)(cid:122)
(cid:125)

n

q(cid:88)
(cid:124)

r=1

q(cid:88)
(cid:124)
(cid:124)

r,r(cid:48)=1

(cid:123)(cid:122)

2

(cid:125)

n

(cid:123)(cid:122)

3

vT (Z − X ˆβ)T D(Z − X ˆβ)vη2

Lemma 2.11.6. limp∧T→∞λmaxΣ(θ∗) = O(1)

Proof.

limP,T→∞,∞λmaxΣ(θ∗)

P T(cid:88)

≤ max

≤

≤

≤

|Σi,j|

i

j=1

j=1

|Σi,j|

∞(cid:88)
(cid:90) ∞
(cid:90) ∞
(cid:90) ∞
(cid:90) ∞
(cid:90)
(cid:90) ∞
(cid:90) ∞

ω=0

τ =0

s=0

t=0

|

(cid:107)ω(cid:107)=0

τ =0

≤ 2

|γ(s, t; θ)|

(cid:90)

exp(ω(cid:48)s + τ(cid:48)t)f (ω, τ )|

R

Rd
η(α2β2 + β2(cid:107)ω(cid:107)2 + α2τ 2 + (cid:107)ω(cid:107)2τ 2)−(ν+ d+1

2 )dωdτ

= O(1)

for d > 2 ν > 0

Lemma 2.11.7. limP,T→∞,∞λmaxΣ(θ∗) ◦ KT aper = O(1) and limP,T→∞,∞λminΣ(θ∗) ◦ KT aper

Proof. We use the result that

min KT aper i,iλminΣ(θ∗) ≤ λmaxΣ(θ∗) ◦ KT aper ≤ max KT aper i,iλmaxΣ(θ∗)

89

(cid:4)

(cid:4)

(cid:4)

Proof. By Taylor series expansion, Φ(X) = Φ(y) + (X − y)φ(y) + op(X − y) holds for almost surely

a random variable X
√

ˆ∆T Σ−1 ˆ∆
ˆ∆T ˆΣ−1Σ ˆΣ−1 ˆ∆

√

∆T Σ−1∆

.

2

Now taking X =
Notice that ˆ∆T Σ−1 ˆ∆ ∼ F ( ˜PT , n − 2) with non central parameter since ˆ∆ and ˆΣ are independent
(cid:4)

and y =

2

If we use weighted mixture of Matern Covariance function,we can show that all of the above

conditions hold for covariance matrix ΣP T×P T (θ)

Lemma 2.11.8. The expression of derivative of covariance function with respect to parameter θ

is given below

Proof. By Fubini’s theorem, we know that the diﬀerentiation and integral can be exchanged.

∂γ(s, t; θ)

∂f (ω, τ ; θ)

∂θ

dωdτ

exp(ω(cid:48)s + τ(cid:48)t)

(cid:90)

(cid:90)

d + 1

2

)

Rd

R

exp(ω(cid:48)s + τ(cid:48)t)g(ω, τ )−(1+ν+ d+1
2 )

= K(θ)

∂θ

(cid:90)

(cid:90)

Rd

R
= −(ν +

=



2αη(β2 + τ 2)

2βη(α2 + τ 2)
η((cid:107)ω(cid:107)2τ 2)

(α2β2 + β2(cid:107)ω(cid:107)2 + α2τ 2 + (cid:107)ω(cid:107)2τ 2)

log(α2β2+β2(cid:107)ω(cid:107)2+α2τ 2+(cid:107)ω(cid:107)2τ 2)

ν+ d+1
2



dωdτ

We can verify that each of the element of K(θ) has ﬁnite integral multiplied with a continuous
(cid:4)

function of θ. Since Θ is compact space in Rq, K(θ) is bounded .

Lemma 2.11.9. The maximum eigenvalue of matrix ∂Σ

∂θ is bounded from above

90

Proof. We will show that (cid:107) ∂Σ

∂θ (cid:107)2 ≤ (cid:107) ∂Σ

∂θ (cid:107)1 =(cid:82)

Rd

(cid:82)

∂γ(s,t;θ)

∂θ < ∞

R

Lemma 2.11.10. The maximum eigenvalue of matrix ∂2Σ

∂θ2 is bounded from above

Proof. Similar to proof above either we can diﬀerentiate again and apply Fubini Theorem

Lemma 2.11.11. (cid:107)Σ(cid:107)2

F = OpP T >> Op(P T )

1
2

Proof. Since covarince function is isotropic and stationary (cid:107)Σ(cid:107)2

P T(cid:80)
γ2(hij; θ ≤(cid:80)
Using the fact that(cid:80)

i min

j

i,j γ2(hij; θ) ≤ P T(cid:80)
γ2(hij; θ) ≤(cid:82) ∞

i max

γ2(hij; θ)
0 γ2(x; θ)dx < ∞

j

i max

j

F =(cid:80)

i,j γ2(hij; θ)

Lemma 2.11.12. If λ1 ≥ λ2... ≥ λn be eigenvalues of positive semi deﬁnite matrix the con-

√
(n−1)
√
m(A)−v(A)

2v(A)

(n−1)

and lower bound 1 +

(cid:4)

(cid:4)

(cid:4)

(cid:4)

(cid:4)

can be have upper bound 1 +

2v(A)/

dition number kn = λmax
λmin

√
(n−1)
√
m(A)−v(A)/
where m(A) = tr(A)/n and v(A) = tr(A2)/n − m2

(n−1)

Proof. From lemma in Wolkowicz & Styan (1980)

Lemma 2.11.13. Assuming general condition 13 holds, we have the results

• (cid:107)Σt − Σt,T aper(cid:107)1 = Op( 1√

Pt

)

• (cid:107)Σk,t − Σk,t,T aper(cid:107)1 = Op( 1√

Pt

)

• (cid:107)Σk,j,t − Σk,j,t,T aper(cid:107)1 = Op( 1√

)

Pt

Proof. (cid:107)Σt − Σt,T aper(cid:107)1 ≤ 2 Ktρ

wp

xd|γ(x : θ)|dx

(cid:82) ∞

wp

Similar result holds for derivative and double derivative of Σ(θ)

proof for theorem 7.2

Proof. By using Lemma 8.1, 8.3 ,7.1 we have established all regularity conditions for tapered matrix
˜ΣT apered W ( ˆβP M LE) = 1 − Φ(

n (n1−n2)(1+Op(
˜PT

∆T Σ−1∆)(1+Op(

(cid:115)

)+

))

∆T Σ−1∆)(1+Op(

2

)+

˜PT
n (n1+n2)(1+Op(

1√
n ˜PT
1√
n ˜PT

))

(cid:4)

1√
n ˜PT
1√
n ˜PT

91

Chapter 3

Random Projection for Tensor data

3.1

Introduction

Given a vector x ∈ Rp×1, we can project it to lower dimensional space through linear map f :
Rp×1 → Rd×1 deﬁned by

f (x) =

˜A(K)x

1√
k
d

where ˜Ap×d is random matrix formed with element sampled independently from special classes of

distributions.

iid∼D a

˜ai,j

This linear projection also preserves pairwise distances with high probability. This phenomenon is

due to JL lemma for some  > 0) and constant C independent of dimension.For k = 1

P(1 −  ≤ (cid:107)f (x) − f (x(cid:48))(cid:107)2

2

(cid:107)x − x(cid:48)(cid:107)2

2

≤ 1 − ) ≤ 2 exp(− d
8

(2/2 − 3/3))

92

Lemma 3.1.1.

P(< f (x), f (y) > − < x, y >≥  < x, y >) ≤ 2 sup
x∈Rp

P((cid:107)f (x)(cid:107)2 − (cid:107)x(cid:107)2 ≥ (cid:107)x(cid:107)2)

Proof. Sun et al. (2018) lemma A.1

(cid:4)

Suppose there are n points in Rp×1 corresponding to data matrix X = [X1..Xi..Xn]1≤i≤n. We

can chose d ≥ 2 log(n)−2 such that all pairwise distances belonging to each of n(n−1)

2

pairs are

preserved with high probability.

There an be various choices of distributions for random variable a. For example, a ∼ N(0, 1)

or in sparse RP s = 3 and very sparse RP s =

d respectively

√

ai,j =

2

2

+1 with probability 1


−1 with probability 1
√
+

√
+

P. Li et al. (2006) ai,j =

0

s with probability 1
2s

with probability 1 − 1

s

s with probability 1
2s

3.2 Kronecker factors

Supposedly ˜A(K)p×d could be decomposed as Kronecker product of k matrices

˜A(K) = A(1) ⊗ ..A(l).. ⊗ A(k)

93

where matrix A(l) is of order dl × pl for each l ∈ {1, 2, .., k}.Trivially p = (cid:81)k

l=1 pl and dk. The

elements of product matrix are given by

˜ai,j = a(1)
i1,j1

a(2)
i2,j2

....a(l)
il,jl

..a(k)
ik,jk

l=1(jl − 1)(cid:81)k
where j = 1 +(cid:80)k
For l = k the expressions(cid:81)k

r=l+1 pr and i = 1 +(cid:80)k
r=l+1 pr and(cid:81)k

l=1(il − 1)dk−l−1

r=l+1 dr assumed to be 1.

If we take each elements of the matrices {Al}1≤l≤k to be independently and identically dis-

tributed of each other, the resulting elements of product matrices ˜Ai,j are no longer mutually

independent but product of independent random variables, thus weekly dependent in the sense

it is ρ mixing, and martingale with respect to ﬁltration In this work, we show that even under

this dependence structure we can achieve JL kind of for ﬁnite p and d. We further assume that

E(a(l)
il,jl

) = 0 and E[(a(l)
il,jl

)]2 = 1 for all l ∈ [k], j ∈ [pl], i ∈ [dl].

3.2.1 Literature reviews

Concentration inequalities for quadratic forms like ours is known as Hansen Wright inequality.For

dependent variables these type of inequality are discussed in Adamczak et al. (2015) which certain

concentration property for 1 Lipschitz function which is hard to verify in our cases.Samson et al.

(2000) discussed the concentration property for 1 Lipschitz function for strong mixing sequence and

Markov chain.However strong mixing conditions like φ mixing is also diﬃcult to verify Another ap-

proach would be use Herbst argument applicable to Log Sobolev measure as mentioned in Ledoux

(1999).But proving such inequality is much more tedious in our case

94

Deﬁne a random variable Yi(x) =< ˜AK

i , x > ,our reader may notice that for ﬁxed i, Yi =

(cid:80)(p1,p2,...pK )
given j = 1 +(cid:80)k

j1=1,j2=1,...jK =1 a(1)

i1,j1

a(2)
i2,j2

l=1(jl − 1)(cid:81)k

....a(l)
il,jl

..a(k)
ik,jk

xj

r=l+1 pr form a martingale w.r.t ﬁltration Fl and 1 ≤ l ≤ K. But

concentration inequalities for martingale diﬀerences provides very poor bounds.

Readers may also notice stationarity a random variables Yi although actual distribution is in-

tractable. Yi is popularly known as Polynomial Gaussian Chaos in probability literatures. Statis-

ticians recognize this expression as general U statistics. Its moments and tail bounds are widely

studied in Adamczak et al. (2015) and Lata(cid:32)la et al. (2006). But in above literatures,moment bound

involve supreme of empirical processes which are hard to estimate. Also, exponential bounds given

in these literature is depended upon various semi norms of vector x based partition of subset of
{1, 2, ...K}. But it is desirable to derive inequalities in terms of L2 norm of x only.

Taking all the above bottleneck into consideration, we follow approach by Schudy & Sviridenko

(2012) where moment bounds are calculated through brute force combinatorial argument. To our
aid we have result known as hyper- contractivity which provides bound of (cid:107)Y (cid:107)r through ((cid:107)Y (cid:107)2)r

in Janson et al. (1997) theorem 5.10 and theorem 6.7. Below is the result stated

3.3 Weak dependence

By convention, for a random variable X ∈ (R,BR, P), (cid:107)X(cid:107)p = (E(|X|)). Deﬁne ﬁltration sigma

ﬁled FK = σ(A(l)
il,jl

: d ≤ il ≤ 1, pl ≤ jl ≤ 1; K ≤ l ≤ 1)

Suppose there is a sequence of random variables (˜ai)n

: 1 ≤ i ≤ h) rho ρ(n) = sup
k≥1
˜a∈Gh
˜b∈G∞

1

h+n

Gk

1 = σ(˜ai

n→∞ ρ(n) → 0

lim

i=1 and ˜an ∈ (R,BR, P),deﬁne sigma ﬁeld
|Cov(˜a, ˜b)|/(cid:107)˜a(cid:107)2(cid:107)˜b(cid:107)2 a sequence is ρ mixing if

95

0 for a ∼ N(0, 1)

0 for a ∼ very sparse with s =

√

d

In our case, ρ(|i − i(cid:48)|) =

for i (cid:54)= i(cid:48)

Proof. Deﬁne Yi(x) =< ˜AK

i , x >. for i (cid:54)= i(cid:48) but sometimes j = j(cid:48)
(cid:88)

(cid:88)

˜ai,j(cid:48) ˜ai(cid:48),jxjxj(cid:48))

E(Yi, Y (cid:48)

i ) = E(

j

j(cid:48)

Now, for i (cid:54)= i(cid:48) with il = i(cid:48)
= E(a(1)
i1,j1
= E(a2 (1)
i1,j1

....a(l)
il,jl

....a2 (l)
il,jl

a2 (2)
i2,j2

a(2)
i2,j2

..a(k)
ik,jk

..a(k)
ik,jk

l but ik = i(cid:48)
k without loss of generality and j = j(cid:48) E(˜ai,j ˜ai(cid:48),j(cid:48))
a(2)
....a(l)
i(cid:48)
i(cid:48)
2,j2
l,jl
) = 1 E(a(k)
ik,jk

..a(k)
i(cid:48)
k,jk
)E(a(k)
i(cid:48)
k,jk

a(1)
i(cid:48)
1,j1
a(k)
i(cid:48)
k,jk

) = 0

)

Due to independence. Hence whole expression has expectation 0. Therefore covariance is 0.

(cid:4)

3.4 Concentration Inequality

For random variables following Gaussian distribution and very sparse distribution P. Li et al. (2006),

we are able to prove this bound. We require some concentration bound for expression

P(| 1
dk

xT ˜A(K),T ˜A(K)x

(cid:107)x(cid:107)2

2

− 1| > t)

to decay at exponential else we would not attain the eﬃciency like JL lemma meaning the lower

projected dimension will be much more than standard result for k=1,i.e we want d = O(log n).

Theorem 3.4.1. JL lemma for Gaussian or Raedmacher For some constant m = m(K) depending

on K

P(| 1
dk

Sdk(cid:107)x(cid:107)2

2

− 1| > t) <

− d
e

m(K) t

2
K

m(K)

96

Proof. Since (cid:107) Sdk
CE(et N1

N2
d ..

d

(cid:107)x(cid:107)2dk(cid:107)2r

2r = O((cid:107) N1

d

d .. Nk
N2

d (cid:107)2r

2r) where Nq

iid∼ N(0, 1) . Then it follows that E(e

t

S

dk

(cid:107)x(cid:107)2 dk ) <

Nk
d ) And using the result stated in Achlioptas (2003), we complete the proof. Also note
(cid:4)

that this inequality is consistent with Lata(cid:32)la et al. (2006).

3.4.1 Choice of d

We choose d > m(K)logn

t2/K

using union bound so that distance between n points are preserved.

3.4.2 Memory eﬃciency

Under Kronecker decomposition, we need to store(cid:80)k
(cid:81)k

l=1 pldk, this is huge reduction even after considering trade-oﬀ in the probabilistic bound.

l=1 pld elements as compared to original matrix

3.4.2.1 Tensor type data

Another application can be for tensor data x ∈ Rp1×p2×..pk we can also obtain lower dimensional

embedding which preserves distance with high probability.

yd1×d2×..dk =

(A(1) ⊗ ..A(l).. ⊗ A(k))T xp1×p2×..pk

V ec(y)d×1 =

(A(1) ⊗ ..A(l).. ⊗ A(k))T V ec(x)p×1

1(cid:113)(cid:81)k

l=1 dl
1√
d

3.4.3 Variance reduction through averaging

Since all random projection are non adaptive methods. In most of the literature, iit is recommended

that we generate several independent RPs and take their average. This ensemble technique also

provide reduction in variance.
Let { ˜Ah}1≤h≤H be H independent copies of RP big Kronecker matrices. WE can deﬁne new

97

ensemble RP as

f∗,H (x) =

1√
H

H(cid:88)

h=1

˜Ahx

3.5 Simulation result

Our ﬁrst experiment evaluates the quality of the isometry for maps We generate n = 10 independent

vectors x1; ..; xn of sizes d = 2500; 10000; 40000 .We consider the following three RPs: 1. Gaussian

RP; 2. Sparse RP ; 3. Very Sparse RP . For each, we compare the performance of RP, KRP

with order 4 and d1 = d2=d3=d4. We evaluate the methods by repeatedly generating a RP and

computing the reduced vector, and plot the ratio of the pairwise distance 1.

Method Gaussian
RP
KRP(4)

0.1409 (0.0015)
0.1431 (0.0016)

Sparse
0.1407 (0.0013)
0.1431 (0.0015)

Very Sparse
0.1412 (00.0014)
0.1520 (0.0033)

Table 3.1: Average of total deviation of ratios of pairwise distance between projected and

actual data from 1 . (Variability)

3.6 Future scope

We believe that such bounds can be achieved for wider class of random variables following identity
(cid:107)Y (cid:107)r

r−1 for all r as shown in Schudy & Sviridenko (2012).

r ≤ (r)M(cid:107)Y (cid:107)r−1

3.7

Introduction to Tensors

With the advancement of information and engineering technology, modern days data science prob-

lems often come with gigantic size and increased complexity. On popular technique of storing

98

these data is use of multi-dimensional arrays, which preserves the data anatomy and contain multi-

dimensional spatial and spatio-temporal correlations.

A tensor is a multi-dimensional or d-way array, which is a generalization of data matrix in a

higher dimensional situation. More formally, according to Kolda & Bader (2009) and Hackbusch

(2012), a d-way tensor is an element of the tensor space generated by the tensor products of d

vector spaces. Similar to the traditional vector based machine learning literature, the learning of

tensor can also be divided into supervised learning and unsupervised learning. Unsupervised tensor

learning generally involves the tensor decomposition and feature selection. Some theoretical results

and applications about unsupervised tensor learning can be found in Kolda & Bader (2009), X. He

et al. (2006), Chi & Kolda (2012), De Lathauwer et al. (2000), and Lu et al. (2008). The framework

of supervised tensor learning has been proposed by Tao et al. (2005), in which one can learn a tensor

based rule from training data for classiﬁcation and regression. The tensor regression problem has

been widely studied. Such examples include Zhou et al. (2013), Wimalawarne et al. (2016), L. Li &

Zhang (2017), Lock (2018), and Raskutti et al. (2019). As an indispensable part of the supervised

tensor learning problem, however, the tensor type data classiﬁcation is under developed. Research

on tensor classiﬁcation includes Zhou et al. (2013), Pan et al. (2018), Signoretto et al. (2011), L. He

et al. (2014), Q. Li & Schonfeld (2014), and Tan et al. (2012).

3.7.1 Limitation for Gaussian assumptions

There are several major deﬁciencies under the current development. Firstly, the assumption about

data distribution. The Gaussian assumption for tensor type of data, e.g., Pan et al. (2018) may not

be adequate in many applications. Since the probability theory for tensor data has not been well

established, probabilistic discriminant such as LDA and Bayes classiﬁer may have limitations from

theoretical foundation perspective. Secondly, the distance based methods, e.g., Lu et al. (2008) and

Q. Li & Schonfeld (2014). These methods train classiﬁer based on tensor Frobenius norm, which is

99

not suitable for high dimensional tensor data. It is known that the high dimensional data has issues

with L2 norm and Frobenius norm. For example Domingos (2012) wrote “ If we approximate a

hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the

hypercube is outside the hypersphere”. Beyer et al. (1999) showed that the diﬀerence between the

maximum and minimum distances to a given query point does not increase as fast as the nearest

distance to any point in high dimensional space. As data dimension increases, these distance based

method may fail. Signoretto et al. (2011) and L. He et al. (2014) performed classiﬁcation with

kernels however, there is no theoretical results about classiﬁcation error established.

3.8 Preliminaries

3.8.1 Mathematical Background for Tensor

We ﬁrst introduce standard tensor notation and operations (e.g. Kolda & Bader (2009)) that are

used in this paper. Numbers and scalars are denoted by lowercase letters such as x, y. Vectors

are denoted by boldface lowercase letters, e.g. a. Matrices are denoted by boldface capital letters,

e.g. A, B. A higher-dimensional tensor is a generalization of vector and matrix representation for

higher order data, which are denoted by boldface Euler script letters such as X, Y. Notations for

vector and tensor spaces will be sepciﬁed when necessary.

The order of a tensor is the number of dimensions, also known as ways or modes. For example,

a scalar can be regarded as a order zero tensor, a vector can be a order one tensor, and a matrix

can be a order two tensor. In general, a tensor can have d modes as long as d is an integer.

The way of indicating entries of tensors is same as we do for vectors and matrices. The i-th

entry of a vector x is xi, the (i, j)-th element of a matrix X is xi,j, and the (i1, ..., id)-th element of

a d-way tensor X is xi1,...,id. The indices of a tensor i1, ..., id range from 1 to their captial version,

e.g. ik = 1, ...., Ik for every mode k = 1, ...d.

100

Sub-arrays of a tensor are formed when a subset of the indices are ﬁxed. Similar to matrices that

have rows and columns, high-dimensional tensors have various types of sub-arrays. For example,

by ﬁxing every index but one in a d-way tensor, we can get one of its ﬁbers, which are analogue of

matrix rows and columns. Another type of frequently used tensor sub-arrays is slice, which is a two

dimensional section of a tensor. A slice of a tensor can be deﬁned by ﬁxing all but two indices. We

will use x:i2...id to denote one ﬁber of a d-way tensor, and use X::i3...id to denote one of its slices.

3.9 Kernelized Support Tensor Machine

3.9.1 Framework of the Classiﬁcation Problem

The classiﬁcation problem for tensor data is a problem of learning a tensor from the training data.
Let T = {(X1, y1), ...., (Xn, yn)} be the training set, where Xi ∈ RI1×I2...×Id are d-mode tensors, yi
are labels. If we assume the training risk of a classiﬁer f ∈ X ∗ is ˆRn(f ) = 1
I(f (Xi) (cid:54)= yi), the

n(cid:80)

n

i=1

problem will be looking for a ˆf

ˆf = {f : ˆRn(f ) = min ˆRn(f ), f ∈ H∗}

(3.1)

To solve this problem, we need to search all functions in the functional tensor space, which is

challenging. However, the problem will be simpliﬁed if the function is an universal reproducing

kernel. This will let us ﬁnd the solution only on the Hilbert space embedded by continuous functions.

We shall prove this claim by presenting a represent theorem for kernelized support tensor machine.

Further, we would be able to prove the consistency of our classiﬁer.

101

3.9.2 Support Tensor Machine

Generalizing the support vector machine to tensor version is quite straightforward. The kernelized

support tensor machine classiﬁer is

sign(f (X))

The function f is the optimal of the following objective function

min

f

λ||f||∗ +

1
n

L(yi, f (Xi))

n(cid:88)

i=1

(3.2)

(3.3)

where ||f||∗ is the norm of the functional tensor space H∗, and L is a measurable loss function
deﬁned on (X ∗ × Y ). X ∗ is an algebraic tensor space. The optimal function is a measureable
function f : X ∗ → R, where X ∗ = RI1 ⊗ RI2... ⊗ RId.

The representer theorem in support vector machine says the solution of support vector machine

can be written as a linear combination of kernel functions. As a result, one only needs to learn the

coeﬃcients in the linear combination instead of learning the whole function. Similarly, we can also

propose a representer theorem for this tensor learning problem.

Theorem 3.9.1. (Tensor Representer Theorem)
Let K(·,·) be a ﬁxed kernel coming from the K(X ∗ ×X ∗, R), H∗ be the corresponding Reproducing

Kernel Hilbert Space. Let L be an arbitrary loss function. If the optimization function (3.3) has

optimal solutions, then all the solutions can be written in the following way:

n(cid:88)

ˆf (X) =

ˆβiK(Xi, X)

i=1

102

The proof of this theorem is attached in our appendix. As a direct beneﬁt of this result, one

just needs to estimate those parameters ˆβi in order to get a optimal solution.

3.9.3 STM with random projection

We can choose K(j) = A(j) where A(j) is random projection matrix thus

r(cid:88)

d(cid:89)

r(cid:88)

d(cid:89)

K = (

A(j)(x(j)

k,1, x(j)

l ), ...,

A(j)(x(j)

k,n, x(j)

l ))T

k,l=1

j=1

k,l=1

j=1

. r is the CP rank of the algebraic tensor space, x(j)
ik

3.9.4 Solving the STM

Now we start discussing the way of estimating our Support tensor machine from a group of training

tensor and their corresponding labels. We are going to consider only the cumulant-based tensor

kernel functions introduced in the previous section in this part. The situation of naive tensor

kernels are straightforward, and one can use traditional SVM method to estimate with only a slight

modiﬁcation. According to the representer theorem and the deﬁnition of cumulant-based kernel

function, we assume that the solution of support tensor machine has the following explicit form.

n(cid:88)

r(cid:88)

d(cid:89)

ˆf (X) =

βi

i=1

k,l=1

j=1

= KT β

K(j)(x(j)

k,i , x(j)
l )

(3.4)

where β = (β1, ..., βn)T and K = (

r(cid:80)

k,l=1

(cid:81)d

j=1 K(j)(x(j)

k,1, x(j)

l ), ...,

r(cid:80)

k,l=1

(cid:81)d

j=1 K(j)(x(j)

k,n, x(j)

l ))T . r is

the CP rank of the algebraic tensor space, x(j)

ik and x(j)

k are components of tensor CP decomposition

of the corresponding training and testing tensors. If the data do not come in CP form, we may need

to perform a CP decomposition at ﬁrst. Plugging the soluction (3.4) into the objection function

103

(3.3), we can get

min

β

λβT Kβ +

n(cid:88)

i=1

1
n

L(yi, KT (:, i)β)

(3.5)

where K(:, i) is the ith column of matrix K = [K1, ..., Kn].

This problem can be solved directly with gradient descent. All values except the primal vector

can be easily evaluated from the training set. The derivative of β is

n(cid:88)

i=1

2λKβ +

1
n

∂L
∂β

(3.6)

Let equation (3.6) equals to zero and solve for our problem. In our application, we took squared hing
loss which L(y, X) = [max(0, 1 − yf (X))]2. The algorithm of training and prediction are described

below denoted as algorithm ?? and algorithm ?? respectively: In algorithm ??, the complexity

of the training process is O(n2r2

any vectorized method, O(n2(cid:81)d

d(cid:80)

j=1

Ij), which is will be much smaller than the complexity of

j=1 Ij), with low rank assumption. In addition, we suggest to save

the decomposed list of the training data when handling high dimensional data in algorithm ??.

This is because the decomposed list is much smaller than the original data, which saves memory

for computers. Having decomposed list instead of the original data can also make the prediction

faster, since one does not need to repeat the decomposition again.

3.9.5 Estimation with Complete Tensor Data

As we mentioned above, a CP decomposition is required when getting data ready for our algorithm.

We can estimate components of a rank r tensor from this procedure, and can feed this estimation

to our model. The estimated tensor decomposition will be an unique approximation for the original

data under most situation. Due to this uniqueness, we take the proposition from Kolda & Bader

(2009):

104

r(cid:80)
k ⊗
x(1)
k = [X(1), X(2), ..., X(d)], where the columns of each Xj ∈ RIj×r, j = 1, ..., d are X (j)

Proposition 1. Let X be a d-way tensor with rank r. If it can be expressed as X =
k ... ⊗ x(d)
x(2)

k=1

k k =

1, ..., r; j = 1, .., d from the expression. The CP decomposition of this tensor is unique if

d(cid:88)

R(X(j)) (cid:62) 2r + d − 1

(3.7)

k=1

where R(X(j)) are the corresponding column ranks of matrices X(j). As a result, the probability of

miss-classiﬁcation is identical, i.e.

P(ˆy (cid:54)= y|X) = P(ˆy (cid:54)= y| ˆX)

where ˆX is the tensor with estimated CP decomposition

(3.8)

For more details of the uniqueness of the decomposition, we refer Kolda & Bader (2009) and

Sidiropoulos & Bro (2000). During our application, we use the Alternating Least Square (ALS) to

estimate the decomposition.

The second issue about our model is the rank assumption. We assume all tensor data are

ideally from the same tensor space with rank r. Under most situations, we do not know the rank r

apriori. We prescribe the ranke r when we pre-process the data for training by ﬁnding one which

can provide the best approximation for tensor decomposition.

3.10 Statistical Property of STM

In the last part of our theoretical results, we want to highlight the performance and the general-

ization ability of our classiﬁer for tensors. In the general evaluation of a decision rule, one will

be interested in exploring the bound for its classiﬁcation risk. For example, assume our data for
classiﬁcation are {(x, y) ∈ X ∗ × Y}. Let f be a decision rule for data generated from X ∗ × Y, the

105

classiﬁcation risk of this rule is deﬁned as:

(cid:90)

X×Y

R(f ) =

Pr(f (x) (cid:54)= y|x)µ(dx)µ(dy)

(3.9)

where µ are measures deﬁned on X and Y. The lower bound of this risk is the Bayes risk, which
we denote with R∗. For simplicity, we consider the binary classiﬁcation case where Y = {−1, 1}.
In addition, we assume there is no noise in the problem such that ∀X ∈ X ∗,

Pr(f (X = 1)|X) (cid:54)= Pr(f (X = −1)|X)

(3.10)

In other words, we will not consider the situation where posterior distribution does not provide a

decision and one can only guess randomly.

However, a classiﬁcation rule learned directly from a given training set usually will not be

able to reach the Bayes risk. In fact, we can only estimate the empirical risk for a rule when an

observation with length n is given, which is

n(cid:88)

i=1

ˆR(f ) =

1
n

I{f (xi) (cid:54)= yi}

(3.11)

{(xi, yi), i = 1, ..., n} is our training set. We always try to minimize this empirical risk with some

methods, and ﬁnd out the optimal classiﬁer ˆf for this risk. This is what we called Empirical Risk

Minimization(ERM) in general statistical learning problems. The evaluation of the method we

followed to ﬁnd ˆf depends on the bound

δn = |R∗ − R( ˆf )|

(3.12)

If this quantity δ converges to zero as the sample size increase, then the classiﬁer is consistent and

106

the method we follow is statistically sound. This quantity can be bounded by

δn (cid:54) |R∗ − R∗(f )| + |R∗(f ) − R( ˆf )|

(3.13)

where R∗(f ) = inf f∈H R(f ) is the minimal risk of a collection of classiﬁers H. According to the

result from Steinwart & Christmann (2008), we have the following theorem:

Theorem 3.10.1. If K is a universal kernel on a compact subset of tensor space X ∗. The loss
function L is Lipschitz continuous. Then R∗ = R∗(f ).

One can see Steinwart & Christmann (2008) for the proof of this theorem. The intuition of

this theorem is pretty simple since we have shown the universal approximation property of tensor

based kernel functions. This result will always hold in the classiﬁcation problems since the loss

functions such as hinge loss are always Lipschitz continuous. The consistency problem turns out
to be ﬁnding bounds for |R∗(f ) − R( ˆf )|. Our next result shows the convergence of this part.

Theorem 3.10.2. Let V be a compact subspace of the algebraic tensor space X ∗. Let K be a
cumulant-based kernel function for tensor that is universal on the V, and |K|∞ (cid:54) 1. If we assume
the tensor space has dimension p = I1 × I2 × ... × Id < ∞. Then for all Borel probaility measure
Pr on (X ∗ × Y) satisfying Pr(f (X = 1)|X) (cid:54)= Pr(f (X = −1)|X), we have

R( ˆfn) → R∗(f )

n → ∞

(3.14)

in probability.

107

APPENDIX

108

a(2)
i2,j2

....a(l)
il,jl

..a(k)
ik,jk

= ˜ai,j (cid:107)(< ˜AK

i , x >)(cid:107)2

j=1 x2

j (˜ai,j)2

2 = E(cid:80)P

+ (cid:80)P

(cid:80)P
Proof. Deﬁne a(1)
i1,j1
j=1 xjx(cid:48)

j(cid:48)=1
j(cid:54)=j(cid:48)

j ˜ai,j ˜ai,j(cid:48)

APPENDIX

Main lemmas

Lemma 3.10.3. E(< ˜AK

i , x >) = 0 for any row i

Lemma 3.10.4. (cid:107)(< ˜AK

i , x >)(cid:107)2 = (cid:107)x(cid:107)2 for any row i

Second term is zero due to conditional expectation property.

First we try to estimate the r th moment of S2

d(x) = xT ˜A(K),T ˜A(K)x

dK

=

(cid:80)d
i=1 (cid:107) ˜A(K)

i x(cid:107)2

2

dK

(cid:4)

in terms

1 = ( ˜A(K)
Y 2

1 x)2, since {Yi}dk

i=1 are exchangeable, stationary, weak ρ mixing sequence. Unfortunately,

Lemma 3.10.5. Rosenthal Inequality for rho mixing sequences

For distribution following sparse distribution P. Li et al. (2006) Assume E(Y ) = 0 and (cid:107)Y (cid:107)2r

2r <

∞ then there exists a positive constant C = Cr,ρ such that

E max
1≤i≤dk

(|Sdk|)2r ≤ C(dr(cid:107)x(cid:107)2r

2 + d(2r − 1)rK(cid:107)x(cid:107)2r
2 )

Proof. Q.-M. Shao (1995) theorem 1, using ρ = 0 and (cid:107)Y1(cid:107)2
using result that (cid:107)Y1(cid:107)2r

2r ≤ (2r − 1)rK(cid:107)Y1(cid:107)2r
2 (cid:107)

2 = (cid:107)x(cid:107)2

2 and exchangeability of Yi now
(cid:4)

Lemma 3.10.6. Hypercontractivity for Gaussian distribution

Suppose Y is k degree Gaussian polynomial chaos in some Gaussian Hilbert space, then (cid:107)Sd(cid:107)2r

2r ≤

(2r − 1)Kr(cid:107)Sd(cid:107)2r

2 = (r − 1)K/2d(cid:107)Y (cid:107)2

2 = (cid:107)x(cid:107)2

2

Proof. Janson et al. (1997) theorem 5.10 and theorem 6.7

(cid:4)

109

Lemma 3.10.7. Hypercontractivity for Gaussian distribution

Suppose Y is k degree Gaussian polynomial chaos in some Gaussian Hilbert space, then
2r ≤ (2r − 1)Kr(cid:107)Sdk(cid:107)2r

2 = (r − 1)K/2d(cid:107)Y (cid:107)2

2 = (cid:107)x(cid:107)2

(cid:107)Sdk(cid:107)2r

2

Proof. Janson et al. (1997) theorem 5.10 and theorem 6.7

(cid:4)

Lemma 3.10.8. Hypercontractivity for very sparse distribution (cid:107)Sdk(cid:107)r ≤ Cr,K(cid:107)Sdk(cid:107)2

Proof. Janson et al. (1997) lemma 5.2 proves hypercontractivity for Raedmacher variable Z. Ob-
serve that for any constant H , same will hold for Z(cid:48) = HZ. Deﬁne a new random variable V
, a ∼ V Z(cid:48) any two point
independent of Z(cid:48)such that P r(V = d1/4) = 1√

and P r(V = 0) = 1 − 1√

d

d

distribution is hyper contractive as well Janson et al. (1997) lemma 5.2. So, due to independence
their product V Z(cid:48) is hyper contractive
(cid:4)

110

BIBLIOGRAPHY

111

BIBLIOGRAPHY

Achlioptas, D. (2003). Database-friendly random projections: Johnson-lindenstrauss with

binary coins. Journal of computer and System Sciences, 66 (4), 671–687.

Adamczak, R., et al. (2015). A note on the hanson-wright inequality for random vectors

with dependencies. Electronic Communications in Probability, 20 .

Avants, B. B., Tustison, N. J., Song, G., Cook, P. A., Klein, A., & Gee, J. C. (2011). A
reproducible evaluation of ants similarity metric performance in brain image registration.
Neuroimage, 54 (3), 2033–2044.

Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor”

meaningful? In International conference on database theory (pp. 217–235).

Bickel, P. J., Levina, E., et al. (2004). Some theory for ﬁsher’s linear discriminant func-
tion,naive bayes’, and some alternatives when there are many more variables than obser-
vations. Bernoulli, 10 (6), 989–1010.

Cai, T., & Liu, W.

(2011). A direct estimation approach to sparse linear discriminant

analysis. Journal of the American Statistical Association, 106 (496), 1566–1577.

Chi, E. C., & Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations.

SIAM Journal on Matrix Analysis and Applications, 33 (4), 1272–1299.

Chu, T., Zhu, J., Wang, H., et al. (2011). Penalized maximum likelihood estimation and

variable selection in geostatistics. The Annals of Statistics, 39 (5), 2607–2625.

Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fmri)“brain
reading”: detecting and classifying distributed patterns of fmri activity in human visual
cortex. Neuroimage, 19 (2), 261–270.

Cressie, N., & Huang, H.-C. (1999). Classes of nonseparable, spatio-temporal stationary
covariance functions. Journal of the American Statistical Association, 94 (448), 1330–
1339.

Cressie, N., & Lahiri, S. N. (1996). Asymptotics for reml estimation of spatial covariance

parameters. Journal of Statistical Planning and Inference, 50 (3), 327–341.

De Lathauwer, L., De Moor, B., & Vandewalle, J.

(2000). A multilinear singular value

decomposition. SIAM journal on Matrix Analysis and Applications, 21 (4), 1253–1278.

Domingos, P. M. (2012). A few useful things to know about machine learning. Commun.

acm, 55 (10), 78–87.

Doshi, J., Erus, G., Ou, Y., Gaonkar, B., & Davatzikos, C. (2013). Multi-atlas skull-stripping.

Academic radiology, 20 (12), 1566–1576.

112

Fan, J., & Fan, Y. (2008). High dimensional classiﬁcation using features annealed indepen-

dence rules. Annals of statistics, 36 (6), 2605.

Fan, J., Feng, Y., & Tong, X. (2012). A road to classiﬁcation in high dimensional space: the
regularized optimal aﬃne discriminant. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 74 (4), 745–771.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its

oracle properties. Journal of the American statistical Association, 96 (456), 1348–1360.

Fan, J., & Lv, J.

(2010). A selective overview of variable selection in high dimensional

feature space. Statistica Sinica, 20 (1), 101.

Furrer, R., Bachoc, F., & Du, J. (2016). Asymptotic properties of multivariate tapering for

estimation and prediction. Journal of Multivariate Analysis, 149 , 177–191.

Furrer, R., Genton, M. G., & Nychka, D. (2006). Covariance tapering for interpolation of
large spatial datasets. Journal of Computational and Graphical Statistics, 15 (3), 502–523.

Ghosh, D. (2001). Singular value decomposition regression models for classiﬁcation of tumors

from microarray experiments. In Biocomputing 2002 (pp. 18–29). World Scientiﬁc.

Gutman, B. A., Hua, X., Rajagopalan, P., Chou, Y.-Y., Wang, Y., Yanovsky, I., . . . others
(2013). Maximizing power to track alzheimer’s disease and mci progression by lda-based
weighting of longitudinal ventricular surface features. Neuroimage, 70 , 386–401.

Hackbusch, W. (2012). Tensor spaces and numerical tensor calculus (Vol. 42). Springer

Science & Business Media.

He, L., Kong, X., Yu, P. S., Yang, X., Ragin, A. B., & Hao, Z. (2014). Dusk: A dual structure-
In

preserving kernel for supervised tensor learning with applications to neuroimages.
Proceedings of the 2014 siam international conference on data mining (pp. 127–135).

He, X., Cai, D., & Niyogi, P. (2006). Tensor subspace analysis.

In Advances in neural

information processing systems (pp. 499–506).

Hiai, F., & Lin, M. (2017). On an eigenvalue inequality involving the hadamard product.

Linear Algebra and its Applications, 515 , 313–320.

Janson, S., et al. (1997). Gaussian hilbert spaces (Vol. 129). Cambridge university press.

Kaufman, C. G., Schervish, M. J., & Nychka, D. W.

(2008). Covariance tapering for
likelihood-based estimation in large spatial data sets. Journal of the American Statistical
Association, 103 (484), 1545–1555.

Kolda, T. G., & Bader, B. W.

(2009). Tensor decompositions and applications. SIAM

review , 51 (3), 455–500.

Kwon, S., & Kim, Y.

(2012). Large sample properties of the scad-penalized maximum

likelihood estimation on high dimensions. Statistica Sinica, 629–653.

113

Lata(cid:32)la, R., et al. (2006). Estimates of moments and tails of gaussian chaoses. The Annals

of Probability, 34 (6), 2315–2331.

Ledoux, M.

(1999). Concentration of measure and logarithmic sobolev inequalities.

In

Seminaire de probabilites xxxiii (pp. 120–216). Springer.

Li, L., & Zhang, X. (2017). Parsimonious tensor response regression. Journal of the American

Statistical Association, 112 (519), 1131–1146.

Li, P., Hastie, T. J., & Church, K. W. (2006). Very sparse random projections. In Proceedings
of the 12th acm sigkdd international conference on knowledge discovery and data mining
(pp. 287–296).

Li, Q., & Schonfeld, D. (2014). Multilinear discriminant analysis for higher-order tensor data
classiﬁcation. IEEE transactions on pattern analysis and machine intelligence, 36 (12),
2524–2537.

Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear

models. Biometrika, 13–22.

Lock, E. F. (2018). Tensor-on-tensor regression. Journal of Computational and Graphical

Statistics, 27 (3), 638–647.

Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2008). Mpca: Multilinear principal
IEEE transactions on Neural Networks, 19 (1),

component analysis of tensor objects.
18–39.

Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in

ultra-high dimensions. Biometrika, 99 (1), 29–42.

Mat´ern, B. (1960). Spatial variation, volume 36 of. Lecture Notes in Statistics.

Mourao-Miranda, J., Bokde, A. L., Born, C., Hampel, H., & Stetter, M. (2005). Classify-
ing brain states and determining the discriminating activation patterns: support vector
machine on functional mri data. NeuroImage, 28 (4), 980–995.

Muschelli, J., Gherman, A., Fortin, J.-P., Avants, B., Whitcher, B., Clayden, J. D., . . .
Crainiceanu, C. M. (2018). Neuroconductor: an r platform for medical imaging analysis.
Biostatistics.

Pan, Y., Mai, Q., & Zhang, X.

(2018). Covariate-adjusted tensor classiﬁcation in high

dimensions. Journal of the American Statistical Association, 1–15.

Petersen, R. C., Aisen, P., Beckett, L. A., Donohue, M., Gamst, A., Harvey, D. J., . . .
others (2010). Alzheimer’s disease neuroimaging initiative (adni): clinical characterization.
Neurology, 74 (3), 201–209.

Popescu, V., Battaglini, M., Hoogstrate, W., Verfaillie, S. C., Sluimer, I., van Schijndel,
R. A., . . . others (2012). Optimizing parameter choice for fsl-brain extraction tool (bet)
on 3d t1 images in multiple sclerosis. Neuroimage, 61 (4), 1484–1494.

114

Raskutti, G., Yuan, M., Chen, H., et al. (2019). Convex regularization for high-dimensional

multiresponse tensor regression. The Annals of Statistics, 47 (3), 1554–1584.

Rathi, V., & Palani, S. (2012). Brain tumor mri image classiﬁcation with feature selection

and extraction using linear discriminant analysis. arXiv preprint arXiv:1208.2128 .

Rohlﬁng, T., Zahr, N. M., Sullivan, E. V., & Pfeﬀerbaum, A. (2008). The sri24 multichannel
brain atlas: construction and applications. In Medical imaging 2008: Image processing
(Vol. 6914, p. 691409).

Samson, P.-M., et al. (2000). Concentration of measure inequalities for markov chains and

φ-mixing processes. The Annals of Probability, 28 (1), 416–461.

Schudy, W., & Sviridenko, M. (2012). Concentration and moment inequalities for polynomi-
als of independent random variables. In Proceedings of the twenty-third annual acm-siam
symposium on discrete algorithms (pp. 437–446).

Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis

by thresholding for high dimensional data. The Annals of statistics, 39 (2), 1241–1265.

Shao, Q.-M.

(1995). Maximal inequalities for partial sums of ρ-mixing sequences. The

Annals of Probability, 948–965.

Sidiropoulos, N. D., & Bro, R. (2000). On the uniqueness of multilinear decomposition of

n-way arrays. Journal of chemometrics, 14 (3), 229–239.

Signoretto, M., De Lathauwer, L., & Suykens, J. A. (2011). A kernel-based framework to

tensorial data analysis. Neural networks, 24 (8), 861–874.

Steinwart, I., & Christmann, A.

(2008). Support vector machines. Springer Science &

Business Media.

Sun, Y., Guo, Y., Tropp, J. A., & Udell, M. (2018). Tensor random projection for low
memory dimension reduction. In Neurips workshop on relational representation learning.

Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., & Zhuang, Y. (2012). Logistic tensor re-
gression for classiﬁcation. In International conference on intelligent science and intelligent
data engineering (pp. 573–581).

Tao, D., Li, X., Hu, W., Maybank, S., & Wu, X. (2005). Supervised tensor learning. In

Data mining, ﬁfth ieee international conference on (pp. 8–pp).

Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer
types by shrunken centroids of gene expression. Proceedings of the National Academy of
Sciences, 99 (10), 6567–6572.

Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2003). Class prediction by nearest

shrunken centroids, with applications to dna microarrays. Statistical Science, 104–117.

115

Varah, J. M. (1975). A lower bound for the smallest singular value of a matrix. Linear

Algebra and its Applications, 11 (1), 3–5.

Wang, Y., Fan, Y., Bhatt, P., & Davatzikos, C. (2010). High-dimensional pattern regression
using machine learning: from medical images to continuous clinical variables. Neuroimage,
50 (4), 1519–1535.

Wimalawarne, K., Tomioka, R., & Sugiyama, M.

(2016). Theoretical and experimental
analyses of tensor-based regression and classiﬁcation. Neural computation, 28 (4), 686–
715.

Witten, D. M., & Tibshirani, R. (2011). Penalized classiﬁcation using ﬁsher’s linear discrim-
inant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (5),
753–772.

Wolkowicz, H., & Styan, G. P. (1980). Bounds for eigenvalues using traces. Linear algebra

and its applications, 29 , 471–506.

Yasui, Y., & Lele, S. (1997). A regression method for spatial disease rates: an estimating

function approach. Journal of the American Statistical Association, 92 (437), 21–32.

Yingjie, L., & Maiti, T. (2019). High dimensional discriminant analysis for spatially depen-

dent data. In Progress.

Yu, H., & Yang, J. (2001). A direct lda algorithm for high-dimensional data—with applica-

tion to face recognition. Pattern recognition, 34 (10), 2067–2070.

Yushkevich, P. A., Gao, Y., & Gerig, G. (2016). Itk-snap: An interactive tool for semi-
automatic segmentation of multi-modality biomedical images. In Engineering in medicine
and biology society (embc), 2016 ieee 38th annual international conference of the (pp.
3342–3345).

Zhou, H., Li, L., & Zhu, H. (2013). Tensor regression with applications in neuroimaging

data analysis. Journal of the American Statistical Association, 108 (502), 540–552.

Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood

models. Annals of statistics, 36 (4), 1509.

116