STATISTICAL ANALYSIS FOR NETWORK-BASED MODELS WITH
APPLICATIONS TO GENETIC ASSOCIATION AND PREDICTION

By

Xiaoxi Shen

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Statistics — Doctor of Philosophy

2019

ABSTRACT

STATISTICAL ANALYSIS FOR NETWORK-BASED MODELS WITH APPLICATIONS

TO GENETIC ASSOCIATION AND PREDICTION

By

Xiaoxi Shen

Network-based models are popular in statistical applications. The main advantage of

using a network-based model is that one can understand and evaluate its semantics and

properties rather straightforwardly. In this dissertation, we study the statistical properties

for some network-based model and apply these models to genetic association studies and

genetic risk predictions.

In Chapter 2, we propose a conditional autoregressive (CAR) model to account for pos-

sible heterogeneous genetic eﬀects among individuals. In the proposed method, the genetic

eﬀect is considered as a random eﬀect and a score test is developed to test the variance

component of genetic random eﬀect. Through simulations, we compare the type I error and

power performance of the proposed method with those of the generalized genetic random

ﬁeld (GGRF) and the sequence kernel association test (SKAT) methods under diﬀerent dis-

ease scenarios. We ﬁnd that our method outperforms the other two methods when (i) the

rare variants have the major contribution to the disease, or (ii) the genetic eﬀects vary in

diﬀerent individuals or subgroups of individuals. Finally, we illustrate the new method by

applying it to the whole genome sequencing data from the Alzheimer’s Disease Neuroimaging

Initiative.

A kernel-based neural network (KNN) method is proposed in Chapter 3 for genetic risk

prediction. KNN inherits the high-dimensional feature from classical kernel methods and

the non-linear and non-additive features from neural networks. KNN summarizes a large

number of genetic variants into kernel matrices and uses the kernel matrices as input ma-

trices. Based on these kernel matrices, KNN builds a feedforward neural network to model

the complex relationship between genetic variants and a disease outcome. Minimum norm

quadratic unbiased estimation (MINQUE) is implemented in KNN to make parameter esti-

mation feasible. Through theoretical proof and simulations, we demonstrate that KNN can

attain lower average prediction error than LMM. Finally, we illustrate KNN by an application

to the sequencing data from the Alzheimer?s Disease Neuroimaging Initiate.

Nowadays, neural networks have been widely applied in machine learning and artiﬁcial

intelligence. However, as a statistical model, few researches focuses on statistical properties

for neural networks and these will be studied in Chapter 4. A neural network can be classiﬁed

into a nonlinear regression regression. However, if we consider it parametrically, due to

the unidentiﬁability of the parameters, it is diﬃcult to derive its asymptotic properties.

Instead, we consider the estimation problem as a nonparametric regression problem and

use the results from sieve estimation to establish the consistency, rates of convergence and

asymptotic normality of the neural network estimators. We also illustrate the validity of the

theories via simulations.

I dedicate this dissertation to my parents, Xinmei Dai, Xiong Shen and my friends,

Chang, Carlos, Fengkan, Julia, Shunjie and many others.

iv

ACKNOWLEDGMENTS

Here, I would like to express my deepest gratitude to my advisors Dr. Qing Lu and Dr.

Yuehua Cui for their support and guidance towards my studies and researches. Dr. Lu and

Dr. Cui are extremely kind and knowledgeable. They always provide valuable insights and

suggestions on the improvements of my work.

I would also like to extend my sincere thanks to my dissertation committee members, Dr.

Lyudmila Sakhanenko and Dr. Yimin Xiao. Their comments and suggestions are beneﬁcial

for my studies.

I am also grateful to the help I obtained from all the professors in the Department of

Statistics and Probability. During my Ph.D. studies, I took most of the courses oﬀered by

our department. From each course, I learned various techniques to solve diﬀerent research

problems. For example, I learned how to consider a problem in a Bayesian way and how to

interpret the Bayesian results from the course I took from Dr. Tapabrata Maiti.

During my seven years at Michigan State University, I made lots of friends and because

of them, I never feel lonely in these years. Many thanks to my senpais: Dr. Honglang Wang,

Dr. Tao He, Dr. Bin Gao, Dr. Shunjie Guan and Dr. Jingyi Zhang. They have become

successful faculty members and statisticians in big companies. My thanks also go to my

friends: Chang Jiang, Shan Zhang, Xiaoran Tong, Kaixu Yang, Yuning Hao, Jinghang Lin,

Yuan Zhou and Di Wu. Without your help, I could not be who I am now and I sincerely

wish all of you have a wonderful future.

Last but not least, I would like to express my sincere thanks to my parents for their

support and conﬁdence in me. I would like to treat this dissertation as my unique gift to

both of you.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Linear Mixed Models for Genetic Association Studies . . . . . . . . . . . . .
1.3 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Gaussian Graphical Models
. . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Neural Networks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Objective and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2 A Conditional Autoregressive Model for Genetic Association
Analysis Accounting for Genetic Heterogeneity . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Score Test for Variance Component . . . . . . . . . . . . . . . . . . .
2.2.3
2.2.3.1
γ As a Fixed Constant . . . . . . . . . . . . . . . . . . . . .
γ As an Unknown Nuisance Parameter . . . . . . . . . . . .
2.2.3.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation I: Heterogeneous Genetic Eﬀects Among Individuals or
Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation II: Various Causal SNV Rates . . . . . . . . . . . . . . . .
Simulation III: Misspeciﬁcation of Weights . . . . . . . . . . . . . . .
2.4 Real Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Simulation Results

2.3.1

2.3.2
2.3.3

Risk Prediction Analysis

Chapter 3 A Kernel-Based Neural Network for High-dimensional Genetic
. . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Quadratic Estimators for Variance Components . . . . . . . . . . . .
3.2.2 MINQUE in KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Including Fixed Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Nonlinear Random Eﬀect
. . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Nonadditive Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3 Predictions
3.4
3.5 Simulations

vi

x

1
1
3
5
5
6
10
13

15
15
17
17
22
25
26
28
30

33
37
38
44
48

50
50
52
58
61
66
72
76
76
78

3.5.3 Non-normal Error Distributions . . . . . . . . . . . . . . . . . . . . .
3.6 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80
82
85

Chapter 4 Asymptotic Properties of Neural Network Sieve Estimators .
87
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
87
4.2 Existence
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
4.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
4.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6 Simulation Studies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.6.1 Parameter Inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . 132
. . . . . . . . . . . 134
4.6.2 Consistency for Neural Network Sieve Estimators
4.6.3 Asymptotic Normality for Neural Network Sieve Estimators
. . . . . 136
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Chapter 5 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Appendix A Technical Details and Supplementary Materials for Chapter 2 . . . . 148
Appendix B Technical Details and Supplementary Materials for Chapter 3 . . . . 155
Appendix C Technical Details and Supplementary Materials for Chapter 4 . . . . 167

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

vii

LIST OF TABLES

Table 2.1: Empirical type I error rates under diﬀerent weights at level α = 0.05 and
α = 0.01 based on 1000 replicates. Each cell in the table contains the em-
pirical type I error rate. GGRF, SKAT, CAR.FIX and CAR.SUP are the
generalized genetic random ﬁeld model of Li et al. (2014) [41], the sequence
kernel association test of Wu et al. (2011) [78], the conditional autoregres-
sive model with ﬁxed nuisance parameter γ, the conditional autoregressive
model with maximum score test statistic, respectively . . . . . . . . . . .

Table 2.2: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on
1,000 Monte Carlo replicates. In the simulation, we simulated 8 diﬀerent
subgroups with σZ = 0.0001, 0.0005, 0.002, 0.001, 0.003, 0.0001, 0.0005, 0.002
for BETA and WSS; with σZ = 0.01, 0.05, 0.2, 0.1, 0.3, 0.01, 0.05, 0.2 for UW
and LOG. The number in the parenthesis is the standard deviation.
. . .

Table 2.3: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on
1,000 Monte Carlo replicates. In the simulation, we simulated 8 diﬀerent
subgroups with σZ = 0.0001, 0.0001, 0.0002, 0.005, 0.0003, 0.0005, 0.0001, 0.005
for BETA and WSS; with σZ = 0.01, 0.01, 0.02, 0.5, 0.03, 0.05, 0.01, 0.5 for
UW and LOG. The number in the parenthesis is the standard deviation. .

33

36

37

Table 2.4: Top 10 hippocampus-associated genes detected by GGRF, SKAT and CAR. 47

Table 3.1: Average mean squared prediction error of KNN with product output kernel
matrix, KNN with output kernel matrix as polynomial of order 2 and the
BLUP based on LMM.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

Table 4.1: Comparison of the true parameters and the estimated parameters in a

single-layer neural network with 2 hidden units. . . . . . . . . . . . . . . . 132

Table 4.2: Comparison of errors (cid:107) ˆfn − f0(cid:107)2

n and the least square errors Qn( ˆfn) after

20,000 iterations under diﬀerent sample sizes.

. . . . . . . . . . . . . . . . 136

Table 4.3: p-values for Shapiro-Wilks test and Kolmogorov-Smirnov test for nomality
test. We use “NN" to denote the true function as a neural network described
in (4.13); “TRI" to denote the true function as a trigonometric function
described in (4.14) and “ND" to denote the true function as a continuous
function having a non-diﬀerential point described in (4.15) . . . . . . . . . 138

Table A.1: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with En-

torhinal.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

viii

Table A.2: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Ventricle.153

Table A.3: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Whole

Brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

ix

LIST OF FIGURES

Figure 1.1: Diagram of a neuron with main components (Figure from Wiki). . . . . .

10

Figure 1.2: A perceptron that mimics the structure of a neuron in a human brain.
x1, . . . , xp are input features and x0 is a bias unit or intercept. Σ is
known as a computation unit, where a linear combination of the features,
including the bias unit and the parameters θ to be learned is calculated
and then an activation function is applied. In this example, a standard
sigmoid activation function σ(x) = (1 + e−x)−1 is applied.
. . . . . . . .

Figure 1.3: A neural network with one hidden layer. x1, . . . , xp are input features and
x0 is a bias unit or intercept. Units in the dashed rectangle are known
as hidden computation units, where a linear combination of features from
previous layer and associated parameters is calculated and an activation
function is applied to obtain outputs for these hidden units. Σ at the end
is an output computation unit, in which a linear combination of the output
features from the last hidden layers and associated weights is calcuated
and an acitivation function is applied.
. . . . . . . . . . . . . . . . . . .

Figure 2.1: An illustration of a Gaussian graphical model. For the CAR model, the
graph represents a network of the genetic eﬀects among individuals. In the
graph, each node stands for the gentic eﬀect of an indiviudal and the ex-
istence of an edge between two individuals indicates dependence between
genetic eﬀects between these two individuals. This is characterized by the
precision matrix in the joint normal distribution of these genetic eﬀects. .

Figure 2.2: The Distribution of minor allele frequency of in sequencing variants on
. . . . . . . . . . . . . .

chromosome 17 from the 1,000 Genome project.

Figure 2.3: Empirical Power Comparison of CAR, SKAT and GGRF by varying the

levels of genetic heterogeneity among individuals under four diﬀerent weights.
The x-axis is the eﬀect size σZ, used in the simulation. The eﬀect sizes
are chosen as 0.01, 0.05, 0.075, 0.1, 0.3, 0.5 for the UW and LOG weights,
and the eﬀect sizes are set as 0.0005, 0.00075, 0.001, 0.0015, 0.002 for the
BETA and WSS weights. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.4: Comparison of empirical power of CAR, SKAT and GGRF with diﬀerent
causal SNV rates under the BETA weight and the WSS weight. The
standard deviation σZ used in the simulation is gradually increased. σZ =
0.0005, 0.001, 0.002, 0.005, respectively in each column from left to right.

11

12

20

32

35

39

x

Figure 2.5: Comparison of empirical power of CAR, SKAT and GGRF with dif-
ferent causal SNV rates under the UW weight and the LOG weight.
The standard deviation σZ used in the simulation is gradually increased.
σZ = 0.05, 0.1, 0.2, 0.5 respectively in each column from left to right. . . .

40

Figure 2.6: Comparison of empirical power of CAR, SKAT and GGRF with misspec-
iﬁed weights when the true weight is BETA and WSS. In each column
from left to right, we used BETA, LOG, UW and WSS weight, respectively. 42

Figure 2.7: Comparison of empirical power of CAR, SKAT and GGRF with misspec-
iﬁed weights when the true weight is UW and LOG. In each column from
.
left to right, we used BETA, LOG, UW and WSS weight, respectively.

Figure 2.8: Histograms of the Phenotypes used for Real Data Analyses . . . . . . . .

Figure 3.1: An illustration of the hierarchical structure of the kernel neural network
model with L input kernel matrices K1(G), . . . , KL(G) and J hidden
kernel matrices K1(U ), . . . , KJ (U ). The output layer is the prediction
. . . . . . . . . . . . . . . . . . . . . . . . . . .
for the random eﬀect f.

Figure 3.2: The intuition under the assumption ˜σ2

R ≤ ˜τ ξ in Proposition 3.3.2.

. . . .

43

45

59

70

Figure 3.3: The boxplots for linear mixed models (LMM) and kernel neural networks
(KNN) in terms of prediction errors. The left panel shows the results when
a linear function is used and the right panel shows the results when a sine
function is used. In the horizontal axis, “1" corresponds to the LMM; “2"
corresponds to the KNN with product input kernel and product output
kernel; “3" corresponds to the KNN with product input and polynomial
output; “4" corresponds to the KNN with polynomial input and product
output and “5" corresponds to the polynomial input and polynomial output. 77

Figure 3.4: The boxplots for linear mixed models (LMM) and kernel neural networks
(KNN) in terms of prediction errors based on the simulation model fo-
cusing on the interaction eﬀect. The vertical axis is scaled to 0-5 by
removing some outliers to make the comparison visually clear. In the hor-
izontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN
with product input kernel and product output kernel; “3" corresponds to
the KNN with product input and polynomial output; “4" corresponds to
the KNN with polynomial input and product output and “5" corresponds
to the polynomial input and polynomial output. . . . . . . . . . . . . . .

79

xi

Figure 3.5: The boxplots for linear mixed models (LMM) and kernel neural networks
(KNN) in terms of prediction errors based on the simulation model us-
ing dominant coding (left ﬁgure) and recessive coding (right ﬁgure) for
SNPs.
In the horizontal axis, “1" corresponds to the LMM; “2" corre-
sponds to the KNN with product input kernel and product output kernel;
“3" corresponds to the KNN with product input and polynomial output;
“4" corresponds to the KNN with polynomial input and product output
and “5" corresponds to the polynomial input and polynomial output.
. .

81

Figure 3.6: The boxplots for linear mixed models (LMM) and kernel neural networks
(KNN) in terms of prediction errors based on the simulation model using
t-distribution for error (left ﬁgure) and centered χ2
1-distribution for error
(right ﬁgure). In the horizontal axis, “1" corresponds to the LMM; “2"
corresponds to the KNN with product input kernel and product output
kernel; “3" corresponds to the KNN with product input and polynomial
output; “4" corresponds to the KNN with polynomial input and product
output and “5" corresponds to the polynomial input and polynomial output. 82

Figure 4.1: Comparison of the true function and ﬁtted function under the simulation
model (4.12). The black curve is the true function deﬁned in (4.13) and
the blue dashed curve is the ﬁtted curve obtained after ﬁtting the neural
network model.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Figure 4.2: Figures on comparison of the true function and the ﬁtted function used in
simulations. The top panel shows the scenario when the true function is a
single layer neural network; the middle panel shows the scenario when the
true function is a sine function and the bottom panel show the scenario
when the true function is a continuous function having a non-diﬀerentiable
point. As we can see from all the cases, the ﬁtted curve becomes closer to
the truth as the sample size increases.

. . . . . . . . . . . . . . . . . . . 137

under 200 iterations
and various sample sizes. The true function f0 is a single-layer neural
network with 2 hidden units as deﬁned in (4.13).

. . . . . . . . . . . . . 139

i=1

Figure 4.3: Normal Q-Q plot for n−1/2(cid:80)n
Figure 4.4: Normal Q-Q plot for n−1/2(cid:80)n
Figure 4.5: Normal Q-Q plot for n−1/2(cid:80)n

(cid:105)
(cid:104) ˆfn(xi) − f0(xi)
(cid:104) ˆfn(xi) − f0(xi)
(cid:105)
(cid:104) ˆfn(xi) − f0(xi)
(cid:105)

under 200 iterations
and various sample sizes. The true function f0 is a trigonometric function
as deﬁned in (4.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

i=1

under 200 iterations
and various sample sizes. The true function f0 is a continuous function
having a non-diﬀerential point as deﬁned in (4.15).

. . . . . . . . . . . . 141

i=1

xii

Figure B.1: The boxplots for linear mixed models (LMM) and kernel neural network
(KNN) in terms of prediction errors. The left panel shows the results
when an inverse logistic function is used and the right panel shows the
results when a polynomial function of order 2 is used. In the horizontal
axis, “1" corresponds to the LMM; “2" corresponds to the KNN with
product input kernel and product output kernel; “3" corresponds to the
KNN with product input and polynomial output; “4" corresponds to the
KNN with polynomial input and product output and “5" corresponds to
the polynomial input and polynomial output.

. . . . . . . . . . . . . . . 166

xiii

Chapter 1

Introduction

1.1 Overview

During the past decades, genome-wide association studies (GWAS) become a powerful tool

for investigating the associations between human genome and diseases. There are many

successful examples. For example, Scott et al. (2007)[62] identiﬁed type 2 diabetes (T2D)-

associated variants in an intergenic region of chromosome 11p12, near the genes IGF2BP2

and CDKAL1 and the region of CDKN2A and CDKN2B. They also conﬁrmed that variants

near TCF7L2, SLC30A8, HHEX, FTO, PPARG and KCNJ11 are associated with T2D

risk. Through case-control comparisons, the Wellcome Trust Case Control Consortium [14]
identiﬁed 24 independent association signals at p < 5 × 10−7 for seven common diseases

including bipolar disorder, type 1 and type 2 diabetes, etc. With the development of next-

generation sequencing (NGS) technology, it is now possible to sequence the whole human

genome with little cost. As pointed out in Goldstein et al. (2013)[24], GWAS primarily make

use of markers that are intended to represent causal variation indirectly, whereas, NGS can

directly identify the cause variants. Many novel genetic association analyses nowadays are

based on sequencing data (see for example, Wu et al. (2011)[78] and Li et al. (2014)[41])

and have achieved various levels of success.

As precision medicine will be one of the focus on healthcare in the future, another im-

1

portant topic in statistical genetics is risk prediction. Accurate risk prediction can enable

targeted preventative treatments, including ﬁtness regimens for patients at risk of cardio-

vascular disease, or increased mammogram frequency for patients with high breast cancer

risk [35]. Abraham and Inouye (2015) [1] mentioned that the main aim of risk prediction

is maximization of predictive power, including the validity and robustness of model predic-

tions, while a causal interpretation is not strictly necessary for a good predictive model.

Therefore, a model that is capable of jointly estimating the predictive eﬀects of all single-

nucleotide variant (SNV) sets and achieves high and robust prediction accuracy is critically

important for risk prediction research [72].

With the breakthrough in computational technologies, we are in the era of heading to

artiﬁcial intelligence. Many learning methods based on network structure have been proposed

for classiﬁcation and regression. Among which, neural networks have been receiving more

and more popularity due to its ability to capture nonlinear structures in the data. Due to the

complex relationship between genetic variants and human diseases, statistical models that

are capable for taking nonlinear relationship into account may perform better (e.g., reaching

higher prediction accuracy) than current methods. As an example, Wang et al. (2016)[71]

used a Deep Convolutional Neural Fields (DeepCNF) to predict protein secondary structure

and achieves higher prediction accuracy than that of the best predictors’ accuracy used

before.

In this chapter, we will ﬁrst review linear mixed models along with their popular ap-

plications in genetic association studies in section 1.2. Then in section 1.3, we will brieﬂy

introduce some methods in statistical learning, including Gaussian graphical models, kernel

methods and neural networks.

2

1.2 Linear Mixed Models for Genetic Association Studies

Linear mixed models are powerful tools in statistics to model correlated continuous outcomes.

The general formulation of a linear mixed model is

y = Xβ + Zb + ,

(1.1)

where y ∈ Rn is a vector of continuous response; X ∈ Rn×p is the design matrix for ﬁxed
eﬀects and β ∈ Rp is the ﬁxed eﬀect coeﬃcients; Z ∈ Rn×q is the design matrix for random
eﬀects and b ∈ Rq is the random eﬀect and  ∈ Rn is a vector of random error. Typical

assumptions on the distribution of random vectors b and  are normality assumptions:

b ∼ Nq(0, σ2

b D),

 ∼ Nn(0, σ2R),

where D ∈ Rq×q and R ∈ Rn×n are the covariance matrices for b and  respectively. In

most applications, R is chosen to be the identity matrix In.

In terms of applications in genetic association studies, the ﬁxed eﬀect design matrix X is

usually composed of clinical covariates such as age, gender, ethnicity groups etc. The random

eﬀect design matrix Z consists of genetic variants such as single-neucleotide polymorphisms

(SNPs). Sequence Kernel Association Test (SKAT)[78], which is a commonly used method

in genetic association analysis, is based on such model setup. Under this situation, to test

whether there is an association between genetic variants and the response, it suﬃces to test

whether the variance component with respect to the genetic variants, which is σ2

b in model

3

(1.1), is zero or not. That is, we need to test

H0 : σ2

b = 0 vs H1 : σ2

b (cid:54)= 0.

(1.2)

In SKAT, this is accomplished based on a score-type test. One of the signiﬁcant advantage

of modeling genetic variants for testing using random eﬀects rather than ﬁxed eﬀects is to

reduce the dimensionality of the parameters to be tested. For example, to test whether a

speciﬁc gene is associated with the response, if we model the genetic variants as ﬁxed eﬀect,

multiple testing corrections need to be conducted and since a gene may contain thousands of

SNPs, it may require an extremely small p-value to conclude statistical signiﬁcance, which

will lead to power loss. On the other hand, by regarding genetic eﬀect as a random eﬀect,

we only need to test a few variance components.

Note that we may write model (1.1) as follows:

y = Xβ + a + ,

(1.3)

where a ∈ Rn is a random vector with a ∼ Nn(0, σ2
b ZDZT ). As pointed out in Yang et
al. (2011)[79], we may consider a as the total genetic eﬀects of the individuals and assume
a ∼ Nn(0, σ2
relationship matrix between individuals. Again, testing the association between genetic

aA) for some positive deﬁnite matrix A and A is interpreted as the genetic

variants and response is equivalent to test H0 : σ2

a = 0. We will also see later in this chapter

that such a linear mixed model is equivalent to a semi-parametric model when the function

in the non-parametric part belongs to a certain function space.

4

1.3 Statistical Learning

Statistical learning has received more and more attention nowadays. Due to a wide range

of topics in statistical learning, it is impossible to list all of them in several pages. We refer

interested readers to Friedman et al. (2001)[20] for more details. Here, we will focus on

Gaussian graphical models, kernel methods and neural networks.

1.3.1 Gaussian Graphical Models

Graphical models are powerful tools in statistical learning. In short, graphical models use

a graph-based representation as the basis for compactly encoding a complex distribution

over a high-dimensional space. In short, a graphical model is a combination of multivariate

statistics and graphical structure and it is indeed a probability distribution. The data is the

nodal information in the graph with each node being treated as a random variable, while

the edges represent a set of independencies that hold in the distribution. One of the beneﬁts

we can get from graphical models is that the number of parameters in the model could be

reduced dramatically.

Directed graphs and undirected graphs are two types of commonly used graphs. A

directed graphical model based on a directed acyclic graph (DAG) is known as a Bayesian

network and directed edges give causal relationships as well as the characterization of the

conditional distributions among the random variables. An undirected graphical model is

known as a Markov random ﬁeld and the undirected edges give correlations between variables.

A Gaussian graphical model is a particular case of a graphical model where the joint

distribution of the random variables are Gaussian. Here we will focus on undirected Gaussian
graphical models. Suppose X = [X1, . . . , Xn]T ∼ Nn(µ, Σ) and G = (X ,E) be a labeled

5

graph with X = {1, . . . , n} and E be the edge set such that there is no edge between Xi and
Xj if and only if Xi ⊥ Xj|X−ij. Such property is known as the pairwise Markov property
[58]. Due to the Gaussianity, such conditional independencies can be characterized by the
zero elements in the precision matrix Q = Σ−1.

Theorem 1.3.1 (Theorem 2.2 in [58]). Let X be normally distributed with mean µ and
precision matrix Q being symmetric positive deﬁnite. Then for i (cid:54)= j,

Xi ⊥ Xj|X−ij ⇔ Qij = 0.

Theorem 1.3.1 implies that the nonzero pattern of Q determines G, so we can read oﬀ

from Q whether Xi and Xj are conditionally independent given others. If Q is a completely
dense matrix, then G is fully connected.

1.3.2 Kernel Methods

A kernel arises as a similarity measure that can be though of as an inner product in a so-
called feature space [60]. More speciﬁcally, let X be the input domain and F be the feature
space and it connects the input domain X via a map Φ, i.e.,

Φ : X → F

x (cid:55)→ x := Φ(x).

Then a kernel K is deﬁned to be the inner product in the feature space, that is,

6

K : X × X → C

(x, x(cid:48)) (cid:55)→ K(x, x(cid:48)) :=(cid:10)Φ(x), Φ(x(cid:48))(cid:11)

Therefore, a kernel function can be thought of as a generalized inner product. Since Φ can

be a nonlinear function, a kernel function is able to characterized nonlinear features among

the input variables. A particular kernel function that plays an important role in statistical

learning is known as the reproducing kernel.

Deﬁnition 1.3.1 (Reproducing Kernel [6]). Let E be a nonempty abstract set. A function

K : E × E → C

(s, t) (cid:55)→ K(s, t)

is a reproducing kernel of a Hilbert space H if and only if

(i) ∀t ∈ E, K(·, t) ∈ H;

(ii) ∀t ∈ E, ∀φ ∈ H, (cid:104)φ, K(·, t)(cid:105) = φ(t)

A Hilbert space H possessing a reproducing kernel K is called a Reproducing Kernel Hilbert
Space (RKHS) and is denoted by HK.

The second property in the previous deﬁnition is known as the reproducing property: the
value of the function φ at the point t is reproduced by the inner product of φ with K(·, t).

7

From the reproducing property, we can easily see that

(cid:104)K(·, t), K(·, t(cid:48))(cid:105) = K(t, t(cid:48)).

This identity forms the basis for the kernel trick, which is frequently used in statistical

learning.

“Given an algorithm which is formulated in terms of a positive deﬁnite kernel K, one can

construct an alternative algorithm by replacing K by another positive deﬁnite kernel ˜K."[60]

It has been shown in Liu et al.

(2007)[42] that there is a close connection between

kernel methods and linear mixed model. In fact, such connection is due to the following

Nonparametric Representer Theorem [59]:

Theorem 1.3.2 (Nonparametric Representer Theorem [59]). Suppose we are given a nonempty
set X , a positive deﬁnite real-valued kernel K on X×X , a training sample (x1, y1), . . . , (xn, yn) ∈
X ×R, a strictly monotonically increasing real-valued function g on [0,∞), an arbitrary cost
function c : (X × R2)n → ¯R, and a class of functions

(cid:40)
f ∈ RX : f (·) =

F =

∞(cid:88)

i=1

βiK(·, zi), βi ∈ R, zi ∈ X ,(cid:107)f(cid:107)HK

< ∞

,

(cid:41)

is the norm in the reproducing kernel Hilbert space (RKHS) HK associated

where (cid:107) · (cid:107)HK
with K, i.e., for any zi ∈ X , βi ∈ R (i ∈ N),

(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞(cid:88)

i=1

βiK(·, zi)

(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2

HK

∞(cid:88)

∞(cid:88)

i=1

j=1

βiβjK(zi, zj).

=

8

Then for any f ∈ F minimizing the regularized risk functional

c((x1, y1, f (x1)), . . . , (xn, yn, f (xn))) + g((cid:107)f(cid:107)HK

)

admits a representation of the form

f (·) =

n(cid:88)

i=1

αiK(·, xi).

Here are some details in Liu et al. (2007)[42]. Consider a partial linear model

yi = xT

i β + h(zi) + i,

i ∼ i.i.d.N (0, σ2)

where xi ∈ Rq is the vector of clinical covariates for subject i; zi ∈ Rp is a vector of gene
expression pathway and h lies in some RKHS HK, the estimator for β and h obtained by
minimizing the penalized quadratic loss function

n(cid:88)

(cid:16)

i=1

(cid:17)2

J(h, β) =

1
2

yi − xT

i β − h(zi)

+

λ
2

(cid:107)h(cid:107)2HK

is the same as the best linear unbiased predictor (BLUP) of the linear mixed eﬀect model

y = Xβ + h + 
h ∼ Nn (0, τ K)
 ∼ Nn(0, σ2In),

where the (i, j)th element of K is K(zi, zj) and τ = λ−1σ2. Hence to conduct hypothesis

9

testing on H0 : h = 0, it is equivalent to test H0 : τ = 0. In fact, the normality assumptions

here is not necessary since as we shall see below, the equivalence comes from the Henderson’s

mixed model equation [28], which does not depend on the normality assumptions.

1.3.3 Neural Networks

A neural network is a learning algorithm used to solve nonlinear classiﬁcation and regression

problems. The origins of neural network are algorithms that try to mimic the function of a

human brain. Figure 1.1 provides the structure of a neuron in a human brain.

Figure 1.1: Diagram of a neuron with main components (Figure from Wiki).

In a neuron, dendrites are “input wires" receiving signals from other locations and axon

is an “output wire" which sends signals to other neurons. So basically, we have some inputs

and within the neuron, some calculations are processed and then send the results through

axons. Hence a neuron model (logistic unit) can be constructed as shown in Figure 1.2. Such

model is known as a perceptron. In this simpliﬁed structure, x1, . . . , xp are input units. The

node represented by Σ is called a computation unit, which plays a role analogous to the body

10

of the neuron. The arrows from the input units to the computation unit are “input wires",

mimic the dendrites of a neuron, while the arrow originated from the computation unit is

an “output wire", analogous to the axon of a neuron.

Figure 1.2: A perceptron that mimics the structure of a neuron in a human brain. x1, . . . , xp
are input features and x0 is a bias unit or intercept. Σ is known as a computation unit,
where a linear combination of the features, including the bias unit and the parameters θ
to be learned is calculated and then an activation function is applied. In this example, a
standard sigmoid activation function σ(x) = (1 + e−x)−1 is applied.

In this perceptron model, the computation unit ﬁrst calculates the linear combination of

the inputs (including the bias unit) x and the parameters (also known as weights) θ and
then apply a sigmoid activation function σ(x) = (1 + e−x)−1. Other activation functions

typically used in neural networks are

• Hyperbolic Tangent (Tanh): a(x) = tanh(x) = ex−e−x
ex+e−x ;
• Rectiﬁed Linear Unit (ReLU): a(x) = x+ = max{x, 0};

• Leaky ReLU: a(x) = max{x, x} for some small positive number .

A neural network is simply a group of perceptrons combined together as shown in Figure

1.3. This is an example of a neural network with one hidden layer in that the computation

11

units in the dashed rectangle are known as hidden units. Modern deep learning methods

are based on deep neural networks, which are simply neural networks with multiple hidden

layers. But in this dissertation, we will only focus on neural networks with one hidden layer.

Figure 1.3: A neural network with one hidden layer. x1, . . . , xp are input features and x0
is a bias unit or intercept. Units in the dashed rectangle are known as hidden computation
units, where a linear combination of features from previous layer and associated parameters
is calculated and an activation function is applied to obtain outputs for these hidden units.
Σ at the end is an output computation unit, in which a linear combination of the output
features from the last hidden layers and associated weights is calcuated and an acitivation
function is applied.

One promising property for neural networks is the universal approximation property,

which simply says that a neural network with one hidden layer can approximate a continuous

function on a compact support with arbitrary degree of accuracy. Formally, this is given in

the following theorem.

Theorem 1.3.3 (Universal Approximation Theorem (Theorem 30.4 in [18])). For every
continuous function f : [a, b]d → R and for every  > 0, there exists a neural network with

12

one hidden layer ψ(x) such that

|f (x) − ψ(x)| < .

sup

x∈[a,b]d

1.4 Objective and Organization

In current literature of genetic association analysis based on linear mixed models, the covari-

ance matrix of the genetic random eﬀects are usually speciﬁed directly based on some kernel

matrices. Our proposed approach used an indirect method. Instead Due to the Gaussian

assumption, the zero element in the precision matrix indicates a missing edge in the graph

representing the connections of the genetic eﬀect among the subjects. Therefore, the CAR

model is based on the assumption that similar genetic variants leads to similar genetic eﬀects.

Therefore, our model is able to capture more genetic heterogeneity among individuals.

Since a linear mixed model with a kernel matrix as the covariance matrix of the random

eﬀect is equivalent to a semi-parametric model, the prediction performance for linear mixed

model is usually satisfactory. By adding another hierarchy to a linear mixed model, we can

show that it is equivalent to a nonlinear mixed eﬀect model so that it may have even better

prediction performance in genetic risk prediction than classical linear mixed model. This is

the basic idea and objective of our proposed kernel neural network.

Most researches on neural networks focus on improving the prediction accuracy of neural

networks on testing data set. From a statistical point of view, a neural network is simply

a nonlinear regression problem. Therefore, it is worthwhile to establish the asymptotic

properties, including the consistency and asymptotic normality of neural network estimators

since once we have establish these properties, we may conduct statistical inference and may

13

apply to genetic association analysis later on. However, as we will see, classical results on

nonlinear regression cannot be applied directly to neural networks. Instead, we used some

techniques in empirical processes and nonparametric regression to establish consistency and

asymptotic normality of neural network sieve estimators.

The remaining dissertation is organized as follows. In Chapter 2, we focus on a conditional

autoregressive model with its application to genetic association analysis on sequencing data.

In Chapter 3, we focus on a kernel-based neural network and discuss its performance in

genetic risk prediction. In Chapter 4, we derive the asymptotic properties for neural network

sieve estimators and in Chapter 5, we discuss possible extensions of these methods and future

work.

14

Chapter 2

A Conditional Autoregressive Model for

Genetic Association Analysis Accounting

for Genetic Heterogeneity

2.1 Introduction

Substantial evidence from a wide range of diseases (e.g., breast cancer and hearing loss)

indicates that complex diseases are characterized by remarkable genetic heterogeneity [46].

Evolutionary studies also suggest that individually rare mutations generated from each gen-

eration create vast genetic heterogeneity in human diseases and could collectively play a

substantial role in causing diseases. The recently developed whole-genome sequencing tech-

nology generates a deep catalog of genetic variants, especially those rare variants, and allows

researchers to comprehensively investigate their role in human diseases. Although new tech-

nology holds promise for uncovering novel disease-associated variants, the massive amount

of sequencing data and low frequent of rare variants bring tremendous analytical challenges

to sequencing data analysis. Further challenge comes from sequencing variants, especially

those rare variants, which could be highly heterogeneous: (1) the same gene may harbor

many (hundreds of even thousands) diﬀerent rare mutations; and (2) the same variant may

15

have heterogeneous eﬀects in diﬀerent individuals or subgroups of individuals [46]. These

degree of genetic heterogeneity often have been neglected in existing statistical frameworks,

adding another layer of diﬃculty to the discovery process.

Many new statistical methods have been proposed to deal with the joint association

analysis of single nucleotide variants (SNVs), including rare variants. The burden test de-

veloped by Morgenthaler and Thilly (2007), Li and Leal (2008), Madsen and Browning

(2009)[40, 44, 49], as a pioneer in testing the genetic association on sequencing data, col-

lapses all the genetic information through a weighted sum. The burden test performs well

if the eﬀects of SNVs are in the same direction and same magnitude. Nevertheless, it is

subject to power loss if the assumption fails. Similarity-based methods have been proposed

to address this issue. One of the most popular methods is the sequence kernel association

test (SKAT) [78], which is a semi-parametric method. SKAT is closely related to the sum of

square score (SSU) test [52], and can detect both uni-directional and bi-directional genetic

eﬀects. More recently, a genetic random ﬁeld model (GenRF) [27] was proposed in ana-

lyzing sequencing data for continuous phenotypes, and a generalized genetic random ﬁeld

model (GGRF) [41] was proposed to generalize the random ﬁeld model for other types of

phenotypes. Most of these methods were based on the idea that individuals with similar

genotypes tend to have similar phenotypes. Compared with other similarity-based methods

(e.g., SKAT), GenRF and GGRF have nice asymptotic properties, and can by applied to

small-scale sequencing studies without small-sample adjustment.

Most of the existing methods assume the disease under investigation as one uniﬁed phe-

notype with homogeneous genetic causal. When genetic heterogeneity is present, the existing

methods will likely yield attenuated estimates for genetic variants with heterogeneous eﬀects,

leading to low testing power. To consider the genetic heterogeneity in sequencing studies,

16

we proposed a conditional autoregressive (CAR) model. Diﬀerent from the previous GGRF

model, which applies the conditional autoregressive model on the phenotypes directly, we

use a linear mixed model with the genetic eﬀect being considered as a random eﬀect to

account for heterogeneous genetic eﬀects. By using a score test for variance components,

it has advantage of computational eﬃciency since we only need to obtain estimators under

the null hypothesis. On the other hand, it shares a nice asymptotic feature with GenRF

and GGRF, which makes it appealing for small sample size studies. Simulation studies

also showed that our proposed method can have high power when rare variants play an

important role or when variants have diﬀerent genetic eﬀects among diﬀerent individuals

or subgroups of individuals. Therefore, CAR provides a powerful alternative approach to

search for disease-associated variants, especially those rare or having heterogeneous eﬀects.

The remaining chapter is arranged as follow: In section 2.2, we propose a conditional

autoregressive model for genetic association analysis of sequencing data and a score test for

statistical inference. In section 2.3, we conduct simulation studies to compare performance

of our method with two existing methods (i.e., SKAT and GGRF) under diﬀerent scenarios.

Finally, in section 2.4, we apply our model to the whole-genome sequencing data from the

Alzheimer’s Disease Neuroimaging Initiative.

2.2 Methods

2.2.1 Motivation

The linear mixed model has been commonly used to assess the association of a set of SNVs

with a continuous phenotype.

It models the eﬀect of each SNV as a random eﬀect and

assumes the genetic eﬀects of all SNVs in the set (e.g., a gene) follow an arbitrary distribution.

17

By testing the variance component of the random eﬀect, it evaluates the joint eﬀects of SNVs

in the set on the phenotype of interest. One of the most popularly used linear mixed model

for sequencing studies is the sequence kernel association test (SKAT) developed by Wu et

al. (2011) [78].

It is a semi-parametric method that uses a kernel function to deal with

high-dimensional genetic data and uses a score-based variance component test to assess the

association. While SKAT has many advantages, such as being robust for the direction and

magnitude of genetic eﬀects, it does not consider the heterogeneous eﬀects of genetic variants

among individuals or subgroups of individuals (e.g., gender and race groups). If the disease

of interest undergoes heterogeneous genetic etiological processes (i.e., genetic causes diﬀers

among individuals), the traditional linear mixed model (e.g., SKAT), which typically assumes

the genetic eﬀects are similar across all the samples, can suﬀer from power loss. To consider

the genetic heterogeneity in association analysis, we propose a conditional autoregressive

model. Similar to SKAT, the genetic eﬀect of an individual is considered as a random eﬀect.

In fact, according to the CAR model construction, it is based on the assumption that similar

genotype leads to similar genetic eﬀect and therefore accounts for the genetic heterogeneity

among individuals.

All SKAT [78], genome-wide complex trait analysis (GCTA) [79] and CAR are based on

the following linear mixed model:

y = Xβ + a + ,

where a is the total genetic random eﬀects of the individuals; X is the design matrix con-

taining clinical covariates such as age, gender, etc. and  is the random error. It is nat-

ural to assume the total genetic eﬀect a follows a multivariate normal distribution with

18

a ∼ Nn(0, σ2
aΣ). A natural question is how to choose the covariance matrix Σ. In SKAT,
Σ = GW GT , where G is a matrix of all SNVs and W = diag{w1, . . . , wp} is a diagonal
matrix containing the weights of the p genetic variants. In GCTA [79], the (i, j)-th element

in Σ is deﬁned to be

p(cid:88)

k=1

1
p

Σij =

(gik − 2pk)(gjk − 2pk)

2pk(1 − pk)

,

where pk is the frequency of the reference allele for the kth SNV. Both methods use a

direct way to deﬁne the marginal covariance of the genetic eﬀects between two subjects.

For CAR model, we consider an indirect way to model the covariance of the genetic eﬀects.

Speciﬁcally, if we use a graph to represent the connections of the genetic eﬀects among the

individuals as shown in Figure 2.1 where a circle represents a person’s genetic eﬀect and

an edge represents a link between the genetic eﬀects of two subjects, due to the Gaussian
assumption on the random eﬀect a, an edge exists if and only if ai ⊥ aj|a−i,j, or equivalently
p(ai|a−i) = p(ai|a−i,j). So it is worth investigating the conditional distribution of ai|a−i.

A reasonable conditional model can be assumed as follow

(cid:88)

j(cid:54)=i

 ,

ai|aj, j (cid:54)= i ∼ N

bijaj, τ 2
i

To ﬁnd the joint distribution of a, we ﬁrst quote the Brook’s Lemma.

Lemma 2.2.1 ([10]). Let π(x) be the density for x ∈ Rn and deﬁne Ω = {x ∈ Rn : π(x) >
0}. Let x, x(cid:48) ∈ Ω, then

n(cid:89)
n(cid:89)

i=1

i=1

π(x)
π(x(cid:48))

=

=

π(xi|x1, . . . , xi−1, x(cid:48)
π(x(cid:48)
i|x1, . . . , xi−1, x(cid:48)
π(xi|x(cid:48)
π(x(cid:48)
i|x(cid:48)

i+1, . . . , x(cid:48)
n)
i+1, . . . , x(cid:48)
n)
i−1, xi+1, . . . , xn)
i−1, xi+1, . . . , xn)

1, . . . , x(cid:48)
1, . . . , x(cid:48)

19

Figure 2.1: An illustration of a Gaussian graphical model. For the CAR model, the graph
represents a network of the genetic eﬀects among individuals. In the graph, each node stands
for the gentic eﬀect of an indiviudal and the existence of an edge between two individuals
indicates dependence between genetic eﬀects between these two individuals. This is charac-
terized by the precision matrix in the joint normal distribution of these genetic eﬀects.

Now we are able to transfer a conditional distribution to a joint distribution, which is

established in 2.2.1 and its proof can be found in Appendix A.

Proposition 2.2.1. Let a ∈ Rn be a random vector with

(cid:88)

j(cid:54)=i

 ,

ai|aj, j (cid:54)= i ∼ N

bijaj, τ 2
i

then the joint density function of a is given by

(cid:26)

−1
2

(cid:27)
aT ∆−1(I − B)a

,

π(a) ∝ exp

where B = [bij] is a symmetric matrix with bii = 0 and ∆ = diag{τ 2
that a ∼ Nn(0, (I − B)−1∆).

1 , . . . , τ 2

n}. This shows

20

Proposition 2.2.1 shows that a ∼ Nn(0, (I − B)−1∆). The ﬁrst thing we need to ensure
is that ∆−1(I − B) is symmetric and a simple condition is ∆−1B is symmetric, which

becomes

Given a similarity matrix S, this can be accomplished by setting bij = sij/(cid:80)

bij
τ 2
i

=

bji
τ 2
j

for all i, j.

j(cid:54)=i sij and

τ 2
i = σ2

j(cid:54)=i sij and the CAR model becomes

 1(cid:80)

j(cid:54)=i sij

(cid:88)

j(cid:54)=i

a(cid:80)

σ2
j(cid:54)=i sij

sijaj,

 .

(2.1)

ai|aj, j (cid:54)= i ∼ N

a/(cid:80)

In such case, the joint distribution of a is

(cid:27)

aT (D − S)a

,

(cid:26)

− 1
2σ2
a

(cid:111)

π(a) ∝ exp

(cid:110)(cid:80)
j(cid:54)=1 s1j, . . . ,(cid:80)

j(cid:54)=n snj

. However, one issue here is that D − S is
where D = diag
singular since (D− S)1n = 0. Such impropriety can be remedied by redeﬁning the precision
matrix of a as D − γS, where γ is chosen to make D − γS nonsingular. In fact, as will be
shown in Proposition 2.2.2, the nonsingularity of D − γS is guaranteed by choosing |γ| < 1.

The proof of Proposition 2.2.2 can be found in Appendix A.

Proposition 2.2.2. Let D − γS be as deﬁned in the main text. Then it is nonsingular if
|γ| < 1.

In such case, (2.1) becomes

ai|aj, j (cid:54)= i ∼ N

 γ(cid:80)

j(cid:54)=i sij

(cid:88)

j(cid:54)=i

 ,

a(cid:80)

σ2
j(cid:54)=i sij

sijaj,

21

(2.2)

which will be used in the model deﬁned in section 2.2.2.

The main diﬀerence between CAR model and the other methods is that the precision

matrix of the random eﬀect a, which is the inverse of the covariance matrix, is speciﬁed

ﬁrst. It is well-known that the (i, j)-th element in the precision matrix is in fact the partial

covariance between ai and aj given all the others. The generalized genetic random ﬁelds

(GGRF) model proposed by Li et al.(2014)[41] has a similar structure. A GGRF model is

based on the idea of similar genotypes leads to similar phenotypes, while a CAR model can

be interpreted as similar genotypes leads to similar genetic eﬀects and the variations among

phenotypes is related to the variations among genetic eﬀects of individuals.

In section 2.2.2, a CAR model will be introduced to account for the genetic heterogeneity.

In section 2.2.3, a linear score test for testing the variance component will be derived.

2.2.2 Model Setup

Suppose that there are n subjects. Let yi be a quantitative trait for the ith subject. To

relate the phenotype with genetic variants, we consider the following linear mixed model,

also known as the conditional autoregressive (CAR) model:

yi = xT

i β + ai + i,

i = 1, . . . , n



(2.3)

a(cid:80)

σ2
j(cid:54)=i sij

sijaj,

 γ(cid:80)

(cid:88)

j(cid:54)=i

ai|aj, j (cid:54)= i ∼ N

j(cid:54)=i sij
1, . . . n ∼ i.i.d. N (0, σ2),

where xi = [xi1, . . . , xip]T is the covariates of the ith subject; β is the ﬁxed eﬀect of the

covariates; ai is the genetic random eﬀect and i is the random error of the ith subject;

22

γ measures the overall genetic correlation among all subjects. A larger value of γ implies

strong overall genetic correlation among all samples; sij is the genetic similarity between

the ith and the jth subjects, which is measured on a set of SNVs. sij does not need to be

estimated and it is calculated based on some given similarity metrics (e.g. IBS kernel); σ2
a

measures the variation of the genetic eﬀects.

As we have seen in section 2.2.1, the joint distribution of a = [a1, . . . , an]T can be written

as

a ∼ Nn(0, σ2

a(D − γS)−1),

where Nn denotes an n-dimensional multivariate normal distribution; S is the genetic sim-
ilarity matrix and D is a diagonal matrix with diagonal elements being the row sums of

S.

S =



0

s12

s13

s21

s31
...

0

s23

s32
...

0
...

sn1 sn2 sn3



··· s1n
··· s2n
··· s3n
...
...
···

0

(cid:80)



, D =

j(cid:54)=1 s1j (cid:80)

j(cid:54)=2 s2j

... (cid:80)

j(cid:54)=n snj

Therefore, model (2.3) can be written into the following matrix form:

y = Xβ + a + 
a ∼ Nn(0, σ2
 ∼ Nn(0, σ2In),

a(D − γS)−1)

23

 .

(2.4)

where y = [y1, . . . , yn]T is the phenotype and X = [x1, . . . , xn]T is the covaraite matrix.

Since all the genetic information are contained in the total genetic random eﬀects a, in order

to test the association of a SNV set with the response, it is suﬃcient to test H0 : σ2

a = 0.

Let Gi = [gi1, . . . , giK ]T be the the genotypes of K SNVs in a genetic region (e.g. a gene)
for the ith subject, where gij, j = 1, . . . , K is coded as additive, i.e. {0, 1, 2}. Given the set
of SNVs for subjects i and j, a commonly-used kernel function measuring genetic similarity

is the weighted identity-by-state (IBS) function [78], which is deﬁned by

K(cid:88)

s(Gi, Gj) =

ωk{2 − |gik − gjk|},

k=1

where ωk is the prespeciﬁed weight for the kth variant, which is usually a function of minor

allele frequency (MAF). In our model, we consider the scaled version of the weighted IBS

similarity deﬁned by

K(cid:88)

j=1

sij =

ωk{2 − |gik − gjk|}

.

2(cid:80)K

k=1 ωk

For the weights ωk, we considered four diﬀerent weight functions based on the minor allele

frequencies (MAF) as considered in Li et al. (2014) [41]. Among these four weight functions,

unweighted (UW) assigns same weights to both common and rare variants; weighted sum

statistics type of weight (WSS), on the other hand, put almost all the weights on rare

variants and nearly no weights on the common variants. The Beta distribution type of

weights (BETA) and the logarithm of MAFs (LOG) lie between UW and WSS with LOG

putting more weights on common variants than BETA.

1. Unweighted (UW)

ωk = 1,

1 ≤ k ≤ p

24

2. Beta distribution type of weights (BETA)

ωk = dbeta(MAFk, 1, 25)2,

1 ≤ k ≤ p,

i.e. the weight is the square of the probability density of the Beta distribution with

parameters 1 and 25.

3. Weighted sum statistics type of weights (WSS)

ωk =

1

MAFk(1 − MAFk)

1 ≤ k ≤ p

,

4. Logarithm of MAFs as weights (LOG)

ωk = − log10(MAFk),

1 ≤ k ≤ p

2.2.3 Score Test for Variance Component

In order to perform the hypothesis testing, we also need to consider the nuisance parameter

γ. In the following discussion, we considered two cases. In the ﬁrst case, we treat γ as a ﬁxed

constant and use the linear score test proposed by Qu et al. (2013) [54]. In the second case,

γ is treated as an unknown nuisance parameter. Based on the idea of Davies (1987) [17], a

maximum statistic is proposed and the corresponding p-value is obtained via a simulation

based method.

To evaluate whether there is a genetic association between a SNV set and the trait of

interest, we test H0 : σ2

a = 0. By introducing the ratio between two variance components

λ = σ2

a/σ2, it is equivalent to test H0 : λ = 0 vs H1 : λ > 0.

25

2.2.3.1 γ As a Fixed Constant

When γ is ﬁxed as a constant (e.g. overall mean of SNV correlations), a linear score test

procedure [54] can be used to test H0 : λ = 0 vs H1 : λ > 0. Based on model (2.4), we have

the marginal distribution of y:

y ∼ Nn(Xβ, ˜V ),

a(D − γS)−1 + σ2In = σ2(cid:2)In + λ(D − γS)−1(cid:3) := σ2V (λ). Let θ =

where ˜V = σ2

[λ, σ2, βT ]T be the vector of unknown parameters, where σ2 and β are nuisance param-

eters. The nuisance parameter β does not need to be estimated when we use the restricted
log-likelihood method. We can ﬁnd a q × n matrix K such that KX = 0 and KKT = Iq,
where q = n − rank(X). Such a matrix can be found by using the QR decomposition of X
[47]. Since y ∼ Nn(Xβ, σ2V (λ)), we obtain

y∗ := Ky ∼ Nq(0, σ2KV (λ)KT ),

(2.5)

where y∗ is known as the error contrast since E[y∗] = 0, which does not depend on X. The

restricted log-likelihood function, which provides an unbiased estimator of σ2, can be formed

as

(cid:96)(λ, σ2|y∗) ∝ − q
2

log σ2 − 1
2

log |KV (λ)KT| − 1

2σ2 y∗T

(KV (λ)KT )−1y∗.

(2.6)

Moreover, by considering the proﬁled version of the restricted log-likelihood function in (2.6),

we can get the proﬁled REML of σ2 for a given λ:

˜σ2(λ) = arg max
σ2

(cid:96)(σ2|y∗, λ) =

1
q

y∗T

(KV (λ)KT )−1y∗.

26

Back substituting ˜σ2(λ) into the restricted log-likelihood function (2.6), we get the proﬁled

restricted log-likelihood function as follow:

(cid:96)p(λ|y∗) ∝ − q

2

log[y∗T

(KV (λ)KT )−1y∗] − 1
2

log |KV (λ)KT|.

(2.7)

As pointed out in Stern and Welsh (2000) [66], the proﬁled restricted log-likelihood function

is score and information unbiased and hence it can be used for inference. The score statistic

S(λ) can be calculated as follow:

S(λ) =

=

∂(cid:96)p
∂λ
y∗T

q
2

y∗T

=

q
2

(KV (λ)KT )−1K ∂V (λ)

∂λ KT (KV (λ)KT )−1y∗

y∗T (KV (λ)KT )−1y∗

−

(cid:20)
(KV (λ)KT )−1K

(cid:21)

∂V (λ)

∂λ

KT

tr

1
2

(KV (λ)KT )−1K(D − γS)−1KT (KV (λ)KT )−1y∗

y∗T (KV (λ)KT )−1y∗
1
2

tr

−

(cid:104)
(KV (λ)KT )−1K(D − γS)−1KT(cid:105)

.

Under the null hypothesis H0 : λ = 0, V (λ) = V (0) = In so that the score statistic under

H0 can be expressed as:

S(0) =

y∗T

q
2

K(D − γS)−1KT y∗

y∗T y∗

− 1
2

tr

(cid:104)
(D − γS)−1(cid:105)

.

27

To test H0 : λ = 0 vs H1 : λ > 0, it is equivalent to test H0 : S(0) = 0 vs H1 : S(0) (cid:54)= 0.
The exact way to calculate the p-value of the test is given in Qu et al. (2013) [54]

 q(cid:88)

j=1

 ,

P(S(0) > s) = P

λjZ2

j > 0

where s is the observed value of S(0), λ1, . . . , λq are the eigenvalues of the following matrix

B = K(D − γS)−1KT −

(cid:20)2s

q

+

tr

1
q

(cid:16)

(D − γS)−1(cid:17)(cid:21)

Iq.

q are independent chi-square-distributed random variables and the p-value can be

Z2
1 , . . . , Z2
calculated by using the Davis method [16].

2.2.3.2 γ As an Unknown Nuisance Parameter

In practice, γ is usually unknown so that it is necessary to consider how to test H0 : λ = 0 vs

H1 : λ > 0 under such case. One natural idea is to estimate γ and plug in the estimator (e.g.

MLE). However, since γ is embedded in the variance-covariance matrix of y, it is unlikely

that the maximum likelihood estimator of γ has an analytic form. Moreover, γ only appears

in the alternative model, therefore it is not even possible to evaluate the MLE of γ under H0.

Instead, we follow the idea of Davies (1987) [17] and construct a maximum score statistic

for the association test. More speciﬁcally, following the notations we used in section 2.2.3.1,

we have

Sγ(0) =

y∗T

q
2

K(D − γS)−1KT y∗

y∗T y∗

− 1
2

tr

(D − γS)−1(cid:105)
(cid:104)

,

where y∗ ∼ Nq(0, σ2KV (λ)KT ) with V (λ) = In + λ(D − γS)−1. Now, note that under
σ y∗ ∼ Nq(0, Iq). We
the null hypothesis H0 : λ = 0, we have y∗ ∼ Nq(0, σ2Iq) so that 1

28

form the maximum score test statistic as follow:

T = sup
γ∈Γ

Sγ(0)
qy∗T

K(D − γS)−1KT y∗ − y∗T tr(cid:2)(D − γS)−1(cid:3) Iqy∗
σ y∗(cid:17) sup

y∗(cid:19)T

y∗(cid:19)

2y∗T y∗

(cid:18) 1

(cid:18) 1

γ∈Γ

Σ(γ)

σ

σ

= sup
γ∈Γ

(cid:16) 1
σ y∗(cid:17)T(cid:16) 1

1

=

2

:= (2(cid:107)Z(cid:107)2

2)−1 sup
γ∈Γ

ZT Σ(γ)Z,

(cid:16)

where Z = 1

σ y∗ ∼ Nq(0, Iq), Σ(γ) = qK(D − γS)−1KT − tr(cid:2)(D − γS)−1(cid:3) Iq. γ ∈
(cid:17)
(cid:17)∩(−1, 1).

1/µ(1), 1/µ(n)
ensuring that the matrix D − γS is nonsingular (see [5]). Since the parameter γ can be

, where µ(1) < µ(2) < ··· < µ(n) are the ordered eigenvalues of D

interpreted as a measurement of correlation coeﬃcient, we set Γ =

2 SD

(cid:16)

− 1
2

− 1

1/µ(1), 1/µ(n)

To calculate the p-value, let t be the observed value of T , then

(cid:32)
(cid:32)
(cid:32)

ZT Σ(γ)Z > t

(cid:33)
2)−1 sup
(cid:33)
ZT(cid:0)Σ(γ) − 2tIq
(cid:1) Z > 0
γ∈Γ
(cid:33)
q(cid:88)

λj(γ)Z2

j > 0

,

P(T > t) = P

= P

= P

(2(cid:107)Z(cid:107)2

sup
γ∈Γ

sup
γ∈Γ

i=1

where λ1(γ), . . . , λq(γ) are the eigenvalues of

(Σ(γ) − 2tIq) = K(D − γS)−1KT −

1
q

(cid:18)2t

q

+

tr

1
q

(D − γS)−1(cid:105)(cid:19)
(cid:104)

Iq

29

and Z2

1 , . . . , Z2

q are independent chi-square-distributed random variables with degrees of

freedom 1. We use the following procedures to approximate this probability:

1. Partition the index set Γ as γ1 < γ2 < ··· < γM . Γ is an open interval, denoted by
(L, U ). When we implement this step, we choose a small  > 0 and let γ1 = L +  and
γM = U − ;

2. Generate an N × q matrix Υ with each element being a χ2

1 random variable;

3. For each γi, i = 1, . . . , M,

Calculate λ1(γi), . . . , λq(γi), which are the eigenvalues of Σ(γi) − 2tIq;

j=1 λj(γi)Υkj, k = 1, . . . , N, where Υkj is the (k, j)-th element of

Calculate (cid:80)q

Υ;

P

sup

γ∈Γ

q(cid:88)

j=1

4. Approximate the p-value as

(cid:110)

 ≈ #

λj(γ)Z2

j > 0

k : maxγi,i=1,...,M
N

j=1 λj(γi)Υkj > 0

(cid:80)q

(cid:111)

.

2.3 Simulation Results

Simulation studies were conducted to compare the CAR model with the original GGRF

model and the commonly used SKAT method. For the CAR model, we evaluated both

scenarios where γ is ﬁxed (denoted by CAR.FIX) and γ is treated as an unknown nuisance

parameter (denoted by CAR.SUP). In the simulation, we compared the empirical type I

error rates and empirical power of the three methods under various disease models with

heterogeneous genetic eﬀects. In addition, we compared the empirical power of the three

30

methods under diﬀerent percentage of causal SNVs and under the situation when the weights

were misspeciﬁed. In order to mimic the real structure of sequencing data, the genetic data

used for simulation was simulated based on the real sequencing data of Chromosome 17:

7344328 - 8344327 from the 1,000 Genome project [13]. The minor allele frequencies (MAF)

of SNVs in this region range from 0.046% to 49.954% with a distribution highly skewed to

the right. Figure 2.2 summarizes the distribution of the MAF with MAF<0.05. For each

setting, we simulated 1,000 Monte Carlo replicates to calculate the empirical type I error

rates and the empirical power of the three methods. In each replicate, we randomly selected

a 30Kb segment from the region and used all the genetic variants in that segment for the

association analysis.

We ﬁrst examined the type I error performances of the three methods. In the simulation,

we considered two signiﬁcance levels, α = 0.05 and α = 0.01, under the following null model,

yi = εi ∼ N (0, 1).

Table 2.1 summarizes the empirical type I error rates of the three methods under four diﬀerent

weights (i.e., UW, BETA, WSS and LOG). As we observe from the results, the empirical

type I error rates of GGRF, CAR.FIX and CAR.SUP are well controlled at the level of 0.05

or 0.01, while the type I error rate for SKAT is conservative under WSS (for α = 0.05) and

LOG (for α = 0.01). Because the score test of CAR and the test procedure of GGRF are

exact test procedures without any asymptotic approximation, the type I error rates of the

two methods are well controlled.

31

Figure 2.2: The Distribution of minor allele frequency of in sequencing variants on chromo-
some 17 from the 1,000 Genome project.

32

Minor Allele FrequencyFrequency0.000.010.020.030.040.050200040006000Table 2.1: Empirical type I error rates under diﬀerent weights at level α = 0.05 and α = 0.01
based on 1000 replicates. Each cell in the table contains the empirical type I error rate.
GGRF, SKAT, CAR.FIX and CAR.SUP are the generalized genetic random ﬁeld model of Li
et al. (2014) [41], the sequence kernel association test of Wu et al. (2011) [78], the conditional
autoregressive model with ﬁxed nuisance parameter γ, the conditional autoregressive model
with maximum score test statistic, respectively

α = 0.05

Methods
GGRF
SKAT

UW BETA WSS LOG
0.036
0.047
0.041
0.042
0.052
CAR.FIX 0.051
CAR.SUP 0.041
0.048

0.054
0.038
0.057
0.055

0.041
0.040
0.052
0.053

α = 0.01

UW BETA WSS LOG
0.009
0.006
0.006
0.004
0.012
0.010
0.010
0.012

0.012
0.009
0.010
0.011

0.015
0.007
0.010
0.009

2.3.1 Simulation I: Heterogeneous Genetic Eﬀects Among Individ-

uals or Subgroups

We ﬁrst considered a complex disease scenario in which each individual has a diﬀerent genetic

eﬀect, which we refer as the heterogeneous genetic eﬀect. The following model was used to

simulate the phenotype:

K(cid:88)

k=1

yi =

w∗
kgi,kZi,k + εi,

1 ≤ i ≤ n,

(2.8)

where K is the number of genetic variants in a 30 Kb segment; gi,k is the genotype of the kth

SNV for individual i, coded as additive (i.e., gi,k = 0 for genotype AA, gi,k = 1 for genotype
Aa and gi,k = 2 for genotype aa). We set the percentage of causal SNVs as 50%. w∗
k = 0
if the kth genetic variant is not a causal variant and w∗
k = ωk if the kth genetic variant is
a causal variant, where ωk is the weight deﬁned in section 2. Zi,1, . . . , Zi,K , 1 ≤ i ≤ n are
genetic eﬀects for the ith individual, which follow a normal distribution with mean 0 and
standard deviation σZ. ε1, . . . , εn are iid random variables distributed as N (0, 1). Note that

33

in (2.8), the coeﬃcients of the genetic variants are unique for each individual, and therefore

diﬀerent individuals could have diﬀerent genetic eﬀects.

Figure 2.3 summarizes the empirical power of three methods under diﬀerent degrees of ge-

netic heterogeneity. As we observe from the simulation results, the CAR model outperforms

the other two methods with the increasing level of genetic heterogeneity (i.e., increasing

σZ). For instance, when common variants (i.e., UW and LOG) have more contributions to

the phenotype, our model has substantial power increase with the increased level of genetic

heterogeneity (i.e., σZ), while the power of SKAT and GGRF remain low even with the

increased genetic eﬀect. When rare variants play an important role (i.e., BETA and WSS),

GGRF and SKAT have some power increase with the increased genetic eﬀect, but still have

lower power than CAR. As also shown in Figure 2.3, CAR.FIX and CAR.SUP have very

similar performance. Therefore, for simplicity, the further analysis is performed based on

CAR.FIX, in which we ﬁx the nuisance parameter γ as the overall mean of the correlation

matrix of SNVs.

Next, we considered a scenario where genetic heterogeneity exist among subgroups (e.g.,

ethnic groups). Under such case, individuals within a group had homogeneous genetic eﬀects,

but the genetic eﬀects among groups were diﬀerent. In this simulation, we partitioned the

total number of individuals of 500 into 8 groups, with the number of individuals in each

groups being 200, 100, 50, 25, 40, 35, 45 and 5, respectively. Within each group, we used

the following model to simulate the phenotypes, so that the eﬀect sizes were diﬀerent for

diﬀerent groups.

p(cid:88)

k=1

yi =

w∗
kgi,kZk + εi,

1 ≤ i ≤ n.

(2.9)

The only diﬀerence in model (2.9) from model (2.8) in the main text is that in model (2.9),

34

Figure 2.3: Empirical Power Comparison of CAR, SKAT and GGRF by varying the levels
of genetic heterogeneity among individuals under four diﬀerent weights. The x-axis is the
eﬀect size σZ, used in the simulation. The eﬀect sizes are chosen as 0.01, 0.05, 0.075, 0.1,
0.3, 0.5 for the UW and LOG weights, and the eﬀect sizes are set as 0.0005, 0.00075, 0.001,
0.0015, 0.002 for the BETA and WSS weights.

35

BETAWSSUWLOG0.0e+005.0e-041.0e-031.5e-032.0e-030.0e+005.0e-041.0e-031.5e-032.0e-030e+001e-012e-013e-014e-015e-010e+001e-012e-013e-014e-015e-010.000.250.500.751.000.000.250.500.751.00Standard DeviationEmpirical PowerMethodGGRFSKATCAR.FIXCAR.SUPZk does not depend on the individual i and we also assume that Z1, . . . , Zn ∼ N (0, σ2
Z ) so
that all the individual have the same genetic eﬀect. We set the eﬀect sizes σZ as 0.0001,

0.0005, 0.002, 0.001, 0.003, 0.0001, 0.0005 and 0.002 for the eight groups for the BETA and

WSS weights, and 0.01, 0.05, 0.2, 0.1, 0.3, 0.01, 0.05 and 0.2 for the UW and LOG weights.

Table 2.2 summarizes the results of the simulation. Consistent with previous ﬁndings,

CAR has higher power than the other two methods for all four scenarios. SKAT has good

power under the WSS weight, but has reduced power under the other weights. While all

three methods have low power when common variants play an important role (i.e., UW and

LOG), CAR still performs better than SKAT and GGRF.

Table 2.2: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on
1,000 Monte Carlo replicates.
In the simulation, we simulated 8 diﬀerent subgroups
with σZ = 0.0001, 0.0005, 0.002, 0.001, 0.003, 0.0001, 0.0005, 0.002 for BETA and WSS; with
σZ = 0.01, 0.05, 0.2, 0.1, 0.3, 0.01, 0.05, 0.2 for UW and LOG. The number in the parenthesis
is the standard deviation.

Methods
GGRF
SKAT
CAR

UW

0.298(1.505E-02)
0.328(1.497E-02)
0.378(1.548E-02)

BETA

0.180(1.243E-02)
0.454(1.622E-02)
0.817(1.248E-02)

WSS

0.189(1.226E-02)
0.805(1.265E-02)
0.952(0.670E-02)

LOG

0.148(1.140E-02)
0.210(1.287E-02)
0.445(1.611E-02)

Since rare mutation occurs in a small number of individuals and could have a larger eﬀect

than common variants, we also modiﬁed the eﬀect sizes of the eight groups so that groups

with a small number of individuals tended to have higher eﬀect sizes. Speciﬁcally, we set

σZ as 0.0001, 0.0001, 0.0002, 0.005, 0.0003, 0.0005, 0.0001 and 0.005 for the eight groups for

the BETA and WSS weights, and 0.01, 0.01, 0.02, 0.5, 0.03, 0.05, 0.01 and 0.5 for the UW

and LOG weights. Under such setting, the groups with the sizes of 25 and 5 had the highest

eﬀect sizes.

From Table 2.3, we ﬁnd that all three methods have some power losses than the previous

case, but the conclusion is similar to the previous case. CAR still outperforms the other two

36

under all four scenarios. SKAT performs well when WSS is used, while GGRF has a good

performance under the UW weight.

Table 2.3: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on
1,000 Monte Carlo replicates. In the simulation, we simulated 8 diﬀerent subgroups with
σZ = 0.0001, 0.0001, 0.0002, 0.005, 0.0003, 0.0005, 0.0001, 0.005 for BETA and WSS; with
σZ = 0.01, 0.01, 0.02, 0.5, 0.03, 0.05, 0.01, 0.5 for UW and LOG. The number in the parenthe-
sis is the standard deviation.

Methods
GGRF
SKAT
CAR

UW

0.201(1.263E-02)
0.217(1.288E-02)
0.382(1.576E-02)

BETA

0.149(1.093E-02)
0.374(1.534E-02)
0.672(1.472E-02)

WSS

0.154(1.161E-02)
0.618(1.545E-02)
0.769(1.342E-02)

LOG

0.115(1.024E-02)
0.149(1.096E-02)
0.428(1.544E-02)

2.3.2 Simulation II: Various Causal SNV Rates

Since the percentage of causal SNVs in a genetic region (e.g., a gene) also have impact on

the power of the association test, in simulation II, we examined the performances of GGRF,

SKAT and CAR by varying the percentage of causal SNVs. Similar as simulation I, the

phenotypes were simulated by using the simulation model (2.8).

In this simulation, we

varied the percentages of causal SNPs and investigated the eﬀect of 50%, 40%, 30%, 20%,

10%, 5% and 1% causal SNV rates on the power performance of the three methods.

Figures 2.4 and 2.5 illustrate the empirical power of GGRF, SKAT and CAR under dif-

ferent causal SNV rates and four weights. Figure 2.4 summarizes the results under the WSS

and BETA wights, i.e the weights focusing more on rare variants. Overall, the CAR model

outperforms the other two methods. As the causal SNV rates increase, the empirical power

of CAR increases signiﬁcantly. The similar trend can also be found for SKAT, especially

when the WSS weights are used. For the BETA weight, SKAT also attain high empirical

power but not as high as that of CAR. GGRF has lower empirical power as compared to

CAR and SKAT and its empirical power increases slowly with the increase of causal SNV

37

rates. We also ﬁnd that under the WSS weight, both SKAT and CAR attain decent perfor-

mance when the genetic causes are heterogeneous (i.e., increasing σZ) and the causal SNV

rates are moderately high.

Figure 2.5 summarizes the results under the LOG and UW weights, i.e the weights

focusing more on the eﬀects of common variants. Same as the case of rare variants, the

CAR model performs the best for all the scenarios. Even the empirical power of CAR under

the LOG and UW weights is not high as those under the BETA and WSS weights, it can

still reach high power when the causal SNV rate is high. The CAR model could also gain

more substantial power than the other two methods with the increased levels of genetic

heterogeneity (i.e., increased σZ). Neither SKAT nor GGRF performs as well as they do

under the WSS or BETA weights, even with high causal SNV rate and high genetic eﬀects.

As we can see from Figure 2.5, if the common variants play an important role in disease

phenotypes and have heterogeneous eﬀects, both SKAT and GGRF suﬀer from extreme

power loss, while the CAR model could still obtain moderate or high power.

2.3.3 Simulation III: Misspeciﬁcation of Weights

In practice, we do not know whether common variants or rare variants play a more important

role in the disease process. Since the performance of GGRF, SKAT and CAR depend on the

prespeciﬁed weights, it is important to investigate whether these models could still obtain

reasonable power when the weights are misspeciﬁed. We used similar simulation settings as

before except that we randomly selected 50 SNV as the causal variants in this particular

simulation.

We ﬁrst focused on rare variants, and simulated phenotypes using either the WSS weight

or the BETA weight. When we applied GGRF, SKAT and CAR to the simulation data, all

38

Figure 2.4: Comparison of empirical power of CAR, SKAT and GGRF with diﬀerent causal
SNV rates under the BETA weight and the WSS weight. The standard deviation σZ used in
the simulation is gradually increased. σZ = 0.0005, 0.001, 0.002, 0.005, respectively in each
column from left to right.

39

5e-040.0010.0020.005BETAWSS010203040500102030405001020304050010203040500.000.250.500.751.000.000.250.500.751.00Causal SNV Rate (%)Empirical PowerMethodGGRFSKATCARFigure 2.5: Comparison of empirical power of CAR, SKAT and GGRF with diﬀerent causal
SNV rates under the UW weight and the LOG weight. The standard deviation σZ used
in the simulation is gradually increased. σZ = 0.05, 0.1, 0.2, 0.5 respectively in each column
from left to right.

40

0.050.10.20.5UWLOG010203040500102030405001020304050010203040500.000.250.500.751.000.000.250.500.751.00Causal SNV Rate (%)Empirical PowerMethodGGRFSKATCARthe weights (UW, BETA, WSS and LOG) were used so that we were able to evaluate their

performances under both misspeciﬁed and correctly speciﬁed weights.

Figure 2.6 summarizes the results when the underlying weights in the simulation are WSS

and BETA. From Figure 2.6, since WSS put extremely high weights to the rare variants, all

the three methods attain high empirical power when the weights are speciﬁed as WSS or

BETA. As expected, all the methods have the highest empirical power when the weight func-

tion is correctly speciﬁed (i.e., WSS). We also ﬁnd that neither SKAT nor GGRF performs

well when the weight is misspeciﬁed as UW or LOG. On the other hand, as we can see from

the ﬁgures, as long as there is genetic heterogeneity, CAR has a good power performance

even though the weight is misspeciﬁed.

Same conclusions hold for BETA. Both GGRF and CAR will obtain highest power when

the BETA weight is applied, while SKAT reaches highest power when WSS is used. CAR

outperforms SKAT and GGRF in all cases. With the increased genetic heterogeneity, CAR

remains high power even with misspeciﬁed weights. To conclude, in the case when the disease

is mainly caused by rare variants, both SKAT and CAR can have high power, but for SKAT,

we need to use the right weights (BETA or WSS).

Next, we focused on common variants and simulated phenotypes using UW and LOG.

Figure 2.7 summarizes the results based on UW and LOG. In the ﬁrst row, the true weight

used in the simulation is UW. As expected, CAR attains high power when the speciﬁed weight

functions focus on common variants (i.e., UW and LOG). With the increased heterogeneity,

CAR can obtain high power with the LOG or UW weights. However, it could suﬀer from

power loss when the weight is misspeciﬁed (i.e., WSS an BETA). Overall, CAR outperforms

SKAT and GGRF, even when the weight is misspeciﬁed.

The same conclusion holds for the LOG weight. Since LOG also puts some weight on

41

Figure 2.6: Comparison of empirical power of CAR, SKAT and GGRF with misspeciﬁed
weights when the true weight is BETA and WSS. In each column from left to right, we used
BETA, LOG, UW and WSS weight, respectively.

42

BETALOGUWWSSBETAWSS0.0010.0020.0030.0040.0050.0010.0020.0030.0040.0050.0010.0020.0030.0040.0050.0010.0020.0030.0040.0050.000.250.500.751.000.000.250.500.751.00Standard DeviationEmpirical PowerMethodCARGGRFSKATFigure 2.7: Comparison of empirical power of CAR, SKAT and GGRF with misspeciﬁed
weights when the true weight is UW and LOG. In each column from left to right, we used
BETA, LOG, UW and WSS weight, respectively.

43

BETALOGUWWSSUWLOG0.10.20.30.40.50.10.20.30.40.50.10.20.30.40.50.10.20.30.40.50.000.250.500.751.000.000.250.500.751.00Standard DeviationEmpirical PowerMethodCARGGRFSKATrare variants, we ﬁnd that SKAT has a good performance as compared with its performance

under the UW weight. However, SKAT attains highest power by using the WSS weight.

This can be explained by the fact that WSS can help to catch the heterogeneous genetic

eﬀect. As expected, CAR attains the highest power when the true weight (i.e., LOG) is

speciﬁed. When LOG is the underlying weight, we can see from Figure 2.7 that there is no

signiﬁcant diﬀerence of the four weights used in the CAR model. Overall, GGRF has lower

power than CAR and SKAT.

2.4 Real Data Applications

We applied our method to the whole genome sequencing data from Alzheimer’s Disease Neu-

roimaging Initiative (ADNI) and performed a genome-wide gene based association analysis.

A total of 808 samples at the screening and baseline of the ADNI1 and ADNI2 studies have

the whole genome sequencing data, from which we extracted 21069 genes based on the GRCh

37 assembly. ADNI also provides pre-calculated volumes of cortical regions. We chose four

of them as the phenotypes of interests, which are hippocampus, entorhinal, whole brain and

ventricles. The motivation of choosing these four volumes as the phenotypes is from previous

biological ﬁndings. More speciﬁcally, the hippocampus, a brain area playing an important

role in learning and memory, is especially vulnerable to damage at early stages of Alzheimer’s

disease (AD) [51]. The volume of hippocampus changes over time and could have a large

impact on Alzheimer’s disease [61]. The entorhinal cortex is also crucial in declarative mem-

ories.

In mild AD patients, the loss in the entorhinal volume is evident. The entorhinal

cortex is also highly correlated with the severity of the disease [36]. Similarly, the whole

brain volume decreases signiﬁcantly in patients with AD [67]. The brain ventricles also play

44

Figure 2.8: Histograms of the Phenotypes used for Real Data Analyses

an important role in AD and it is well known that ventricular volume is signiﬁcantly higher

in AD patients [19].

Figure 2.8 plots the histograms of the four phenotypes used in the analysis. As we can

see from the histograms, hippocampus, entorhinal and whole brain are nearly normally dis-

tributed. For Ventricles, we ﬁrst applied a normal quantile transformation to the phenotype

and then applied our method.

We restricted our analyses to individuals with caucasian ancestry due to the small sample

45

size of non-caucasion samples and the issue of population stratiﬁcation. In the analysis, age,

gender, years of education, marriage status, and APoEε4 were used as covariates. After

removing individuals with missing phenotypes or covariates, a total of 588 subjects remained

for the analysis. In this analysis, we also compared our method with SKAT and GGRF. The

BETA weight was chosen for all the three methods, and the linear kernel was used in SKAT.

Table 2.4 summarizes the top 10 genes (based on the p-values) associated with hippocam-

pus found by the three methods (CAR, GGRF and SKAT). As we can see from the table,

the results of GGRF and SKAT are similar and are diﬀerent from that of CAR. This could

be due to the fact that both methods are similar in terms of assuming genetic homogeneity,

while CAR is developed based on a diﬀerent model, which accounts for genetic heterogeneity.

Nonetheless, all the three methods found gene RPL27 on chromosome 17 associated with

hippocampus.

In addition to hippocampus, Appendix A also provides results of the top

10 genes related to entorhinal, ventricle and whole brain. Given the limited sample size of

the study, none of the association reached statistical signiﬁcance after multiple-testing ad-

justment. Nevertheless, few genes (e.g., RPL27 ) have been identiﬁed by all three methods,

which might worth further investigation. In addition, some of the genes (e.g. LINC01449 )

was detected by the CAR model only, which could also be further evaluated for potential

genetic heterogeneity.

46

Table 2.4: Top 10 hippocampus-associated genes detected by GGRF, SKAT and CAR.

CAR

CHR Gene Name
LINC01449

7
17
10
17
2
18
1
14
15
1

GGRF

LOC400997

P -value CHR Gene Name
4.98E-05
CDKN1C
8.09E-05
RPL27
8.79E-05
8.81E-05
1.21E-04
1.60E-04
1.71E-04
1.81E-04
SECISBP2L 2.88E-04
2.93E-04

RGS9
MAPK8
RPL27
WTH3DI
PQLC1
NTNG1
PSMC1

IFI35
AJAP1
GRIN2B
NLRP11
EFCAB1

11
17
2
17
1
12
19
8
1
1

AK4

MROH9

F5

P -value CHR
17
6.37E-05
17
7.87E-05
1.35E-04
2
4
1.44E-04
17
1.98E-04
12
2.83E-04
3.03E-04
1
19
4.10E-04
10
4.69E-04
5.43E-04
2

SKAT

Gene Name

RPL27
IFI35

LOC400997

MMRN1
RUNDC1
CCDC59

AK4

ISYNA1
C10orf107

LOC101929260

P -value
4.04E-05
6.15E-05
1.57E-04
2.03E-04
2.53E-04
3.56E-04
3.61E-04
3.99E-04
4.48E-04
5.29E-04

47

2.5 Discussion

We have proposed a conditional autoregressive model for genetic association analysis of se-

quencing data. Our simulations show that the CAR model can obtain high power under the

scenario (i) when rare variants are related to the phenotypes and (ii) when genetic variants

have diﬀerent genetic eﬀects among individuals or subgroups of individuals. Moreover, we

derive the exact form of the test statistic, which makes the method computationally eﬃcient

for large-scale sequencing data analysis. Unlike SKAT, which uses the asymptotic distri-

bution for its test statistic, the exact distribution of the CAR test statistic under the null

hypothesis is not conservative. Therefore, no additional small sample size adjustment is

needed for the CAR model.

In our simulations, γ is chosen as the average correlation of the genetic correlation coeﬃ-

cients among all individuals. Alternatively, we could also use the average of pairwise linkage

disequilibrium (LD) correlation coeﬃcient. Simulation results ﬁnd that there are no substan-

tial diﬀerences between two values (results not shown). However, calculating LD correlation

coeﬃcient is more time consuming than estimating the traditional correlation coeﬃcient.

It is also interesting to consider the model when γ = 1, i.e. assuming the genetic random

eﬀects are highly correlated. As mentioned in Rue and Held (2005) [58], when γ = 1, the

genetic random eﬀect is then modeled by an intrinsic random ﬁeld. In this case, the joint

distribution of the genetic random eﬀects is not a proper distribution so that traditional

frequentist approach may not work well. One potential solution is to use Bayesian method

as proposed in Rue and Held (2005) [58].

The work introduced in this paper focuses on the continuous phenotype with normal

distribution. Nevertheless, the model can be extended to the case when the phenotypes

48

follow distributions in the exponential family by using the generalized linear mixed model

(GLMM). One potential challenge of extending the CAR model to GLMM is that the test

statistic and the likelihood function may not have closed form, which requires alternative

approach (e.g. Monte Carlo method) to numerically estimate both of them. This is a future

work worth further study.

49

Chapter 3

A Kernel-Based Neural Network for

High-dimensional Genetic Risk

Prediction Analysis

3.1 Introduction

Neural-network-based (NN) methods, such as deep learning, has made impressive advances in

areas, such as imaging recognition and nature language precessing [38]. There is an increasing

interest in using NN methods in genomic studies, especially in the ﬁeld of regulatory genomic,

where NN methods and software have been developed to improve the capacity of modeling

DNA and RNA targets of regulatory proteins [53].

Even though NN methods also hold great promise in revealing the complex relationship

between human diseases and genetic variants, few research has been done in this ﬁeld. One

of the key challenges is the massive genomic data, which often includes millions of genetic

variants. Directly applying the classic NN methods to such a large number of genetic variants

requires the estimation of millions of parameters, which is computationally diﬃcult and likely

causes serious over-ﬁtting issue. Instead of modeling individual eﬀects of millions of genetic

variants, we can model the genetic eﬀect of these variants as a random eﬀect by adopting a

50

linear mixed model (LMM) framework. LMM has been widely used in genetic data analysis,

starting from complex segregation analysis [50] to recent genome-wide complex trait analysis

[79] and set-based association analysis[78].

Based on LMM, we propose a method, referred to as the kernel neural network (KNN),

for high-dimensional genetic data analysis. As we show in the later section of the paper, un-

der certain conditions, KNN is equivalent to a nonlinear mixed eﬀect model so that a linear

mixed model is a special case of KNN. In addition to its ability to handle high-dimensional

data, KNN inherits properties from the neural network, which allows the method to consider

nonlinear and nonadditive eﬀects. Due to the complexity of such method, it is diﬃcult to

estimate the parameters from KNN using the conventional approach. Moreover, the pa-

rameters in KNN are unidentiﬁable. To address such issue, instead of using the restricted

maximum likelihood estimator (REML) [15], we use the minimum quadratic unbiased esti-

mator (MINQUE) proposed by Rao (1970, 1971, 1972)[55, 56, 57] to estimate the “variance

components". We show both theoretically and empirically that MINQUE has nice properties.

The remaining chapter is arranged as follows: Section 2 provides the description of the

KNN model and the MINQUE estimation procedure; Section 3 provides theoretical compari-

son of prediction performance between KNN and LMM and Section 4 discusses the inclusion

of ﬁxed eﬀects; results from simulation study and a real data application are presented in

Sections 5 and 6, respectively. Before we proceeding to the main text, we ﬁrst summarize

notations that will be frequently used in this paper.

Notations: Throughout the chapter, capital bold italic letters A, B, . . . , Γ, Θ, . . . will be

used to denote matrices; small bold italic letters a, b, . . . , α, β, . . . will be used to denote

vectors and other small letters will be used to denote scalars. In will be used to denote an
n × n identity matrix and the symbol “(cid:45)" will be used to denote asymptotically less than.

51

3.2 Methodologies

Kernel methods are widely used in recent machine learning area due to its capability of

capturing nonlinear features from the data so that the prediction error can be diminished.

As has been mentioned in Shawe-Taylor et al. (2004)[63], given a kernel and a training set,

the kernel matrix acts as an information bottleneck, as all the information available to a

kernel algorithm must be extracted from this matrix. On the other hand, linear mixed eﬀect

models are also widely used in the area of genetic risk prediction. Therefore, it seems natural

to combine these two methods together. A naive way is to change the covariance matrix of

the random eﬀect in the linear mixed model to a kernel matrix as was did in [78] and [79].

As we have seen in section 1.3.2 that for a linear mixed model with a kernel matrix as the

covariance matrix of the random eﬀect is equivalent to a semiparametric regression model.

Now, we move a little bit further. By extending the Representer Theorem, we will

see that a nonlinear mixed eﬀect model is equivalent to a hierarchical linear mixed model.

Speciﬁcally, consider m i.i.d. random features u1, . . . , um and the following nonlinear mixed

eﬀect model:

yi = xT

i β + f (ui1, . . . , uim) + i,

i = 1, . . . , n,

(3.1)

where 1, . . . , n are i.i.d. N (0, φ) random variables and f ∈ HK for some RKHS HK. We
consider the following loss function:

(cid:104)

(cid:105)
(y − Xβ − f (u1, . . . , um))T (y − Xβ − f (u1, . . . , um))

L(β, f ) =

1
2φ

EU

+

λ
2

.

(cid:107)f(cid:107)2HK
(3.2)

The ﬁrst thing we are going to do is to extend Theorem 1.3.2 to meet with our need.

Theorem 3.2.1 (Extended Representer Theorem). For a ﬁxed kernel K, let HK be the

52

corresponding RKHS and let X1, . . . , Xn be random variables. Then for a convex function
L : Rn → R and non-decreasing Ω : R → R, if the optimization problem can be expressed as

(cid:17)(cid:111)

(cid:16)(cid:107)f(cid:107)2HK

,

J(f∗) = min
f∈HK

J(f ) = min
f∈HK

then the solution can be expressed as

(cid:110)E [L(f (X1), . . . , f (Xn))] + Ω

n(cid:88)

i=1

αiE [K(·, Xi)]

f∗ =

Proof. Consider the subspace S ⊂ HK given by

S = span{E [K(·, Xi)] : i = 1, . . . , n} .

Since S is a ﬁnite dimensional space, it is therefore closed. The projection theorem then
implies HK = S ⊕ S⊥, i.e., for every f ∈ HK, we can uniquely write f = fS + f⊥, where
fS ∈ S and f⊥ ∈ S⊥. Noting that (cid:104)f⊥, E [K(·, Xi)](cid:105) = 0 for each i, the reproducing
property implies

E [f (Xi)] = E [(cid:104)f, K(·, Xi)(cid:105)] = (cid:104)f, E[K(·, Xi)](cid:105)

= (cid:104)fS , E [K(·, Xi)](cid:105) = E [fS (Xi)]

= fS (Xi)

On the other hand, Pythagoras Theorem implies that

(cid:107)f(cid:107)2HK

= (cid:107)fS(cid:107)2HK

+ (cid:107)f⊥(cid:107)HK

≥ (cid:107)fS(cid:107)2HK

.

53

Since Ω is non-decreasing, we can know that Ω

from Jensen’s inequality that

(cid:16)(cid:107)f(cid:107)2HK

(cid:17)

. Then it follows

(cid:16)(cid:107)fS(cid:107)2HK
(cid:17) ≥ Ω
(cid:17)
(cid:16)(cid:107)f(cid:107)2HK
(cid:17)
(cid:16)(cid:107)fS(cid:107)2HK
(cid:17)
(cid:16)(cid:107)fS(cid:107)2HK
(cid:17)
(cid:16)(cid:107)fS(cid:107)2HK

.

J(f ) = E [L(f (X1), . . . , f (Xn))] + λΩ

≥ L (E[f (X1)], . . . , E [f (Xn)]) + λΩ

= L (fS (X1), . . . , fS (Xn)) + λΩ
= E [L (fS (X1), . . . , fS (Xn))] + λΩ

Hence, J(f ) is minimized if f ∈ S and we can express the minimizer as

n(cid:88)

f∗(·) =

αiE[K(·, Xi)].

i=1

Let W i be the ith row of the random matrix U = [u1 ··· um], based on Theorem 3.2.1,

the general solution for f (·) in (3.2) can be expressed as

n(cid:88)

f (·) =

αiE [K(·, W i)] ,

i=1

where α = [α1, . . . , αn]T are unknown parameters. Substituting this back into (3.2), we

54

have

J(α, β) =

1
2φ

EU

(cid:104)

(cid:105)
(y − Xβ − EU [K(U )]α)T (y − Xβ − EU [K(U )]α)

(y − Xβ − EU [K(U )]α)T (y − Xβ − EU [K(U )]α)

(cid:42) n(cid:88)

i=1

+

λ
2

(cid:88)

n(cid:88)

i=1

j=1

+

λ
2

αiE[K(·, W i)],

n(cid:88)

αjE(cid:2)K(·, W j)(cid:3)(cid:43)
(cid:105)
αiαjE(cid:104)(cid:10)K(·, W i), K(·, W j)(cid:11)

j=1

HK

HK

=

1
2φ

=

1
2φ

(y − Xβ − EU [K(U )]α)T (y − Xβ − EU [K(U )]α) +

αT EU [K(U )]α,

λ
2

where K(U ) is an n×n matrix whose (i, j)th element if K(W i, W j). Diﬀerentiating J(α, β)
with respect to α and β, we get

= −φ−1XT (y − Xβ − EU [K(U )] α)
= −φ−1EU [K(U )]T (y − Xβ − EU [K(U )]α) + λEU [K(U )] α

∂J
∂β
∂J
∂α

By setting the second equation to 0, we get by the symmetry and positive deﬁniteness of
EU [K(U )],

λEU [K(U )]α + φ−1EU [K(U )]T EU [K(U )]α = φ−1EU [K(U )]T (y − Xβ)

⇒α = (λφ)−1(cid:16)

(cid:17)−1
In + (λφ)−1EU [K(U )]

(y − Xβ).

55

Similarly, for the ﬁrst equation, we have

XT Xβ = XT (y − EU [K(U )]α)

= XT(cid:16)

(cid:17)
y − (λφ)−1EU [K(U )](In + (λφ)−1EU [K(U )])−1(y − Xβ)

= XT (In + (λφ)−1EU [K(U )])−1y

+ (λφ)−1XT EU [K(U )](In + (λφ)−1EU [K(U )])−1Xβ,

which implies that

XT(cid:16)

(cid:17)−1
In + (λφ)−1EU [K(U )]

Xβ = XT(cid:16)

(cid:17)−1
In + (λφ)−1EU [K(U )]

y.

Hence, we get

(cid:20)
XT(cid:16)
ˆα = (λφ)−1(cid:16)

ˆβ =

(cid:17)−1
In + (λφ)−1EU [K(U )]
(cid:17)−1
In + (λφ)−1EU [K(U )]

X

(y − X ˆβ),

(cid:21)−1

XT(cid:16)

(cid:17)−1
In + (λφ)−1EU [K(U )]

y

so that

ˆf = EU [K(U )] ˆα = (λφ)−1EU [K(U )]

(cid:16)

(cid:17)−1
In + (λφ)−1EU [K(U )]

(y − X ˆβ)

Through some simple calculations, it can be seen that ˆβ and ˆα is the solution to the following

equation:

XT R−1X

R−1X R−1 + τ−1 (EU [K(U )])−1

XT R−1

56



 =

β

f

XT R−1y

R−1y

 ,

where τ = λφ, R = φIn and this equation corresponds exactly to the Henderson’s mixed

model equation of the linear mixed model

y = Xβ + f + ,

(3.3)

where E[f ] = 0, Var[f ] = τEU [K(U )] and E[] = 0, Var[] = φIn. This is equivalent to
the following hierarchical model. In the model, we consider a more general setup that the

matrix K(U ) is a positive linear combination of diﬀerent kernel matrices.

y|f ∼ Nn (Xβ + f , φIn)

0,

J(cid:88)
0,

j=1

τjKj(U )

L(cid:88)



 ,

f|u1, . . . , um ∼ Nn

(3.4)

u1, . . . , um ∼ i.i.d. Nn

ξlKl(G)

l=1

where U = [u1,··· , um] ∈ Rn×m; Kj(U ), j = 1, . . . , J are n×n kernel matrices constructed
based on the latent variables u1, . . . , um and Kl(G), l = 1, . . . , L are kernel matrices con-

structed based on the genetic variables. For instance, if we have p genetic variants (p can be
greater than n), we can deﬁne K(G) = p−1GGT , which is known as the product kernel or

linear kernel. To see the connection between model (3.4) and model (3.3), we note that

E[f ] = EU (E[f|u1, . . . , um]) = 0
Var[f ] = EU [Var (f|u1, . . . , um)] + Var [E (f|u1, . . . , um)]
J(cid:88)

(cid:2)Kj(U )(cid:3) .

 J(cid:88)

 =

= EU

τjKj(U )

τjEU

j=1

j=1

57

This is the same as we obtained from model (3.3).

As an illustration, the basic hierarchical structure of the model can be seen from Figure

3.1. Due to the similarity in the network structure as in the case of popular neural networks,

we thus call our model a kernel neural network (KNN). KNN has several nice features. First,

by considering genetic eﬀects as random eﬀects, the method can simultaneously deal with

millions of variants. This addresses the limitation of the ﬁxed-eﬀect conventional neural

network, which is computationally prohibitive on such a large number of variables. Second,

by using random genetic eﬀects, the model complexity is also greatly reduced, as we no

longer require to estimate a large number of ﬁxed genetic eﬀects. Third, KNN allows for

a large number of hidden units without increasing model complexity. Finally, using hidden

units, KNN can capture non-linear and non-additive eﬀects, and therefore is able to model

complex functions beyond linear models.

In the remaining part of this section and section 3.3, we will focus on the scenario where

there is no ﬁxed eﬀects, that is β = 0. In section 3.4, we will consider the general estimation
procedure when β (cid:54)= 0 and we will also calculate the prediction error.

3.2.1 Quadratic Estimators for Variance Components

Popular estimation strategies for variance components in linear models are the maximum

likelihood estimator (MLE) and the restricted maximum likelihood estimator (REML) [15].

However, both methods depend on the marginal distribution of y. In our KNN model, it is

generally diﬃcult to obtain the marginal distribution of y, which involves high dimensional

integration with respect to u1, . . . , um. Moreover, the ui’s are embedded in the kernel

matrices Kj(U ), j = 1, . . . , J, which makes the integration even more complicated.

On the other hand, given the KNN model describes above, we can easily obtain the

58

Figure 3.1: An illustration of the hierarchical structure of the kernel neural network
model with L input kernel matrices K1(G), . . . , KL(G) and J hidden kernel matrices
K1(U ), . . . , KJ (U ). The output layer is the prediction for the random eﬀect f.

59

conditional distribution of y|u1, . . . , um is given by
J(cid:88)

y|u1, . . . , um ∼ Nn

0,

τjKj(U ) + φIn

 .

Then the marginal mean and variance of y can be obtained via conditioning arguments:

j=1

E [y] = E (E [y|u1, . . . , um]) = 0.
Var[y] = E [Var(y|u1, . . . , um)] + Var [E(y|u1, . . . , um)]



 J(cid:88)
J(cid:88)
J(cid:88)

j=1

j=1

= E

=

:=

τjKj(U ) + φIn

τjE[Kj(U )] + φIn

τjE[Kj(U )],

(3.5)

(3.6)

j=0

where τ0 = φ and K0(U ) = In. Given the marginal mean and covariance matrix, the

MInimum Quadratic Unbiased Estimator (MINQUE) proposed by Rao (1970, 1971, 1972)[55,

56, 57] can be used to estimate the variance components. The basic idea of MINQUE is to

use a quadratic form yT Θy to estimate a linear combination of variance components. The

MINQUE matrix Θ is obtained by minimizing a suitable matrix norm, which is typically

chosen to be the Frobenius norm, of the diﬀerence between Θ and the matrix in the quadratic

estimator by assuming that we know the random components in the linear mixed models.

The constraint in the optimization is to guarantee the unbiasedness of the estimators. One

advantage of MINQUE is that it has a closed form solution provided by Lemma 3.4 in Rao

(1971)[56] so that it can be computed eﬃciently. However, MINQUE can also provide a

negative estimator for a single variance component. When this occurs, we simply set the

60

negative estimators to be zero, as we usually do for MLE or REML of variance components,

except for the error variance component. If the MINQUE estimator for the error variance

component becomes negative, we project the MINQUE matrix Θ onto the positive semi-
deﬁnite cone S+

n . Speciﬁcally, the modiﬁed estimator for error variance component becomes

max{λ1, 0}

ˆτ0 = yT O

...



 OT y,

max{λn, 0}

where Odiag{λ1, . . . , λn}OT is the eigen-decomposition of Θ.

3.2.2 MINQUE in KNN

For the ease of theoretical justiﬁcations, throughout the remaining of the chapter, we illus-

(cid:104) 1

g

(cid:105)

trate the method with one hidden kernel matrix (i.e. J = 1) with the form of Kij(U ) =

nwT

i wj

, where w1, . . . , wn are the rows of the matrix U and g[A] means that the

function g is applied to each element in the matrix A.

We start by considering the simplest case g(x) = x, in which case, the hidden kernel

matrix becomes K(U ) = 1

variance for y. Since U U T ∼ Wn(m,(cid:80)L

m U U T . In this case, we have an explicit form of the marginal
l=1 ξlKl(G)), where Wn(m, A) is a Wishart distri-

bution with m degrees of freedom and covariance matrix A, we have

L(cid:88)

V := Var[y] = τEU [K(U )] + φI =

τ ξlKl(G) + φIn.

(3.7)

l=1

From equation (3.7), we can see that there is an identiﬁability issue if we directly estimate

τ and ξl’s. To solve this issue, we reparameterize the covariance matrix by letting θl = τ ξl,

61

θ0 = φ, K0(G) = In and rewrite Var[y] as

L(cid:88)

l=0

L(cid:88)

l=0

θlSl(G)ST

l (G),

(3.8)

V =

θlKl(G) =

where S0(G), . . . , SL(G) are the Cholesky lower triangles for the kernel matrices K0(G), . . . , KL(G),

respectively. Then the parameters θ0, . . . , θL can be estimated via MINQUE. Speciﬁcally,

ˆθ0 = yT ˆA0y
ˆθi = yT Aiy ∨ 0,

where

 L(cid:88)

l=0

L(cid:88)

l=0

ηl

−1

Kl(G)

Ai =

Kl(G)

i = 1, . . . , L,

−1

Kl(G)

 L(cid:88)

l=0

,

i = 0, . . . , L,

and η0, . . . , ηL are the solutions to Γη = ei, where ei is a vector of zero except that the

(i + 1)th elements is 1 and



 L(cid:88)

l=0

−1

Kl(G)

Ki(G)

−1

Kl(G)

 L(cid:88)

l=0

 .

Kj(G)

Γij = tr

Moreover, ˆA0 = PS+
K(U ) = g

(cid:104) 1
m U U T(cid:105)

A0 as mentioned above. For a general kernel matrix of the form

n
, we focus on the case where g satisﬁes the following property, which

we called the Generalized Linear Separable Condition:

 k(cid:88)

α=1

 =

g

cαxα

k(cid:48)(cid:88)

α=1

πα(c1, . . . , ck)γα(x1, . . . , xk),

(3.9)

where c1, . . . , ck ∈ R are coeﬃcients and π1, . . . , πk(cid:48), γ1, . . . , γk(cid:48) are some functions. Exam-

62

ples of kernel functions satisfying the condition are polynomial kernels. We know that

,

wT
s wt
m

(cid:33)
(cid:32)
 ∼ i.i.d. N2

Kst(U ) = f

s, t = 1, . . . , n

(3.10)

and

wi1

 , . . . ,

0
 ,
σii σij
where σii, σjj, σij are the corresponding elements in Σ =(cid:80)L

wim



0

σij σjj

wj1

wjm


 ,

(3.11)

l=1 ξlKl(G). We can then use
Taylor expansion to obtain an approximation of E[Kij(U )], which is given in Lemma 3.2.1.

The proof of Lemma 3.2.1 can be found in Appendix B.

Lemma 3.2.1. Let the random vectors wi, wj ∈ Rm be as in (3.11) and the Kij(U ) as

deﬁned in (3.10). Then if(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)g(cid:48)(cid:48)
P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U )
(cid:12)(cid:12)(cid:12) > δ

for some M > 0 and all λ ∈ [0, 1], we have

where ˆKij(U ) = g(σij) + g(cid:48)(σij)

wT
m − σij
i wj

(cid:32)
λσij + (1 − λ)

(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M,

wT
i wj
m

a.s.,

(3.12)

(cid:40)

−m

(cid:17) ≤ 4 exp
(cid:32)

(cid:33)(cid:41)

,

(cid:114)

2δ
M

δ

20M sij

∧ δ
M σ2
ij

∧ 1
4|σij|

1 ∧

(cid:32)
(cid:33)

.

Lemma 3.2.1 and Remark B.0.1 in Appendix B show that when we use ˆK(U ) = [ ˆKij(U )]

to approximate K(U ) and when the number of hidden features u1, . . . , um is large enough,

the approximation will be suﬃciently small. Hence, we can write K(U ) as follow:

63

K(U ) = ˆK(U ) + oP (1) = g[Σ] + g(cid:48)[Σ] (cid:12)

(cid:19)

U U T − Σ

(cid:18) 1

m

+ oP (1),

(3.13)

where (cid:12) means the Hadamard product of two matrices. Moreover, the Strong Law of Large

Numbers implies 1

m U U T → Σ a.s. Therefore, equation (3.13) can be further written as

K(U ) = g[Σ] + oP (1),

i.e., K(U )

P−→ f [Σ] as m → ∞ element-wisely. Following from the version of Dominated

Convergence Theorem based on convergence in probability, the following lemma can be easily

shown.

Lemma 3.2.2. Under the assumptions of Lemma 3.2.1, if g(cid:48)(cid:48)(cid:0)ηij
(cid:33)2 = o(1),

g(cid:48)(cid:48)(cid:0)ηij

L1(P), then

1

(cid:1)(cid:32)

− σij

wT
i wj
m

E

2

(cid:1)(cid:32)

(cid:33)2

∈

wT
m − σij
i wj

where ηij = λσij + (1 − λ)

wT
m for some λ ∈ [0, 1].
i wj

Based on Lemma 3.2.2, the marginal variance-covariance matrix V can be written as

V = τE [K(U )] + φIn

+ φIn

πl(ξ1, . . . , ξL)γl [K1(G), . . . , KL(G)] + φIn

(cid:39) τ g[Σ] + φIn

(cid:39) τE(cid:104) ˆK(U )
(cid:105)
L(cid:48)(cid:88)
L(cid:48)(cid:88)

= τ

l=1

=

θlSl(G)ST

l (G),

l=0

64

where θ0 = φ, θl = τ πl(ξ1, . . . , ξL), l = 1, . . . , L(cid:48) and S0(G) = In, Sl(G) is the Cholesky
lower triangle for the matrix γl [K1(G), . . . , KL(G)], l = 1, . . . , L. As an example, we

may consider g(x) = (1 + x)2, which corresponds to the output polynomial kernel. g[Σ] =
p GGT )∧(cid:13)2, where the symbol ∧(cid:13)2 means the elementwise square. In this case, L = 1

1

(J n +ξ1
and K1(G) = p−1GGT . Then note that

g[Σ] = J n +

2ξ1
p

J (cid:12) GGT +

= J n + 2ξ1

1
p

GGT + ξ2
1

ξ2
1

p2 (GGT )∧(cid:13)2
p2 (GGT )∧(cid:13)2.

1

This shows that for polynomial output kernel and one input product kernel, L(cid:48) = 3 with

1 and γ1[K1(G)] = J n, γ2[K1(G)] = K1(G) =
π1(ξ1) = 1, π2(ξ1) = 2ξ1, π3(ξ1) = ξ2
p−1GGT , γ3[K1(G)] = p−2(GGT )∧(cid:13)2. The parameters θ0, . . . , θL can be estimated via
MINQUE as well. Based on the above discussion, we can see that the estimation of the

variance components in KNN through MINQUE is an approximation. What we basically do

here is to use a complex mixed model to approximate the KNN.

65

3.3 Predictions

In this section, we make a theoretical comparison of prediction performance between KNN

and LMM. Based on our model, the best predictor for f is given by

ˆy = E [f|y] = EU [E (f|y, u1, . . . , um)]

j=1


 J(cid:88)

 J(cid:88)

 J(cid:88)

j=1

j=1

= EU

= EU

:= EU

j=1

 J(cid:88)
 J(cid:88)
 J(cid:88)

j=1

j=1

τjKj(U )

τjKj(U ) + φIn

φ−1τjKj(U )

φ−1τjKj(U ) + In

˜τjKj(U )

˜τjKj(U ) + In

−1 y
−1 y
−1 y,

where ˜τj = τjφ−1, j = 1, . . . , m. The prediction error based on ˆy is given by

R = (y − ˆy)T (y − ˆy)

In − EU

= yT


 J(cid:88)
In − EU

j=1

 J(cid:88)

 J(cid:88)

j=1

j=1

 J(cid:88)

j=1

˜τjKj(U )

˜τjKj(U ) + In

−1
T

−1
 y.

˜τjKj(U )

˜τjKj(U ) + In

66

In − EU

˜τjKj(U ) + In

Note that

=EU

=EU

we have

j=1

j=1

˜τjKj(U )


 J(cid:88)
 J(cid:88)

 J(cid:88)
˜τjKj(U ) + In − J(cid:88)
−1 ,

 J(cid:88)

 J(cid:88)

EU

˜τjKj(U ) + In

R = yT

j=1

j=1

j=1

−1
 J(cid:88)

j=1

−1

˜τjKj(U )

˜τjKj(U ) + In

−1
2

y.

˜τjKj(U ) + In

j=1

Direct evaluation of the prediction error R is complicated.

Instead, we approximate it

based on asymptotic results. Same as the above, we focus on the case where J = 1 and

K(U ) = g

. The proof of the following Lemma can be found in Appendix B.

(cid:104) 1
m U U T(cid:105)

Lemma 3.3.1 (Approximation of Prediction Error).

(i) When g(x) = x, then as m →

∞,

R (cid:39) yT

 L(cid:88)

l=1

−2

˜τ ξKl(G) + In

y.

(ii) When g is continuous and g[Σ] ∈ Sn

+, then as m → ∞,

˜τ g

 L(cid:88)

R (cid:39) yT

ξKl(G)

 + In

−2

y.

l=1

Now we compare the average prediction error between kernel neural network and linear

mixed model. For a linear mixed model, the prediction error using best predictor can be

67

obtained as follow. The proof can be found in Appendix B.

Proposition 3.3.1 (Prediction Error for a Linear Mixed Model). Consider the linear mixed

eﬀect model

(cid:17)

;

(cid:16)

y = f + ;
f ∼ Nn
RΣ
 ∼ Nn (0, φIn) .

0, σ2

RΣ(cid:0)˜σ2

RΣ + In

(cid:1)−1

y

The prediction error based on quadratic loss and the best predictor ˆy = E[f|y] = ˜σ2

(˜σ2

R = σ2

Rφ−1) is given by

(cid:17)−1

,

˜σ2
Rλi(Σ) + 1

n(cid:88)

(cid:16)

i=1

P ELM M = φ

where P ELM M is the average prediction error for the linear mixed model and λ1(Σ), . . . , λn(Σ)

are the eigenvalues of Σ.

Proposition 3.3.2. Assuming that σ2 = φ and ˜σ2

R ≤ ˜τ min1≤l≤L ξl, we have

P EKN N (cid:45) P ELM M ,

where P EKN N stands for average prediction error for kernel neural network.

(cid:2)(˜τ K(U ) + In)−1(cid:3). Then for the kernel neural network, the average

Proof. Let A = EU

68

prediction error is given by

P EKN N = E(cid:104)

(cid:17)(cid:105)

yT A2y

= EU

(cid:104)E(cid:16)

tr

A2 (˜τ K(U ) + In)

yT A2y|u1, . . . , um

(cid:105)
(cid:16)
(cid:104)
(cid:26)(cid:16)EU
(˜τ K(U ) + In)−1(cid:105)(cid:17)2 EU [˜τ K(U ) + In]
(cid:104)

˜τ
L(cid:88)

˜τ
L(cid:88)
˜τ min
n(cid:88)

(cid:17)(cid:105)
−2˜τ
−1
 + 1

 L(cid:88)

−1

L(cid:88)

l=1

ξlKl(G) + In

ξlKl(G) + In

l=1

l=1

1≤l≤L

ξlλi

Kl(G)

= φEU

= φtr

(cid:39) φtr

= φtr

≤ φ

ξlKl(G) + In

(cid:27)




Under the assumptions in this proposition and the linear mixed model with Σ =(cid:80)L

l=1

i=1

l=1 Kl(G),

we have

φ(cid:0)˜τ min1≤l≤L λi(Σ) + 1(cid:1)−1
σ2(cid:0)˜σ2
Rλi(Σ) + 1(cid:1)−1

=

φ
σ2

˜σ2
Rλi(Σ) + 1

˜τ min1≤l≤L ξlλi(Σ) + 1

≤ 1,

which implies that P EKN N (cid:45) P ELM M .

Remark 3.3.1. The result for Proposition 3.3.2 can be illustrated by using Figure 3.2. As

shown in the Figure 3.2 for the case of L = 1, there are two paths from the kernel matrix

based on G to the response y. One is the kernel neural network path (solid line) and the

other is the linear mixed model path (dash-dotted line). The intuition behind the assumption
R ≤ ˜τ ξ is that the kernel neural network should explain more variations than the linear
˜σ2
mixed model as it has two portions.

We then extend the result to K(U ) = g

, where g is as described in Lemma

(cid:104) 1
m U U T(cid:105)

69

Figure 3.2: The intuition under the assumption ˜σ2

R ≤ ˜τ ξ in Proposition 3.3.2.

3.3.1(ii).

Proposition 3.3.3. Under the above notations, assuming that σ2 = φ, ˜σ2

(cid:105)(cid:17) ≥ 0 with |g(cid:48)(cid:48)(x)| ≤ M for some M > 0 and all x between

R ≤ ˜τ min1≤l≤L ξl

(cid:16)

(cid:104)(cid:80)L

and λ1

(g − ι)

l=1 ξlKl(G)

mini,j σij and maxi,j

vT
i vj
m , we have

P EKN N (cid:45) P ELM M ,

where λ1(Σ) is the smallest eigenvalue of the matrix Σ.

70

Proof. Note that

P EKN N = φtr

(cid:26)(cid:16)E(cid:104)
(cid:27)
(˜τ K(U ) + In)−1(cid:105)(cid:17)2 E [˜τ K(U ) + In]

˜τ g
 L(cid:88)
n(cid:88)
(cid:104)(cid:80)L
(cid:16)
n(cid:88)
(cid:16)

−1
(cid:105)

 + In
(cid:105)(cid:17)

l=1 ξlKl(G)

ξlKl(G)

+ 1

i=1

˜τ λi

(cid:104)(cid:80)L

1
+ min1≤l≤L ξl

l=1 ξlKl(G)

i=1

˜τ λi

(g − ι)

l=1

1

g

(cid:39) φtr

= φ

≤ φ

(cid:80)L

l=1 Kl(G)

(cid:17)

,

+ 1

where ι : Σ (cid:55)→ Σ is the identity map. Corollary 4.3.15 in Horn and Johnson (2012)[32]

implies that

P EKN N (cid:45) φ

n(cid:88)

i=1

1
˜τ min1≤l≤L ξlλi(Kl(G)) + ˜τ λ1

(cid:16)

(g − ι)

(cid:104)(cid:80)L

(cid:105)(cid:17)

l=1 ξlKl(G)

+ 1

Corollary 3.3.1. If g

(cid:104)(cid:80)L

l=1 ξlKl(G)

(cid:105) −(cid:80)L

l=1 ξlKl(G) is positive semideﬁnite, then

P EKN N (cid:45) P ELM M .

Example 3.3.1 (Polynomial Kernels). For a polynomial kernel of degree d, i.e., Kij(U ) =

(cid:32)

wT
i wj
m

c +

(cid:33)d

, we have g(x) = (c + x)d =(cid:80)d

k=0

(g − ι)(x) = cd + (dcd−1 − 1)x +

(cid:0)d

k

(cid:1)cd−kxk so that
d(cid:88)

(cid:18)d

(cid:19)
cd−kxk.

k

k=2

71

Theorem 4.1 in Hiai (2009)[29] states that for a real function on (−α, α), 0 < α ≤ ∞, it is
Schur positive1 if and only if it is analytic and g(k)(0) ≥ 0 for all k ≥ 0. Since g − ι is a

polynomial function, it is clearly analytic. We can then expand g(x) using Taylor expansion

around 0 and obtain

(cid:18)d

(cid:19)

k

Hence, we have

cd−k =

g(k)(0)

k!

⇒ g(k)(0) =

d!

(d − k)!

cd−k,

k = 0, . . . , d.



0

(g − ι)(k)(x) =

dcd−1 − 1
(d−k)!cd−k

d!

if k = 1
if k ∈ {0, 1, . . . , d} \ {1}
k ≥ d + 1

.

To make g − ι Schur positive, we only need to require c ≥ d−1(cid:113) 1

d ≥ 0 so that the minimum

eigenvalue condition of Proposition 3.3.3 holds.

3.4 Including Fixed Eﬀects

In the previous discussions, we focus on the case where the marginal distribution of the

response variable y has mean 0. In many applications, as the one we see in the ADNI real

data application, there are many important covariates which may have large eﬀects to the

response variable. In this part, we are going to extend the proposed KNN model to take the
1For a real function f on (−α, α) and for n ∈ N, it is Schur-positive of order n if f [A] is positive

semideﬁnite for all positive semideﬁnite A ∈ Mn(R) with entries in (−α, α).

72

covariates into account. As we have mentioned in (3.4), the general structure of KNN is

f|u1, . . . , um ∼ Nn

τjKj(U )

(3.14)

y|f ∼ Nn (Xβ + f , φIn)



0,

J(cid:88)
0,

j=1

L(cid:88)

l=1

u1, . . . , um ∼ i.i.d. Nn

ξlKl(G)

Similar to the situation considered before, given the latent variables u1, . . . , um, we have

Xβ,

J(cid:88)

y|u1, . . . , um ∼ Nn

τjKj(U ) + φIn

and then the marginal mean and covariance matrix of y can be obtained as follow:

j=1

 .





E [y] = EU [E [y|u1, . . . , um]] = EU [Xβ] = Xβ
Var [y] = VarU [E [y|u1, . . . , um]] + EU [Var [E [y|u1, . . . , um]]]

= VarU [Xβ] + EU

τjKj(U ) + φIn

 J(cid:88)
(cid:2)Kj(U )(cid:3) ,

j=1

J(cid:88)

j=0

=

τjEU

where τ0 = φ and K0(U ) = In. Again, we focus on the case where J = 1 and K(U ) is

of the form g

with g satisfying the generalized linear separable condition so that

mU U T(cid:105)
(cid:104) 1

after suitable reparameterization, the variance components become estimable. A natural

way to obtain the estimators for β and the variance components θ is to ﬁrst obtain a “good"

estimates for the variance components ˆθ and then plug it into the Aitken equation to obtain

the estimates for ﬁxed-eﬀect parameters ˆβ. Kackar and Harville (1981)[37] showed that as

73

long as ˆθ is even and translation-invariant, pT ˆβ + qT ˆf is an unbiased prediction for the

quantity pT β + qT f.

Let R be an r × n matrix with r = n − rank(X) such that RX = O and RRT = Ir.
Such a matrix can be found using the QR decomposition of X [47]. The estimators for the

variance components can then be estimated based on the transformed model:

˜f|u1, . . . , um ∼ Nr

τjRKj(U )RT

u1, . . . , um ∼ i.i.d. Nn

ξlKl(G)


 ,

˜y|˜f ∼ Nr(˜f , φIr)

0,

J(cid:88)
0,

j=1

L(cid:88)

l=1

where ˜y = Ry and ˜f = Rf, which is the same model framework we mainly discussed in

section 3.2. As we have seen, the estimators for the variance components after reparame-

terization is of the form ˜yT Θ˜y = yT RT ΘRy. Since it is a quadratic form, clearly it is an
even function in y. To see that it is translation invariant, we note that for any c ∈ C(X),
we can know that c = Xb for some vector b ∈ Rr and we have

(y − c)T RT ΘR(y − c) = (y − Xb)T RT ΘR(y − Xb)

= yT RT ΘRy − 2yT RT ΘRXb + bT XT RT ΘRXb

= yT RT ΘRy,

where the last equality follows since RX = O as deﬁned. Therefore, we can know that the

obtained estimators for variance components are also translation-invariant and based on the

results in Kackar and Harville (1981)[37], the obtained estimator for β by plugging in the

74

variance component estimators is unbiased.

For the prediction error, we note that when the covariates present, the predictor for y is
given by ˆy = X ˆβ + ˆf = X ˆβ + E[f|y]. Based on the result from linear mixed models, we

know that


 J(cid:88)

 J(cid:88)

j=1

j=1

ˆf = E

= E

 J(cid:88)
 J(cid:88)

j=1

j=1

−1 (y − Xβ)
−1(cid:20)
(cid:16)

In − X

˜τjKj(U ) + In

˜τjKj(U ) + In

(cid:17)−

(cid:21)

y

XT V −1

XT V −1X

˜τjKj(U )

˜τjKj(U )

XT V −1X

j=1 τjE[Kj(U )] + φIn and then

(cid:21)

y

XT V −1

(cid:17)−
 J(cid:88)
 J(cid:88)
−1(cid:20)

j=1

j=1

(cid:16)

˜τjKj(U )

˜τjKj(U ) + In

˜τjKj(U )

˜τjKj(U ) + In

(cid:17)−

(cid:21)

y

XT V −1X
(cid:16)

XT V −1
(cid:17)−

In − X

XT V −1X

XT V −1

(cid:21)

y

(cid:16)

In − X

−1(cid:20)
−1
(cid:20)
(cid:17)−

(cid:21)

j=1

(cid:20)

− E

y − ˆy =

In − X

where V = Var[y] =(cid:80)J
(cid:16)

 J(cid:88)

In − E
 J(cid:88)

 J(cid:88)


 J(cid:88)

In − E

= E

j=1

j=1

=

j=1

˜τjKj(U ) + In

In − X

XT V −1X

XT V −1

y,

where the last equality follows by noting that

 J(cid:88)

j=1

˜τjKj(U )

˜τjKj(U ) + In

as shown above. Therefore, by letting A = E

75

−1 = E
(cid:20)(cid:16)(cid:80)J


 J(cid:88)

j=1

−1

˜τjKj(U ) + In

(cid:17)−1(cid:21)

and P V =

j=1 ˜τjKj(U ) + In

XT V −1, the prediction error is obtained as follow:

(y − ˆy)T (y − ˆy) = yT(cid:16)
(cid:105)

When J = 1, as we have shown in the proof of Lemma 3.3.1, E(cid:104)
(cid:104)(cid:80)L
(cid:16)

(cid:17)−1

˜τ f

l=1 ξlKl(G)

+ In

(cid:17) ˆA

In − P T

V

2

(In − P V ) y.

(˜τ K(U ) + In)−1(cid:105) (cid:39)

so that we can estimate the prediction error by using simple

plug-in estimators.

(y − ˆy)T (y − ˆy) (cid:39) yT(cid:16)

(cid:92)

(cid:17)ˆ˜τ f

 L(cid:88)

l=1

 + In

ˆξlKl(G)

−2(cid:16)

(cid:17)

y.

In − P ˆV

In − P T
ˆV

3.5 Simulations

In this section, we conducted some simulations to compare the prediction performance of

KNN with MINQUE estimation to the prediction performance of the Best Linear Unbiased

Estimator (BLUP) in linear mixed models. All the simulations are based on 100 individuals

with 500 Monte Carlo iterations.

3.5.1 Nonlinear Random Eﬀect

As we have seen in section 3.2 that a KNN is equivalent to a nonlinear ﬁxed eﬀect model

and we have also shown that the prediction error for KNN will be smaller than that of linear

mixed models. Here, we use a simulation to validate previous observations through some

simulation studies. We used the following model to simulate the response:

(cid:16)

X

XT V −1X

(cid:17)−

(cid:18)

(cid:19)

0,

GGT

1
p

,

(3.15)

y = 1n + 2ζ + f (u) + , u ∼ Nn

76

where G is an n×p matrix containing the genetic information (SNP) and ζ,  ∼ i.i.d. Nn(0, In).
Here ζ is served as the ﬁxed eﬀect in the formulation of a KNN. In this simulation, four

types of functions f are considered, which are linear (f (x) = x), sine (f (x) = sin(2πx)),
inverse logistic (f (x) = 1/(1 + e−x)) and polynomial function of order 2 (f (x) = x2). When

applying the kernel neural network, we set L = J = 1 and choose K(G) and K(U ) as either

product kernel or polynomial kernel of order 2. Figure 3.3 shows the results when f is a

linear or a sine function. The cases where f is an inverse logistic function or a polynomial

function of order 2 are summarized in Appendix B.

Figure 3.3: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN)
in terms of prediction errors. The left panel shows the results when a linear function is
used and the right panel shows the results when a sine function is used. In the horizontal
axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel
and product output kernel; “3" corresponds to the KNN with product input and polynomial
output; “4" corresponds to the KNN with polynomial input and product output and “5"
corresponds to the polynomial input and polynomial output.

77

123450.40.60.81.01.21.4Linear123450.51.01.52.0SineAs we can observe from the boxplots, when the output kernel is chosen to be the product

kernel, the performance of KNN is similar to the performance of LMM although when the

input kernel is chosen to be polynomial kernel, KNN gets a slightly better prediction error,

which we think is not signiﬁcantly diﬀerent from that of LMM. However, KNN performs

signiﬁcantly better than LMM when the output kernel is chosen to be polynomial kernel.

As one can tell from the box plots, when both the input and output kernels are chosen to

be polynomial, the KNN has the best performance in terms of the prediction error, which is

consistent for all nonlinear functions simulated in this section.

3.5.2 Nonadditive Eﬀects

In this simulation, we explore the performances of both method under non-additive eﬀects.

We conducted two simulations in terms of two diﬀerent types of non-additive eﬀects. In the

ﬁrst simulation, we mainly focus on the interaction eﬀect and generate the response using

the following model:

y = f (G) + ,

where G = [g1, . . . , gp] ∈ Rn×p is the SNP data and  ∼ Nn(0, In). When applying
In the
both methods, the mean is adjusted so that the response has marginal mean 0.

simulation, we randomly pick 10 causal SNPs, denoted by gi1
following function,

, . . . , gi10

and consider the

(cid:88)

f (G) =

1≤j1<j2≤10

(cid:12) gij2

,

gij1

where (cid:12) stands for the Hadamard product. For LMM, the product kernel was used as the

covariance matrix to generate the random eﬀect. The result is shown in Figure 3.4. It is

interesting to notice from the boxplots that both LMM and KNN have many outliers when

78

product kernel was used. Overall, LMM has larger variations compared to KNN in this

simulation. When the output kernel in KNN is the polynomial kernel, the performance of

KNN is much better than that of LMM.

Figure 3.4: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN)
in terms of prediction errors based on the simulation model focusing on the interaction eﬀect.
The vertical axis is scaled to 0-5 by removing some outliers to make the comparison visually
clear.
In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN
with product input kernel and product output kernel; “3" corresponds to the KNN with
product input and polynomial output; “4" corresponds to the KNN with polynomial input
and product output and “5" corresponds to the polynomial input and polynomial output.

There are three main modes of inheritance: additive, dominant and recessive. In many

situations, the additive coding (AA=0, Aa=1, aa=2) is used. In the second simulation, we

79

12345012345Interactionconsider the dominant coding (AA=1, Aa=1, aa=0) and the recessive coding (AA=0, Aa=1,

aa=1). The response was simulated based on the model:

(cid:18)

0,

G(cid:48)G(cid:48)T

1
p

(cid:19)

y = a + , a ∼ Nn

 ∼ Nn(0, In),

,

where G(cid:48) is a SNP data matrix based on dominant coding or recessive coding so that each
element in G(cid:48) takes only two possible values 0 and 1. Figure 3.5 summarizes the simulation

results. By comparing the two boxplots in Figure 3.5, the performances look similar in both

cases. Similar as before, KNN with input polynomial kernel and output polynomial kernel

achieves the lowest prediction error.

3.5.3 Non-normal Error Distributions

In this simulation, we consider diﬀerent types of error distributions and explore the perfor-

mance in terms of prediction error between LMM and KNN. Speciﬁcally, we focus on two

types of error distributions. The ﬁrst one is a t-distribution with degrees of freedom 2, which

is a heavy-tailed distribution and the second one is a centered χ2

1-distribution, which is a
non-symmetric distribution. The response was simulated based on the model as in section

5.1 except that the function applied to the random eﬀect is f (x) = x and the distribution

for the error term  is either t2 or centered χ2

1 distribution. The results are summarized in
Figure 3.6. From the results, we can know that the KNN with polynomial input and output

kernel is slightly more robust to non-normal distributions compared with LMM and KNN

with other combinations of input and output kernels considered in the simulations. Since

t2-distribution is a heavy-tailed distribution, we can ﬁnd that there are more outliers than

in the normal case. In terms of the number of outliers, KNN with polynomial input and

80

Figure 3.5: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN)
in terms of prediction errors based on the simulation model using dominant coding (left
ﬁgure) and recessive coding (right ﬁgure) for SNPs. In the horizontal axis, “1" corresponds to
the LMM; “2" corresponds to the KNN with product input kernel and product output kernel;
“3" corresponds to the KNN with product input and polynomial output; “4" corresponds to
the KNN with polynomial input and product output and “5" corresponds to the polynomial
input and polynomial output.

81

123450.51.01.5Dominant Coding123450.51.01.5Recessive Codingoutput kernel still performs the best.

Figure 3.6: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN)
in terms of prediction errors based on the simulation model using t-distribution for error
1-distribution for error (right ﬁgure). In the horizontal axis, “1"
(left ﬁgure) and centered χ2
corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product
output kernel; “3" corresponds to the KNN with product input and polynomial output; “4"
corresponds to the KNN with polynomial input and product output and “5" corresponds to
the polynomial input and polynomial output.

3.6 Real Data Application

We applied our method to the whole genome sequencing data from Alzheimer’s Disease

Neuroimaging Initiative (ADNI) as well and made predictions on the responses. A total of

808 individuals at the baseline of the ADNI1 and ADNI2 studies have the whole genome

sequencing data. We dropped the single nucleotide polymorphisms (SNPs) with low calling

rate (<0.9), or low minor allele frequencies (MAF) (<0.01), or those failed to pass the

82

123451234t−dist123455101520chi−sq−distHardy Weinberg exact test (p-value<1e-6), and non-European American samples were also

dropped. The data was then uploaded to the server in the University of Michigan for posterior

likelihood imputation (https://imputationserver.sph.umich.edu/index.html). From

the imputed data, we extracted SNPs with allelic R2 > 0.9 and then the covariance kernel

matrix the normalized identity-by-state (IBS) kernel matrix were constructed for analysis.

Speciﬁcally, the (i, j)the element in each of the two kernel matrix is deﬁned as follows:

(cid:32)
(cid:88)
p(cid:88)
p(cid:88)
(cid:2)2 − |gik − gjk|(cid:3) ,

gik − 1
p

k=1

gik

k

(cid:33)(cid:32)
gjk − 1
p

(cid:88)

k

(cid:33)

gjk

Kcov(gi, gj) =

1
p − 1

Kibs(gi, gj) =

1
2p

k=1

where p is the number of SNPs in all expressions.

Four volume measures of cortical regions, which are hippocampus, ventricles, entorhinal

and whole brain volumes were used as phenotypes of interest. We chose these four cortical

regions since they play important roles in the Alzheimer’s disease (AD). The loss in the

volumes of the whole brain, hippocampus and entorhinal and the increment in the ventricular

volume can be detected among AD patients. When we applied both the KNN method and

LMM method, we only include the subjects having both genetic information and phenotypic

information, which results in 513 individuals for hippocampus; 564 individuals for ventricles;

516 individuals for entorhinal and 570 individuals for whole brain volumes.

The response variable was chosen to be the natural logarithm of the volumes of the four

cortical regions and the covariates were chosen as the age, gender, education status and

83

APOE4. Then the KNN is based on the model

y|u1, . . . , um ∼ Nn (Xβ, τ K(U ) + φIn)
u1, . . . , um ∼ Nn

0, ξ1Kcov + ξ2Kibs(cid:17)

(cid:16)

,

and the LMM is based on the following model assumption:

(cid:16)

y ∼ Nn

Xβ, τ1Kcov + τ2Kibs + τ3In

(cid:17)

.

(3.16)

The restricted maximum likelihood estimates for τi, i = 1, 2, 3 were then calculated based

on the Fisher scoring methods and the BLUP for the LMM were computed based on these

estimators. Similarly, when we applied the KNN methods, the two kernel matrices Kcov

and Kibs were used as the input kernel matrices and the output kernel matrix is chosen to

be either the product kernel or the polynomial kernel of order 2. The average mean square

errors on predictors of both methods were summarized in the Table 3.1.

Table 3.1: Average mean squared prediction error of KNN with product output kernel matrix,
KNN with output kernel matrix as polynomial of order 2 and the BLUP based on LMM.

Hippocampus

Ventricles
Entorhinal

Whole Brain Volume

2.01e-06
2.44e-03
1.01e-05
1.25e-07

KNN(prod) KNN(poly)

4.17e-05
1.98e-03
1.06e-04
3.04e-08

LMM
2.11e-02
2.24e-01
4.05e-02
3.68e-11

As we can see from the table, both KNN and LMM have good prediction errors. However,

we would say that the prediction errors from KNN are more realistic. It is interesting to

note that the average prediction errors of LMM for entorhinal and whole brain volume are

84

extremely close to zero. We checked the estimates of the variance components in this case

and noticed that the variance component associated with the identity matrix, which is τ3 as

in (3.16). Since the BLUP of the random eﬀect in this case is given by

(cid:16)

τ1Kcov + τ2Kibs(cid:17)(cid:16)

BLU P = X ˆβ +

(cid:17)−1

τ1Kcov + τ2Kibs + τ3In

(y − X ˆβ)

so that if τ3 is very close to 0, we can know that the BLUP is very close to the response y,

which leads to the extremely low prediction error. On the other hand, for KNN, since when

the estimate of error variance becomes negative, we project the MINQUE matrix onto the

positive semideﬁnite cone so that we will always get a positive estimate for the error variance

component, which makes the calculation of prediction error more reasonable.

3.7 Discussion

In this chapter, a kernel-based neural network model was proposed for prediction analyses.

The kernel-based neural network can be thought of as an extension of linear mixed model

since it can reduce to a linear mixed model through choosing product kernel matrix as the

output kernel matrix and via reparameterization. A modiﬁed MINQUE strategy is used

to obtain the estimators of variance components in the KNN model if the output kernel

satisﬁes the generalized linear separable condition. Empirical simulation studies and real data

application show that the KNN model can achieve better performance in terms of the mean

squared prediction error when the output kernel matrix is chosen as the polynomial kernel

matrix. This is analogous to the popular neural network model, where better prediction

accuracy can be achieved when nonlinear activation functions are applied.

85

Many extensions can be made to make the KNN model more ﬂexible and have much

broader applications. First, it may be possible to consider conducting base kernel matrix

selection. Although in this paper, we do not consider how to choose the number of base kernel

matrix L, but too many kernel matrices will certainly increase the amount of redundant

information. So it may be beneﬁcial to propose a criterion on choosing the base kernel

matrices. An other possible extension of the KNN model is more challenging. The theoretical

properties discussed in this paper mainly focus on the case where only one output kernel

matrix is used and the kernel function has a speciﬁc form. It is natural to consider more

general kernel functions, but the estimation procedure of the variance component would be

more complex. Moreover, it is also advisable to consider the performance of KNN if deep

network structures were applied.

86

Chapter 4

Asymptotic Properties of Neural

Network Sieve Estimators

4.1 Introduction

With the widespread usage of machine learning and artiﬁcial intelligence, neural networks

have regained their popularity. More and more learning machines are based on deep neural

networks and have achieved great classiﬁcation and prediction accuracy. We refer interested

readers to Goodfellow et al. (2016) [25] for more background and details. In classical sta-

tistical learning theory, the consistency and the rate of convergence of the empirical risk

minimization principle are of the great interest. Many upper bounds have been obtained for

the empirical risk and the sample complexity based on the growth function and the Vapnik-

Chervonenkis dimension (see for example, [70, 4, 18]). However, not many studies focus on

the asymptotic properties for neural networks. As Thomas J. Sargent said, “artiﬁcial intel-

ligence is actually statistics, but in a very gorgeous phrase, it is statistics." So it is natural

and worthwhile to explore whether a neural network model possesses nice asymptotic prop-

erties since if it does, it may be possible to conduct statistical inference based on this model.

Throughout this chapter, we will focus on the asymptotic properties of neural networks with

one hidden layer.

87

Using the language in statistics, ﬁtting a neural network with one hidden layer is simply

a parametric nonlinear regression problem:

r(cid:88)

(cid:16)

γT

j xi + γ0,j

(cid:17)

+ i,

yi = α0 +

αjσ

j=1

random errors with E[] = 0 and E[2] = σ2 < ∞ and σ(·)
where 1, . . . , n are i.i.d.
is an activation function, for example σ(z) = 1/(1 + e−z), which will be the main focus

in this paper. White and Racine (2001)[75] obtained the asymptotic distribution of the

resulting estimators under the assumption that the true parameters are unique.

In fact,

the authors implicitly assumed that the number of hidden units r is known. However,

even if we assume that we know the number of hidden units, it is diﬃcult to establish the

asymptotic properties for the parameter estimators. In section 4.6.1, we conduct a simulation

based on a single-layer neural network with 2 hidden units. Even for such a simple model,

the simulation result suggests that reaching consistency is highly unlikely. Moreover, since

the number of hidden units is usually unknown in practice, such assumption can be easily

violated. For example, as pointed out in Fukumizu (1996)[21] and Fukumizu et al. (2003)

[22], if the true function is f0(x) = ασ(γx), that is the true number of hidden units is 1,

and if we ﬁt the model using a neural network with two hidden units, then any parameter

θ = [α0, α1, . . . , αr, γ0,1, . . . , γ0,r, γT

1 , . . . , γT

r ]T in the high-dimensional set

{θ : γ1 = γ, α1 = α, γ0,1 = γ0,2 = α2 = α0 = 0}∪
{θ : γ1 = γ2 = γ, γ0,1 = γ0,2 = α0 = 0, α1 + α2 = α}

realizes the true function f0(x). Therefore, when the number of hidden units is unknown,

88

the parameters in this parametric nonlinear regression problem are unidentiﬁable. Theorem

1 in Wu (1981)[77] showed that a necessary condition for the weak consistency of nonlinear

least square estimators is that

n(cid:88)

[f (xi, θ) − f (xi, θ(cid:48))]2 → ∞, as n → ∞,

i=1

for all θ (cid:54)= θ(cid:48) in the parameter space as long as the distribution of the errors has ﬁnite

Fisher information. Such condition implies that when the parameters are not identiﬁable,

the resulting nonlinear least squares estimators will be inconsistent, which hinders further

explorations on the asymptotic properties for the neural network estimators. Liu and Shao

(2003)[43] and Zhu and Zhang (2006)[80] proposed some techniques to conduct hypothesis

testing under loss of identiﬁability. However, their theoretical results are hard to implement

in real world applications.

Even though a function can have diﬀerent neural network parametrizations, the function

itself can be considered as unique. Moreover, due to the Universal Approximation Theorem

[33], any continuous function on a compact support can be approximated arbitrarily well by

a neural network with one hidden layer. So it seems natural to consider a nonparametric

regression setting and approximate the function class through a class of neural networks with

one hidden layer. Speciﬁcally, suppose that the true nonparametric regression model is

yi = f0(xi) + i,

where 1, . . . , n are i.i.d. random variables deﬁned on a complete probability space (Ω,A, P)
with E[] = 0, Var[] = σ2 < ∞; x1, . . . , xn ∈ X ⊂ Rd are vectors of covariates with X being

89

a compact set in Rd and f0 is an unknown function needed to be estimated. We assume
that f0 ∈ F, where F is the class of continuous functions with compact supports. Clearly,
f0 minimizes the population criterion function

(cid:35)

(cid:34)
n(cid:88)
(yi − f (xi))2
n(cid:88)

1
n

i=1

(f (xi) − f0(xi))2 + σ2.

Qn(f ) = E

=

1
n

i=1

A least squares estimator of the regression function can be obtained by minimizing the
empirical squared error loss Qn(f ):

ˆfn = argminf∈FQn(f ) = argminf∈F 1
n

(yi − f (xi))2.

n(cid:88)

i=1

However, if the class of functions F is too rich, the resulting least squares estimator may

have undesired property such as inconsistency [68, 65, 64]. Instead, we may optimize the
squared error loss over some less complex function space Fn, which is an approximation of
F while the approximation error tends to 0 as the sample size increases. In the language of

Grenander (1981)[26], such a sequence of function classes is known as a sieve. More precisely,

we consider a sequence of function classes

F1 ⊆ F2 ⊆ ··· ⊆ Fn ⊆ Fn+1 ⊆ ··· ⊆ F

approximating F in the sense that (cid:83)∞

n=1 Fn is dense in F, that is for each f ∈ F, there
exists πnf ∈ Fn such that d(f, πnf ) → 0 as n → ∞, where d(·,·) is some pseudo-metric
deﬁned on F. With some abuse of notation, an approximate sieve estimator ˆfn is deﬁned to

90

be

where ηn → 0 as n → ∞.

Qn( ˆfn) ≤ inf
f∈Fn

Qn(f ) + Op(ηn),

(4.1)

Throughout the rest of the chapter, we focus on the sieve of neural networks with one

hidden layer and sigmoid activation function. Speciﬁcally, we let

(cid:16)

(cid:17)

Frn =

αjσ

γT

j x + γ0,j

: γj ∈ Rd, αj, γ0,j ∈ R,
d(cid:88)

rn(cid:88)

j=1

α0 +
rn(cid:88)

j=0

 ,

|αj| ≤ Vn for some Vn > 4 and max
1≤j≤rn

i=0

|γi,j| ≤ Mn for some Mn > 0

where rn, Vn, Mn ↑ ∞ as n → ∞. Such method has been discussed in many works (see for
example [73] and [74]). In those papers, consistency of the neural network sieve estimators has

been established under a random design. However, there are few results on the asymptotic

(4.2)

distribution of the neural network sieve estimators, which will be established in this paper.
n Frn
Moreover, throughout this paper, we focus on the ﬁxed design.
is dense in F under the sup-norm. But when considering the asymptotic properties of the
sieve estimators, we use the pseudo-norm (cid:107)f(cid:107)2
in Appendix C) deﬁned on F and Frn.

n = n−1(cid:80)n

i=1 f 2(xi) (see Proposition C.0.1

[33] showed that (cid:83)

In section 4.2, we discuss the existence of neural network sieve estimators. The weak

consistency and rate of convergence of the neural network sieve estimators will be established

in section 4.3 and section 4.4, respectively. Section 4.5 focuses on the asymptotic distribution

of the neural network sieve estimators. Several simulations results are presented in section

4.6.

91

Notation: Throughout the rest of the chapter, bold font alphabetic letters and Greek
letters are vectors. C(X ) is the set of continuous functions deﬁned on X . The symbol (cid:46)
→ 1 as n → ∞.
means “is bounded above up to a universal constant" and an ∼ bn means an
bn
For a pseudo-metric space (T, d), N (, T, d) is its covering number, that is the minimum

number of -balls needed to cover T . Its natural logarithm is the entropy number and is

denoted by H(, T, d).

4.2 Existence

A natural question to ask is whether or not the sieve estimator based on neural networks
exists. Before we enter the main discussion, we ﬁrst look at some simple properties of Frn.
Proposition 4.2.1 shows that the sigmoid function is a Lipschitz function with Lipschitz

constant L = 1/4.

Proposition 4.2.1. A sigmoid function σ(z) = ez/(1 + ez) is a Lipschitz function on R

with Lipschitz constant 1/4.

Proof. For all z1, z2 ∈ R, we know that σ(z) is continuous on [z1, z2] and is diﬀerentiable on
(z1, z2). Note that

σ(cid:48)(z) = σ(z)(1 − σ(z)) ≤ 1
4

∀z ∈ R.

By Mean Value Theorem, we know that

σ(z1) − σ(z2) = σ(cid:48)(λz1 + (1 − λ)z2)(z1 − z2)

92

for some λ ∈ [0, 1]. Hence

|σ(z1) − σ(z2)| = |σ(cid:48)(λz1 + (1 − λ)z2)||z1 − z2| ≤ 1

4

|z1 − z2|,

which means that σ(z) is a Lipschitz function on R with Lipschitz constant 1/4.

The second proposition provides an upper bound for the envelope function supf∈Frn

|f|.

Proposition 4.2.2. For each ﬁxed n

(cid:107)f(cid:107)∞ ≤ Vn.

sup
f∈Frn

Proof. For any f ∈ Frn with n ﬁxed, note that for all x ∈ X , we have

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)α0 +

rn(cid:88)
rn(cid:88)

j=1

j=1

(cid:16)

(cid:16)

|f (x)| =

αjσ

γT

j x + γ0,j

≤ |α0| +

|αj|σ

γT

j x + γ0,j

(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:17) ≤

rn(cid:88)

j=0

|αj| ≤ Vn.

Since the right hand side does not depend on x and f, we get

sup
f∈Frn

(cid:107)f(cid:107)∞ = sup
f∈Frn

sup
x∈X

|f (x)| ≤ Vn.

Now we quote a general result from White and Wooldridge (1991) [76]. The theorem

tells us that under some mild conditions, there exists a sieve approximate estimator and

that such an estimator is also measurable.

93

Theorem 4.2.1 (Theorem 2.2 in White and Wooldeidge (1991)[76]). Let (Ω,A, P) be a
complete probability space and let (Θ, ρ) be a pseudo-metric space. Let {Θn} be a sequence
of compact subsets of Θ. Let Qn : Ω× Θn → ¯R be A⊗B(Θn)/B( ¯R)-measurable, and suppose
that for each ω ∈ Ω, Qn(ω,·) is lower semicontinuous on Θn, n = 1, 2, . . .. Then for each
n = 1, 2, . . ., there exists ˆθn : Ω → Θn, A/B(Θn)-measurable such that for each ω ∈ Ω,
Qn(ω, ˆθn(ω)) = infθ∈Θn

Qn(ω, θ).

Note that

Qn(f ) =

=

=

1
n

1
n

1
n

i=1

n(cid:88)
n(cid:88)
n(cid:88)

i=1

i=1

(yi − f (xi))2

(f0(xi) + i − f (xi))2

(f (xi) − f0(xi))2 − 2

1
n

n(cid:88)

i=1

i(f (xi) − f0(xi)) +

n(cid:88)

i=1

1
n

2
i .

Since the randomness only comes from i’s, it is clear that Qn is a measurable function and
for a ﬁxed ω, Qn is continuous in f. Therefore, to show the existence of the sieve estimator,
it suﬃces to show that Frn is compact in C(X ), which is proved in the following lemma.

Lemma 4.2.1. Let X be a compact subset of Rd. Then for each ﬁxed n, Frn is a compact
set.

Proof. For each ﬁxed n, let θn = [α0, . . . , αrn, γ0,1, . . . , γ0,rn, γT
[−Mn, Mn]rn(d+1) := Θn. For n ﬁxed, Θn is a bounded closed set and hence it is a compact

1 , . . . , γT

rn]T belong to [−Vn, Vn]rn+1×

94

set in Rrn(d+2)+1. Consider a map

H : (Θn,(cid:107) · (cid:107)2) → (Frn,(cid:107) · (cid:107)n)

θn (cid:55)→ H(θn) = α0 +

(cid:16)

αjσ

rn(cid:88)

j=1

(cid:17)

γT

j x + γ0,j

Note that Frn = H(Θn). Therefore, to show that Frn is a compact set, it suﬃces to show
that H is a continuous map due to the compactness of Θn. Let θ1,n, θ2,n ∈ Θn, then

α
(cid:12)(cid:12)(cid:12)α
(cid:12)(cid:12)(cid:12)α
 rn(cid:88)

(1)
j σ

(1)T
j

γ

=

n

α

i=1

1
n

(1)
0 +

rn(cid:88)

(cid:107)H(θ1,n) − H(θ2,n)(cid:107)2
n(cid:88)
n(cid:88)
n(cid:88)

0 − α

(2)
0

(cid:12)(cid:12)(cid:12) +
(cid:12)(cid:12)(cid:12) +

0 − α

(2)
0

j=1

i=1

(1)

(1)

1
n

≤ 1
n

=

(cid:18)
rn(cid:88)
rn(cid:88)

j=1

i=1

i=1

n(cid:88)
 rn(cid:88)
(cid:18) Vn

j=0

4

≤ 1
n

≤

≤

|α

(1)

j − α

(2)
j

j=0

|α

(1)

j − α

(2)
j

| +

(cid:19)2

(cid:12)(cid:12)(cid:12)(cid:12)α

|α

(1)
j

(1)
j σ

(1)
0,j

|

γ

γ

xi + γ

(1)T
j

(1)T
j

(cid:18)
(cid:12)(cid:12)(cid:12)(cid:12)σ
(cid:18)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:16)
rn(cid:88)
j − γ
rn(cid:88)

j=1

(1)

γ

(cid:13)(cid:13)(cid:13)γ

j=1

j=1

| +

Vn
4

(1 ∨ (cid:107)x(cid:107)∞)

[rn(d + 1)](cid:107)θ1,n − θ2,n(cid:107)2
2.

α

(2)
j σ

(2)T
j

γ

xi + γ

(cid:19)

− α

rn(cid:88)

j=1

(2)

0 −
(cid:19)
(cid:19)

(1)
0,j

xi + γ

− α

(2)
j σ

(2)T
j

γ

xi + γ

xi + γ

(1)
0,j

− σ

(2)T
j

xi + γ

(2)
0,j

(cid:18)
(cid:18)

(cid:18)
(cid:18)

γ

(2)
0,j

(cid:19)2
2
(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) +
(cid:19)(cid:21)2

(2)
0,j

(1)

j − α
|α
(cid:17)T

xi

(2)
j

(2)
j

(cid:12)(cid:12)(cid:12)(cid:12) +
(cid:13)(cid:13)(cid:13)1

|σ

(cid:12)(cid:12)(cid:12)γ
(cid:12)(cid:12)(cid:12)γ

+

(2)T
j

γ

xi + γ

(2)
0,j

(1)

0,j − γ

(2)
0,j

(cid:12)(cid:12)(cid:12)

2
2
(cid:12)(cid:12)(cid:12)

(1 ∨ (cid:107)x(cid:107)∞)

Vn
4

(1)

j − γ

(2)
j

(1)

0,j − γ

(2)
0,j

Hence, for any  > 0, by choosing δ = /

(cid:16) Vn
4 (1 ∨ (cid:107)x(cid:107)∞)(cid:112)rn(d + 1)

(cid:17)

, we observe that when

95

(cid:107)θ1,n − θ2,n(cid:107)2 < δ, we have

(cid:107)H(θ1,n) − H(θ2,n)(cid:107)n < ,

which implies that H is a continuous map and hence Frn is a compact set for each ﬁxed
n.

As a simple corollary of Lemma 4.2.1 and Theorem 4.2.1, we can easily obtain the exis-

tence of sieve estimator.

Corollary 4.2.1. Under the notations above, for each n = 1, 2, . . ., there exists ˆfn : Ω →
Frn, A/B(Frn)-measurable such that Qn( ˆfn(ω)) = inff∈Frn

Qn(f ).

4.3 Consistency

In this section, we are going to show the result on the consistency of the neural network

(cid:83)

sieve estimator. The consistency result leans heavily on the following Uniform Law of Large
Numbers. We start by considering a simple case with Vn ≡ V for all n. In such a case,
n Frn is not dense in F but instead in a subset of F containing functions satisfying a
certain smoothness condition.

Lemma 4.3.1. Let 1, . . . , n be i.i.d. sub-Gaussian random variables with sub-Gaussian

parameter σ0. Then if [rn(d + 2) + 1] log[rn(d + 2) + 1] = o(n), we have

|Qn(f ) − Qn(f )| p∗−→ 0.

sup
f∈Frn

96

Proof. For any δ > 0, we have

(cid:32)
(cid:32)
(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

sup
f∈Frn

sup
f∈Frn

n(cid:88)

i=1

P∗

=P∗

≤P

:=(I) + (II).

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

i − σ2
2

(cid:33)
n(cid:88)
(cid:32)

i=1

|Qn(f ) − Qn(f )| > δ

(cid:33)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ

i (f (xi) − f0(xi))

i − σ2 − 2
(cid:33)
2

1
n
+ P∗

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

δ
2

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

sup
f∈Frn

i(f (xi) − f0(xi))

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:33)

δ
4

For (I), by Weak Law of Large Numbers, we know that there exists N1 > 0 such that for all
n ≥ N1 we have

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

(cid:33)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(I) = P

i − σ2
2

δ
2

<

δ
2

.

Now, we are going to evaluate (II). From the sub-Gaussianity of 1, . . . , n, we know that
i(f (xi)− f0(xi)) is also sub-Gaussian with mean 0 and sub-Gaussian parameter σ0|f (xi)−
f0(xi)|. Hence, by Hoeﬀding inequality

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

P

i(f (xi) − f0(xi))

= P

i(f (xi) − f0(xi))

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:33)

δ
4

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)
(cid:40)

i=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:33)

nδ
4

(cid:41)

.

≤ 2 exp

−

(cid:80)n
i=1(f (xi) − f0(xi))2

n2δ2

32σ2
0

(cid:107)f(cid:107)n ≤ V . Hence, based on Corollary 8.3
From Proposition 4.2.2, we know that supf∈Frn
in van de Geer (2000)[68], (II) will have an exponential bound if there exists some constant

C and δ > 0, σ > 0 satisfying V > δ/σ and

(cid:32)(cid:90) V

δ/(8σ)

√

nδ ≥ 2C

H1/2(u,Frn,(cid:107) · (cid:107)n)du ∨ V

97

(cid:33)

.

(4.3)

Now, we are going to show that (4.3) holds in our case. It follows from Theorem 14.5 in
Anthony and Barglett (2009)[4] that an upper bound of the covering number for Frn is given
by

(cid:17)2

N (,Frn,(cid:107) · (cid:107)∞) ≤

4e[rn(d + 2) + 1]
(cid:16) 1
(cid:16) 1
(cid:17)
4 V − 1

rn(d+2)+1
where ˜Arn,d,V =(cid:0)e[rn(d + 2) + 1]V 2/(V − 4)(cid:1)rn(d+2)+1. By letting

4 V


:= ˜Arn,d,V −[rn(d+2)+1],

Arn,d,V = log ˜Arn,d,V − [rn(d + 2) + 1]

= [rn(d + 2) + 1]

log

e[rn(d + 2) + 1]V 2

V − 4

(cid:18)

(cid:19)

− 1

= [rn(d + 2) + 1] log

[rn(d + 2) + 1]V 2

V − 4

,

and note that V 2−eV +4e ≥ 0 for all V , we get log [rn(d+2)+1]V 2

V −4

≥ log V 2

V −4 ≥ log e(V −4)
V −4 =

1 and then

H(,Frn,(cid:107) · (cid:107)∞) = log N (,Frn,(cid:107) · (cid:107)∞)

1


= log ˜Arn,d,V + [rn(d + 2) + 1] log
≤ Arn,d,V + [rn(d + 2) + 1]
≤ Arn,d,V

(cid:19)

(cid:18)

1 +

1


.

1


(since log x ≤ x − 1 for all x > 0)

Note that

n(cid:88)

i=1

(cid:107)f(cid:107)2

n =

1
n

f 2(xi) ≤

(cid:18)

sup
x

(cid:19)2

|f (x)|

= (cid:107)f(cid:107)2∞,

98

δ/(8σ)

(cid:90) V

H1/2(,Frn,(cid:107) · (cid:107)n)d ≤ A

we have H(,Frn,(cid:107) · (cid:107)n) ≤ H(,Frn,(cid:107) · (cid:107)∞). Then
(cid:18)
(cid:90) V
(cid:34)(cid:90) 1
(cid:18)
(cid:20)√
(cid:90) 1
(cid:104)

1/2
rn,d,V

1/2
rn,d,V

1/2
rn,d,V

≤ A

= A

0

0

0

(cid:19)1/2
(cid:19)1/2

d

1 +

1


1 +

1

− 1
2 d +

√

d +
√

(cid:105)

2 + 2

2(V − 1)

2
√
2

≤ A
1/2
rn,d,V
√
= 2

2A

1/2
rn,d,V V.

(cid:19)1/2

(cid:35)

d

1


(cid:90) V

1

(cid:18)
(cid:21)

1 +

2(V − 1)

√
Clearly 2
o(n), we get for any δ > 0 there exists N2 > 0 such that for all n ≥ N2,

rn,d,V V ≥ V and under the assumption that [rn(d + 2) + 1] log[rn(d + 2) + 1] =

2A

1/2

(cid:18) 1

n

√
4

2V

(cid:19)1/2

Arn,d,V

<

δ
4

,

i.e. (4.3) holds with C = 1 and n ≥ N2. Hence, based on Corollary 8.3 in van de Geer
(2000)[68], we get for n ≥ N2,

i(f (xi) − f0(xi))

δ
4

∧ 1
n

i ≤ σ2
2

≤ exp

− nδ2
64V 2

.

(4.4)

(cid:33)

(cid:26)

(cid:27)

i=1

99

P∗

(cid:32)
Since(cid:82) V

sup
f∈Frn

n(cid:88)

i=1

n

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
(cid:32)

P∗

n(cid:88)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

sup
f∈Frn

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

n(cid:88)

i=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:33)

δ
4

0 H1/2(,Frn,(cid:107) · (cid:107)n)d < ∞, we may take σ → ∞ in (4.4) to get

i(f (xi) − f0(xi))

≤ exp

(cid:26)
− nδ2
64V 2

(cid:27)

.

Let N3 = 64V 2
δ2

log 2

δ , then for n ≥ max{N2, N3}, we have

(cid:32)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

(II) = P∗

sup
f∈Frn

i(f (xi) − f0(xi))

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:33)

δ
4

≤ δ
2

.

Thus we conclude that for any δ > 0, by taking n ≥ max{N1, N2, N3}, we have

(cid:32)

P∗

sup
f∈Frn

(cid:33)

|Qn(f ) − Qn(f )| > δ

< δ,

which proves the desired result.

Remark 4.3.1. Lemma 4.3.1 shows that if we have a ﬁxed number of features, the desired

Uniform Law of Large Numbers holds when the number of hidden units in the neural network

sieve does not grow too fast.

Now, we are going to extend the result to a more general case.

In Lemma 4.3.1, we

assumed that the errors 1, . . . , n are i.i.d. sub-Gaussian and Vn ≡ V .
lemma, we are going to relax both restrictions.

In the following

Lemma 4.3.2. Under the assumption that

[rn(d + 2) + 1]V 2

n log(Vn[rn(d + 2) + 1] = o(n), as n → ∞,

we have

|Qn(f ) − Qn(f )| p∗−→ 0, as n → ∞.

sup
f∈Frn

100

Proof. As in the proof of Lemma 4.3.1, it suﬃces to show that

i(f (xi) − f0(xi))

→ 0, as n → ∞.

(4.5)

(cid:32)

P∗

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

sup
f∈Frn

(cid:34)

E∗

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

sup
f∈Frn

(cid:33)

δ
4

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:35)

i(f (xi) − f0(xi))

→ 0, as n → ∞.

By Markov’s inequality, (4.5) holds if we can show

Now, since E[] = 0 and note that for each f ∈ Frn, it has its corresponding parametrization
θn. Since θn is in a compact set, we know that there exists a sequence θn,k → θn as k → ∞
with θn,k ∈ Qrn(d+2)+1 ∩ ([−Vn, Vn]rn+1 × [−Mn, Mn]rn(d+1)). Each θn,k corresponds to
a function fk ∈ Frn. By continuity, we have fk(x) → f (x) for each x ∈ X . By Example
2.3.4 in van der Vaart and Wellner (1996)[69], we know that Frn is P -measurable and by
symmetrization inequality we have

(cid:34)

E∗

sup
f∈Frn

(cid:34)

i(f (xi) − f0(xi))
n(cid:88)

ξii (f (xi) − f0(xi))

(cid:35)

,

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

≤2EEξ

sup
f∈Frn

i=1

(cid:35)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

i=1

n

n(cid:88)

i=1

1
n

101

where ξ1, . . . , ξn are i.i.d. Rademacher random variables independent of 1, . . . , n. Based
on the Strong Law of Large Numbers, there exists N1 > 0, such that for all n ≥ N1,

2
i < σ2 + 1, a.s.

For ﬁxed 1, . . . , n, (cid:80)n

i=1 ξii(f (xi) − f0(xi)) is a sub-Gaussian process indexed by f ∈
Frn. Suppose that (Ξ,C, µ) is the probability space on which ξ1, . . . , ξn are deﬁned and let

Y (f, ω) =(cid:80)n

i=1 ξi(ω)i(f (xi) − f0(xi)) with f ∈ Frn and ω ∈ Ξ. As we have shown above,
we have fk → f and by continuity, Y (fk, ω) → Y (f, ω) for any ω ∈ Ξ. This shows that
{Y (f, ω), f ∈ Frn} is a separable sub-Gaussian process. Hence Corollary 2.2.8 in van der
Vaart and Wellner (1996)[69] implies that there exists a universal constant K and for any
f∗
n ∈ Frn with n ≥ N1,
(cid:34)
n(cid:88)

(cid:35)

ξii(f (xi)) − f0(xi))

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:35)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:35)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:35)

n

1√
n

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
(cid:34)
(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
n(cid:88)
(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
n(cid:88)
(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
n(cid:88)

i=1

i=1

n

n

n

i=1

≤ Eξ

= Eξ

≤ Eξ

Eξ

sup
f∈Frn
= Eξ

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√

n

n(cid:88)

i=1

i=1

sup
f∈Frn

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:35)
ξii(f (xi) − f0(xi))
(cid:118)(cid:117)(cid:117)(cid:116)log N
(cid:16) 1
(cid:17)
(cid:90) ∞
2η,Frn, d
(cid:118)(cid:117)(cid:117)(cid:116)log N
(cid:16) 1
(cid:17)
(cid:90) 2Vn
2 η,Frn, d
(cid:118)(cid:117)(cid:117)(cid:117)(cid:116)log N
(cid:18)
(cid:90) 2Vn

+ K

+ K

n

n

0

0

2

dη

dη

ξii(f∗

n(xi) − f0(xi))

ξii(f∗

n(xi) − f0(xi))

ξii(f∗

n(xi) − f0(xi))

+ K

0

(cid:19)

dη,

√
η,Frn,(cid:107) · (cid:107)∞
1
σ2+1
n

where the second equality follows by Proposition 4.2.2 and for f, g ∈ Frn,

(cid:34) n(cid:88)
(cid:18) 1√
(cid:32)
n(cid:88)

i=1

1
n

i=1

d(f, g) =

=

n

i(f (xi) − f0(xi)) − 1√
(cid:33)1/2
n

i (f (xi) − g(xi))2
2

(cid:19)2(cid:35)1/2

i(g(xi) − f0(xi))

so that the last inequality follows by noting that

(cid:32)

(cid:33)1/2

2
i

.

n(cid:88)

i=1

1
n

d(f, g) ≤ (cid:107)f − g(cid:107)∞

102

Now we are going to evaluate these two terms. For the ﬁrst term, for n ≥ N1, by Cauchy-
Schwarz inequality, we have

(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

Eξ

ξii(f∗

n(xi) − f0(xi))

(cid:35)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

≤ 1
n

(cid:32)
≤(cid:112)

≤

1
n

n(cid:88)
|i||f∗
n(xi) − f0(xi)|
(cid:33)1/2(cid:32)
n(cid:88)
n(cid:88)
(f∗
n(xi) − f0(xi))2

i=1

2
i

1
n
|f∗
n(x) − f0(x)|, a.s.

i=1

i=1

σ2 + 1 sup
x∈X

(cid:33)1/2

By choosing f∗
by Hornik et al. (1989)[33], we know that supx∈X |f∗
for any ζ > 0, there exists N2 > 0, such that for all n ≥ N2,

n = πrnf0 and from the universal approximation theorem of neural network
n(xi) − f0(xi)| → 0 as n → ∞ so that

|f∗
n(xi) − f0(xi)| <

ζ√
σ2 + 1

.

sup
x∈X

Therefore, by choosing n ≥ N1 ∨ N2, we get

(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

Eξ

ξii(f∗

n(xi) − f0(xi))

(cid:35)

< ζ a.s.

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

For the second term, we use the same bound from Theorem 14.5 in Anthony and Bartlett

(2009)[4] as we did in the proof of Lemma 4.3.1:

(cid:18)

N

(cid:19)

√
1
σ2 + 1
2

η,Frn,(cid:107) · (cid:107)∞

8

√

≤

σ2 + 1e[rn(d + 2) + 1]

(cid:16) 1
(cid:17)
4 Vn − 1
:= ˜Brn,d,Vnη−[rn(d+2)+1],

η

(cid:16) 1

4 Vn

(cid:17)2

rn(d+2)+1

103

(cid:16)

√
2

where ˜Brn,d,Vn =

σ2 + 1e[rn(d + 2) + 1]V 2

n /(Vn − 4)

(cid:17)rn(d+2)+1

. By letting

≤ 2[rn(d + 2) + 1] log

Vn − 4
√
where by choosing N3 so that rn(d + 2) + 1 ≥ 2
noting that V 2

√
σ2 + 1). We also have for
log(2

n − Vn + 4 ≥ 0 for all Vn so that log [rn(d+1)+1]V 2
(cid:18)

Vn−4

(cid:19)

(cid:18)

n

H

√
2

1
σ2 + 1

η,Frn,(cid:107) · (cid:107)∞

= log N

√
2

1
σ2 + 1

η,Frn,(cid:107) · (cid:107)∞

σ2 + 1, the last inequality follows by

√
≥ log 2

(cid:19)

σ2+1(Vn−4)
Vn−4

=

1
η

Brn,d,Vn = log ˜Brn,d,Vn − [rn(d + 2) + 1]

= [rn(d + 2) + 1]

log

(cid:32)
(cid:18)

(cid:33)

− 1

(cid:19)

(cid:112)

√
σ2 + 1e[rn(d + 2) + 1]V 2
2
n

Vn − 4
[rn(d + 2) + 1]V 2
n

Vn − 4

[rn(d + 2) + 1]V 2
n

, for all n ≥ N1 ∨ N3,

= [rn(d + 2) + 1]

log

+ log(2

σ2 + 1)

= log ˜Brn,d,Vn + [rn(d + 2) + 1] log
≤ Brn,d,Vn + [rn(d + 2) + 1]
≤ Brn,d,Vn

(cid:18)

(cid:19)

1 +

1
η

1
η

104

and hence for all n ≥ N1 ∨ N3,
(cid:18)

(cid:90) 2Vn

H1/2

√
2

0

(cid:19)

dη

dη

dη +
√

1
σ2 + 1

1 +

η,Frn,(cid:107) · (cid:107)∞
(cid:19)1/2
(cid:18)
(cid:90) 2Vn
(cid:34)(cid:90) 1
(cid:19)1/2
(cid:18)
(cid:20)√
(cid:90) 1

1 +

1
η

1
η

0

0

η−1/2dη +

≤ B

1/2
rn,d,Vn

= B

1/2
rn,d,Vn

≤ B
1/2
rn,d,Vn
√
≤ 4

2B

2

0

Vn,

1/2
rn,d,Vn

(cid:19)1/2

(cid:35)

dη

1
η

(cid:90) 2Vn

(cid:18)

1

1 +

(cid:21)

2(2Vn − 1)

which implies that

(cid:18)

(cid:118)(cid:117)(cid:117)(cid:117)(cid:116)H

(cid:90) 2Vn

0

√
1
σ2+1

2

η,Frn,(cid:107) · (cid:107)∞
n

(cid:19)

√
(cid:114)
2n−1/2B
dη ≤ 4

∼ 8

1/2
rn,d,Vn

Vn

[rn(d + 2) + 1]V 2

n log(Vn[rn(d + 2) + 1])

,

n

where the last part follows by noting that log V 2
in the Lemma, there exists N4 > 0, such that for all n ≥ N4, we have

Vn−4 ∼ log Vn. Under the assumption given

n

(cid:114)

[rn(d + 2) + 1]V 2

n log(Vn[rn(d + 2) + 1])

n

<

ζ
8

.

Therefore, by choosing n ≥ N1 ∨ N2 ∨ N3 ∨ N4, we get

(cid:104)

i.e. Eξ

supf∈Frn

(cid:12)(cid:12)(cid:12) 1

n

(cid:12)(cid:12)(cid:12)(cid:105) → 0 a.s.. Moreover, based on what we

Eξ

n(cid:88)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

(cid:34)
(cid:80)n
i=1 ξii(f (xi) − f0(xi))

sup
f∈Frn

i=1

n

ξii(f (xi) − f0(xi))

(cid:35)

< 2ζ a.s.,

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

105

have shown, we know that for n suﬃciently large,

(cid:34)

Eξ

sup
f∈Frn

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

ξii(f (xi) − f0(xi))

(cid:35)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

≤(cid:112)

σ2 + 1(cid:107)πrnf0 − f0(cid:107)∞+
√
2KB
4

1/2
rn,d,Vn

n−1/2Vn → 0, a s.

(cid:104)√
σ2 + 1(cid:107)πrnf0−f0(cid:107)∞+
σ2 + 1(cid:107)πrnf0 − f0(cid:107)∞ + 4
n−1/2Vn → 0 < ∞, by Generalized Dominated Convergence Theorem, we

n−1/2Vn

1/2
rn,d,Vn

2KB

(cid:105)

√

√

=

and since E
√
1/2
4
rn,d,Vn

2KB

know that

E∗

(cid:34)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
(cid:34)

n

n(cid:88)

i=1

sup
f∈Frn

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

sup
f∈Frn
≤ 2EEξ

i(f (xi) − f0(xi))

(cid:35)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

ξii (f (xi) − f0(xi))

(cid:35)

→ 0,

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

which completes the proof.

Based on the above lemmas, we are ready to state the theorem on the consistency of

neural network sieve estimators.

Theorem 4.3.1. Under the notation given above, if

[rn(d + 2) + 1]V 2

n log(Vn[rn(d + 2) + 1] = o(n), as n → ∞,

(4.6)

then

(cid:107) ˆfn − f0(cid:107)n

p−→ 0.

106

Proof. Since Q is continuous at f0 ∈ F and Q(f0) = σ2 < ∞, we note that for any  > 0,

inf

f :(cid:107)f−f0(cid:107)n≥

Qn(f ) − Qn(f0) =

inf

f :(cid:107)f−f0(cid:107)n≥

1
n

i=1

n(cid:88)
(f (xi) − f0(xi))2 ≥ 2 > 0.

Hence, it follows from Lemma 4.2.1, Lemma 4.3.2 and Corollary 2.6 in White and Wooldridge

(1991)[76] that

(cid:107) ˆfn − f0(cid:107)n

p−→ 0.

Remark 4.3.2. We discuss the condition (4.6) in Theorem 4.3.1 via some simple examples
here. If αj = O(1) for j = 1, . . . , rn, then Vn = O(rn) and

[rn(d + 2) + 1]V 2

n log(Vn[rn(d + 2) + 1]) = O(r3

n log rn).

(cid:16)

(n/ log n)1/3(cid:17)

rn = o

Therefore, a possible growth rate for the number of hidden units in a neural network is

. On the other hand, if we have a slow growth rate for the number of

hidden units in the neural network, such as rn = log Vn, then we have

[rn(d + 2) + 1]V 2

n log(Vn[rn(d + 2) + 1]) = O((Vn log Vn)2).
(cid:17)

(cid:16)

Hence, a possible growth rate for the upper bound of the weights from the hidden layer to the

output layer is Vn = o

n1/2/ log n

.

107

4.4 Rate of Convergence

To obtain the rate of convergence for neural network sieves, we are going to apply Theorem

3.4.1 in van der Vaart and Wellner (1996)[69].

Lemma 4.4.1. Let f∗
and δ > 8(cid:107)f∗

n − f0(cid:107)n, we have

n = πrnf0 ∈ Frn. Then under the notations given above, for every n

δ

2 <(cid:107)f−f∗

sup
n(cid:107)n≤δ,f∈Frn

Qn(f∗

n) − Qn(f ) (cid:46) −δ2.

Proof. First, we note that

Qn(f∗

n) − Qn(f ) =

1
n
= (cid:107)f∗

i=1

n(cid:88)
(f∗
n(xi) − f0(xi))2 + σ2 − 1
n

n(cid:88)

i=1

(f (xi) − f0(xi))2 − σ2

n − f0(cid:107)2

n − (cid:107)f − f0(cid:107)2
n.

So to show the result, we need to provide an upper bound for Qn(f∗
(cid:107)f − f∗

n(cid:107)n. Due to the fact that (cid:107) · (cid:107)n is a pseudo-norm, the triangle inequality gives

n) − Qn(f ) in terms of

(cid:107)f − f∗

n(cid:107)n ≤ (cid:107)f − f0(cid:107)n + (cid:107)f∗
= (cid:107)f − f0(cid:107)n − (cid:107)f∗

n − f0(cid:107)n
n − f0(cid:107)n + 2(cid:107)f∗

n − f0(cid:107)n.

Therefore, we have

(cid:107)f − f0(cid:107)n − (cid:107)f∗

n − f0(cid:107)n ≥ (cid:107)f − f∗

n(cid:107) − 2(cid:107)f∗

n − f0(cid:107)n,

108

so that for every f such that (cid:107)f − f∗
n(cid:107)2

n ≥ 16(cid:107)f∗

n − f0(cid:107)2

n, i.e., (cid:107)f − f∗

n(cid:107)n ≥ 4(cid:107)f∗

n − f0(cid:107)n,

(cid:107)f − f0(cid:107)n − (cid:107)f∗

n − f0(cid:107)n ≥ (cid:107)f − f∗

n(cid:107)n − 1
2

(cid:107)f − f∗

n(cid:107)n =

(cid:107)f − f∗

n(cid:107)n ≥ 0.

1
2

By squaring both sides, we obtain

(cid:107)f − f∗
n(cid:107)2

n ≤ (cid:107)f − f0(cid:107)2
≤ (cid:107)f − f0(cid:107)2
= (cid:107)f − f0(cid:107)2

n + (cid:107)f∗
n + (cid:107)f∗
n − (cid:107)f∗

n − f0(cid:107)2
n − f0(cid:107)2
n − f0(cid:107)2

n

n − 2(cid:107)f − f0(cid:107)n(cid:107)f∗
n − 2(cid:107)f∗

n − f0(cid:107)2

n

n − f0(cid:107)n

1
4

and hence

δ

2 <(cid:107)f−f∗

sup
n(cid:107)n≤δ,f∈Frn

Qn(f∗

n) − Qn(f ) ≤

(cid:107)f−f∗

sup
n(cid:107)n> δ

2 ,f∈Frn

sup
n(cid:107)n> δ

2 ,f∈Frn

≤

(cid:107)f−f∗
(cid:46) −δ2.

(cid:107)f∗
(cid:18)

n − f0(cid:107)2

n − (cid:107)f − f0(cid:107)2

n

(cid:19)

(cid:107)f − f∗
n(cid:107)2

n

−1
4

Lemma 4.4.2. For every suﬃciently large n and δ > 8(cid:107)f∗



E∗

√

δ

2 <(cid:107)f−f∗

sup
n(cid:107)n≤δ,f∈Frn

n [(Qn − Qn)(f∗

n) − (Qn − Qn)(f )]+

n − f0(cid:107)n, we have
(cid:113)
log N (η,Frn,(cid:107) · (cid:107)∞)dη

 (cid:46)

(cid:90) δ

0

109

Proof. Note that

(Qn − Qn)(f∗

n) =

(Qn − Qn)(f∗

n) =

n(cid:88)
n(cid:88)

i=1

i=1

1
n

1
n

i − σ2 − 2
2
n
i − σ2 − 2
2
n

n(cid:88)
n(cid:88)

i=1

i=1

i(f∗

n(xi) − f0(xi))

i(f (xi) − f0(xi)),

we have

(Qn − Qn)(f∗

n) − (Qn − Qn)(f ) =

n(cid:88)

i=1

2
n

i(f (xi) − f∗

n(xi)).

Hence by following similar arguments as in the proof of Lemma 4.3.2 as well as applying

Corollary 2.2.8 in van der Vaart and Wellner (1996)[69], we have



n) − (Qn − Qn)(f )]+

E∗






(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

n(xi))

√

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√

n [(Qn − Qn)(f∗
n(cid:88)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√

i(f (xi) − f∗
n(cid:88)

i=1

n

n

i=1

ξii (f (xi) − f∗

n(xi))



(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

δ

2 <(cid:107)f−f∗

sup
n(cid:107)n≤δ,f∈Frn

≤E∗

δ



sup
n(cid:107)n≤δ,f∈Frn

2 <(cid:107)f−f∗
≤2EEξ
sup
(cid:90) δ
(cid:113)
2 <(cid:107)f−f∗
n(cid:107)n≤δ,f∈Frn
log N (η,Frn,(cid:107) · (cid:107)∞)dη,

(cid:46)

δ

0

where the last inequality follows since f∗

n ∈ Frn for n large enough.

Now we are ready to apply Theorem 3.4.1 in van der Vaart and Wellner (1996)[69] to

obtain the rate of convergence for neural network sieve estimators.

Theorem 4.4.1. Under the above notations, if

ηn = O(cid:16)

n, rn(d + 2) log(rnVn(d + 2))/n, rn(d + 2) log n/n}(cid:17)

,

min{(cid:107)πrnf0 − f0(cid:107)2

110

then

(cid:107) ˆfn−f0(cid:107)n = Op

(cid:32)

max

(cid:40)
(cid:107)πrnf0 − f0(cid:107)n,

(cid:114)

rn(d + 2) log[rnVn(d + 2)]

n

(cid:114)

,

rn(d + 2) log n

n

(cid:41)(cid:33)

.

Proof. Use the same bound from Theorem 14.5 in Anthony and Bartlett (2009)[4] as we did

before, we have

log N (η,Frn,(cid:107) · (cid:107)n) ≤ log N (η,Frn,(cid:107) · (cid:107)∞)

4e[rn(d + 2) + 1]
(cid:16) 1
(cid:16) 1
(cid:17)
4 Vn − 1

η

4Vn

(cid:17)2

rn(d+2)+1

≤ log

= [rn(d + 2) + 1] log

˜Crn,d,Vn

η

,

> e. Then it follows from Lemma 3.8 in Mendelson

where ˜Crn,d,Vn = e[rn(d+2)+1]V 2
(2003)[48] that for δ < 1,

Vn−4

n

(cid:90) δ

(cid:113)

0

log N (η,Frn,(cid:107) · (cid:107)n)dη ≤ [rn(d + 2) + 1]1/2

(cid:46) [rn(d + 2) + 1]1/2δ

log

(cid:115)

(cid:90) δ
(cid:115)

0

˜Crn,d,Vn

dη

log

η
˜Crn,d,Vn

δ

:= φn(δ).

111

(cid:114)

log

˜Crn,d,Vn

δ

, then since for 0 < δ < 1

log

log

˜Crn,d,Vn

δ

˜Crn,d,Vn

δ

− 1
2

− 1
2

log−1/2

˜Crn,d,Vn

δ

δ2

˜Crn,d,Vn

˜Crn,d,Vn

δ2

log−1/2

˜Crn,d,Vn

δ



Deﬁne h : δ (cid:55)→ φn(δ)/δα = [rn(d + 2) + 1]1/2δ1−α

(cid:115)
(cid:115)

(1 − α)δ−α
(1 − α)δ−α

h(cid:48)(δ) = [rn(d + 2) + 1]1/2

= [rn(d + 2) + 1]1/2

< 0



for 1 < α < 2, we can know that δ (cid:55)→ φn(δ)/δα is decreasing on (0,∞). Let ρn (cid:46) (cid:107)πrnf0 −
f0(cid:107)−1

n , then note that

(cid:18) 1

(cid:19)

ρn

ρ2
nφn

= ρn[rn(d + 2) + 1]1/2 log1/2(cid:16)

= [rn(d + 2) + 1]1/2ρn

log ρn + log ˜Crn,d,Vn

(cid:17)

ρn ˜Crn,d,Vn

(cid:113)

and

we have

log ˜Crn,d,Vn = 1 + log

[rn(d + 2) + 1]V 2
n

Vn − 4

(cid:46) log

[rn(d + 2) + 1]V 2
n

Vn − 4

∼ log[rnVn(d + 2)],

(cid:18) 1

(cid:19)

ρn

(cid:46) √

ρ2
nφn

n ⇔ rn(d + 2)ρ2

n (log ρn + log[rnVn(d + 2)]) (cid:46) n.

112

This shows that for

n

rn(d + 2) log[rnVn(d + 2)]

(cid:19)1/2

(cid:18)

,

n

rn(d + 2) log n

(cid:19)1/2(cid:41)

,

we have ρ2

nφn

n. Based on these observation, Lemma 4.4.1, Lemma 4.4.2 and

Theorem 3.4.1 in van der Vaart and Wellner (1996)[69] imply that

ρn (cid:46) min

(cid:16) 1

ρn

(cid:40)(cid:18)
(cid:17) (cid:46) √
(cid:32)

max

(cid:107) ˆfn−πrnf0(cid:107)n = Op

rn(d + 2) log[rnVn(d + 2)]

n

(cid:40)
(cid:107)πrnf0 − f0(cid:107)n,

(cid:114)

(cid:114)

,

rn(d + 2) log n

n

(cid:41)(cid:33)

.

By triangle inequality, we can further get

(cid:114)
(cid:107) ˆfn − f0(cid:107)n ≤ (cid:107) ˆfn − πrnf0(cid:107)n + (cid:107)πrnf0 − f0(cid:107)n

(cid:40)
(cid:107)πrnf0 − f0(cid:107)n,

= Op

(cid:32)

max

rn(d + 2) log[rnVn(d + 2)]

n

(cid:114)

,

rn(d + 2) log n

n

(cid:41)(cid:33)

.

Remark 4.4.1. Recall that a suﬃcient condition to ensure consistency is rn(d+2)V 2
2)] = o(n). In this case, rn(d + 2) log[rnVn(d + 2)] ≤ n and the rate of convergence can be
simpliﬁed to

n log[rnVn(d+

(cid:32)

(cid:40)
(cid:107)πrnf0 − f0(cid:107)n,

(cid:114)

(cid:107) ˆfn − f0(cid:107)n = Op

max

(cid:41)(cid:33)

.

rn(d + 2) log n

n

If we assume f0 ∈ F where F is the space of functions, which have ﬁnite ﬁrst absolute

113

moments of the Fourier magnitude distributions, i.e.,

(cid:26)

(cid:90)

(cid:110)

(cid:111)

F =

f : Rd → R : f (x) =

(cid:90)

exp

iaT x

dµf (a),
max((cid:107)a(cid:107)1, 1)d|µf|(a) ≤ C

(cid:27)

,

(4.7)

(cid:107)µf(cid:107)1 :=

sup(cid:80)∞
and (cid:107)a(cid:107)1 =(cid:80)d

n=1 |µ(An)| and the supremum is taken over all measurable partitions {An}∞

where µf is a complex measure on Rd; |µf| denotes the total variation of µf , i.e., |µ|(A) =
n=1 of A;
i=1 |ai| for a = [a1, . . . , ad]T ∈ Rd. Theorem 3 in Markovoz (1996)[45] shows
n and Vn ≡ V in

that δn := (cid:107)f0 − πrnf0(cid:107)n (cid:46) r
the proof of Theorem 4.4.1, δn must also satisfy the following inequality:

. So if we let d ﬁxed and ρn = δ−1

−1/2−1/(2d)
n

(cid:18) 1

(cid:19)

ρn

ρ2
nφ

(cid:46) ρnr

n log1/2(cid:16)

1/2

(cid:17) (cid:46) √

n

ρn ˜Crn,d,Vn

nrn log rn (cid:46) n

⇒ ρ2
nrn log ρn + ρ2
⇒ δ−2
n (−rn log δn + rn log rn) (cid:46) n
1+ 1
⇒ r
d
n

rn log rn (cid:46) n

One possible choice of rn to satisfy such condition is rn (cid:16) (n/ log n)
obtain

d
2+d . In such a case, we

(cid:107) ˆfn − f0(cid:107)n = Op

(cid:18) n

log n

4(1+1/(2d))

(cid:19)− 1+1/d

 ,
(n/ log n)−1/3(cid:17)

(cid:16)

which is the same rate obtained in Chen and Shen (1998)[12]. It is interesting to note that
in the case where d = 1, we have (cid:107) ˆfn− f0(cid:107)n = Op
. Such rate is close to the
Op(n−1/3), which is the convergence rate in non-parametric least square problems when the
class of functions considered has bounded variation in R (see Example 9.3.3 in van de Geer

114

(2000)[68]) and as shown in Proposition C.0.3 in the Appendix C, Frn is a class of functions
with bounded variation in R so that the convergence rate we obtained makes sense.

4.5 Asymptotic Normality

Through out this section, we will assume that E[||2+λ] < ∞ for some λ > 0. To establish

the asymptotic normality of sieve estimator based on neural network, we follow the idea in

Shen (1997)[64]. We start by calculating the Gâteaux derivative of the empirical criterion

function Qn(f ) = n−1(cid:80)n
i=1(yi − f (xi))2.
(cid:34)
n(cid:88)
(yi − f0(xi) − t(f (xi) − f0(xi)))2 − 1
(cid:104)
n(cid:88)
n

[f − f0] = lim
t→0

Q(cid:48)
n,f0

1
n

1
t

(cid:35)

(yi − f0(xi))2

n(cid:88)

i=1

(yi − f0(xi))2 − 2t(yi − f0(xi))(f (xi) − f0(xi))

+t2(f (xi) − f0(xi))2 − (yi − f0(xi))2(cid:105)

i=1
1
t

= lim
t→0

1
n

i=1

n(cid:88)

i=1

= − 2
n

i(f (xi) − f0(xi)).

Then the remainder of ﬁrst-order functional Taylor series expansion is

Rn[f − f0] = Qn(f ) − Qn(f0) − Q(cid:48)
n,f0
(yi − f (xi))2 − 1
n

1
n

=

[f − f0]
n(cid:88)

i=1

n(cid:88)
n(cid:88)

i=1

2
n

i=1

n(cid:88)

i=1

(yi − f (xi))2 +

i(f (xi) − f0(xi))

2
i +

2
n

i(f (xi) − f0(xi))

i=1

n(cid:88)
n(cid:88)
n(cid:88)

i=1

i=1

=

=

1
n

1
n

(i + f0(xi) − f (xi))2 − 1
n

(f (xi) − f0(xi))2

= (cid:107)f − f0(cid:107)2
n.

115

empirical process {n−1/2(cid:80)n

As will be seen in the proof of asymptotic normality, the rate of convergence for the
i=1 i(f (xi) − f0(xi)) : f ∈ Frn} plays an important role. Here

we establish a lemma, which will be used to ﬁnd the desired rate of convergence.

Lemma 4.5.1. Let X1, . . . , Xn be independent random variables with Xi ∼ Pi. Deﬁne the
empirical process {νn(f )} as

n(cid:88)

νn(f ) =

1√
n

[f (Xi) − Pif ].

Let Fn = {f : (cid:107)f(cid:107)∞ ≤ Vn}. Let  > 0 and α ≥ supf∈Fn n−1(cid:80)n

i=1

Deﬁne t0 by H(t0,Fn,(cid:107)·(cid:107)∞) = 

4 ψ(M, n, α), where ψ(M, n, α) = M 2/

2α

1 + M Vn
nα

√
2

. If

i=1 Var[f (Xi)] be arbitrary.

(cid:104)

(cid:16)

(cid:17)(cid:105)

H(u,Fn,(cid:107) · (cid:107)∞) ≤ Anu−r,

(4.8)

for some 0 < r < 2 and u ∈ (0, a], where a is a small positive number, and there exists a

positive constant Ki = Ki(r, ), i = 1, 2 such that

M ≥ K1A

r+2
n

2

2−r
r+2
n

V

n

r−2
2(r+2) ∨ K2A

1/2
n α

2−r
4 .

Then

(cid:32)

P∗

(cid:33)

|νn(f )| > M

sup
f∈Fn

≤ 5 exp{−(1 − )ψ(M, n, α)} .

Proof. The proof of the lemma is similar to the proof of Corollary 2.2 in Alexander (1984)
[2] and the proof of Lemma 1 in Shen and Wong (1994)[65]. Since H(u,Fn,(cid:107)·(cid:107)∞) ≤ Anu−r

116

for some 0 < r < 2, then we have

(cid:90) t

s

Based on our assumption,

I(s, t) :=

H1/2(u,Fn,(cid:107) · (cid:107)∞)du ≤ 2(2 − r)−1A

1
2

n t1− r

2 .

Ant−r

0 ≥ H(t0,Fn,(cid:107) · (cid:107)∞) =

ψ(M, n, α) ⇒ t0 ≤


4

(cid:20)4An

(cid:21)1/r

ψ

.

√
Note that if M ≤ 3

√
nα/Vn, then ψ(M, n, α) ≥ M 2/(4α) and if M ≥ 3

nα/Vn, then

√
2(

√
nα + M Vn/3) ≤ 4M Vn/3 and hence ψ(M, n, α) ≥ 3

nM/(4Vn). In summary,

 M 2/(4α)

√
3

nM/(4Vn)

√
if M < 3
nα/Vn,
√
if M ≥ 3

nα/Vn

.

ψ(M, n, α) ≥

√
Therefore, if M ≥ 3

nα/Vn,

(cid:19)

(cid:18) M

√
64

, t0

n

28−3/2I

≤ 29−3/2(2 − r)−1A

1− r
1/2
2
n t
0

(cid:18)4

(cid:19) 1

r− 1

2


4− 1

1

2r M

1

2− 1
r ,

(cid:18) 3

4Vn

√

nM

1/r
n

A

(cid:19) 1

2− 1

r

≤ 29−3/2(2 − r)−1

= ˜K1A

1/r
n V

r− 1

2

1
n

n

117

where ˜K1 = 29−3/2(2 − r)−1(cid:16) 4
(cid:18) M


(cid:17) 1
(cid:19)

28−3/2I

√
64

n

, t0

2(cid:16) 3

(cid:17) 1

4

2− 1

r− 1

r . Hence

< M ⇔ ˜K1A

r− 1

2

1
1/r
n V
n

1
⇔ ˜K1A
1/r
n V
n
⇔ M > K1A

r− 1

2

2

r+2
n

r < M

1

1

2r M

2− 1

4− 1
r−2
4r < M
2−r
r−2
r+2
2(r+2) ,
n

1
r + 1
2

n

n

n

V

where K1 = ˜K

2r
r+2
1

√
. On the other hand, if M < 3

nα/Vn,

(cid:19)

(cid:18) M

√
64

, t0

n

28−3/2I

≤ 29−3/2(2 − r)−1A

1− r
1/2
2
n t
0

(cid:18)4

(cid:19) 1

r− 1

2

(cid:19) 1

2− 1

r

(cid:18) M 2

4α

1/r
n

A

≤ 29−3/2(2 − r)−1

= ˜K2A

(cid:17) 1

r− 1

1/r

n M 1− 2
(cid:17) 1
2(cid:16) 1
(cid:19)

2− 1

4


r− 1
2 ,

1

r α

r . Hence

where ˜K2 = 29−3/2(2 − r)−1(cid:16) 4
(cid:18) M


28−3/2I

√
64

n

, t0

< M ⇔ ˜K2A

2 < M

1

r− 1

1/r

r α

n M 1− 2
2−r
2r < M
2−r
4 ,

1/2
n α

2
r

⇔ ˜K2A
1/r
n α
⇔ M > K2A

where K2 = ˜K
28−3/2I

(cid:16) M

√
64

n

r/2
2

.

(cid:17)

, t0

In conclusion, if M ≥ K1A
< M and by Theorem 2.1 in Alexander (1984) [2], we get the desired

1/2
n α

r+2
n

n

V

2−r
r+2
n

r−2
2(r+2) ∨ K2A

2−r
4 , then

2

result.

As a Corollary to Lemma 4.5.1, we can get the rate of convergence for the the empirical

118

process {n−1/2(cid:80)n

i=1 i(f (xi) − f0(xi)) : f ∈ Frn}.

Corollary 4.5.1. Let ρn be such that ρn(cid:107) ˆfn − f0(cid:107)n = Op(1) and Frn be the class of neural
network sieves as deﬁned in (4.2). Suppose that E[||2+λ] < ∞ for some λ > 0. Then under

the conditions

(C1) rn(d + 2)Vn log[rnVn(d + 2)] = o(n1/4);

(C2) nρ−2

n /V λ

n = o(1),

we have




P∗

≤P∗

sup

(cid:107)f−f0(cid:107)n≤ρ−1

n ,f∈Frn

n ,f∈Frn

sup

(cid:107)f−f0(cid:107)n≤ρ−1
+ P∗



sup

(cid:107)f−f0(cid:107)n≤ρ−1

n ,f∈Frn

Proof. To establish the desired result, we apply the truncation device.

n(cid:88)

i=1

n

i(f − f0)(xi)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M

i(f − f0)(xi)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√
n(cid:88)
n(cid:88)

i=1

i=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√

n

n

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1).




(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M

iI{|i|≤Vn}(f − f0)(xi)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√

n

n(cid:88)

i=1

iI{|i|>Vn}(f − f0)(xi)



(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M

sup

(cid:107)f−f0(cid:107)n≤ρ−1

n ,f∈Frn

:=(I) + (II)

For (I), we can apply Lemma 4.5.1 directly. Note that |I{||≤Vn}(f − f0)(x)| ≤ Vn(Vn +

119

(cid:17)2

4 Vn

η

4e[rn(d + 2) + 1]
(cid:16) 1
(cid:16) 1
(cid:17)
(cid:18)
4 Vn − 1
(cid:18)
(cid:19)

(cid:18)

1
η

= [rn(d + 2) + 1]

log ˜Crn,d,Vn + log

≤ [rn(d + 2) + 1]

log ˜Crn,d,Vn +

rn(d+2)+1
(cid:19)
(cid:19)

1
η
− 1

1
η

(cid:107)f0(cid:107)∞) (cid:46) V 2

n since (cid:107)f0(cid:107)∞ < ∞ and for 0 < η < 1,

log N (η,Frn,(cid:107) · (cid:107)∞) ≤ log

= Crn,d,Vn
≤ 2Crn,d,Vn

1 +

1
η

,

and

where ˜Crn,d,Vn = e[rn(d+2)+1]V 2

Vn−4

n

Crn,d,Vn = [rn(d + 2) + 1] log ˜Crn,d,Vn − [rn(d + 2) + 1]

[rn(d + 2) + 1]V 2
n

= [rn(d + 2) + 1] log
∼ rn(d + 2) log[rnVn(d + 2)].

Vn − 4

Therefore, equation (4.8) is satisﬁed with r = 1 and An = 2Crn,d,Vn and it follows from
Lemma 4.5.1 that for M (cid:38) C

α1/4, we have (I) ≤ 5 exp{−(1 − )ψ(M, n, α)}

n n−1/6∨C
V

2/3

2/3
rn,d,Vn

1/2
rn,d,Vn

 .

2/3
n

V

2/3
rn,d,Vn
n1/6

and hence

(cid:107)f−f0(cid:107)≤ρ−1

sup
n ,f∈Frn

(cid:12)(cid:12)(cid:12)(cid:12) 1√

n

iI{|i|≤Vn}(f − f0)(xi)

(cid:12)(cid:12)(cid:12)(cid:12) = Op

C

120

Then since by (C1),

C

2/3
n

V

2/3
rn,d,Vn
n1/6

(cid:18)rn(d + 2)Vn log[rnVn(d + 2)]

(cid:19)2/3

n1/4

∼

= op(1).

For (II), by Cauchy-Schwarz inequality, we have

iI{|i|>Vn}(f − f0)(xi)

n(cid:88)

i=1

1
n

2
i

I{|i|>Vn}

(cid:33)1/2

(cid:107)f − f0(cid:107)n.

Then it follows from Markov inequality that

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M n−1/2



iI{|i|>Vn}(f − f0)(xi)

sup

n ,f∈Frn
I{|i|>Vn}

(cid:107)f−f0(cid:107)n≤ρ−1
n(cid:88)
n(cid:88)

1
n

2
i

i=1

I{|i|>Vn} (cid:38) M 2n−1ρ2

n

2
i

ρ−1
n (cid:38) M n−1/2

(cid:33)



It then follows from condition (C2) that (II) → 0 and hence

sup

(cid:107)f−f0(cid:107)n≤ρ−1

n ,f∈Frn

iI{|i|>Vn}(f − f0)(xi)

Combining the results we obtained above, we get

sup

(cid:107)f−f0(cid:107)n≤ρ−1

n ,f∈Frn

i(f − f0)(xi)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1).
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

(II) = P∗


(cid:32)
(cid:32)

1
n

≤ P

= P

i=1

(cid:46) M−2nρ−2
(cid:46) M−2nρ−2

n

n E[2I||>Vn]

E[||2+λ]

.

V λ
n

(cid:32)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
n(cid:88)
(cid:33)1/2

i=1

n

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

n

n(cid:88)

i=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√

n

n(cid:88)

i=1

121

Remark 4.5.1. Condition (C2) can be further simpliﬁed using the results we obtained in

Theorem 4.4.1. If ηn = O(cid:0)min{(cid:107)πrnf0 − f0(cid:107)2

n, rn(d + 2) log(rnVn(d + 2))/n, rn(d + 2) log n/n}(cid:1),
(cid:110)(cid:107)πrnf0 − f0(cid:107)n,(cid:112)rn(d + 2) log[rnVn(d + 2)]/n,(cid:112)rn(d + 2) log n/n
(cid:110)(cid:107)πrnf0 − f0(cid:107)n,(cid:112)rn(d + 2) log n/n
(cid:111)

(cid:111)
n (cid:16)(cid:112)rn(d + 2) log n/n. This holds for functions having

n (cid:16) max

n (cid:16) max

then ρ−1
follows from condition (C1), that ρ−1
simplicity, here we assume that ρ−1

. For

.

It

ﬁnite ﬁrst absolute moments of the Fourier magnitude distributions as we have discussed at

the end of section 4.4. Then in this case,

nρ−2

n /V λ

n (cid:16) rn(d + 2) log n/V λ
n ,

so that condition (C2) becomes rn(d + 2) log n/V λ

n → 0.

Now we are going to establish the asymptotic normality for neural network estimators.

For f ∈ {f ∈ Frn : (cid:107)f − f0(cid:107)n ≤ ρ−1

n }, we consider a local alternative

˜fn(f ) = (1 − δn)f + δn(f0 + ι),

(4.9)

where 0 ≤ δn = η

1/2

n = o(n−1/2) is chosen such that ρnδn = o(1) and ι(x) ≡ 1.

Theorem 4.5.1 (Asymptotic Normality). Suppose that 0 ≤ ηn = o(n−1) and conditions
(C1) and (C2) in Corollary 4.5.1 hold. Assume further that the following two conditions

hold

(C3) sup

f∈Frn :(cid:107)f−f0(cid:107)n≤ρ−1

n

(C4) sup

f∈Frn :(cid:107)f−f0(cid:107)n≤ρ−1

n

(cid:16)

˜fn(f ) − ˜fn(f )(cid:107)n = Op(ρnδ2
n);

(cid:107)πrn
(cid:80)n

1
n

i=1 i

(cid:17)

= Op(δ2
n),

˜fn(f )(xi) − ˜fn(f )(xi)

πrn

122

then

(cid:104) ˆfn(xi) − f0(xi)

(cid:105) d−→ N (0, σ2).

n(cid:88)

i=1

1√
n

Before we proceed to the proof of the theorem, let us focus on the conditions given in

the theorem. For condition (C1), note that if (C1) holds, then

rn(d + 2)V 2

n log[rnVn(d + 2)] ≤ [rn(d + 2)]4V 4

n (log[rnVn(d + 2)])4 = o(n),

mator. As in Remark 4.3.2, we consider some simple scenarios here.

so it is a suﬃcient condition to ensure the consistency of the neural network sieve esti-
If Vn = O(rn),

(cid:1) so that a possible growth rate for rn is
then rn(d + 2)Vn log[rnVn(d + 2)] = O(cid:0)r2
O(cid:0)Vn(log Vn)2(cid:1) and a possible growth rate for Vn is Vn = o(n1/4/(log n)2). Thus, we can see

. On the other hand, if rn = log Vn, then rn(d+2)Vn log[rnVn(d+2)] =

n1/8/(log n)2(cid:17)

n log rn

rn = o

(cid:16)

that in both cases, the growth rate required for the asymptotic normality of neural network

sieve estimator is slower than the growth rate required for the consistency as given in Remark

4.3.2. One explanation for this is that due to the Universal Approximation Theorem, a neural

network with one hidden layer can approximate a continuous function on compact support

arbitrarily well if the number of hidden units is suﬃciently large. Therefore, if the number of

hidden units is too large, the neural network sieve estimator ˆfn may be very close to the best

projector of the true function f0 in Frn so that the error(cid:80)n
number of hidden units can increase the variations in the quantity(cid:80)n

(cid:105)
(cid:104) ˆfn(xi) − f0(xi)
(cid:104) ˆfn(xi) − f0(xi)

to zero and the variations in this quantity may be small. By allowing slower growth rate of the

may be close

(cid:105)

i=1

,

i=1

which makes the asymptotic normality more reasonable. On the other hand, condition (C3)

and condition (C4) are similar conditions as in Shen (1997)[64], which are known for condi-

tions on approximation error and these conditions dictate that the approximation rate of a

123

single layer neural network cannot be too slow, otherwise it may require a huge number of

samples to reach the desired approximation error. Therefore, the conditions in the theorem

can be considered as a trade-oﬀ between bias and variance.

Proof of Theorem 4.5.1. The main idea of the proof is to use the functional Taylor series
expansion for Qn(f ) and carefully bound each term in the expansion. For any f ∈ {f ∈
Frn : (cid:107)f − f0(cid:107)n ≤ ρ−1
n },

Qn(f ) = Qn(f0) + Q(cid:48)
i − 2
2
n

n(cid:88)

1
n

=

n,f0

[f − f0] + Rn[f − f0]
n(cid:88)
i(f (xi) − f0(xi)) +

i=1

i=1

n(cid:88)

i=1

1
n

(f (xi) − f0(xi))2.

(4.10)

Note that

(cid:107) ˜fn(f ) − f0(cid:107)n = (cid:107)(1 − δn) ˆfn + δn(f0 + ι) − f0(cid:107)n

= (cid:107)(1 − δn)( ˆfn − f0) + δnι(cid:107)n
≤ (1 − δn)(cid:107) ˆfn − f0(cid:107)n + δn,

and since δn = o(n−1/2), we can know that with probability tending to 1, (cid:107) ˜fn(f ) − f0(cid:107)n ≤
ρ−1
n . Then replacing f in (4.10) by ˆfn and πrn

˜fn(f ) as deﬁned in (4.9), we get

n(cid:88)
n(cid:88)

i=1

i=1

124

Qn( ˆfn) =

Qn(πrn

˜fn(f )) =

1
n

1
n

n(cid:88)
n(cid:88)

i=1

i=1

i − 2
2
n
i − 2
2
n

i( ˆfn(xi) − f0(xi)) + (cid:107) ˆfn − f0(cid:107)2

n

i(πrn

˜fn(f )(xi) − f0(xi)) + (cid:107)πrn

˜fn(f ) − f0(cid:107)2
n.

˜fn(f )(xi) − ˆfn(xi)

πrn

i

(cid:17)

+ (cid:107) ˆfn − f0(cid:107)2

n − (cid:107)πrn

˜fn(f ) − f0(cid:107)2
n.

(cid:16)

n(cid:88)

i=1

Subtracting these two equations yields

Qn( ˆfn) = Qn(πrn

˜fn(f )) +

2
n

Now note that

(cid:107)πrn

˜fn(f ) − f0(cid:107)2

n = (cid:107)πrn
= (cid:107)πrn
(cid:68)
= (cid:107)πrn

=

πrn

n

˜fn(f ) − ˜fn(f ) + ˜fn(f ) − f0(cid:107)2
˜fn(f ) − ˜fn(f ) + (1 − δn) ˆfn + δn(f0 + ι) − f0(cid:107)2
˜fn(f ) − ˜fn(f ) + (1 − δn)( ˆfn − f0) + δnι(cid:107)2
˜fn(f ) − ˜fn(f ) + (1 − δn)( ˆfn − f0) + δnι,

n

n

πrn
˜fn(f ) − ˜fn(f )(cid:107)2

= (cid:107)πrn

≤ (1 − δn)2(cid:107) ˆfn − f0(cid:107)2

(cid:68)

n + δ2
n

(cid:69)
˜fn(f ) − ˜fn(f ) + (1 − δn)( ˆfn − f0) + δnι
(cid:69)
n + (1 − δn)2(cid:107) ˆfn − f0(cid:107)2
(cid:69)
(cid:68)
(cid:68) ˆfn − f0, ι
(cid:69)
(cid:69)
(cid:68) ˆfn − f0, δnι

˜fn(f ) − ˜fn(f ), ˆfn − f0

˜fn(f ) − ˜fn(f ), ι

+ δ2
n

πrn

πrn

+ 2δn
+ 2(1 − δn)δn

+ 2(1 − δn)

n + 2(1 − δn)
+ 2(1 − δn)(cid:107)πrn
+ 2δn(cid:107)πrn

˜fn(f ) − ˜fn(f )(cid:107)n(cid:107) ˆfn − f0(cid:107)n

˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn

˜fn(f ) − ˜fn(f )(cid:107)2
n,

125

where the last inequality follows from the Cauchy-Schwarz inequality and since

(cid:16)

n(cid:88)

i=1

2
n

˜fn(f )(xi) − ˆfn(xi)

πrn

i

(cid:17)

=

=

2
n

2
n

n(cid:88)
n(cid:88)

i=1

i

i

i=1

− 2
n

δn

πrn

(cid:16)
(cid:16)
n(cid:88)

πrn

i=1

(cid:17)
˜fn(f )(xi) − ˜fn(f )(xi) + ˜fn(f )(xi) − ˆfn(xi)

(cid:17)
˜fn(f )(xi) − ˜fn(f )(xi)
(cid:17) − 2

(cid:16) ˆfn(xi) − f0(xi)

i

n(cid:88)

i=1

i,

δn

n

we have by the deﬁnition of ˆfn,

−Op(δ2

Qn(f ) − Qn( ˆfn)

n) ≤ inf
f∈Frn
≤ Qn(πrn
= (cid:107)πrn

˜fn(f )) − Qn( ˆfn)

˜fn(f ) − f0(cid:107)2

n − (cid:107) ˆfn − f0(cid:107)2

n(cid:88)

(cid:16)

≤ (1 − δn)2(cid:107) ˆfn − f0(cid:107)2

(cid:17)

i

πrn

n − 2
n
n + 2(1 − δn)δn

i=1

˜fn(f )(xi) − ˆfn(xi)

(cid:68) ˆfn − f0, ι
(cid:69)

˜fn(f ) − ˜fn(f )(cid:107)n

˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn
(cid:16)

(cid:17)
˜fn(f )(xi) − ˜fn(f )(xi)

˜fn(f ) − ˜fn(f )(cid:107)2

n

n(cid:88)

i=1

+

2
n

δn

(cid:16) ˆfn(xi) − f0(xi)
(cid:17)

i

n − (cid:107) ˆfn − f0(cid:107)2
+ 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn
+ 2δn(cid:107)πrn
n(cid:88)
− 2
n(cid:88)
n

πrn

i=1

i + Op(δ2
n)

δn

i

+

2
n

i=1

126

= (−2δn + δ2

n + 2(1 − δn)δn

(cid:68) ˆfn − f0, ι
(cid:69)
(cid:17)
˜fn(f )(xi) − ˜fn(f )(xi)

n)(cid:107) ˆfn − f0(cid:107)2
+ 2δn(cid:107)πrn
˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn
(cid:16)
n(cid:88)
− 2
n(cid:88)
n

πrn

i=1

i

+

˜fn(f ) − ˜fn(f )(cid:107)2

+ 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn
(cid:17)
n(cid:88)

(cid:16) ˆfn(xi) − f0(xi)

δn

i

n

2
n

i=1

˜fn(f ) − ˜fn(f )(cid:107)n

≤ δ2

+

δn

i=1

i + Op(δ2
n)

2
n
n + 2(1 − δn)δn
n(cid:107) ˆfn − f0(cid:107)2
+ 2δn(cid:107)πrn
n(cid:88)
− 2
n(cid:88)
n

(cid:68) ˆfn − f0, ι
(cid:69)
˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn
(cid:16)

(cid:17)
˜fn(f )(xi) − ˜fn(f )(xi)

πrn

i=1

i

i + Op(δ2
n),

+

δn

2
n

i=1

+ 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn
˜fn(f ) − ˜fn(f )(cid:107)2

n

˜fn(f ) − ˜fn(f )(cid:107)n

n(cid:88)

i=1

+

2
n

δn

(cid:16) ˆfn(xi) − f0(xi)

i

(cid:17)

(4.11)

where the last inequality follows by noting that (1 − δn)2 − 1 = −2δn + δ2
condition (C1), we can get

n ≤ δ2

n. From the

[rn(d + 2) + 1]V 2

n log[rnVn(d + 2) + 1]
≤ ([rn(d + 2) + 1]Vn log[rnVn(d + 2) + 1])4 = o(n),

combining with Theorem 4.3.1, we obtain that (cid:107) ˆfn− f0(cid:107)n = op(1) and hence δ2
op(δ2

n). From condition (C3), we have

n(cid:107) ˆfn− f0(cid:107)2

n =

2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn

(cid:17)
˜fn(f ) − ˜fn(f )(cid:107)n ≤ 2(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn

(cid:16)

= Op

ρ−1
n ρnδ2
n

˜fn(f ) − ˜fn(f )(cid:107)n
= Op(δ2
n).

127

Similarly, since ρnδn = o(1), we have

2δn(cid:107)πrn
(cid:107)πrn

˜fn(f ) − ˜fn(f )(cid:107)n = Op(δn · ρnδ2
˜fn(f ) − ˜fn(f )(cid:107)2

n = Op(ρ2

nδ4

n) = op(δ2
n).

n) = op(δ2
n)

Based on condition (C4), we also know that

i

πrn

˜fn(f ) − ˜fn(f )

(cid:17)

= Op(δ2
n),

(cid:16)

n(cid:88)

i=1

2
n

and from Corollary 4.5.1, we also have

(cid:16) ˆfn(xi) − f0(xi)

(cid:17)

= op(δn · n−1/2).

i

n(cid:88)

i=1

2
n

δn

It follows from these observations that

(cid:68) ˆfn − f0, δnι
(cid:69)

n(cid:88)

i=1

i ≤ Op(δ2

n) + p(δ2

n) + op(δn · n−1/2),

+

2δn
n

−2(1 − δn)

which implies that

−(1 − δn)

(cid:68) ˆfn − f0, ι
(cid:69)

n(cid:88)

i=1

+

1
n

i ≤ Op(δn) + op(n−1/2) = op(n−1/2).

128

By replacing ι with −ι, we can obtain the same result and hence

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:68) ˆfn − f0, ι
(cid:69) − 1

n

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(1 − δn)

i

n(cid:88)

i=1

(cid:68) ˆfn − f0, ι

(cid:69) − 1

n

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + δn

(cid:69)(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)(cid:68) ˆfn − f0, ι

i

n(cid:88)

i=1

Therefore,

≤ op(n−1/2) + δn(cid:107) ˆfn − f0(cid:107)n
= op(n−1/2).

(cid:68) ˆfn − f0, ι
(cid:69)

n(cid:88)

i=1

=

1
n

i + op(n−1/2)

and the desired result follows from the classical Central Limit Theorem.

Theorem 4.5.1 can be used directly for hypothesis testing based on the model of neural

network with one hidden layer if we know the variance σ2 of the random error. In practice,

this is rarely the case. To perform hypothesis testing when σ2 is unknown, it is natural to

use a good estimator for σ2 and use a “plug-in" test statistic. A natural estimator for σ2 is

n(cid:88)

(cid:16)

(cid:17)2
yi − ˆfn(xi)

(cid:17)

(cid:16) ˆfn

.

= Qn

ˆσ2
n =

1
n

i=1

So we also need to establish the asymptotic normality for the statistic

(cid:104) ˆfn(xi) − f0(xi)
where X ⊂ Rd is a compact set and 0 ≤ ηn = o(cid:0)n−1(cid:1). Then under condition as in Theorem

Theorem 4.5.2 (Asymptotic Normality for Plug-in Statistic). Suppose that f0 ∈ C(X ),

(cid:80)n

√
1
ˆσn

i=1

n

(cid:105)

.

4.5.1, we have

n(cid:88)

i=1

√
1
ˆσn

n

(cid:104) ˆfn(xi) − f0(xi)

(cid:105) d−→ N (0, 1).

129

Proof. Note that

n = Qn( ˆfn) =
ˆσ2

=

=

1
n

1
n

1
n

i=1

n(cid:88)
n(cid:88)
n(cid:88)

i=1

i=1

(cid:16)
(cid:17)2
yi − ˆfn(xi)
(cid:16) ˆfn(xi) − f0(xi)

=

1
n

(cid:16)
(cid:17)2
n(cid:88)
f0(xi) + i − ˆfn(xi)
(cid:16) ˆfn(xi) − f0(xi)
(cid:17)2 − 2
(cid:17)
n(cid:88)
(cid:16) ˆfn(xi) − f0(xi)
(cid:17)

+ (cid:107) ˆfn − f0(cid:107)2

i=1

i=1

i

n

n

i − 2
2
n

n(cid:88)

i=1

i

n(cid:88)

i=1

2
i

+

1
n

Based on the rate of convergence of ˆfn we obtained in Theorem 4.4.1 and condition (C1),

we can know that

Under (C3), (cid:107)πrnf0 − f0(cid:107)2

(cid:13)(cid:13)(cid:13) ˆfn − f0

p

n

max

= O∗

rn(d + 2) log n

(cid:18)
(cid:13)(cid:13)(cid:13)2
n = o(cid:0)ρ2
(cid:18) rn(d + 2) log n

(cid:27)(cid:19)
(cid:26)
(cid:107)πrnf0 − f0(cid:107)2
n,
(cid:1) = o(n−1/2) and under (C1), we have
(cid:19)

n1/4 log n

(cid:33)

nδ4
n

n

(cid:32)
(cid:18)log n

n3/4

n

(cid:19)

≤ o

= o

= o(n−1/2)

which implies that

= op(n−1/2). Moreover, by the same arguments as in the

proof of Theorem 4.5.1, we can show that

n

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13) ˆfn − f0
n(cid:88)

n

2
n

(cid:16) ˆfn(xi) − f0(xi)

(cid:17)

= op(n−1/2).

i

i=1

130

Therefore,

n(cid:88)

i=1

1
n

Qn( ˆfn) =

i + op(n−1/2).
2
(cid:80)n

n = Qn( ˆfn) = σ2 + op(1)
ˆσ2

Based on Weak Law of Large Numbers, we know that 1
n

i=1 2

i = σ2 + op(1). Therefore,

and it follows from Slutsky’s Theorem and Theorem 4.5.1, we obtain

n(cid:88)

i=1

√
1
ˆσn

n

(cid:105)
(cid:104) ˆfn(xi) − f0(xi)

n(cid:88)

i=1

=

√
1

σ
ˆσn

σ

n

(cid:104) ˆfn(xi) − f0(xi)

(cid:105) d−→ N (0, 1).

4.6 Simulation Studies

In this section, simulations were conducted to check the validity of the theoretical results

obtained in the previous sections. We ﬁrst used a simple simulation to show that it is hard

for the parameter estimators in a neural network with one hidden layer to reach parametric

consistency. Then the consistency of the neural network sieve estimators were examined un-

der various simulation setups. Finally, in the last part, we checked the asymptotic normality

of the neural network sieve estimators.

131

4.6.1 Parameter Inconsistency

As we have mentioned in the introduction, due to the loss of identiﬁability of the parameters,

the parameter estimators obtained in a neural network model are unlikely to be consistent.

In this simulation, we use empirical results to conﬁrm such observations. We simulated the

response through the following model:

yi = f0(xi) + i,

i = 1, . . . , n,

(4.12)

where the total sample size n = 500, x1, . . . , xn ∼ i.i.d.N (0, 1), 1, . . . , n ∼ i.i.d.N (0, 0.12)
and

f0(xi) = −1 + σ(2xi + 1) − σ(−xi + 1),

(4.13)

that is, the true model is a single-layer neural network and the number of hidden units is

2. When we conducted the simulation, we also used a single-layer neural network to ﬁt

the data. When ﬁtting the neural network, we set the learning rate as 0.1 and performed

3e4 iterations for the back propagation. The cost after 3e4 iterations is 0.0106. Table 4.1

summarizes the estimated values for the parameters in this model.

Table 4.1: Comparison of the true parameters and the estimated parameters
in a single-layer neural network with 2 hidden units.

Estimated Values

Weights

Biases

γ1

γ2

α1

α2

True Value

2.00

-1.00

1.00

-1.00

γ0,1

1.00

γ0,2

1.00

α0

-1.00

Estimated Value

0.82

1.30

-0.34

-0.58

-0.03

-0.03

-1.04

Based on the results in Table 4.1, it is clear that the estimators for most of the weights

132

and biases (except α0) are far from reaching consistency. On the other hand, if we look at

the curve of the true function and the curve of the ﬁtted function as shown in Figure 4.1, we

can see that most parts are ﬁtted extremely well except for the tail parts. The approximation
error (cid:107) ˆf − f0(cid:107)n is almost zero as shown in the Figure. This suggests that we may study
the asymptotic properties of the neural network using the estimated function instead of the

estimated parameters.

Figure 4.1: Comparison of the true function and ﬁtted function under the simulation model
(4.12). The black curve is the true function deﬁned in (4.13) and the blue dashed curve is
the ﬁtted curve obtained after ﬁtting the neural network model.

133

-3-2-101234-2.0-1.8-1.6-1.4-1.2HiddenUnits = 2, Sample = 500, Err = 0xyTruthFitted4.6.2 Consistency for Neural Network Sieve Estimators

In this simulation, we are going to check the consistency result obtained in Section ?? and the

α: (cid:80)rn

validity of the assumption made in Theorem 4.3.1. Based on our construction of the neural
network sieve estimators, in each sieve space Frn, there is a constraint on the (cid:96)1 norm for
i=0 |αi| ≤ Vn. So ﬁnding the nearly optimal function in Frn for Qn(f ) is in fact a
constrained optimization problem. Classical way to conduct this optimization is through

introducing a Lagrange multiplier for each constraint. But it is usually hard to ﬁnd an

explicit connection between the Lagrange multiplier and the upper bound in the inequality

constraint. Instead, we use the subgradient method as discussed in section 7 in Boyd and

Mutapcic (2008)[9]. The basic idea is to update the parameter α0, . . . , αrn through

(k+1)
α
i

= α

(k)

i − δkg(k),

i = 0, . . . , rn

where δk > 0 is a step size and δk is chosen to be 0.1/ log(e + k) throughout the simulation,

which is known as a nonsummable diminishing step size rule. g(k) is a subgradient of the

j=0 |αj| − Vn at α(k). More speciﬁcally, we take

objective or the constraint function(cid:80)rn
 ∂

g(k) ∈

α(k)

∂

α(k)

if (cid:80)rn
if (cid:80)rn

j=0 |αj| ≤ Vn
j=0 |αj| > Vn

Qn(f )

(cid:80)rn
j=0 |αj|

The updating equation of γ1, . . . γrn, γ0,1, . . . , γ0,rn remains the same as in the classical
gradient descent algorithm.

We still used equation (4.12) as our simulation model.

Instead, we assumed that the
random error 1, . . . , n are i.i.d. N (0, 0.72) throughout the simulations. For the true function

134

f0(x), we considered the following three functions:

(1) Neural network with single hidden layer and 2 hidden units, which is the same as in

equation (4.13).

(2) A trigonometric function:

(cid:17)

x

(cid:16)π

3

(cid:17)

x + 1

(cid:16)π

4

+

1
3

cos

f0(x) = sin

(3) A continuous function having a non-diﬀerential point

 −2x
(cid:16)

√

f0(x) =

(cid:17)

if x ≤ 0

if x > 0

x − 1

4

x

(4.14)

(4.15)

We then trained a neural network using the subgradient method mentioned at the beginning

of this subsection and set the number of iterations used for ﬁtting as 20,000. For the growth

rate on the number of hidden units rn and the upper bound for (cid:96)1 norm of the weights and

bias from the hidden layer to the output layer Vn, we chose rn = n1/4 and Vn = 10n1/4. Such

choice satisﬁes the condition mentioned in Remark 4.3.2 and hence satisﬁes the condition
in Theorem 4.3.1. We compared the errors (cid:107) ˆfn − f0(cid:107)2
n and the least square errors Qn( ˆfn)
under diﬀerent sample sizes and the results are summarized in Table 4.2.

As we can see from Table 4.2, the errors (cid:107) ˆfn − f0(cid:107)2

n indeed has a decreasing pattern as

the sample size increases although there are some cases where the error becomes a little bit

larger when the sample sizes increases (e.g. the errors using 500 sample in all scenarios is

larger than those errors using 200 sample). One explanation is that the number of hidden

units increase from 3 (for 200 sample) to 4 (for 500 sample) under our simulation setup. So

135

Table 4.2: Comparison of errors (cid:107) ˆfn − f0(cid:107)2
iterations under diﬀerent sample sizes.

n and the least square errors Qn( ˆfn) after 20,000

Sample Sizes

50
100
200
500
1000
2000

Neural Network

(cid:107) ˆfn − f0(cid:107)2
3.33E-2
2.79E-2
6.05E-3
8.15E-3
3.02E-3
2.88E-3

n Qn( ˆfn)
0.519
0.552
0.500
0.484
0.475
0.500

Sine
(cid:107) ˆfn − f0(cid:107)2
n Qn( ˆfn)
0.513
6.04E-2
3.04E-2
0.587
0.501
1.05E-2
0.499
1.19E-2
0.480
1.54E-2
9.72E-3
0.506

Piecewise Continuous
(cid:107) ˆfn − f0(cid:107)2
n Qn( ˆfn)
1.124
6.20E-1
3.20E-1
0.920
0.786
2.51E-1
0.769
3.26E-1
0.489
2.98E-2
1.69E-2
0.515

there may be some variations among the estimation performance. Overall, the table shows
that the estimated function ˆfn is indeed consistent in the sense that (cid:107) ˆfn − f0(cid:107)n = o∗
Figure 4.2 illustrates the ﬁtted functions and the true function, from which we can visualize

p(1).

the result more straightforwardly.

4.6.3 Asymptotic Normality for Neural Network Sieve Estimators

The last part of the simulation focus on the asymptotic normality derived in Theorem 4.5.1.

We still considered the same three types of true functions as used in section 4.6.2 but the

random errors are sampled from the standard normal distribution and we still used the

subgradient method to obtain the ﬁtted model. The number of iterations used for ﬁtting

was set as 20,000. What is diﬀerent in the simulation setup from what we did in section 4.6.2

is that the growth rates for rn and Vn. As mentioned in section 4.5, the growth rates required

for asymptotic normality are slower than those required for consistency. Therefore, in the

simulation we chose rn = n1/8 and Vn = 10n1/10. Such choice satisﬁes the condition (C1)

in Theorem 4.5.1. In order to get the normal Q-Q plot for n−1/2(cid:80)n

(cid:104) ˆfn(xi) − f0(xi)

i=1

(cid:105)

,

we repeated the simulation 200 times.

136

Figure 4.2: Figures on comparison of the true function and the ﬁtted function used in
simulations. The top panel shows the scenario when the true function is a single layer neural
network; the middle panel shows the scenario when the true function is a sine function and
the bottom panel show the scenario when the true function is a continuous function having
a non-diﬀerentiable point. As we can see from all the cases, the ﬁtted curve becomes closer
to the truth as the sample size increases.

137

-2-1012-2-101True Function vs Fitted FunctionsxyTruth5010020050010002000-2-1012-1012True Function vs Fitted FunctionsxyTruth5010020050010002000-2-1012-101234True Function vs Fitted FunctionsxyTruth5010020050010002000Figure 4.3 to Figure 4.5 are the normal Q-Q plots under each simulation setup and various

sample sizes. Based on the ﬁgures, we can see that the statistic n−1/2(cid:80)n

i=1

ﬁts the normal distribution pretty well under all simulation models. It is also worth to note

(cid:104) ˆfn(xi) − f0(xi)
(cid:105)
(cid:104) ˆfn(xi) − f0(xi)

(cid:105)

i=1

that the Q-Q plots looks similar under all simulation models and this is what we should expect

is N (0, 1). Another implication we can obtain from the Q-Q plots is that the statis-

since under all scenarios, the limiting distribution for the statistic n−1/2(cid:80)n
tic n−1/2(cid:80)n
n−1/2(cid:80)n
(cid:80)n

(cid:104) ˆfn(xi) − f0(xi)
(cid:105)
(cid:105)
(cid:104) ˆfn(xi) − f0(xi)
(cid:105)

(cid:104) ˆfn(xi) − f0(xi)

is robust to the choice of f0. Therefore, as long as the
true function f0 is continuous, N (0, 1) can be served as a good asymptotic distribution for

Besides the Q-Q plots, we also conducted the normality tests to check whether n−1/2

follows the standard normal distribution. Speciﬁcally, we used the

and can be used to conduct hypothesis testing.

i=1

i=1

i=1

Shapiro-Wilks test and Kolmogorov-Smirnov test to perform the normality test. Table 4.3

summarizes the p-values for both normality tests and we can see from the p-values, in all

cases, we failed to reject that the distribution of n−1/2(cid:80)n

(cid:104) ˆfn(xi) − f0(xi)

(cid:105)

follows the

i=1

standard normal distribution.

Table 4.3: p-values for Shapiro-Wilks test and Kolmogorov-Smirnov test for nomality test.
We use “NN" to denote the true function as a neural network described in (4.13); “TRI" to
denote the true function as a trigonometric function described in (4.14) and “ND" to denote
the true function as a continuous function having a non-diﬀerential point described in (4.15)

Kolmogorov-Smirnov Test
NN
0.584
0.472
0.731
0.976
0.670
0.733

TRI
0.597
0.508
0.719
0.986
0.754
0.769

ND
0.595
0.484
0.708
0.973
0.708
0.733

Sample Sizes

50
100
200
300
400
500

Shapiro-Wilks Test
NN
ND
0.881
0.878
0.095
0.098
0.940
0.944
0.872
0.884
0.513
0.514
0.768
0.768

TRI
0.884
0.095
0.944
0.888
0.525
0.778

138

Figure 4.3: Normal Q-Q plot for n−1/2(cid:80)n

under 200 iterations and
various sample sizes. The true function f0 is a single-layer neural network with 2 hidden
units as deﬁned in (4.13).

i=1

(cid:104) ˆfn(xi) − f0(xi)

(cid:105)

139

-3-2-10123-2-10123Sample Size = 50Theoretical QuantilesSample Quantiles-3-2-10123-3-2-1012Sample Size = 100Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 200Theoretical QuantilesSample Quantiles-3-2-10123-2-1012Sample Size = 300Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 400Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 500Theoretical QuantilesSample QuantilesFigure 4.4: Normal Q-Q plot for n−1/2(cid:80)n

under 200 iterations and
various sample sizes. The true function f0 is a trigonometric function as deﬁned in (4.14).

i=1

(cid:104) ˆfn(xi) − f0(xi)

(cid:105)

140

-3-2-10123-2-10123Sample Size = 50Theoretical QuantilesSample Quantiles-3-2-10123-3-2-1012Sample Size = 100Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 200Theoretical QuantilesSample Quantiles-3-2-10123-2-1012Sample Size = 300Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 400Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 500Theoretical QuantilesSample QuantilesFigure 4.5: Normal Q-Q plot for n−1/2(cid:80)n

under 200 iterations and
various sample sizes. The true function f0 is a continuous function having a non-diﬀerential
point as deﬁned in (4.15).

i=1

(cid:104) ˆfn(xi) − f0(xi)

(cid:105)

141

-3-2-10123-2-10123Sample Size = 50Theoretical QuantilesSample Quantiles-3-2-10123-3-2-1012Sample Size = 100Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 200Theoretical QuantilesSample Quantiles-3-2-10123-2-1012Sample Size = 300Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 400Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 500Theoretical QuantilesSample Quantiles4.7 Discussion

We have investigated the asymptotic properties, including the consistency, rate of conver-

gence and asymptotic normality for neural network sieve estimators with one hidden layer.

While in practice, the number of hidden unites is often chosen ad hoc, it is important to note

that the conditions in the theorems provide theoretical guidelines on choosing the number

of hidden units for a neural network with one hidden layer in order to achieve the desired

statistical properties. The validity of the conditions made in the theorems has also been

checked through simulation results. Theorem 4.5.1 and Theorem 4.5.2 can be served as

preliminary work for conducting hypothesis testing on H0 : f0 = h0 for a ﬁxed function h0

according to neural networks. However, currently there is no simple way to check conditions

(C3) and (C4) in the theorem, which requires further researches on local entropy numbers

for classes of neural networks.

The work conducted in this chapter mainly focus on sieve estimators based on neural

networks with one hidden layer and standard sigmoid activation function. There are many

extensions on the results presented in this chapter. In fact, the main theorems in this chapter

depend heavily on the covering number or the entropy number of the function class consisting

of neural network with one hidden layer. Theorem 14.5 in Anthony and Bartlett (2009)[4]

provides a general upper bound for the covering number of a function class consisting of deep

neural networks with Lipshitz continuous activation functions. Therefore, it is possible to

extend our results discussed in this chapter to a deep neural network with Lipshitz continuous

activation functions.

It is also worthwhile to investigate asymptotic properties of other

commonly used deep learning models such as convolutional neural networks (CNNs) and

recurrent neural networks (RNNs).

142

When we train a deep neural network, we usually need to fact an overﬁtting issue and in

practice, regularization is frequently used to reduce overﬁtting. Another natural extension

of the results discussed in this chapter is to modify the loss function by involving some

regularization terms. By taking regularization into account, we believe the theories may

have broader applicability in real world data applications.

143

Chapter 5

Epilogue

The main goal of this dissertation is to investigate some network-based models with their

application to genetic association study and risk prediction. We have seen in Chapter 2 that

a conditional autoregressive (CAR) model can achieve higher power when individuals have

heterogeneous genetic eﬀects and have robust results under misspeciﬁcation of weights. The

kernel neural network (KNN) model discussed in Chapter 3 provides a novel way to conduct

genetic risk prediction and we have shown theoretically and empirically that KNN can reach

lower prediction error than classical linear mixed model. Finally, in Chapter 4, we have

established the asymptotic properties for neural network sieve estimators, which serves as a

foundation for conducting hypothesis testing based on neural networks for later researches.

By using the advanced theories in empirical processes, the results were developed based on

minimal number of conditions to be checked and as a byproduct, the conditions also provide

some guidelines on choosing the number of hidden units in order to achieve desired statistical

properties. Here, we will brieﬂy discuss our future work.

In the CAR model discussed in Chapter 2, we mainly focused on the continuous phe-

notype. In the future, we will extend the CAR model so that it can be applied to discrete

dependent variables. Speciﬁcally, the CAR model can be extended under the generalized

144

linear mixed model (GLMM) framework:

(cid:27)

(cid:26) yiηi − b(ηi)
 γ(cid:80)
(cid:88)

φ

j(cid:54)=i sij

j(cid:54)=i

sijaj,

yi|ai ∼ ind exp

ai|aj, j (cid:54)= i ∼ N

, i = 1, . . . , n



(5.1)

+ c(yi, φ)

τ(cid:80)

j(cid:54)=i sij

ηi = g(µi) = xT

i β + ai,

where g(·) is the canonical link function. For example, g(x) = x, known as the identity link
when yi|ai is normally distributed and g(x) = log x
1−x, known as the logit link when yi|ai
follows a Bernoulli distribution. To test whether genotype-phenotype association exits, it is

equivalent to test H0 : τ = 0 vs H1 : τ > 0. However, due to the fact that the number of

random eﬀects is the same as the sample size, the commonly-used Laplace approximations

will not work. So a new framework of approximate inference should be established to conduct

the statistical inference.

Some empirical studies reveal that the magnitude diﬀerence between the variance com-

ponent estimates of random eﬀects and the variance component estimate of random error

is large, which will tend to make the prediction error small. In future work on KNN, it is

necessary to control the magnitude for the variance component estimates of random eﬀects.

A commonly used technique to control the magnitude of the parameter in statistics and

machine learning is to use regularization. However, direct adding regularization term to

loss function of MINQUE is diﬃcult. Instead, we will consider another way to estimate the

variance components, known as the variance least square[3]. The basic idea of variance least

square is to minimize the matrix distance between sample covariance matrix and population

covariance matrix. Speciﬁcally, let ˆβ be an OLS estimator for β and pretend it was true.

145

Then

ˆE = (y − Z ˆβ)(y − Z ˆβ)T

is clearly an unbiased estimator for the variance-covariance matrix. We know that the
j=1 τjE[Kj(U )] + φIn. So we can

marginal covariance matrix of y in KNN is Var[y] =(cid:80)J

 ˆE − φIn − J(cid:88)

estimate the variance component θ = [φ, τ T , ξT ]T based on

ˆθ = argminθtr

2

τjE[Kj(U )]

j=1

Therefore, to control the magnitude of the variance components, we can consider the penal-

ized estimators via

ˆθp = argminθtr


 ˆE − φIn − J(cid:88)

j=1

2 + penλ(θ).

τjE[Kj(U )]

We will also investigate the statistical properties including the bias and consistency in the

futre work.

Having established the basic asymptotic properties for neural networks with one hidden

layer, as mentioned in Chapter 4, we will extend the results to deep neural networks, which is

a popular tools in deep learning and artiﬁcial intelligence, as well as the convolutional neural

networks (CNNs) and recurrent neural networks (RNNs). By extending the results to deep

neural networks, we expect that some guidelines on selecting the number of hidden units in

each layer and the number of layers in the deep network. Considering the loss function with

regularization will also be one of the focus in our future research.

146

APPENDICES

147

Appendix A

Technical Details and Supplementary

Materials for Chapter 2

Proof of Proposition 2.2.1

Proof. Fix a(cid:48) = 0, then based on the ﬁrst equation in the Brook’s Lemma, we have

i=1

n(cid:89)
n(cid:89)
n(cid:89)

i=1

π(a)
π(0)

=

=

=

i=1

= exp

(cid:17)2(cid:27)
(cid:17)2(cid:27)

− 1

2 exp

(2πτ 2
i )

− 1
2τ 2
i
− 1
2τ 2
i

π(ai|a1, . . . , ai−1, 0, 0)
(cid:16)
π(0|a1, . . . , ai−1, 0, . . . , 0)
ai −(cid:80)i−1
(cid:16)−(cid:80)i−1

i−1(cid:88)

i−1(cid:88)

(cid:26)
(cid:26)
a2
i − 2
n(cid:88)

bijaiaj

bijaiaj

2τ 2
i

2 exp

a2
i
τ 2
i

1
τ 2
i

i=1

j=1

j=1

+

j=1 bijaj

j=1 bijaj

− 1

exp

(2πτ 2
i )

− 1
−1
n(cid:88)
−1

i=1

2

2

n(cid:88)

i=1

π(a)
π(0)

= exp

Similarly, from the second equation in the Brook’s Lemma, we have

n(cid:88)

n(cid:88)

1
τ 2
i

i=1

j=i+1

bijaiaj



a2
i
τ 2
i

+

148

By the symmetry of bij, the density of a can then be expressed as

1
2

n(cid:88)
−(cid:88)

i=1

j(cid:54)=i

1
τ 2
i

bij
τ 2
i

+

a2
i
τ 2
i

 a2

i
τ 2
i

aT ∆−1a − aT ∆−1Ba

aT ∆−1(I − B)a

2

−1
−1
(cid:26)
(cid:26)

−1
2
−1
2

2

i=1

n(cid:88)
n(cid:88)
(cid:104)
(cid:104)

i=1

π(a) ∝ exp

= exp

= exp

= exp



bijaiaj


(cid:105)(cid:27)

(cid:88)

j(cid:54)=i

aiaj

(cid:105)(cid:27)

where



0

b12

b13

b21

b31
...

0

b23

b32
...

0
...

bn1

bn2

bn3

···
···
···
...
···



b1n

b2n

b3n
...

0

B =

τ 2
1



τ 2
2

...

 .

τ 2
n

, ∆ =

This shows that a ∼ Nn(0, (I − B)−1∆) and the proof is ﬁnished.

Proof of Proposition 2.2.2

The proof of this result is based on some facts in matrix analysis. We ﬁrst provide the

deﬁnition of a diagonally dominant matrix.

Deﬁnition A.0.1 (Diagonally Dominant Matrix). An n × n real matrix J is diagonally

dominant if

∆i(J ) := |J ii| −(cid:88)

j(cid:54)=i

|J ij| ≥ 0, for i = 1, . . . , n.

(A.1)

149

If the inequality in (A.1) is a strict inequality, then J is called a strictly diagonally dominant

matrix.

The following result is a rephrase of Corollary 5.6.17 in Horn and Johnson (1990)[31].

Corollary A.0.1. Let A is an n × n matrix. If A is strictly diagonally dominant, then A

is nonsingular.

Based on Corollary A.0.1, it is easy to obtain the desired property. The following is the

proof for Proposition 2.2.2.

Proof. Based on Corollary A.0.1, it suﬃces to check that D − γS is strictly diagonally
dominant if |γ| < 1. Note that

J := D − γS =

then since sij ≥ 0 for i, j by the deﬁnition of similarity, we have

j(cid:54)=i

|J ij|

∆i(J ) = |J ii| −(cid:88)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) −(cid:88)
(cid:88)
sij − |γ|(cid:88)
(cid:88)
(cid:88)

= (1 − |γ|)

j(cid:54)=i

j(cid:54)=i

=

=

sij

j(cid:54)=i

sij.

j(cid:54)=i

| − γsij|

j(cid:54)=i

sij

150

(cid:80)
j(cid:54)=1 s1j
−γs21

...



(cid:80)
−γs12

j(cid:54)=2 s2j
...

··· −γs1n
··· −γs2n
...

...

··· (cid:80)

j(cid:54)=n snj

−γsn1

−γsn2

 ,

Hence

which ﬁnishes the proof.

∆i(J ) > 0 ⇔ |γ| < 1,

Top 10 Signiﬁcant Genes Found by CAR, GGRF and SKAT

for Entorhinal, Ventricle and Whole Brain

Table A.1, Table A.2 and Table A.3 summarize the top 10 signiﬁcant gene selected by the

three methods associated with entorhinal, ventricle and whole brain respectively. According

to the result, we can ﬁnd that most of the signiﬁcant genes associated with entorhinal and

whole brain found by GGRF and SKAT are the same. This may be from the fact that both

methods are constructed based on the same idea that individuals with similar genotypes tend

to have similar phenotypes. Besides, both CAR and SKAT ﬁnd the gene G1D76 signiﬁcant

for entorhinal. These two methods also ﬁnd the gene G0A21 signiﬁcant for the whole brain.

For ventricle, all the three method discovered that the gene G14B2 is signiﬁcant.

151

Table A.1: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Entorhinal.

GGRF

SKAT

CHR Gene Name P -value CHR Gene Name P -value CHR Gene Name P -value
6.35E-05
1.36E-04
1.41E-04
1.55E-04
1.74E-04
2.24E-04
2.54E-04
3.46E-04
3.95E-04
4.07E-04

2.43E-05
5.11E-05
5.82E-05
7.46E-05
8.99E-05
1.34E-04
OR6F1
1.55E-04
FBXO9
CAMK2D 1.95E-04
2.12E-04
FLJ25758
2.96E-04
MIR3654

MGAM
HSD3B1
GPR85
LRRC40
DMGDH
UGT2A3
HDLBP
PTGER4
C10orf107
FBXO9

GPR85
LRRC40
HSD3B1
OXR1
CSF1

DMGDH
UGT2A3
PTGER4
C10orf107
MIR4514

1.96E-04
2.42E-04
2.58E-04
2.86E-04
4.61E-04
4.79E-04
5.22E-04
5.27E-04
5.67E-04
6.33E-04

5
5
11
5
4
1
6
4
19
7

7
1
1
8
1
5
4
5
10
15

CAR

MZB1
PROB1
CAT

DNAJC18
METTL14

7
1
7
1
5
4
2
5
10
6

152

Table A.2: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Ventricle.

CAR

Gene Name

LIG3
CNDP1
KIAA0232

PGM5P4-AS1

CTSL
ICAM4
RAD51D

LAT

JCHAIN

DNAH17-AS1

CHR
17
18
4
2
9
19
17
17
16
4

SKAT

Gene Name

MEG8
JCHAIN
CCDC61
MIR4428
TREML2
MIR4743
BCL2L2
PHACTR3

LOC105378349

TINAGL1

P -value
1.69E-05
5.47E-05
1.76E-04
2.76E-04
3.41E-04
3.63E-04
4.73E-04
4.74E-04
5.13E-04
5.23E-04

GGRF

P -value CHR Gene Name P -value CHR
14
4.57E-05
4
6.21E-05
7.89E-05
19
1
1.28E-04
6
1.58E-04
18
2.15E-04
2.33E-04
14
20
2.42E-04
10
2.54E-04
2.73E-04
1

JCHAIN
CYP20A1
CEP120
TINAGL1
CCDC61
NPIPB4
RAPH1
ABCC11
MEG8

3.23E-05
9.99E-05
1.12E-04
1.58E-04
2.29E-04
2.94E-04
3.07E-04
4.27E-04
5.01E-04
5.13E-04

4
2
5
1
19
16
2
16
14
17

LINC00670

153

Table A.3: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Whole Brain.

CAR

Gene Name
BCL11A
TAOK2
TRMT6
MIR559
TMEM219

LOC101559451
LOC101928429

CRNN
SOAT1
TRIM24

CHR

2
16
20
2
16
17
6
1
1
7

P -value CHR
17
1.95E-05
20
7.92E-05
1.57E-04
5
7
1.63E-04
8
1.76E-04
1
1.78E-04
1.97E-04
17
11
2.62E-04
2
2.66E-04
2.91E-04
1

GGRF

Gene Name
ANKRD13B
B4GALT5

BRIX1
EEPD1

LOC101929217

C1orf194
C17orf82

ME3

TMEM163

IL6R

P -value CHR
1.18E-04
1.23E-04
1.53E-04
1.85E-04
2.18E-04
2.45E-04
3.42E-04
3.45E-04
3.76E-04
4.87E-04

1
5
17
2
8
7
2
1
4
1

SKAT

Gene Name
C1orf194
BRIX1

ANKRD13B

BCL11A

LOC101929217

EEPD1

TMEM163
LINC00467
MAD2L1

IL6R

P -value
1.13E-04
1.29E-04
1.34E-04
1.63E-04
1.83E-04
2.44E-04
3.40E-04
4.06E-04
4.45E-04
5.03E-04

154

Appendix B

Technical Details and Supplementary

Materials for Chapter 3

In this appendix, we are going to provide the proofs for the results in Chapter 3. Before we

head into the proof, we need some background knowledge on concentration inequalities and

matrix analysis.

Some Results from Concentration Inequality and Matrix

Analysis

Sub-Gaussian and Sub-Exponential Inequalities

In this part, we present two basic concentration inequalities used in the main text. For

more details on concentration inequality, readers may refer to Buldygin and Kozachenko

(2000)[11], Boucheron et al. (2004)[8] and Ledoux (2005)[39].

Deﬁnition B.0.1.

(i) A random variable X with mean µ = E[X] is called sub-Gaussian

if there exists a number σ ≥ 0 such that

E(cid:104)

eλ(X−µ)(cid:105) ≤ exp

(cid:27)

λ2σ2

(cid:26)1

2

∀λ ∈ R

,

155

The constant σ is known as the sub-Gaussian parameter.

(ii) A random variable X with mean µ = E[X] is called sub-exponential (also called pre-

Gaussian) if there exist non-negative constants (β, α) such that

E(cid:104)
eλ(X−µ)(cid:105) ≤ exp

(cid:26)1

2

(cid:27)

λ2β2

∀|λ| <

,

1
α

.

The pair of constants (β, α) is known as the sub-exponential parameter.

Based on the deﬁnition of sub-Gaussian random variables, it is clear that if X ∼ N (µ, σ2),

then X is sub-Gaussian with sub-Gaussian parameter σ. The tail probabilities of both

sub-Gaussian and sub-exponential random variables can be bounded exponentially. For

sub-Gaussian random variables, the result is the famous Hoeﬀding inequality.

Theorem B.0.1. [30] Suppose that the random variables X1, . . . , Xn are independent and

Xi has mean µi and sub-Gaussian parameter σi. Then for all t > 0,

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)

i=1

P

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t
(cid:33)

(cid:40)

(cid:41)

.

2(cid:80)n

t2
i=1 σ2
i

(Xi − µi)

≤ 2 exp

−

Theorem B.0.2. Suppose that X is a sub-exponential random variable with mean µ and

sub-exponential parameters (β, α). Then for all t > 0,

P (|X − µ| > t) ≤

 2 exp
(cid:26)
2 exp(cid:8)− t

− t2
2β2

(cid:27)
(cid:9) ,

2α

if 0 < t ≤ β2

α

,

if t > β2
α

.

156

Results on Matrix Analysis

Proposition B.0.1 shows that a inverse map of a matrix is a continuous map, which will be

frequently used in later parts when we approximate the average prediction errors.

Proposition B.0.1 (Gentle (2007)[23]). Let Ψ be an arbitrary invertible matrix. Then the
map f : Ψ (cid:55)→ Ψ−1 is continuous.

Proof. Since Ψ−1 = |Ψ|−1Adj(Ψ), where Adj(Ψ) is the adjugate matrix of Ψ, i.e., Adj(Ψ) =
CT and C is the cofactor matrix of Ψ with Cij = (−1)i+jM ij and M ij is the (i, j) cofactor
of Ψ. Based on the deﬁnition of determinant, it is easy to see that the map g : Ψ (cid:55)→ |Ψ|
is continuous and the map h : Ψ (cid:55)→ Adj(Ψ) is continuous as well. Therefore, the map
f : Ψ (cid:55)→ Ψ−1 is continuous.

Another result that will be used later is that any two matrix norms are equivalent in
the sense that for any given pair of matrix norms (cid:107) · (cid:107)s and (cid:107) · (cid:107)t, there is a ﬁnite positive
constant Cst such that

(cid:107)A(cid:107)s ≤ Cst(cid:107)A(cid:107)t,

∀A ∈ Mn,

where Mn is the collection of all n × n matrices.

Proof of Lemma 3.2.1

Proof. Note that

(cid:34)

E

wT
i wj
m

(cid:35)

m(cid:88)

k=1

=

1
m

E(cid:2)wikwjk

(cid:3) = σij.

157

Now we consider the Taylor expansion of Kij(U ) around σij:

Kij(U ) = g(σij) + g(cid:48)(σij)

wT
i wj
m

− σij

g(cid:48)(cid:48)(ηij)

+

1
2

(cid:32)

(cid:33)

(cid:32)

(cid:33)2

,

wT
i wj
m

− σij

i wj
m . Then the truncation error can be evaluated as follows.

where ηij is between σij and wT
For all δ > 0,

P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U )
(cid:12)(cid:12)(cid:12) > δ



> δ

(cid:33)2


(cid:32)

|g(cid:48)(cid:48)(ηij)|
(cid:32)

wT
i wj
m

i wj
m

− σij

2

2

M

1
1
(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wT
(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wT
(cid:32)

m − σij
i wj
(cid:33)

σjj − σ2
ij
σii
(cid:32)

σjj − σ2
ij
σii

> δ

wT
i wj
m

− σij
(cid:33)2
− σij
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >
(cid:33)
(cid:114)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >
(cid:33)
(cid:113) 2δ
(cid:33)

2δ
M

M

.

Im

,

(cid:33)

(cid:33)

wT

i wi

(cid:17)

= P

≤ P

= P

(cid:32)

σij
σii

wi,

σij
σii

wT

i wi,

158

So it suﬃces to evaluate the tail probability P

. Note that

we get

wj|wi ∼ Nm
(cid:32)

i wj|wi ∼ N
wT

Therefore, given wi, the random variable wT
eter σ−1
i wi, where sij = σiiσjj − σ2

ii si,jwT

i wj is sub-Gaussian with sub-Gaussian param-
ij. Since

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wT

P

(cid:33)

(cid:114)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wT

i wj
m

− σij

≤ P

2δ
M

i wj
m

− σij
σii

wT
i wi
m

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:114)

1
2δ
2
M
− σij

(cid:33)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

wT
i wi
m

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σij

σii

(cid:33)

(cid:114)

+

1
2

2δ
M

P

:= (I) + (II),

it suﬃces to provide bounds for (I ) and (II ).

For (II), note that wi ∼ Nm(0, σiiIm), we have 1
σii

i wi ∼ χ2
wT

m, which implies that

(cid:32)

λ

e

E

vT
i vi
σii

−m

e

(cid:33) = e−λmE
(cid:32)

e−λ√
1 − 2λ

=

vT
i vi
σii

λ

(cid:33)m

 = e−λm(1 − 2λ)

− m
2 ,

for λ <

1
2

≤ e2mλ2
4mλ2

,

= e

2

for all |λ| <

for all |λ| <

,

1
4
1
4

,

√
i wi is sub-exponential with parameters (2

wT

i.e., 1
σii

m, 4). Hence, for σij (cid:54)= 0, by Theorem

B.0.2, we have

(II) = P

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12) 1
(cid:40)

σii

(cid:12)(cid:12)(cid:12)(cid:12) >

(cid:114)
2m
(cid:114)
|σij|

(cid:33)
(cid:33)(cid:41)

2δ
M

i wi − m
(cid:32)
wT

≤ 2 exp

−

δm
σ2
ijM

∧ m
4|σij|

2δ
M

If σij = 0, then (II) = 0.

159

(B.1)

For (I), by Hoeﬀding inequality, we have

(cid:34)
(cid:34)
(cid:34)

P

P

i wj
m

(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wT
− σij
(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)wT
σii
i wj − σij
(cid:40)
σii
− σiim2δ
4M sijwT

2 exp

wT
i wi
m

wT

i wi

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >
(cid:12)(cid:12)(cid:12)(cid:12) >
(cid:41)(cid:35)

1
2

m
2

(cid:114)
(cid:114)

(cid:33)(cid:35)
(cid:33)(cid:35)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wi
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wi

2δ
M

2δ
M

(B.2)

i wi

(I) = Ewi

= Ewi

≤ Ewi

From Theorem A in Inglot (2010)[34] stated that for a random variable χ ∼ χ2

α)th percentile is upper bounded by m + log(1/α) + 2(cid:112)m log(1/α), which is of the order

m, the 100(1−

O(m) as m → ∞. Now, since σ−1

ii wT

i wi ∼ χ2

m, we get for any α ∈ (0, 1),

(cid:32)
σ−1
ii wT

P

(cid:114)

i wi ≥ m + 2 log

1
α

+ 2

2m log

= α.

Let q(α, m) = m + 2 log(1/α) + 2(cid:112)2m log(1/α). Since the function exp{−a/x} is increasing

(cid:33)

1
α

(cid:35)

in x for a > 0, we can further bound (B.2) as follows:

(cid:34)

(cid:40)

(I) ≤ Ewi

2 exp

−

i wi

(cid:41)
(cid:40)
+ 2P(cid:16)

2 exp

Ewi

m2δ
(cid:34)
4M sijσ−1
ii wT
(cid:27)
(cid:27)

m2δ

m2δ

4M sijq(α, m)

4M sijq(α, m)

(cid:26)
(cid:26)

−

−

≤ 2 exp

≤ 2 exp

+

−

I{σ−1

i wi≤q(α,m)}
ii wT
m2δ
4M sijσ−1
ii wT
σ−1
ii wT

(cid:41)
I{σ−1
(cid:17)
i wi
i wi ≥ q(α, m)

(cid:35)

ii wT

i wi≥q(α,m)}

+ 2α.

160

By choosing α = exp{−m}, we get q(α, m) = m + 2m + 2

√
m2 = 5m so that

(I) ≤ 2 exp

≤ 2 exp

− m2δ
(cid:18)
20M msij

−m

1 ∧

+ 2 exp{−m}

(cid:19)(cid:27)

.

(B.3)

(cid:26)
(cid:26)

(cid:27)

δ

20M sij

Combining (B.3) and (B.1), we obtain for all δ > 0,

P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U )
(cid:12)(cid:12)(cid:12) > δ

(cid:17) ≤ (I) + (II)
(cid:40)

≤ 4 exp

−m

(cid:32)

1 ∧

(cid:33)(cid:41)

.

(cid:114)

2δ
M

δ

20M sij

∧ δ
M σ2
ij

∧ 1
4|σij|

Remark B.0.1. The condition (3.12) can be weakened as follows:

(cid:32)

(cid:33)

g(cid:48)(cid:48)

λσij + (1 − λ)

wT
i wj
m

= Op(1)

for all λ ∈ [0, 1]. In such case, the evaluation of truncation error can be modiﬁed as follows:

For all δ > 0, there exists Mδ > 0 such that

P(cid:0)(cid:12)(cid:12)g(cid:48)(cid:48)(ηij)(cid:12)(cid:12) > Mδ

(cid:1) <

δ
2

,

161

where ηij = λσij + (1 − λ)

P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U )
(cid:12)(cid:12)(cid:12) > δ

Now, we can choose m >

= P

(cid:32)

− σij



g(cid:48)(cid:48)(ηij)

wT
i wj
m

wT
m for some λ ∈ [0, 1] and then
i wj
(cid:33)2
(cid:17)
(cid:33)2
− σij
P(cid:0)|g(cid:48)(cid:48)(ηij)| > Mδ
(cid:33)

wT
i wj
m

g(cid:48)(cid:48)(ηij)

(cid:115)

(cid:32)

≤ P

> δ

> δ

2

1
1
(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wT
(cid:40)

2

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) >

δ

20Mδsij

≤ P

i wj
m

− σij
(cid:32)

1 ∧

≤ 4 exp

−m

δ

1 ∧

(cid:32)
P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U )

20Mδsij

Mδσ2
ij

∧

δ

∧ 1
4|σij|

(cid:17)

(cid:12)(cid:12)(cid:12) > δ

 ∩(cid:8)|g(cid:48)(cid:48)(ηij)| ≤ Mδ
(cid:1)

(cid:9) +

(cid:33)(cid:41)

(cid:115)

2δ
Mδ

+

δ
2

+

δ
2

2δ
Mδ
∧

(cid:113) 2δ

Mδ

∧ 1
4|σij|

δ

Mδσ2
ij

(cid:33)−1

log 8

δ so that

<

δ
2

+

δ
2

= δ

and hence Kij(U ) = ˆKij(U ) + op(1).

Proof of Lemma 3.3.1

Proof.

(i) When f (x) = x, we have K(U ) = 1

m U U T and since the hidden random vectors

u1, . . . , um are i.i.d, we can know that

u1uT

1 , . . . , umuT

m ∼ i.i.d. Wn

1,

L(cid:88)

l=1

 ,

ξlKl(X)

162

(cid:16)

1,(cid:80)L

(cid:17)
dom 1 and covariance matrix (cid:80)L

where Wn

l=1 ξlKl(X)

Numbers implies that as m → ∞,

stands for a Wishart distribution with degrees of free-

l=1 ξlKl(X). Therefore, the Strong Law of Large

K(U ) =

1
m

U U T =

1
m

m(cid:88)

i=1

i → E(cid:104)

uiuT

u1uT
1

(cid:105)

=

L(cid:88)

l=1

ξlKl(X),

a.s..

Since the the map ψ : A (cid:55)→ A−1 for non-singular matrix A is continuous, then we can

obtain the following result by using the Continuous Mapping Theorem.

 L(cid:88)

−1

(˜τ K(U ) + In)−1 →

˜τ ξKl(X) + In

a.s., as m → ∞

,

l=1

Let ˜A = (˜τ K(U ) + In)−1, we have

max
1≤i,j≤n

| ˜Aij| ≤ (cid:107) ˜A(cid:107)∞ (cid:46) (cid:107) ˜A(cid:107)op
(cid:16)

= λmax

(˜τ K(U ) + In)−1(cid:17)

=

1

˜τ λmin(K(U )) + 1

≤ 1 < ∞.

Therefore, by Bounded Convergence Theorem,

A := E(cid:104)

(˜τ K(U ) + In)−1(cid:105) →

 L(cid:88)

l=1

−1

a.s. as m → ∞.

,

˜τ ξKl(X) + In

Asymptotically, we get as m → ∞.

 L(cid:88)

−2

y.

˜τ ξKl(X) + In

R (cid:39) yT

l=1

163

(ii) Note that equation (3.13) can be further written as

K(U ) = f [Σ] + oP (1),

or equivalently, K(U )
sumption of (cid:107)K(U )(cid:107)op < ∞ a.s., we have

P−→ f [Σ] as m → ∞ element-wisely. Similarly, under the as-

E [˜τ K(U ) + In] → ˜τ f [Σ] + In.

Hence by Bounded Convergence Theorem and Continuous Mapping Theorem, we have

A = E(cid:104)

(˜τ K(U ) + In)−1(cid:105) → (˜τ f [Σ] + In)−1 ,

as m → ∞,

which shows that as m → ∞,

R (cid:39) yT

˜τ f

 L(cid:88)

l=1

 + In

ξKl(X)

−2

y.

164

Proof of Proposition 3.3.2

Proof. The result follows by noting that

RΣ + In)−1(cid:17)

(cid:21)

y

In − ˜σ2

RΣ(˜σ2

(cid:105)

RΣ + In)−1(cid:17)T(cid:16)
(cid:17)(cid:21)

(cid:21)

y

σ2
RΣ + φIn

(y − E[a|y])T (y − E[a|y])

= E

RΣ(˜σ2

In − ˜σ2

AP ELM M = E(cid:104)
(cid:20)
yT(cid:16)
(cid:20)
yT(cid:16)
RΣ + In)−1(cid:17)2
(cid:20)(cid:16)
RΣ + In)−1(cid:17)2(cid:16)
RΣ + In)−1(cid:105)
(cid:104)
(cid:17)−1
(cid:16)
n(cid:88)

= φtr

= E

(˜σ2

= tr

(˜σ2

(˜σ2

= φ

˜σ2
Rλi(Σ) + 1

.

i=1

More Simulation Results

Figure B.1 demonstrates the performance of LMM and KNN in terms of prediction error

when the inverse logistic function and the polynomial function of order 2 are used.

165

Figure B.1: The boxplots for linear mixed models (LMM) and kernel neural network (KNN)
in terms of prediction errors. The left panel shows the results when an inverse logistic
function is used and the right panel shows the results when a polynomial function of order
2 is used. In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN
with product input kernel and product output kernel; “3" corresponds to the KNN with
product input and polynomial output; “4" corresponds to the KNN with polynomial input
and product output and “5" corresponds to the polynomial input and polynomial output.

166

123450.20.40.60.81.01.21.4inv−Logit123450.51.01.52.02.53.03.5PolynomialAppendix C

Technical Details and Supplementary

Materials for Chapter 4

In this appendix, we are going to explore some basic properties of the parameter space
(F,(cid:107) · (cid:107)n) discussed in Chapter 4.

Proposition C.0.1. The space (F,(cid:107) · (cid:107)n) is a pseudo-normed space.

Proof. Note that (cid:107)f(cid:107)n =

i=1 f 2(xi)

, then

(cid:16) 1

n

(cid:80)n

(cid:17)1/2

(i) Based on the deﬁnition of (cid:107) · (cid:107)n, it is clear that (cid:107)f(cid:107)n ≥ 0, for any f ∈ F.

(ii) For any λ ∈ R and f ∈ F,

(cid:107)λf(cid:107)n =

(cid:32)

n(cid:88)

i=1

1
n

(cid:33)1/2

λ2f 2(xi)

= |λ|(cid:107)f(cid:107)n

(iii) For any f, g ∈ F,

(cid:32)
n(cid:88)
(cid:32) n(cid:88)
(cid:18) 1√

1
n

i=1

i=1

(cid:107)f + g(cid:107)n =

≤

(f (xi) + g(xi))2

(cid:33)1/2
(cid:32) n(cid:88)
(cid:32) n(cid:88)
(cid:19)2(cid:33)1/2
(cid:18) 1√

i=1

=

+

(cid:19)2(cid:33)1/2

g(xi)

(cid:18) 1√

n

1√
n

f (xi) +

(cid:19)2(cid:33)1/2

f (xi)

n

g(xi)

n

= (cid:107)f(cid:107)n + (cid:107)g(cid:107)n,

i=1

167

where we have used the triangle inequality for classical Euclidean norm.

Therefore, we can know that (F,(cid:107) · (cid:107)n) is a pseudo-normed space.

Proposition C.0.2. There is an pseudo-inner product on F such that (cid:107)f(cid:107)2 = (cid:104)f, f(cid:105) for
any f ∈ F. Moreover, the pseudo-inner product is given by

n(cid:88)

i=1

(cid:104)f, g(cid:105) =

1
n

f (xi)g(xi),

∀f, g ∈ F.

Proof. Based on the theorem attributed to Fréchet, von Neumann and Jordan (see for ex-

ample, Proposition 14.1.2 in Blanchard and Brüning (2015)[7]), to show the existence of

the inner product, it suﬃces to check the parallelogram law of the pseudo-norm and the

corresponding pseudo-inner product can be obtained via the polarization identity. To check
to validity of the parallelogram law, we note that for any f, g ∈ F,

(cid:107)f + g(cid:107)2

n + (cid:107)f − g(cid:107)2

n =

=

1
n

1
n

=

2
n

n(cid:88)
n(cid:88)

i=1

i=1

n(cid:88)

i=1

n(cid:88)

i=1

1
n

n(cid:88)

i=1

n(cid:88)

i=1

f 2(xi) − 2
n(cid:88)
n

g2(xi)

i=1

2
n

n(cid:88)

i=1
2
n

+

1
n

f 2(xi) +

(f (xi) + g(xi))2 +

(f (xi) − g(xi))2

f 2(xi) +

f (xi)g(xi) +

g2(xi)

n(cid:88)

i=1

1
n

f (xi)g(xi) +

1
n

n(cid:88)

i=1

g2(xi)

= 2(cid:107)f(cid:107)2

n + 2(cid:107)g(cid:107)2
n.

Hence, the parallelogram law is satisﬁed by the pseudo-norm and hence the pseudo-inner

168

product does exist and by the polarization identity, we get for any f, g ∈ F,

n(cid:88)

i=1

1
n

n(cid:88)
g2(xi) − 1
n(cid:88)
n

i=1

f (xi)g(xi) − 1
n

i=1

f 2(xi)

(cid:33)

g2(xi)

f (xi)g(xi) +

i=1

+

2
n

n(cid:88)

i=1

(cid:16)(cid:107)f + g(cid:107)2
(cid:32)
n(cid:88)

(cid:17)
n − (cid:107)f − g(cid:107)2
n(cid:88)

n

f 2(xi) +

2
n

1
n

i=1

(cid:104)f, g(cid:105) =

=

1
4
1
4

n(cid:88)

i=1

=

1
n

f (xi)g(xi).

Let

(cid:26)

g : R → R,

(cid:90) (cid:12)(cid:12)g(cid:48)(z)(cid:12)(cid:12) dz ≤ M

(cid:27)

G =

be the class of functions of bounded variation in R (see Example 9.3.3 in van de Geer
(2000)[68]). The following proposition shows that Frn ⊂ G for ﬁxed n.

Proposition C.0.3. For ﬁxed n, Frn ⊂ G.

Proof. For any f ∈ Frn, we have

f (x) = α0 +

αjσ(cid:0)γjx + γ0,j

(cid:1)

rn(cid:88)

j=1

so that

f(cid:48)(x) =

rn(cid:88)

j=1

αjγjσ(cid:0)γjx + γ0,j

(cid:1)(cid:2)1 − σ(cid:0)γjx + γ0,j

(cid:1)(cid:3) .

169

Without loss of generality, we assume that γj (cid:54)= 0 for j = 1, . . . , rn. Then note that

(cid:90)

|f(cid:48)(x)|dx =

≤

≤

(cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) rn(cid:88)
rn(cid:88)
rn(cid:88)

j=1

j=1

j=1

|αj||γj|
|αj||γj|

γj

αjγjσ(cid:0)γjx + γ0,j
(cid:90)
(cid:90)

(cid:1)(cid:2)1 − σ(cid:0)γjx + γ0,j

(cid:1)(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx

σ(γjx + γ0,j)[1 − σ(γjx + γ0,j)]dx

σ(uj)(1 − σ(uj))duj,

where in the last inequality, we let uj = γjx + γ0,j. Clearly, |γj|/γj = sign(γj). Moreover,
since

(cid:90)

σ(x)(1 − σ(x))dx =

ex

(cid:90)
(cid:90) ∞
(1 + ex)2 dx
= −(1 + u)−1(cid:12)(cid:12)(cid:12)∞
(1 + u)2 du (by letting u = ex)

=

1

0

0

we get for ﬁxed n,

= 1,

rn(cid:88)

j=1

(cid:90)

|f(cid:48)(x)|dx ≤

|αj|sign(γj) ≤

rn(cid:88)

j=1

|αj| ≤ Vn.

Therefore, f ∈ G and the desired result follows.

170

BIBLIOGRAPHY

171

BIBLIOGRAPHY

[1] Gad Abraham and Michael Inouye. Genomic risk prediction of complex human disease
and its clinical application. Current opinion in genetics & development, 33:10–16, 2015.

[2] Kenneth S Alexander. Probability inequalities for empirical processes and a law of the

iterated logarithm. The Annals of Probability, pages 1041–1067, 1984.

[3] Takeshi Amemiya. A note on a heteroscedastic model. Journal of Econometrics,

6(3):365–370, 1977.

[4] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.

cambridge university press, 2009.

[5] Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical modeling and

analysis for spatial data. Crc Press, 2014.

[6] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in prob-

ability and statistics. Springer Science & Business Media, 2011.

[7] Philippe Blanchard and Erwin Brüning. Mathematical methods in Physics: Distribu-
tions, Hilbert space operators, variational methods, and applications in quantum physics,
volume 69. Birkhäuser, 2015.

[8] Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities.

In Advanced Lectures on Machine Learning, pages 208–240. Springer, 2004.

[9] Stephen Boyd and Almir Mutapcic. Subgradient methods (notes for EE364B Winter

2006-07, Stanford University), 2008.

[10] D Brook. On the distinction between the conditional probability and the joint prob-
ability approaches in the speciﬁcation of nearest-neighbour systems. Biometrika,
51(3/4):481–483, 1964.

[11] Valeri˘ı Vladimirovich Buldygin and IU V Kozachenko. Metric characterization of ran-
dom variables and random processes, volume 188. American Mathematical Soc., 2000.

[12] Xiaohong Chen and Xiaotong Shen. Sieve extremum estimates for weakly dependent

data. Econometrica, pages 289–314, 1998.

[13] 1000 Genomes Project Consortium et al. A map of human genome variation from

population-scale sequencing. Nature, 467(7319):1061–1073, 2010.

172

[14] Wellcome Trust Case Control Consortium et al. Genome-wide association study of
14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661,
2007.

[15] Robert R Corbeil and Shayle R Searle. Restricted maximum likelihood (reml) estimation

of variance components in the mixed model. Technometrics, 18(1):31–38, 1976.

[16] Robert B Davies. Algorithm as 155: The distribution of a linear combination of χ 2
random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics),
29(3):323–333, 1980.

[17] Robert B Davies. Hypothesis testing when a nuisance parameter is present only under

the alternatives. Biometrika, 74:33–43, 1987.

[18] Luc Devroye, László Györﬁ, and Gábor Lugosi. A probabilistic theory of pattern recog-

nition, volume 31. Springer Science & Business Media, 2013.

[19] Luca Ferrarini, Walter M Palm, Hans Olofsen, Mark A van Buchem, Johan HC Reiber,
and Faiza Admiraal-Behloul. Shape diﬀerences of the brain ventricles in alzheimer’s
disease. NeuroImage, 32(3):1060–1069, 2006.

[20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical

learning, volume 1. Springer series in statistics New York, 2001.

[21] Kenji Fukumizu. A regularity condition of the information matrix of a multilayer per-

ceptron network. Neural networks, 9(5):871–879, 1996.

[22] Kenji Fukumizu et al. Likelihood ratio of unidentiﬁable models and multilayer neural

networks. The Annals of Statistics, 31(3):833–851, 2003.

[23] James E Gentle. Matrix algebra: theory, computations, and applications in statistics.

Springer Science & Business Media, 2007.

[24] David B Goldstein, Andrew Allen, Jonathan Keebler, Elliott H Margulies, Steven
Petrou, Slavé Petrovski, and Shamil Sunyaev. Sequencing studies in human genetics:
design and interpretation. Nature Reviews Genetics, 14(7):460, 2013.

[25] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning,

volume 1. MIT press Cambridge, 2016.

[26] Ulf Grenander. Abstract Inference. Wily, New York, 1981.

[27] Zihuai He, Min Zhang, Xiaowei Zhan, and Qing Lu. Modeling and testing for joint

association using a genetic random ﬁeld model. Biometrics, 70(3):471–479, 2014.

173

[28] Charles R Henderson. Sire evaluation and genetic trends. Journal of Animal Science,

1973(Symposium):10–41, 1973.

[29] Fumio Hiai. Monotonicity for entrywise functions of matrices. Linear Algebra and its

Applications, 431(8):1125–1146, 2009.

[30] Wassily Hoeﬀding. Probability inequalities for sums of bounded random variables. Jour-

nal of the American statistical association, 58(301):13–30, 1963.

[31] Roger A Horn, Roger A Horn, and Charles R Johnson. Matrix analysis. Cambridge

university press, 1990.

[32] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press,

2 edition, 2012.

[33] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward net-

works are universal approximators. Neural networks, 2(5):359–366, 1989.

[34] Tadeusz Inglot. Inequalities for quantiles of the chi-square distribution. Probability and

Mathematical Statistics, 30(2):339–351, 2010.

[35] Luke Jostins and Jeﬀrey C Barrett. Genetic risk prediction in complex disease. Human

molecular genetics, 20(R2):R182–R188, 2011.

[36] K Juottonen, MP Laakso, R Insausti, M Lehtovirta, A Pitkänen, K Partanen, and
H Soininen. Volumes of the entorhinal and perirhinal cortices in alzheimer’s disease.
Neurobiology of aging, 19(1):15–22, 1998.

[37] Raghu N Kackar and David A Harville. Unbiasedness of two-stage estimation and
prediction procedures for mixed linear models. Communications in statistics-theory and
methods, 10(13):1249–1261, 1981.

[38] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton.

521(7553):436, 2015.

Deep learning.

nature,

[39] Michel Ledoux. The concentration of measure phenomenon. Number 89. American

Mathematical Soc., 2005.

[40] Bingshan Li and Suzanne M Leal. Methods for detecting associations with rare variants
for common diseases: application to analysis of sequence data. The American Journal
of Human Genetics, 83(3):311–321, 2008.

[41] Ming Li, Zihuai He, Min Zhang, Xiaowei Zhan, Changshuai Wei, Robert C Elston, and
Qing Lu. A generalized genetic random ﬁeld method for the genetic association analysis
of sequencing data. Genetic epidemiology, 38(3):242–253, 2014.

174

[42] Dawei Liu, Xihong Lin, and Debashis Ghosh. Semiparametric regression of multidimen-
sional genetic pathway data: Least-squares kernel machines and linear mixed models.
Biometrics, 63(4):1079–1088, 2007.

[43] Xin Liu and Yongzhao Shao. Asymptotics for likelihood ratio tests under loss of iden-

tiﬁability. The Annals of Statistics, 31(3):807–832, 2003.

[44] Bo Eskerod Madsen and Sharon R Browning. A groupwise association test for rare

mutations using a weighted sum statistic. PLoS Genet, 5(2):e1000384, 2009.

[45] Yuly Makovoz. Random approximants and neural networks. Journal of Approximation

Theory, 85(1):98–109, 1996.

[46] Jon McClellan and Mary-Claire King. Genetic heterogeneity in human disease. Cell,

141(2):210–217, 2010.

[47] Charles E. McCulloch, S. R. Searle, and John M. Neuhaus. Generalized, linear, and

mixed models. Wiley, Hoboken, N.J, 2nd edition, 2008.

[48] Shahar Mendelson. A few notes on statistical learning theory. In Advanced lectures on

machine learning, pages 1–40. Springer, 2003.

[49] Stephan Morgenthaler and William G Thilly. A strategy to discover genes that carry
multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (cast).
Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1):28–
56, 2007.

[50] NE Morton and CJ MacLean. Analysis of family resemblance. 3. complex segregation

of quantitative traits. American journal of human genetics, 26(4):489, 1974.

[51] Yangling Mu and Fred H Gage. Adult hippocampal neurogenesis and its role in

alzheimer’s disease. Molecular neurodegeneration, 6(1):1, 2011.

[52] Wei Pan, Junghi Kim, Yiwei Zhang, Xiaotong Shen, and Peng Wei. A powerful and

adaptive association test for rare variants. Genetics, 197(4):1081–1095, 2014.

[53] Yongjin Park and Manolis Kellis. Deep learning for regulatory genomics. Nature biotech-

nology, 33(8):825, 2015.

[54] Long Qu, Tobias Guennel, and Scott L. Marshall. Linear score tests for variance compo-
nents in linear mixed models and applications to genetic association studies. Biometrics,
69(4):883–892, 2013.

[55] C Radhakrishna Rao. Estimation of heteroscedastic variances in linear models. Journal

of the American Statistical Association, 65(329):161–172, 1970.

175

[56] C Radhakrishna Rao. Estimation of variance and covariance components—minque the-

ory. Journal of multivariate analysis, 1(3):257–275, 1971.

[57] C Radhakrishna Rao. Estimation of variance and covariance components in linear

models. Journal of the American Statistical Association, 67(337):112–115, 1972.

[58] Havard Rue and Leonhard Held. Gaussian Markov random ﬁelds: theory and applica-

tions. CRC Press, 2005.

[59] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer the-
In International conference on computational learning theory, pages 416–426.

orem.
Springer, 2001.

[60] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels:

support vector machines, regularization, optimization, and beyond. MIT press, 2002.

[61] N Schuﬀ, N Woerner, L Boreta, T Kornﬁeld, LM Shaw, JQ Trojanowski, PM Thompson,
CR Jack, MW Weiner, Disease Neuroimaging Initiative, et al. Mri of hippocampal
volume loss in early alzheimer’s disease in relation to apoe genotype and biomarkers.
Brain, 132(4):1067–1077, 2009.

[62] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Willer, Yun Li, William L
Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson,
et al. A genome-wide association study of type 2 diabetes in ﬁnns detects multiple
susceptibility variants. science, 316(5829):1341–1345, 2007.

[63] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cam-

bridge university press, 2004.

[64] Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, pages

2555–2591, 1997.

[65] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. The Annals

of Statistics, pages 580–615, 1994.

[66] Steven E Stern and Alan H Welsh. Likelihood inference for small variance components.

Canadian Journal of Statistics, 28(3):517–532, 2000.

[67] Madhav Thambisetty, Andrew Simmons, Abdul Hye, James Campbell, Eric Westman,
Yi Zhang, Lars-Olof Wahlund, Anna Kinsey, Mirsada Causevic, Richard Killick, et al.
Plasma biomarkers of brain atrophy in alzheimer’s disease. PloS one, 6(12):e28527,
2011.

[68] Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university

press, 2000.

176

[69] Aad W van der Vaart and Jon A Wellner. Weak convergence and empirical processes.

Springer, 1996.

[70] Vladimir Vapnik. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998.

[71] Sheng Wang, Jian Peng, Jianzhu Ma, and Jinbo Xu. Protein secondary structure pre-

diction using deep convolutional neural ﬁelds. Scientiﬁc reports, 6:18962, 2016.

[72] Yalu Wen, Xiaoxi Shen, and Qing Lu. Genetic risk prediction using a spatial autore-

gressive model with adaptive lasso. Statistics in medicine, 37(26):3764–3775, 2018.

[73] Halbert White. Learning in artiﬁcial neural networks: A statistical perspective. Neural

computation, 1(4):425–464, 1989.

[74] Halbert White. Connectionist nonparametric regression: Multilayer feedforward net-

works can learn arbitrary mappings. Neural networks, 3(5):535–549, 1990.

[75] Halbert White and Jeﬀrey Racine. Statistical inference, the bootstrap, and neural-
network modeling with application to foreign exchange rates. IEEE Transactions on
Neural Networks, 12(4):657–673, 2001.

[76] Halbert White and J Wooldridge. Some results on sieve estimation with dependent
observations. In W.A. Barnett, J. Powell, and G. Tauchen, editors, Nonparametric and
Semiparametric Methods in Economics, pages 459–493. Cambridge University Press New
York, 1991.

[77] Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. The Annals of

Statistics, pages 501–513, 1981.

[78] Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin.
Rare-variant association testing for sequencing data with the sequence kernel association
test. The American Journal of Human Genetics, 89(1):82–93, 2011.

[79] Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a tool
for genome-wide complex trait analysis. The American Journal of Human Genetics,
88(1):76–82, 2011.

[80] Hongtu Zhu and Heping Zhang. Asymptotics for estimation and testing procedures

under loss of identiﬁability. Journal of Multivariate Analysis, 97(1):19–45, 2006.

177