STATISTICAL ANALYSIS FOR NETWORK-BASED MODELS WITH APPLICATIONS TO GENETIC ASSOCIATION AND PREDICTION By Xiaoxi Shen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics — Doctor of Philosophy 2019 ABSTRACT STATISTICAL ANALYSIS FOR NETWORK-BASED MODELS WITH APPLICATIONS TO GENETIC ASSOCIATION AND PREDICTION By Xiaoxi Shen Network-based models are popular in statistical applications. The main advantage of using a network-based model is that one can understand and evaluate its semantics and properties rather straightforwardly. In this dissertation, we study the statistical properties for some network-based model and apply these models to genetic association studies and genetic risk predictions. In Chapter 2, we propose a conditional autoregressive (CAR) model to account for pos- sible heterogeneous genetic effects among individuals. In the proposed method, the genetic effect is considered as a random effect and a score test is developed to test the variance component of genetic random effect. Through simulations, we compare the type I error and power performance of the proposed method with those of the generalized genetic random field (GGRF) and the sequence kernel association test (SKAT) methods under different dis- ease scenarios. We find that our method outperforms the other two methods when (i) the rare variants have the major contribution to the disease, or (ii) the genetic effects vary in different individuals or subgroups of individuals. Finally, we illustrate the new method by applying it to the whole genome sequencing data from the Alzheimer’s Disease Neuroimaging Initiative. A kernel-based neural network (KNN) method is proposed in Chapter 3 for genetic risk prediction. KNN inherits the high-dimensional feature from classical kernel methods and the non-linear and non-additive features from neural networks. KNN summarizes a large number of genetic variants into kernel matrices and uses the kernel matrices as input ma- trices. Based on these kernel matrices, KNN builds a feedforward neural network to model the complex relationship between genetic variants and a disease outcome. Minimum norm quadratic unbiased estimation (MINQUE) is implemented in KNN to make parameter esti- mation feasible. Through theoretical proof and simulations, we demonstrate that KNN can attain lower average prediction error than LMM. Finally, we illustrate KNN by an application to the sequencing data from the Alzheimer?s Disease Neuroimaging Initiate. Nowadays, neural networks have been widely applied in machine learning and artificial intelligence. However, as a statistical model, few researches focuses on statistical properties for neural networks and these will be studied in Chapter 4. A neural network can be classified into a nonlinear regression regression. However, if we consider it parametrically, due to the unidentifiability of the parameters, it is difficult to derive its asymptotic properties. Instead, we consider the estimation problem as a nonparametric regression problem and use the results from sieve estimation to establish the consistency, rates of convergence and asymptotic normality of the neural network estimators. We also illustrate the validity of the theories via simulations. I dedicate this dissertation to my parents, Xinmei Dai, Xiong Shen and my friends, Chang, Carlos, Fengkan, Julia, Shunjie and many others. iv ACKNOWLEDGMENTS Here, I would like to express my deepest gratitude to my advisors Dr. Qing Lu and Dr. Yuehua Cui for their support and guidance towards my studies and researches. Dr. Lu and Dr. Cui are extremely kind and knowledgeable. They always provide valuable insights and suggestions on the improvements of my work. I would also like to extend my sincere thanks to my dissertation committee members, Dr. Lyudmila Sakhanenko and Dr. Yimin Xiao. Their comments and suggestions are beneficial for my studies. I am also grateful to the help I obtained from all the professors in the Department of Statistics and Probability. During my Ph.D. studies, I took most of the courses offered by our department. From each course, I learned various techniques to solve different research problems. For example, I learned how to consider a problem in a Bayesian way and how to interpret the Bayesian results from the course I took from Dr. Tapabrata Maiti. During my seven years at Michigan State University, I made lots of friends and because of them, I never feel lonely in these years. Many thanks to my senpais: Dr. Honglang Wang, Dr. Tao He, Dr. Bin Gao, Dr. Shunjie Guan and Dr. Jingyi Zhang. They have become successful faculty members and statisticians in big companies. My thanks also go to my friends: Chang Jiang, Shan Zhang, Xiaoran Tong, Kaixu Yang, Yuning Hao, Jinghang Lin, Yuan Zhou and Di Wu. Without your help, I could not be who I am now and I sincerely wish all of you have a wonderful future. Last but not least, I would like to express my sincere thanks to my parents for their support and confidence in me. I would like to treat this dissertation as my unique gift to both of you. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Linear Mixed Models for Genetic Association Studies . . . . . . . . . . . . . 1.3 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Objective and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 A Conditional Autoregressive Model for Genetic Association Analysis Accounting for Genetic Heterogeneity . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Score Test for Variance Component . . . . . . . . . . . . . . . . . . . 2.2.3 2.2.3.1 γ As a Fixed Constant . . . . . . . . . . . . . . . . . . . . . γ As an Unknown Nuisance Parameter . . . . . . . . . . . . 2.2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation I: Heterogeneous Genetic Effects Among Individuals or Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation II: Various Causal SNV Rates . . . . . . . . . . . . . . . . Simulation III: Misspecification of Weights . . . . . . . . . . . . . . . 2.4 Real Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Simulation Results 2.3.1 2.3.2 2.3.3 Risk Prediction Analysis Chapter 3 A Kernel-Based Neural Network for High-dimensional Genetic . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Quadratic Estimators for Variance Components . . . . . . . . . . . . 3.2.2 MINQUE in KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Including Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Nonlinear Random Effect . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Nonadditive Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Predictions 3.4 3.5 Simulations vi x 1 1 3 5 5 6 10 13 15 15 17 17 22 25 26 28 30 33 37 38 44 48 50 50 52 58 61 66 72 76 76 78 3.5.3 Non-normal Error Distributions . . . . . . . . . . . . . . . . . . . . . 3.6 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 82 85 Chapter 4 Asymptotic Properties of Neural Network Sieve Estimators . 87 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 87 4.2 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.6.1 Parameter Inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . 132 . . . . . . . . . . . 134 4.6.2 Consistency for Neural Network Sieve Estimators 4.6.3 Asymptotic Normality for Neural Network Sieve Estimators . . . . . 136 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Chapter 5 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Appendix A Technical Details and Supplementary Materials for Chapter 2 . . . . 148 Appendix B Technical Details and Supplementary Materials for Chapter 3 . . . . 155 Appendix C Technical Details and Supplementary Materials for Chapter 4 . . . . 167 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 vii LIST OF TABLES Table 2.1: Empirical type I error rates under different weights at level α = 0.05 and α = 0.01 based on 1000 replicates. Each cell in the table contains the em- pirical type I error rate. GGRF, SKAT, CAR.FIX and CAR.SUP are the generalized genetic random field model of Li et al. (2014) [41], the sequence kernel association test of Wu et al. (2011) [78], the conditional autoregres- sive model with fixed nuisance parameter γ, the conditional autoregressive model with maximum score test statistic, respectively . . . . . . . . . . . Table 2.2: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on 1,000 Monte Carlo replicates. In the simulation, we simulated 8 different subgroups with σZ = 0.0001, 0.0005, 0.002, 0.001, 0.003, 0.0001, 0.0005, 0.002 for BETA and WSS; with σZ = 0.01, 0.05, 0.2, 0.1, 0.3, 0.01, 0.05, 0.2 for UW and LOG. The number in the parenthesis is the standard deviation. . . . Table 2.3: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on 1,000 Monte Carlo replicates. In the simulation, we simulated 8 different subgroups with σZ = 0.0001, 0.0001, 0.0002, 0.005, 0.0003, 0.0005, 0.0001, 0.005 for BETA and WSS; with σZ = 0.01, 0.01, 0.02, 0.5, 0.03, 0.05, 0.01, 0.5 for UW and LOG. The number in the parenthesis is the standard deviation. . 33 36 37 Table 2.4: Top 10 hippocampus-associated genes detected by GGRF, SKAT and CAR. 47 Table 3.1: Average mean squared prediction error of KNN with product output kernel matrix, KNN with output kernel matrix as polynomial of order 2 and the BLUP based on LMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 4.1: Comparison of the true parameters and the estimated parameters in a single-layer neural network with 2 hidden units. . . . . . . . . . . . . . . . 132 Table 4.2: Comparison of errors (cid:107) ˆfn − f0(cid:107)2 n and the least square errors Qn( ˆfn) after 20,000 iterations under different sample sizes. . . . . . . . . . . . . . . . . 136 Table 4.3: p-values for Shapiro-Wilks test and Kolmogorov-Smirnov test for nomality test. We use “NN" to denote the true function as a neural network described in (4.13); “TRI" to denote the true function as a trigonometric function described in (4.14) and “ND" to denote the true function as a continuous function having a non-differential point described in (4.15) . . . . . . . . . 138 Table A.1: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with En- torhinal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 viii Table A.2: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Ventricle.153 Table A.3: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Whole Brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 ix LIST OF FIGURES Figure 1.1: Diagram of a neuron with main components (Figure from Wiki). . . . . . 10 Figure 1.2: A perceptron that mimics the structure of a neuron in a human brain. x1, . . . , xp are input features and x0 is a bias unit or intercept. Σ is known as a computation unit, where a linear combination of the features, including the bias unit and the parameters θ to be learned is calculated and then an activation function is applied. In this example, a standard sigmoid activation function σ(x) = (1 + e−x)−1 is applied. . . . . . . . . Figure 1.3: A neural network with one hidden layer. x1, . . . , xp are input features and x0 is a bias unit or intercept. Units in the dashed rectangle are known as hidden computation units, where a linear combination of features from previous layer and associated parameters is calculated and an activation function is applied to obtain outputs for these hidden units. Σ at the end is an output computation unit, in which a linear combination of the output features from the last hidden layers and associated weights is calcuated and an acitivation function is applied. . . . . . . . . . . . . . . . . . . . Figure 2.1: An illustration of a Gaussian graphical model. For the CAR model, the graph represents a network of the genetic effects among individuals. In the graph, each node stands for the gentic effect of an indiviudal and the ex- istence of an edge between two individuals indicates dependence between genetic effects between these two individuals. This is characterized by the precision matrix in the joint normal distribution of these genetic effects. . Figure 2.2: The Distribution of minor allele frequency of in sequencing variants on . . . . . . . . . . . . . . chromosome 17 from the 1,000 Genome project. Figure 2.3: Empirical Power Comparison of CAR, SKAT and GGRF by varying the levels of genetic heterogeneity among individuals under four different weights. The x-axis is the effect size σZ, used in the simulation. The effect sizes are chosen as 0.01, 0.05, 0.075, 0.1, 0.3, 0.5 for the UW and LOG weights, and the effect sizes are set as 0.0005, 0.00075, 0.001, 0.0015, 0.002 for the BETA and WSS weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.4: Comparison of empirical power of CAR, SKAT and GGRF with different causal SNV rates under the BETA weight and the WSS weight. The standard deviation σZ used in the simulation is gradually increased. σZ = 0.0005, 0.001, 0.002, 0.005, respectively in each column from left to right. 11 12 20 32 35 39 x Figure 2.5: Comparison of empirical power of CAR, SKAT and GGRF with dif- ferent causal SNV rates under the UW weight and the LOG weight. The standard deviation σZ used in the simulation is gradually increased. σZ = 0.05, 0.1, 0.2, 0.5 respectively in each column from left to right. . . . 40 Figure 2.6: Comparison of empirical power of CAR, SKAT and GGRF with misspec- ified weights when the true weight is BETA and WSS. In each column from left to right, we used BETA, LOG, UW and WSS weight, respectively. 42 Figure 2.7: Comparison of empirical power of CAR, SKAT and GGRF with misspec- ified weights when the true weight is UW and LOG. In each column from . left to right, we used BETA, LOG, UW and WSS weight, respectively. Figure 2.8: Histograms of the Phenotypes used for Real Data Analyses . . . . . . . . Figure 3.1: An illustration of the hierarchical structure of the kernel neural network model with L input kernel matrices K1(G), . . . , KL(G) and J hidden kernel matrices K1(U ), . . . , KJ (U ). The output layer is the prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . for the random effect f. Figure 3.2: The intuition under the assumption ˜σ2 R ≤ ˜τ ξ in Proposition 3.3.2. . . . . 43 45 59 70 Figure 3.3: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors. The left panel shows the results when a linear function is used and the right panel shows the results when a sine function is used. In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. 77 Figure 3.4: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors based on the simulation model fo- cusing on the interaction effect. The vertical axis is scaled to 0-5 by removing some outliers to make the comparison visually clear. In the hor- izontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. . . . . . . . . . . . . . . 79 xi Figure 3.5: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors based on the simulation model us- ing dominant coding (left figure) and recessive coding (right figure) for SNPs. In the horizontal axis, “1" corresponds to the LMM; “2" corre- sponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. . . 81 Figure 3.6: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors based on the simulation model using t-distribution for error (left figure) and centered χ2 1-distribution for error (right figure). In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. 82 Figure 4.1: Comparison of the true function and fitted function under the simulation model (4.12). The black curve is the true function defined in (4.13) and the blue dashed curve is the fitted curve obtained after fitting the neural network model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Figure 4.2: Figures on comparison of the true function and the fitted function used in simulations. The top panel shows the scenario when the true function is a single layer neural network; the middle panel shows the scenario when the true function is a sine function and the bottom panel show the scenario when the true function is a continuous function having a non-differentiable point. As we can see from all the cases, the fitted curve becomes closer to the truth as the sample size increases. . . . . . . . . . . . . . . . . . . . 137 under 200 iterations and various sample sizes. The true function f0 is a single-layer neural network with 2 hidden units as defined in (4.13). . . . . . . . . . . . . . 139 i=1 Figure 4.3: Normal Q-Q plot for n−1/2(cid:80)n Figure 4.4: Normal Q-Q plot for n−1/2(cid:80)n Figure 4.5: Normal Q-Q plot for n−1/2(cid:80)n (cid:105) (cid:104) ˆfn(xi) − f0(xi) (cid:104) ˆfn(xi) − f0(xi) (cid:105) (cid:104) ˆfn(xi) − f0(xi) (cid:105) under 200 iterations and various sample sizes. The true function f0 is a trigonometric function as defined in (4.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 i=1 under 200 iterations and various sample sizes. The true function f0 is a continuous function having a non-differential point as defined in (4.15). . . . . . . . . . . . . 141 i=1 xii Figure B.1: The boxplots for linear mixed models (LMM) and kernel neural network (KNN) in terms of prediction errors. The left panel shows the results when an inverse logistic function is used and the right panel shows the results when a polynomial function of order 2 is used. In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. . . . . . . . . . . . . . . . 166 xiii Chapter 1 Introduction 1.1 Overview During the past decades, genome-wide association studies (GWAS) become a powerful tool for investigating the associations between human genome and diseases. There are many successful examples. For example, Scott et al. (2007)[62] identified type 2 diabetes (T2D)- associated variants in an intergenic region of chromosome 11p12, near the genes IGF2BP2 and CDKAL1 and the region of CDKN2A and CDKN2B. They also confirmed that variants near TCF7L2, SLC30A8, HHEX, FTO, PPARG and KCNJ11 are associated with T2D risk. Through case-control comparisons, the Wellcome Trust Case Control Consortium [14] identified 24 independent association signals at p < 5 × 10−7 for seven common diseases including bipolar disorder, type 1 and type 2 diabetes, etc. With the development of next- generation sequencing (NGS) technology, it is now possible to sequence the whole human genome with little cost. As pointed out in Goldstein et al. (2013)[24], GWAS primarily make use of markers that are intended to represent causal variation indirectly, whereas, NGS can directly identify the cause variants. Many novel genetic association analyses nowadays are based on sequencing data (see for example, Wu et al. (2011)[78] and Li et al. (2014)[41]) and have achieved various levels of success. As precision medicine will be one of the focus on healthcare in the future, another im- 1 portant topic in statistical genetics is risk prediction. Accurate risk prediction can enable targeted preventative treatments, including fitness regimens for patients at risk of cardio- vascular disease, or increased mammogram frequency for patients with high breast cancer risk [35]. Abraham and Inouye (2015) [1] mentioned that the main aim of risk prediction is maximization of predictive power, including the validity and robustness of model predic- tions, while a causal interpretation is not strictly necessary for a good predictive model. Therefore, a model that is capable of jointly estimating the predictive effects of all single- nucleotide variant (SNV) sets and achieves high and robust prediction accuracy is critically important for risk prediction research [72]. With the breakthrough in computational technologies, we are in the era of heading to artificial intelligence. Many learning methods based on network structure have been proposed for classification and regression. Among which, neural networks have been receiving more and more popularity due to its ability to capture nonlinear structures in the data. Due to the complex relationship between genetic variants and human diseases, statistical models that are capable for taking nonlinear relationship into account may perform better (e.g., reaching higher prediction accuracy) than current methods. As an example, Wang et al. (2016)[71] used a Deep Convolutional Neural Fields (DeepCNF) to predict protein secondary structure and achieves higher prediction accuracy than that of the best predictors’ accuracy used before. In this chapter, we will first review linear mixed models along with their popular ap- plications in genetic association studies in section 1.2. Then in section 1.3, we will briefly introduce some methods in statistical learning, including Gaussian graphical models, kernel methods and neural networks. 2 1.2 Linear Mixed Models for Genetic Association Studies Linear mixed models are powerful tools in statistics to model correlated continuous outcomes. The general formulation of a linear mixed model is y = Xβ + Zb + , (1.1) where y ∈ Rn is a vector of continuous response; X ∈ Rn×p is the design matrix for fixed effects and β ∈ Rp is the fixed effect coefficients; Z ∈ Rn×q is the design matrix for random effects and b ∈ Rq is the random effect and  ∈ Rn is a vector of random error. Typical assumptions on the distribution of random vectors b and  are normality assumptions: b ∼ Nq(0, σ2 b D),  ∼ Nn(0, σ2R), where D ∈ Rq×q and R ∈ Rn×n are the covariance matrices for b and  respectively. In most applications, R is chosen to be the identity matrix In. In terms of applications in genetic association studies, the fixed effect design matrix X is usually composed of clinical covariates such as age, gender, ethnicity groups etc. The random effect design matrix Z consists of genetic variants such as single-neucleotide polymorphisms (SNPs). Sequence Kernel Association Test (SKAT)[78], which is a commonly used method in genetic association analysis, is based on such model setup. Under this situation, to test whether there is an association between genetic variants and the response, it suffices to test whether the variance component with respect to the genetic variants, which is σ2 b in model 3 (1.1), is zero or not. That is, we need to test H0 : σ2 b = 0 vs H1 : σ2 b (cid:54)= 0. (1.2) In SKAT, this is accomplished based on a score-type test. One of the significant advantage of modeling genetic variants for testing using random effects rather than fixed effects is to reduce the dimensionality of the parameters to be tested. For example, to test whether a specific gene is associated with the response, if we model the genetic variants as fixed effect, multiple testing corrections need to be conducted and since a gene may contain thousands of SNPs, it may require an extremely small p-value to conclude statistical significance, which will lead to power loss. On the other hand, by regarding genetic effect as a random effect, we only need to test a few variance components. Note that we may write model (1.1) as follows: y = Xβ + a + , (1.3) where a ∈ Rn is a random vector with a ∼ Nn(0, σ2 b ZDZT ). As pointed out in Yang et al. (2011)[79], we may consider a as the total genetic effects of the individuals and assume a ∼ Nn(0, σ2 relationship matrix between individuals. Again, testing the association between genetic aA) for some positive definite matrix A and A is interpreted as the genetic variants and response is equivalent to test H0 : σ2 a = 0. We will also see later in this chapter that such a linear mixed model is equivalent to a semi-parametric model when the function in the non-parametric part belongs to a certain function space. 4 1.3 Statistical Learning Statistical learning has received more and more attention nowadays. Due to a wide range of topics in statistical learning, it is impossible to list all of them in several pages. We refer interested readers to Friedman et al. (2001)[20] for more details. Here, we will focus on Gaussian graphical models, kernel methods and neural networks. 1.3.1 Gaussian Graphical Models Graphical models are powerful tools in statistical learning. In short, graphical models use a graph-based representation as the basis for compactly encoding a complex distribution over a high-dimensional space. In short, a graphical model is a combination of multivariate statistics and graphical structure and it is indeed a probability distribution. The data is the nodal information in the graph with each node being treated as a random variable, while the edges represent a set of independencies that hold in the distribution. One of the benefits we can get from graphical models is that the number of parameters in the model could be reduced dramatically. Directed graphs and undirected graphs are two types of commonly used graphs. A directed graphical model based on a directed acyclic graph (DAG) is known as a Bayesian network and directed edges give causal relationships as well as the characterization of the conditional distributions among the random variables. An undirected graphical model is known as a Markov random field and the undirected edges give correlations between variables. A Gaussian graphical model is a particular case of a graphical model where the joint distribution of the random variables are Gaussian. Here we will focus on undirected Gaussian graphical models. Suppose X = [X1, . . . , Xn]T ∼ Nn(µ, Σ) and G = (X ,E) be a labeled 5 graph with X = {1, . . . , n} and E be the edge set such that there is no edge between Xi and Xj if and only if Xi ⊥ Xj|X−ij. Such property is known as the pairwise Markov property [58]. Due to the Gaussianity, such conditional independencies can be characterized by the zero elements in the precision matrix Q = Σ−1. Theorem 1.3.1 (Theorem 2.2 in [58]). Let X be normally distributed with mean µ and precision matrix Q being symmetric positive definite. Then for i (cid:54)= j, Xi ⊥ Xj|X−ij ⇔ Qij = 0. Theorem 1.3.1 implies that the nonzero pattern of Q determines G, so we can read off from Q whether Xi and Xj are conditionally independent given others. If Q is a completely dense matrix, then G is fully connected. 1.3.2 Kernel Methods A kernel arises as a similarity measure that can be though of as an inner product in a so- called feature space [60]. More specifically, let X be the input domain and F be the feature space and it connects the input domain X via a map Φ, i.e., Φ : X → F x (cid:55)→ x := Φ(x). Then a kernel K is defined to be the inner product in the feature space, that is, 6 K : X × X → C (x, x(cid:48)) (cid:55)→ K(x, x(cid:48)) :=(cid:10)Φ(x), Φ(x(cid:48))(cid:11) Therefore, a kernel function can be thought of as a generalized inner product. Since Φ can be a nonlinear function, a kernel function is able to characterized nonlinear features among the input variables. A particular kernel function that plays an important role in statistical learning is known as the reproducing kernel. Definition 1.3.1 (Reproducing Kernel [6]). Let E be a nonempty abstract set. A function K : E × E → C (s, t) (cid:55)→ K(s, t) is a reproducing kernel of a Hilbert space H if and only if (i) ∀t ∈ E, K(·, t) ∈ H; (ii) ∀t ∈ E, ∀φ ∈ H, (cid:104)φ, K(·, t)(cid:105) = φ(t) A Hilbert space H possessing a reproducing kernel K is called a Reproducing Kernel Hilbert Space (RKHS) and is denoted by HK. The second property in the previous definition is known as the reproducing property: the value of the function φ at the point t is reproduced by the inner product of φ with K(·, t). 7 From the reproducing property, we can easily see that (cid:104)K(·, t), K(·, t(cid:48))(cid:105) = K(t, t(cid:48)). This identity forms the basis for the kernel trick, which is frequently used in statistical learning. “Given an algorithm which is formulated in terms of a positive definite kernel K, one can construct an alternative algorithm by replacing K by another positive definite kernel ˜K."[60] It has been shown in Liu et al. (2007)[42] that there is a close connection between kernel methods and linear mixed model. In fact, such connection is due to the following Nonparametric Representer Theorem [59]: Theorem 1.3.2 (Nonparametric Representer Theorem [59]). Suppose we are given a nonempty set X , a positive definite real-valued kernel K on X×X , a training sample (x1, y1), . . . , (xn, yn) ∈ X ×R, a strictly monotonically increasing real-valued function g on [0,∞), an arbitrary cost function c : (X × R2)n → ¯R, and a class of functions (cid:40) f ∈ RX : f (·) = F = ∞(cid:88) i=1 βiK(·, zi), βi ∈ R, zi ∈ X ,(cid:107)f(cid:107)HK < ∞ , (cid:41) is the norm in the reproducing kernel Hilbert space (RKHS) HK associated where (cid:107) · (cid:107)HK with K, i.e., for any zi ∈ X , βi ∈ R (i ∈ N), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞(cid:88) i=1 βiK(·, zi) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2 HK ∞(cid:88) ∞(cid:88) i=1 j=1 βiβjK(zi, zj). = 8 Then for any f ∈ F minimizing the regularized risk functional c((x1, y1, f (x1)), . . . , (xn, yn, f (xn))) + g((cid:107)f(cid:107)HK ) admits a representation of the form f (·) = n(cid:88) i=1 αiK(·, xi). Here are some details in Liu et al. (2007)[42]. Consider a partial linear model yi = xT i β + h(zi) + i, i ∼ i.i.d.N (0, σ2) where xi ∈ Rq is the vector of clinical covariates for subject i; zi ∈ Rp is a vector of gene expression pathway and h lies in some RKHS HK, the estimator for β and h obtained by minimizing the penalized quadratic loss function n(cid:88) (cid:16) i=1 (cid:17)2 J(h, β) = 1 2 yi − xT i β − h(zi) + λ 2 (cid:107)h(cid:107)2HK is the same as the best linear unbiased predictor (BLUP) of the linear mixed effect model y = Xβ + h +  h ∼ Nn (0, τ K)  ∼ Nn(0, σ2In), where the (i, j)th element of K is K(zi, zj) and τ = λ−1σ2. Hence to conduct hypothesis 9 testing on H0 : h = 0, it is equivalent to test H0 : τ = 0. In fact, the normality assumptions here is not necessary since as we shall see below, the equivalence comes from the Henderson’s mixed model equation [28], which does not depend on the normality assumptions. 1.3.3 Neural Networks A neural network is a learning algorithm used to solve nonlinear classification and regression problems. The origins of neural network are algorithms that try to mimic the function of a human brain. Figure 1.1 provides the structure of a neuron in a human brain. Figure 1.1: Diagram of a neuron with main components (Figure from Wiki). In a neuron, dendrites are “input wires" receiving signals from other locations and axon is an “output wire" which sends signals to other neurons. So basically, we have some inputs and within the neuron, some calculations are processed and then send the results through axons. Hence a neuron model (logistic unit) can be constructed as shown in Figure 1.2. Such model is known as a perceptron. In this simplified structure, x1, . . . , xp are input units. The node represented by Σ is called a computation unit, which plays a role analogous to the body 10 of the neuron. The arrows from the input units to the computation unit are “input wires", mimic the dendrites of a neuron, while the arrow originated from the computation unit is an “output wire", analogous to the axon of a neuron. Figure 1.2: A perceptron that mimics the structure of a neuron in a human brain. x1, . . . , xp are input features and x0 is a bias unit or intercept. Σ is known as a computation unit, where a linear combination of the features, including the bias unit and the parameters θ to be learned is calculated and then an activation function is applied. In this example, a standard sigmoid activation function σ(x) = (1 + e−x)−1 is applied. In this perceptron model, the computation unit first calculates the linear combination of the inputs (including the bias unit) x and the parameters (also known as weights) θ and then apply a sigmoid activation function σ(x) = (1 + e−x)−1. Other activation functions typically used in neural networks are • Hyperbolic Tangent (Tanh): a(x) = tanh(x) = ex−e−x ex+e−x ; • Rectified Linear Unit (ReLU): a(x) = x+ = max{x, 0}; • Leaky ReLU: a(x) = max{x, x} for some small positive number . A neural network is simply a group of perceptrons combined together as shown in Figure 1.3. This is an example of a neural network with one hidden layer in that the computation 11 units in the dashed rectangle are known as hidden units. Modern deep learning methods are based on deep neural networks, which are simply neural networks with multiple hidden layers. But in this dissertation, we will only focus on neural networks with one hidden layer. Figure 1.3: A neural network with one hidden layer. x1, . . . , xp are input features and x0 is a bias unit or intercept. Units in the dashed rectangle are known as hidden computation units, where a linear combination of features from previous layer and associated parameters is calculated and an activation function is applied to obtain outputs for these hidden units. Σ at the end is an output computation unit, in which a linear combination of the output features from the last hidden layers and associated weights is calcuated and an acitivation function is applied. One promising property for neural networks is the universal approximation property, which simply says that a neural network with one hidden layer can approximate a continuous function on a compact support with arbitrary degree of accuracy. Formally, this is given in the following theorem. Theorem 1.3.3 (Universal Approximation Theorem (Theorem 30.4 in [18])). For every continuous function f : [a, b]d → R and for every  > 0, there exists a neural network with 12 one hidden layer ψ(x) such that |f (x) − ψ(x)| < . sup x∈[a,b]d 1.4 Objective and Organization In current literature of genetic association analysis based on linear mixed models, the covari- ance matrix of the genetic random effects are usually specified directly based on some kernel matrices. Our proposed approach used an indirect method. Instead Due to the Gaussian assumption, the zero element in the precision matrix indicates a missing edge in the graph representing the connections of the genetic effect among the subjects. Therefore, the CAR model is based on the assumption that similar genetic variants leads to similar genetic effects. Therefore, our model is able to capture more genetic heterogeneity among individuals. Since a linear mixed model with a kernel matrix as the covariance matrix of the random effect is equivalent to a semi-parametric model, the prediction performance for linear mixed model is usually satisfactory. By adding another hierarchy to a linear mixed model, we can show that it is equivalent to a nonlinear mixed effect model so that it may have even better prediction performance in genetic risk prediction than classical linear mixed model. This is the basic idea and objective of our proposed kernel neural network. Most researches on neural networks focus on improving the prediction accuracy of neural networks on testing data set. From a statistical point of view, a neural network is simply a nonlinear regression problem. Therefore, it is worthwhile to establish the asymptotic properties, including the consistency and asymptotic normality of neural network estimators since once we have establish these properties, we may conduct statistical inference and may 13 apply to genetic association analysis later on. However, as we will see, classical results on nonlinear regression cannot be applied directly to neural networks. Instead, we used some techniques in empirical processes and nonparametric regression to establish consistency and asymptotic normality of neural network sieve estimators. The remaining dissertation is organized as follows. In Chapter 2, we focus on a conditional autoregressive model with its application to genetic association analysis on sequencing data. In Chapter 3, we focus on a kernel-based neural network and discuss its performance in genetic risk prediction. In Chapter 4, we derive the asymptotic properties for neural network sieve estimators and in Chapter 5, we discuss possible extensions of these methods and future work. 14 Chapter 2 A Conditional Autoregressive Model for Genetic Association Analysis Accounting for Genetic Heterogeneity 2.1 Introduction Substantial evidence from a wide range of diseases (e.g., breast cancer and hearing loss) indicates that complex diseases are characterized by remarkable genetic heterogeneity [46]. Evolutionary studies also suggest that individually rare mutations generated from each gen- eration create vast genetic heterogeneity in human diseases and could collectively play a substantial role in causing diseases. The recently developed whole-genome sequencing tech- nology generates a deep catalog of genetic variants, especially those rare variants, and allows researchers to comprehensively investigate their role in human diseases. Although new tech- nology holds promise for uncovering novel disease-associated variants, the massive amount of sequencing data and low frequent of rare variants bring tremendous analytical challenges to sequencing data analysis. Further challenge comes from sequencing variants, especially those rare variants, which could be highly heterogeneous: (1) the same gene may harbor many (hundreds of even thousands) different rare mutations; and (2) the same variant may 15 have heterogeneous effects in different individuals or subgroups of individuals [46]. These degree of genetic heterogeneity often have been neglected in existing statistical frameworks, adding another layer of difficulty to the discovery process. Many new statistical methods have been proposed to deal with the joint association analysis of single nucleotide variants (SNVs), including rare variants. The burden test de- veloped by Morgenthaler and Thilly (2007), Li and Leal (2008), Madsen and Browning (2009)[40, 44, 49], as a pioneer in testing the genetic association on sequencing data, col- lapses all the genetic information through a weighted sum. The burden test performs well if the effects of SNVs are in the same direction and same magnitude. Nevertheless, it is subject to power loss if the assumption fails. Similarity-based methods have been proposed to address this issue. One of the most popular methods is the sequence kernel association test (SKAT) [78], which is a semi-parametric method. SKAT is closely related to the sum of square score (SSU) test [52], and can detect both uni-directional and bi-directional genetic effects. More recently, a genetic random field model (GenRF) [27] was proposed in ana- lyzing sequencing data for continuous phenotypes, and a generalized genetic random field model (GGRF) [41] was proposed to generalize the random field model for other types of phenotypes. Most of these methods were based on the idea that individuals with similar genotypes tend to have similar phenotypes. Compared with other similarity-based methods (e.g., SKAT), GenRF and GGRF have nice asymptotic properties, and can by applied to small-scale sequencing studies without small-sample adjustment. Most of the existing methods assume the disease under investigation as one unified phe- notype with homogeneous genetic causal. When genetic heterogeneity is present, the existing methods will likely yield attenuated estimates for genetic variants with heterogeneous effects, leading to low testing power. To consider the genetic heterogeneity in sequencing studies, 16 we proposed a conditional autoregressive (CAR) model. Different from the previous GGRF model, which applies the conditional autoregressive model on the phenotypes directly, we use a linear mixed model with the genetic effect being considered as a random effect to account for heterogeneous genetic effects. By using a score test for variance components, it has advantage of computational efficiency since we only need to obtain estimators under the null hypothesis. On the other hand, it shares a nice asymptotic feature with GenRF and GGRF, which makes it appealing for small sample size studies. Simulation studies also showed that our proposed method can have high power when rare variants play an important role or when variants have different genetic effects among different individuals or subgroups of individuals. Therefore, CAR provides a powerful alternative approach to search for disease-associated variants, especially those rare or having heterogeneous effects. The remaining chapter is arranged as follow: In section 2.2, we propose a conditional autoregressive model for genetic association analysis of sequencing data and a score test for statistical inference. In section 2.3, we conduct simulation studies to compare performance of our method with two existing methods (i.e., SKAT and GGRF) under different scenarios. Finally, in section 2.4, we apply our model to the whole-genome sequencing data from the Alzheimer’s Disease Neuroimaging Initiative. 2.2 Methods 2.2.1 Motivation The linear mixed model has been commonly used to assess the association of a set of SNVs with a continuous phenotype. It models the effect of each SNV as a random effect and assumes the genetic effects of all SNVs in the set (e.g., a gene) follow an arbitrary distribution. 17 By testing the variance component of the random effect, it evaluates the joint effects of SNVs in the set on the phenotype of interest. One of the most popularly used linear mixed model for sequencing studies is the sequence kernel association test (SKAT) developed by Wu et al. (2011) [78]. It is a semi-parametric method that uses a kernel function to deal with high-dimensional genetic data and uses a score-based variance component test to assess the association. While SKAT has many advantages, such as being robust for the direction and magnitude of genetic effects, it does not consider the heterogeneous effects of genetic variants among individuals or subgroups of individuals (e.g., gender and race groups). If the disease of interest undergoes heterogeneous genetic etiological processes (i.e., genetic causes differs among individuals), the traditional linear mixed model (e.g., SKAT), which typically assumes the genetic effects are similar across all the samples, can suffer from power loss. To consider the genetic heterogeneity in association analysis, we propose a conditional autoregressive model. Similar to SKAT, the genetic effect of an individual is considered as a random effect. In fact, according to the CAR model construction, it is based on the assumption that similar genotype leads to similar genetic effect and therefore accounts for the genetic heterogeneity among individuals. All SKAT [78], genome-wide complex trait analysis (GCTA) [79] and CAR are based on the following linear mixed model: y = Xβ + a + , where a is the total genetic random effects of the individuals; X is the design matrix con- taining clinical covariates such as age, gender, etc. and  is the random error. It is nat- ural to assume the total genetic effect a follows a multivariate normal distribution with 18 a ∼ Nn(0, σ2 aΣ). A natural question is how to choose the covariance matrix Σ. In SKAT, Σ = GW GT , where G is a matrix of all SNVs and W = diag{w1, . . . , wp} is a diagonal matrix containing the weights of the p genetic variants. In GCTA [79], the (i, j)-th element in Σ is defined to be p(cid:88) k=1 1 p Σij = (gik − 2pk)(gjk − 2pk) 2pk(1 − pk) , where pk is the frequency of the reference allele for the kth SNV. Both methods use a direct way to define the marginal covariance of the genetic effects between two subjects. For CAR model, we consider an indirect way to model the covariance of the genetic effects. Specifically, if we use a graph to represent the connections of the genetic effects among the individuals as shown in Figure 2.1 where a circle represents a person’s genetic effect and an edge represents a link between the genetic effects of two subjects, due to the Gaussian assumption on the random effect a, an edge exists if and only if ai ⊥ aj|a−i,j, or equivalently p(ai|a−i) = p(ai|a−i,j). So it is worth investigating the conditional distribution of ai|a−i. A reasonable conditional model can be assumed as follow (cid:88) j(cid:54)=i  , ai|aj, j (cid:54)= i ∼ N bijaj, τ 2 i To find the joint distribution of a, we first quote the Brook’s Lemma. Lemma 2.2.1 ([10]). Let π(x) be the density for x ∈ Rn and define Ω = {x ∈ Rn : π(x) > 0}. Let x, x(cid:48) ∈ Ω, then n(cid:89) n(cid:89) i=1 i=1 π(x) π(x(cid:48)) = = π(xi|x1, . . . , xi−1, x(cid:48) π(x(cid:48) i|x1, . . . , xi−1, x(cid:48) π(xi|x(cid:48) π(x(cid:48) i|x(cid:48) i+1, . . . , x(cid:48) n) i+1, . . . , x(cid:48) n) i−1, xi+1, . . . , xn) i−1, xi+1, . . . , xn) 1, . . . , x(cid:48) 1, . . . , x(cid:48) 19 Figure 2.1: An illustration of a Gaussian graphical model. For the CAR model, the graph represents a network of the genetic effects among individuals. In the graph, each node stands for the gentic effect of an indiviudal and the existence of an edge between two individuals indicates dependence between genetic effects between these two individuals. This is charac- terized by the precision matrix in the joint normal distribution of these genetic effects. Now we are able to transfer a conditional distribution to a joint distribution, which is established in 2.2.1 and its proof can be found in Appendix A. Proposition 2.2.1. Let a ∈ Rn be a random vector with (cid:88) j(cid:54)=i  , ai|aj, j (cid:54)= i ∼ N bijaj, τ 2 i then the joint density function of a is given by (cid:26) −1 2 (cid:27) aT ∆−1(I − B)a , π(a) ∝ exp where B = [bij] is a symmetric matrix with bii = 0 and ∆ = diag{τ 2 that a ∼ Nn(0, (I − B)−1∆). 1 , . . . , τ 2 n}. This shows 20 Proposition 2.2.1 shows that a ∼ Nn(0, (I − B)−1∆). The first thing we need to ensure is that ∆−1(I − B) is symmetric and a simple condition is ∆−1B is symmetric, which becomes Given a similarity matrix S, this can be accomplished by setting bij = sij/(cid:80) bij τ 2 i = bji τ 2 j for all i, j. j(cid:54)=i sij and τ 2 i = σ2 j(cid:54)=i sij and the CAR model becomes  1(cid:80) j(cid:54)=i sij (cid:88) j(cid:54)=i a(cid:80) σ2 j(cid:54)=i sij sijaj,  . (2.1) ai|aj, j (cid:54)= i ∼ N a/(cid:80) In such case, the joint distribution of a is (cid:27) aT (D − S)a , (cid:26) − 1 2σ2 a (cid:111) π(a) ∝ exp (cid:110)(cid:80) j(cid:54)=1 s1j, . . . ,(cid:80) j(cid:54)=n snj . However, one issue here is that D − S is where D = diag singular since (D− S)1n = 0. Such impropriety can be remedied by redefining the precision matrix of a as D − γS, where γ is chosen to make D − γS nonsingular. In fact, as will be shown in Proposition 2.2.2, the nonsingularity of D − γS is guaranteed by choosing |γ| < 1. The proof of Proposition 2.2.2 can be found in Appendix A. Proposition 2.2.2. Let D − γS be as defined in the main text. Then it is nonsingular if |γ| < 1. In such case, (2.1) becomes ai|aj, j (cid:54)= i ∼ N  γ(cid:80) j(cid:54)=i sij (cid:88) j(cid:54)=i  , a(cid:80) σ2 j(cid:54)=i sij sijaj, 21 (2.2) which will be used in the model defined in section 2.2.2. The main difference between CAR model and the other methods is that the precision matrix of the random effect a, which is the inverse of the covariance matrix, is specified first. It is well-known that the (i, j)-th element in the precision matrix is in fact the partial covariance between ai and aj given all the others. The generalized genetic random fields (GGRF) model proposed by Li et al.(2014)[41] has a similar structure. A GGRF model is based on the idea of similar genotypes leads to similar phenotypes, while a CAR model can be interpreted as similar genotypes leads to similar genetic effects and the variations among phenotypes is related to the variations among genetic effects of individuals. In section 2.2.2, a CAR model will be introduced to account for the genetic heterogeneity. In section 2.2.3, a linear score test for testing the variance component will be derived. 2.2.2 Model Setup Suppose that there are n subjects. Let yi be a quantitative trait for the ith subject. To relate the phenotype with genetic variants, we consider the following linear mixed model, also known as the conditional autoregressive (CAR) model: yi = xT i β + ai + i, i = 1, . . . , n  (2.3) a(cid:80) σ2 j(cid:54)=i sij sijaj,  γ(cid:80) (cid:88) j(cid:54)=i ai|aj, j (cid:54)= i ∼ N j(cid:54)=i sij 1, . . . n ∼ i.i.d. N (0, σ2), where xi = [xi1, . . . , xip]T is the covariates of the ith subject; β is the fixed effect of the covariates; ai is the genetic random effect and i is the random error of the ith subject; 22 γ measures the overall genetic correlation among all subjects. A larger value of γ implies strong overall genetic correlation among all samples; sij is the genetic similarity between the ith and the jth subjects, which is measured on a set of SNVs. sij does not need to be estimated and it is calculated based on some given similarity metrics (e.g. IBS kernel); σ2 a measures the variation of the genetic effects. As we have seen in section 2.2.1, the joint distribution of a = [a1, . . . , an]T can be written as a ∼ Nn(0, σ2 a(D − γS)−1), where Nn denotes an n-dimensional multivariate normal distribution; S is the genetic sim- ilarity matrix and D is a diagonal matrix with diagonal elements being the row sums of S. S =  0 s12 s13 s21 s31 ... 0 s23 s32 ... 0 ... sn1 sn2 sn3  ··· s1n ··· s2n ··· s3n ... ... ··· 0 (cid:80)  , D = j(cid:54)=1 s1j (cid:80) j(cid:54)=2 s2j ... (cid:80) j(cid:54)=n snj Therefore, model (2.3) can be written into the following matrix form: y = Xβ + a +  a ∼ Nn(0, σ2  ∼ Nn(0, σ2In), a(D − γS)−1) 23  . (2.4) where y = [y1, . . . , yn]T is the phenotype and X = [x1, . . . , xn]T is the covaraite matrix. Since all the genetic information are contained in the total genetic random effects a, in order to test the association of a SNV set with the response, it is sufficient to test H0 : σ2 a = 0. Let Gi = [gi1, . . . , giK ]T be the the genotypes of K SNVs in a genetic region (e.g. a gene) for the ith subject, where gij, j = 1, . . . , K is coded as additive, i.e. {0, 1, 2}. Given the set of SNVs for subjects i and j, a commonly-used kernel function measuring genetic similarity is the weighted identity-by-state (IBS) function [78], which is defined by K(cid:88) s(Gi, Gj) = ωk{2 − |gik − gjk|}, k=1 where ωk is the prespecified weight for the kth variant, which is usually a function of minor allele frequency (MAF). In our model, we consider the scaled version of the weighted IBS similarity defined by K(cid:88) j=1 sij = ωk{2 − |gik − gjk|} . 2(cid:80)K k=1 ωk For the weights ωk, we considered four different weight functions based on the minor allele frequencies (MAF) as considered in Li et al. (2014) [41]. Among these four weight functions, unweighted (UW) assigns same weights to both common and rare variants; weighted sum statistics type of weight (WSS), on the other hand, put almost all the weights on rare variants and nearly no weights on the common variants. The Beta distribution type of weights (BETA) and the logarithm of MAFs (LOG) lie between UW and WSS with LOG putting more weights on common variants than BETA. 1. Unweighted (UW) ωk = 1, 1 ≤ k ≤ p 24 2. Beta distribution type of weights (BETA) ωk = dbeta(MAFk, 1, 25)2, 1 ≤ k ≤ p, i.e. the weight is the square of the probability density of the Beta distribution with parameters 1 and 25. 3. Weighted sum statistics type of weights (WSS) ωk = 1 MAFk(1 − MAFk) 1 ≤ k ≤ p , 4. Logarithm of MAFs as weights (LOG) ωk = − log10(MAFk), 1 ≤ k ≤ p 2.2.3 Score Test for Variance Component In order to perform the hypothesis testing, we also need to consider the nuisance parameter γ. In the following discussion, we considered two cases. In the first case, we treat γ as a fixed constant and use the linear score test proposed by Qu et al. (2013) [54]. In the second case, γ is treated as an unknown nuisance parameter. Based on the idea of Davies (1987) [17], a maximum statistic is proposed and the corresponding p-value is obtained via a simulation based method. To evaluate whether there is a genetic association between a SNV set and the trait of interest, we test H0 : σ2 a = 0. By introducing the ratio between two variance components λ = σ2 a/σ2, it is equivalent to test H0 : λ = 0 vs H1 : λ > 0. 25 2.2.3.1 γ As a Fixed Constant When γ is fixed as a constant (e.g. overall mean of SNV correlations), a linear score test procedure [54] can be used to test H0 : λ = 0 vs H1 : λ > 0. Based on model (2.4), we have the marginal distribution of y: y ∼ Nn(Xβ, ˜V ), a(D − γS)−1 + σ2In = σ2(cid:2)In + λ(D − γS)−1(cid:3) := σ2V (λ). Let θ = where ˜V = σ2 [λ, σ2, βT ]T be the vector of unknown parameters, where σ2 and β are nuisance param- eters. The nuisance parameter β does not need to be estimated when we use the restricted log-likelihood method. We can find a q × n matrix K such that KX = 0 and KKT = Iq, where q = n − rank(X). Such a matrix can be found by using the QR decomposition of X [47]. Since y ∼ Nn(Xβ, σ2V (λ)), we obtain y∗ := Ky ∼ Nq(0, σ2KV (λ)KT ), (2.5) where y∗ is known as the error contrast since E[y∗] = 0, which does not depend on X. The restricted log-likelihood function, which provides an unbiased estimator of σ2, can be formed as (cid:96)(λ, σ2|y∗) ∝ − q 2 log σ2 − 1 2 log |KV (λ)KT| − 1 2σ2 y∗T (KV (λ)KT )−1y∗. (2.6) Moreover, by considering the profiled version of the restricted log-likelihood function in (2.6), we can get the profiled REML of σ2 for a given λ: ˜σ2(λ) = arg max σ2 (cid:96)(σ2|y∗, λ) = 1 q y∗T (KV (λ)KT )−1y∗. 26 Back substituting ˜σ2(λ) into the restricted log-likelihood function (2.6), we get the profiled restricted log-likelihood function as follow: (cid:96)p(λ|y∗) ∝ − q 2 log[y∗T (KV (λ)KT )−1y∗] − 1 2 log |KV (λ)KT|. (2.7) As pointed out in Stern and Welsh (2000) [66], the profiled restricted log-likelihood function is score and information unbiased and hence it can be used for inference. The score statistic S(λ) can be calculated as follow: S(λ) = = ∂(cid:96)p ∂λ y∗T q 2 y∗T = q 2 (KV (λ)KT )−1K ∂V (λ) ∂λ KT (KV (λ)KT )−1y∗ y∗T (KV (λ)KT )−1y∗ − (cid:20) (KV (λ)KT )−1K (cid:21) ∂V (λ) ∂λ KT tr 1 2 (KV (λ)KT )−1K(D − γS)−1KT (KV (λ)KT )−1y∗ y∗T (KV (λ)KT )−1y∗ 1 2 tr − (cid:104) (KV (λ)KT )−1K(D − γS)−1KT(cid:105) . Under the null hypothesis H0 : λ = 0, V (λ) = V (0) = In so that the score statistic under H0 can be expressed as: S(0) = y∗T q 2 K(D − γS)−1KT y∗ y∗T y∗ − 1 2 tr (cid:104) (D − γS)−1(cid:105) . 27 To test H0 : λ = 0 vs H1 : λ > 0, it is equivalent to test H0 : S(0) = 0 vs H1 : S(0) (cid:54)= 0. The exact way to calculate the p-value of the test is given in Qu et al. (2013) [54]  q(cid:88) j=1  , P(S(0) > s) = P λjZ2 j > 0 where s is the observed value of S(0), λ1, . . . , λq are the eigenvalues of the following matrix B = K(D − γS)−1KT − (cid:20)2s q + tr 1 q (cid:16) (D − γS)−1(cid:17)(cid:21) Iq. q are independent chi-square-distributed random variables and the p-value can be Z2 1 , . . . , Z2 calculated by using the Davis method [16]. 2.2.3.2 γ As an Unknown Nuisance Parameter In practice, γ is usually unknown so that it is necessary to consider how to test H0 : λ = 0 vs H1 : λ > 0 under such case. One natural idea is to estimate γ and plug in the estimator (e.g. MLE). However, since γ is embedded in the variance-covariance matrix of y, it is unlikely that the maximum likelihood estimator of γ has an analytic form. Moreover, γ only appears in the alternative model, therefore it is not even possible to evaluate the MLE of γ under H0. Instead, we follow the idea of Davies (1987) [17] and construct a maximum score statistic for the association test. More specifically, following the notations we used in section 2.2.3.1, we have Sγ(0) = y∗T q 2 K(D − γS)−1KT y∗ y∗T y∗ − 1 2 tr (D − γS)−1(cid:105) (cid:104) , where y∗ ∼ Nq(0, σ2KV (λ)KT ) with V (λ) = In + λ(D − γS)−1. Now, note that under σ y∗ ∼ Nq(0, Iq). We the null hypothesis H0 : λ = 0, we have y∗ ∼ Nq(0, σ2Iq) so that 1 28 form the maximum score test statistic as follow: T = sup γ∈Γ Sγ(0) qy∗T K(D − γS)−1KT y∗ − y∗T tr(cid:2)(D − γS)−1(cid:3) Iqy∗ σ y∗(cid:17) sup y∗(cid:19)T y∗(cid:19) 2y∗T y∗ (cid:18) 1 (cid:18) 1 γ∈Γ Σ(γ) σ σ = sup γ∈Γ (cid:16) 1 σ y∗(cid:17)T(cid:16) 1 1 = 2 := (2(cid:107)Z(cid:107)2 2)−1 sup γ∈Γ ZT Σ(γ)Z, (cid:16) where Z = 1 σ y∗ ∼ Nq(0, Iq), Σ(γ) = qK(D − γS)−1KT − tr(cid:2)(D − γS)−1(cid:3) Iq. γ ∈ (cid:17) (cid:17)∩(−1, 1). 1/µ(1), 1/µ(n) ensuring that the matrix D − γS is nonsingular (see [5]). Since the parameter γ can be , where µ(1) < µ(2) < ··· < µ(n) are the ordered eigenvalues of D interpreted as a measurement of correlation coefficient, we set Γ = 2 SD (cid:16) − 1 2 − 1 1/µ(1), 1/µ(n) To calculate the p-value, let t be the observed value of T , then (cid:32) (cid:32) (cid:32) ZT Σ(γ)Z > t (cid:33) 2)−1 sup (cid:33) ZT(cid:0)Σ(γ) − 2tIq (cid:1) Z > 0 γ∈Γ (cid:33) q(cid:88) λj(γ)Z2 j > 0 , P(T > t) = P = P = P (2(cid:107)Z(cid:107)2 sup γ∈Γ sup γ∈Γ i=1 where λ1(γ), . . . , λq(γ) are the eigenvalues of (Σ(γ) − 2tIq) = K(D − γS)−1KT − 1 q (cid:18)2t q + tr 1 q (D − γS)−1(cid:105)(cid:19) (cid:104) Iq 29 and Z2 1 , . . . , Z2 q are independent chi-square-distributed random variables with degrees of freedom 1. We use the following procedures to approximate this probability: 1. Partition the index set Γ as γ1 < γ2 < ··· < γM . Γ is an open interval, denoted by (L, U ). When we implement this step, we choose a small  > 0 and let γ1 = L +  and γM = U − ; 2. Generate an N × q matrix Υ with each element being a χ2 1 random variable; 3. For each γi, i = 1, . . . , M, Calculate λ1(γi), . . . , λq(γi), which are the eigenvalues of Σ(γi) − 2tIq; j=1 λj(γi)Υkj, k = 1, . . . , N, where Υkj is the (k, j)-th element of Calculate (cid:80)q Υ; P sup γ∈Γ q(cid:88) j=1 4. Approximate the p-value as (cid:110)  ≈ # λj(γ)Z2 j > 0 k : maxγi,i=1,...,M N j=1 λj(γi)Υkj > 0 (cid:80)q (cid:111) . 2.3 Simulation Results Simulation studies were conducted to compare the CAR model with the original GGRF model and the commonly used SKAT method. For the CAR model, we evaluated both scenarios where γ is fixed (denoted by CAR.FIX) and γ is treated as an unknown nuisance parameter (denoted by CAR.SUP). In the simulation, we compared the empirical type I error rates and empirical power of the three methods under various disease models with heterogeneous genetic effects. In addition, we compared the empirical power of the three 30 methods under different percentage of causal SNVs and under the situation when the weights were misspecified. In order to mimic the real structure of sequencing data, the genetic data used for simulation was simulated based on the real sequencing data of Chromosome 17: 7344328 - 8344327 from the 1,000 Genome project [13]. The minor allele frequencies (MAF) of SNVs in this region range from 0.046% to 49.954% with a distribution highly skewed to the right. Figure 2.2 summarizes the distribution of the MAF with MAF<0.05. For each setting, we simulated 1,000 Monte Carlo replicates to calculate the empirical type I error rates and the empirical power of the three methods. In each replicate, we randomly selected a 30Kb segment from the region and used all the genetic variants in that segment for the association analysis. We first examined the type I error performances of the three methods. In the simulation, we considered two significance levels, α = 0.05 and α = 0.01, under the following null model, yi = εi ∼ N (0, 1). Table 2.1 summarizes the empirical type I error rates of the three methods under four different weights (i.e., UW, BETA, WSS and LOG). As we observe from the results, the empirical type I error rates of GGRF, CAR.FIX and CAR.SUP are well controlled at the level of 0.05 or 0.01, while the type I error rate for SKAT is conservative under WSS (for α = 0.05) and LOG (for α = 0.01). Because the score test of CAR and the test procedure of GGRF are exact test procedures without any asymptotic approximation, the type I error rates of the two methods are well controlled. 31 Figure 2.2: The Distribution of minor allele frequency of in sequencing variants on chromo- some 17 from the 1,000 Genome project. 32 Minor Allele FrequencyFrequency0.000.010.020.030.040.050200040006000 Table 2.1: Empirical type I error rates under different weights at level α = 0.05 and α = 0.01 based on 1000 replicates. Each cell in the table contains the empirical type I error rate. GGRF, SKAT, CAR.FIX and CAR.SUP are the generalized genetic random field model of Li et al. (2014) [41], the sequence kernel association test of Wu et al. (2011) [78], the conditional autoregressive model with fixed nuisance parameter γ, the conditional autoregressive model with maximum score test statistic, respectively α = 0.05 Methods GGRF SKAT UW BETA WSS LOG 0.036 0.047 0.041 0.042 0.052 CAR.FIX 0.051 CAR.SUP 0.041 0.048 0.054 0.038 0.057 0.055 0.041 0.040 0.052 0.053 α = 0.01 UW BETA WSS LOG 0.009 0.006 0.006 0.004 0.012 0.010 0.010 0.012 0.012 0.009 0.010 0.011 0.015 0.007 0.010 0.009 2.3.1 Simulation I: Heterogeneous Genetic Effects Among Individ- uals or Subgroups We first considered a complex disease scenario in which each individual has a different genetic effect, which we refer as the heterogeneous genetic effect. The following model was used to simulate the phenotype: K(cid:88) k=1 yi = w∗ kgi,kZi,k + εi, 1 ≤ i ≤ n, (2.8) where K is the number of genetic variants in a 30 Kb segment; gi,k is the genotype of the kth SNV for individual i, coded as additive (i.e., gi,k = 0 for genotype AA, gi,k = 1 for genotype Aa and gi,k = 2 for genotype aa). We set the percentage of causal SNVs as 50%. w∗ k = 0 if the kth genetic variant is not a causal variant and w∗ k = ωk if the kth genetic variant is a causal variant, where ωk is the weight defined in section 2. Zi,1, . . . , Zi,K , 1 ≤ i ≤ n are genetic effects for the ith individual, which follow a normal distribution with mean 0 and standard deviation σZ. ε1, . . . , εn are iid random variables distributed as N (0, 1). Note that 33 in (2.8), the coefficients of the genetic variants are unique for each individual, and therefore different individuals could have different genetic effects. Figure 2.3 summarizes the empirical power of three methods under different degrees of ge- netic heterogeneity. As we observe from the simulation results, the CAR model outperforms the other two methods with the increasing level of genetic heterogeneity (i.e., increasing σZ). For instance, when common variants (i.e., UW and LOG) have more contributions to the phenotype, our model has substantial power increase with the increased level of genetic heterogeneity (i.e., σZ), while the power of SKAT and GGRF remain low even with the increased genetic effect. When rare variants play an important role (i.e., BETA and WSS), GGRF and SKAT have some power increase with the increased genetic effect, but still have lower power than CAR. As also shown in Figure 2.3, CAR.FIX and CAR.SUP have very similar performance. Therefore, for simplicity, the further analysis is performed based on CAR.FIX, in which we fix the nuisance parameter γ as the overall mean of the correlation matrix of SNVs. Next, we considered a scenario where genetic heterogeneity exist among subgroups (e.g., ethnic groups). Under such case, individuals within a group had homogeneous genetic effects, but the genetic effects among groups were different. In this simulation, we partitioned the total number of individuals of 500 into 8 groups, with the number of individuals in each groups being 200, 100, 50, 25, 40, 35, 45 and 5, respectively. Within each group, we used the following model to simulate the phenotypes, so that the effect sizes were different for different groups. p(cid:88) k=1 yi = w∗ kgi,kZk + εi, 1 ≤ i ≤ n. (2.9) The only difference in model (2.9) from model (2.8) in the main text is that in model (2.9), 34 Figure 2.3: Empirical Power Comparison of CAR, SKAT and GGRF by varying the levels of genetic heterogeneity among individuals under four different weights. The x-axis is the effect size σZ, used in the simulation. The effect sizes are chosen as 0.01, 0.05, 0.075, 0.1, 0.3, 0.5 for the UW and LOG weights, and the effect sizes are set as 0.0005, 0.00075, 0.001, 0.0015, 0.002 for the BETA and WSS weights. 35 BETAWSSUWLOG0.0e+005.0e-041.0e-031.5e-032.0e-030.0e+005.0e-041.0e-031.5e-032.0e-030e+001e-012e-013e-014e-015e-010e+001e-012e-013e-014e-015e-010.000.250.500.751.000.000.250.500.751.00Standard DeviationEmpirical PowerMethodGGRFSKATCAR.FIXCAR.SUP Zk does not depend on the individual i and we also assume that Z1, . . . , Zn ∼ N (0, σ2 Z ) so that all the individual have the same genetic effect. We set the effect sizes σZ as 0.0001, 0.0005, 0.002, 0.001, 0.003, 0.0001, 0.0005 and 0.002 for the eight groups for the BETA and WSS weights, and 0.01, 0.05, 0.2, 0.1, 0.3, 0.01, 0.05 and 0.2 for the UW and LOG weights. Table 2.2 summarizes the results of the simulation. Consistent with previous findings, CAR has higher power than the other two methods for all four scenarios. SKAT has good power under the WSS weight, but has reduced power under the other weights. While all three methods have low power when common variants play an important role (i.e., UW and LOG), CAR still performs better than SKAT and GGRF. Table 2.2: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on 1,000 Monte Carlo replicates. In the simulation, we simulated 8 different subgroups with σZ = 0.0001, 0.0005, 0.002, 0.001, 0.003, 0.0001, 0.0005, 0.002 for BETA and WSS; with σZ = 0.01, 0.05, 0.2, 0.1, 0.3, 0.01, 0.05, 0.2 for UW and LOG. The number in the parenthesis is the standard deviation. Methods GGRF SKAT CAR UW 0.298(1.505E-02) 0.328(1.497E-02) 0.378(1.548E-02) BETA 0.180(1.243E-02) 0.454(1.622E-02) 0.817(1.248E-02) WSS 0.189(1.226E-02) 0.805(1.265E-02) 0.952(0.670E-02) LOG 0.148(1.140E-02) 0.210(1.287E-02) 0.445(1.611E-02) Since rare mutation occurs in a small number of individuals and could have a larger effect than common variants, we also modified the effect sizes of the eight groups so that groups with a small number of individuals tended to have higher effect sizes. Specifically, we set σZ as 0.0001, 0.0001, 0.0002, 0.005, 0.0003, 0.0005, 0.0001 and 0.005 for the eight groups for the BETA and WSS weights, and 0.01, 0.01, 0.02, 0.5, 0.03, 0.05, 0.01 and 0.5 for the UW and LOG weights. Under such setting, the groups with the sizes of 25 and 5 had the highest effect sizes. From Table 2.3, we find that all three methods have some power losses than the previous case, but the conclusion is similar to the previous case. CAR still outperforms the other two 36 under all four scenarios. SKAT performs well when WSS is used, while GGRF has a good performance under the UW weight. Table 2.3: Empirical power comparison of GGRF [41], SKAT [78] and CAR based on 1,000 Monte Carlo replicates. In the simulation, we simulated 8 different subgroups with σZ = 0.0001, 0.0001, 0.0002, 0.005, 0.0003, 0.0005, 0.0001, 0.005 for BETA and WSS; with σZ = 0.01, 0.01, 0.02, 0.5, 0.03, 0.05, 0.01, 0.5 for UW and LOG. The number in the parenthe- sis is the standard deviation. Methods GGRF SKAT CAR UW 0.201(1.263E-02) 0.217(1.288E-02) 0.382(1.576E-02) BETA 0.149(1.093E-02) 0.374(1.534E-02) 0.672(1.472E-02) WSS 0.154(1.161E-02) 0.618(1.545E-02) 0.769(1.342E-02) LOG 0.115(1.024E-02) 0.149(1.096E-02) 0.428(1.544E-02) 2.3.2 Simulation II: Various Causal SNV Rates Since the percentage of causal SNVs in a genetic region (e.g., a gene) also have impact on the power of the association test, in simulation II, we examined the performances of GGRF, SKAT and CAR by varying the percentage of causal SNVs. Similar as simulation I, the phenotypes were simulated by using the simulation model (2.8). In this simulation, we varied the percentages of causal SNPs and investigated the effect of 50%, 40%, 30%, 20%, 10%, 5% and 1% causal SNV rates on the power performance of the three methods. Figures 2.4 and 2.5 illustrate the empirical power of GGRF, SKAT and CAR under dif- ferent causal SNV rates and four weights. Figure 2.4 summarizes the results under the WSS and BETA wights, i.e the weights focusing more on rare variants. Overall, the CAR model outperforms the other two methods. As the causal SNV rates increase, the empirical power of CAR increases significantly. The similar trend can also be found for SKAT, especially when the WSS weights are used. For the BETA weight, SKAT also attain high empirical power but not as high as that of CAR. GGRF has lower empirical power as compared to CAR and SKAT and its empirical power increases slowly with the increase of causal SNV 37 rates. We also find that under the WSS weight, both SKAT and CAR attain decent perfor- mance when the genetic causes are heterogeneous (i.e., increasing σZ) and the causal SNV rates are moderately high. Figure 2.5 summarizes the results under the LOG and UW weights, i.e the weights focusing more on the effects of common variants. Same as the case of rare variants, the CAR model performs the best for all the scenarios. Even the empirical power of CAR under the LOG and UW weights is not high as those under the BETA and WSS weights, it can still reach high power when the causal SNV rate is high. The CAR model could also gain more substantial power than the other two methods with the increased levels of genetic heterogeneity (i.e., increased σZ). Neither SKAT nor GGRF performs as well as they do under the WSS or BETA weights, even with high causal SNV rate and high genetic effects. As we can see from Figure 2.5, if the common variants play an important role in disease phenotypes and have heterogeneous effects, both SKAT and GGRF suffer from extreme power loss, while the CAR model could still obtain moderate or high power. 2.3.3 Simulation III: Misspecification of Weights In practice, we do not know whether common variants or rare variants play a more important role in the disease process. Since the performance of GGRF, SKAT and CAR depend on the prespecified weights, it is important to investigate whether these models could still obtain reasonable power when the weights are misspecified. We used similar simulation settings as before except that we randomly selected 50 SNV as the causal variants in this particular simulation. We first focused on rare variants, and simulated phenotypes using either the WSS weight or the BETA weight. When we applied GGRF, SKAT and CAR to the simulation data, all 38 Figure 2.4: Comparison of empirical power of CAR, SKAT and GGRF with different causal SNV rates under the BETA weight and the WSS weight. The standard deviation σZ used in the simulation is gradually increased. σZ = 0.0005, 0.001, 0.002, 0.005, respectively in each column from left to right. 39 5e-040.0010.0020.005BETAWSS010203040500102030405001020304050010203040500.000.250.500.751.000.000.250.500.751.00Causal SNV Rate (%)Empirical PowerMethodGGRFSKATCAR Figure 2.5: Comparison of empirical power of CAR, SKAT and GGRF with different causal SNV rates under the UW weight and the LOG weight. The standard deviation σZ used in the simulation is gradually increased. σZ = 0.05, 0.1, 0.2, 0.5 respectively in each column from left to right. 40 0.050.10.20.5UWLOG010203040500102030405001020304050010203040500.000.250.500.751.000.000.250.500.751.00Causal SNV Rate (%)Empirical PowerMethodGGRFSKATCAR the weights (UW, BETA, WSS and LOG) were used so that we were able to evaluate their performances under both misspecified and correctly specified weights. Figure 2.6 summarizes the results when the underlying weights in the simulation are WSS and BETA. From Figure 2.6, since WSS put extremely high weights to the rare variants, all the three methods attain high empirical power when the weights are specified as WSS or BETA. As expected, all the methods have the highest empirical power when the weight func- tion is correctly specified (i.e., WSS). We also find that neither SKAT nor GGRF performs well when the weight is misspecified as UW or LOG. On the other hand, as we can see from the figures, as long as there is genetic heterogeneity, CAR has a good power performance even though the weight is misspecified. Same conclusions hold for BETA. Both GGRF and CAR will obtain highest power when the BETA weight is applied, while SKAT reaches highest power when WSS is used. CAR outperforms SKAT and GGRF in all cases. With the increased genetic heterogeneity, CAR remains high power even with misspecified weights. To conclude, in the case when the disease is mainly caused by rare variants, both SKAT and CAR can have high power, but for SKAT, we need to use the right weights (BETA or WSS). Next, we focused on common variants and simulated phenotypes using UW and LOG. Figure 2.7 summarizes the results based on UW and LOG. In the first row, the true weight used in the simulation is UW. As expected, CAR attains high power when the specified weight functions focus on common variants (i.e., UW and LOG). With the increased heterogeneity, CAR can obtain high power with the LOG or UW weights. However, it could suffer from power loss when the weight is misspecified (i.e., WSS an BETA). Overall, CAR outperforms SKAT and GGRF, even when the weight is misspecified. The same conclusion holds for the LOG weight. Since LOG also puts some weight on 41 Figure 2.6: Comparison of empirical power of CAR, SKAT and GGRF with misspecified weights when the true weight is BETA and WSS. In each column from left to right, we used BETA, LOG, UW and WSS weight, respectively. 42 BETALOGUWWSSBETAWSS0.0010.0020.0030.0040.0050.0010.0020.0030.0040.0050.0010.0020.0030.0040.0050.0010.0020.0030.0040.0050.000.250.500.751.000.000.250.500.751.00Standard DeviationEmpirical PowerMethodCARGGRFSKAT Figure 2.7: Comparison of empirical power of CAR, SKAT and GGRF with misspecified weights when the true weight is UW and LOG. In each column from left to right, we used BETA, LOG, UW and WSS weight, respectively. 43 BETALOGUWWSSUWLOG0.10.20.30.40.50.10.20.30.40.50.10.20.30.40.50.10.20.30.40.50.000.250.500.751.000.000.250.500.751.00Standard DeviationEmpirical PowerMethodCARGGRFSKAT rare variants, we find that SKAT has a good performance as compared with its performance under the UW weight. However, SKAT attains highest power by using the WSS weight. This can be explained by the fact that WSS can help to catch the heterogeneous genetic effect. As expected, CAR attains the highest power when the true weight (i.e., LOG) is specified. When LOG is the underlying weight, we can see from Figure 2.7 that there is no significant difference of the four weights used in the CAR model. Overall, GGRF has lower power than CAR and SKAT. 2.4 Real Data Applications We applied our method to the whole genome sequencing data from Alzheimer’s Disease Neu- roimaging Initiative (ADNI) and performed a genome-wide gene based association analysis. A total of 808 samples at the screening and baseline of the ADNI1 and ADNI2 studies have the whole genome sequencing data, from which we extracted 21069 genes based on the GRCh 37 assembly. ADNI also provides pre-calculated volumes of cortical regions. We chose four of them as the phenotypes of interests, which are hippocampus, entorhinal, whole brain and ventricles. The motivation of choosing these four volumes as the phenotypes is from previous biological findings. More specifically, the hippocampus, a brain area playing an important role in learning and memory, is especially vulnerable to damage at early stages of Alzheimer’s disease (AD) [51]. The volume of hippocampus changes over time and could have a large impact on Alzheimer’s disease [61]. The entorhinal cortex is also crucial in declarative mem- ories. In mild AD patients, the loss in the entorhinal volume is evident. The entorhinal cortex is also highly correlated with the severity of the disease [36]. Similarly, the whole brain volume decreases significantly in patients with AD [67]. The brain ventricles also play 44 Figure 2.8: Histograms of the Phenotypes used for Real Data Analyses an important role in AD and it is well known that ventricular volume is significantly higher in AD patients [19]. Figure 2.8 plots the histograms of the four phenotypes used in the analysis. As we can see from the histograms, hippocampus, entorhinal and whole brain are nearly normally dis- tributed. For Ventricles, we first applied a normal quantile transformation to the phenotype and then applied our method. We restricted our analyses to individuals with caucasian ancestry due to the small sample 45 size of non-caucasion samples and the issue of population stratification. In the analysis, age, gender, years of education, marriage status, and APoEε4 were used as covariates. After removing individuals with missing phenotypes or covariates, a total of 588 subjects remained for the analysis. In this analysis, we also compared our method with SKAT and GGRF. The BETA weight was chosen for all the three methods, and the linear kernel was used in SKAT. Table 2.4 summarizes the top 10 genes (based on the p-values) associated with hippocam- pus found by the three methods (CAR, GGRF and SKAT). As we can see from the table, the results of GGRF and SKAT are similar and are different from that of CAR. This could be due to the fact that both methods are similar in terms of assuming genetic homogeneity, while CAR is developed based on a different model, which accounts for genetic heterogeneity. Nonetheless, all the three methods found gene RPL27 on chromosome 17 associated with hippocampus. In addition to hippocampus, Appendix A also provides results of the top 10 genes related to entorhinal, ventricle and whole brain. Given the limited sample size of the study, none of the association reached statistical significance after multiple-testing ad- justment. Nevertheless, few genes (e.g., RPL27 ) have been identified by all three methods, which might worth further investigation. In addition, some of the genes (e.g. LINC01449 ) was detected by the CAR model only, which could also be further evaluated for potential genetic heterogeneity. 46 Table 2.4: Top 10 hippocampus-associated genes detected by GGRF, SKAT and CAR. CAR CHR Gene Name LINC01449 7 17 10 17 2 18 1 14 15 1 GGRF LOC400997 P -value CHR Gene Name 4.98E-05 CDKN1C 8.09E-05 RPL27 8.79E-05 8.81E-05 1.21E-04 1.60E-04 1.71E-04 1.81E-04 SECISBP2L 2.88E-04 2.93E-04 RGS9 MAPK8 RPL27 WTH3DI PQLC1 NTNG1 PSMC1 IFI35 AJAP1 GRIN2B NLRP11 EFCAB1 11 17 2 17 1 12 19 8 1 1 AK4 MROH9 F5 P -value CHR 17 6.37E-05 17 7.87E-05 1.35E-04 2 4 1.44E-04 17 1.98E-04 12 2.83E-04 3.03E-04 1 19 4.10E-04 10 4.69E-04 5.43E-04 2 SKAT Gene Name RPL27 IFI35 LOC400997 MMRN1 RUNDC1 CCDC59 AK4 ISYNA1 C10orf107 LOC101929260 P -value 4.04E-05 6.15E-05 1.57E-04 2.03E-04 2.53E-04 3.56E-04 3.61E-04 3.99E-04 4.48E-04 5.29E-04 47 2.5 Discussion We have proposed a conditional autoregressive model for genetic association analysis of se- quencing data. Our simulations show that the CAR model can obtain high power under the scenario (i) when rare variants are related to the phenotypes and (ii) when genetic variants have different genetic effects among individuals or subgroups of individuals. Moreover, we derive the exact form of the test statistic, which makes the method computationally efficient for large-scale sequencing data analysis. Unlike SKAT, which uses the asymptotic distri- bution for its test statistic, the exact distribution of the CAR test statistic under the null hypothesis is not conservative. Therefore, no additional small sample size adjustment is needed for the CAR model. In our simulations, γ is chosen as the average correlation of the genetic correlation coeffi- cients among all individuals. Alternatively, we could also use the average of pairwise linkage disequilibrium (LD) correlation coefficient. Simulation results find that there are no substan- tial differences between two values (results not shown). However, calculating LD correlation coefficient is more time consuming than estimating the traditional correlation coefficient. It is also interesting to consider the model when γ = 1, i.e. assuming the genetic random effects are highly correlated. As mentioned in Rue and Held (2005) [58], when γ = 1, the genetic random effect is then modeled by an intrinsic random field. In this case, the joint distribution of the genetic random effects is not a proper distribution so that traditional frequentist approach may not work well. One potential solution is to use Bayesian method as proposed in Rue and Held (2005) [58]. The work introduced in this paper focuses on the continuous phenotype with normal distribution. Nevertheless, the model can be extended to the case when the phenotypes 48 follow distributions in the exponential family by using the generalized linear mixed model (GLMM). One potential challenge of extending the CAR model to GLMM is that the test statistic and the likelihood function may not have closed form, which requires alternative approach (e.g. Monte Carlo method) to numerically estimate both of them. This is a future work worth further study. 49 Chapter 3 A Kernel-Based Neural Network for High-dimensional Genetic Risk Prediction Analysis 3.1 Introduction Neural-network-based (NN) methods, such as deep learning, has made impressive advances in areas, such as imaging recognition and nature language precessing [38]. There is an increasing interest in using NN methods in genomic studies, especially in the field of regulatory genomic, where NN methods and software have been developed to improve the capacity of modeling DNA and RNA targets of regulatory proteins [53]. Even though NN methods also hold great promise in revealing the complex relationship between human diseases and genetic variants, few research has been done in this field. One of the key challenges is the massive genomic data, which often includes millions of genetic variants. Directly applying the classic NN methods to such a large number of genetic variants requires the estimation of millions of parameters, which is computationally difficult and likely causes serious over-fitting issue. Instead of modeling individual effects of millions of genetic variants, we can model the genetic effect of these variants as a random effect by adopting a 50 linear mixed model (LMM) framework. LMM has been widely used in genetic data analysis, starting from complex segregation analysis [50] to recent genome-wide complex trait analysis [79] and set-based association analysis[78]. Based on LMM, we propose a method, referred to as the kernel neural network (KNN), for high-dimensional genetic data analysis. As we show in the later section of the paper, un- der certain conditions, KNN is equivalent to a nonlinear mixed effect model so that a linear mixed model is a special case of KNN. In addition to its ability to handle high-dimensional data, KNN inherits properties from the neural network, which allows the method to consider nonlinear and nonadditive effects. Due to the complexity of such method, it is difficult to estimate the parameters from KNN using the conventional approach. Moreover, the pa- rameters in KNN are unidentifiable. To address such issue, instead of using the restricted maximum likelihood estimator (REML) [15], we use the minimum quadratic unbiased esti- mator (MINQUE) proposed by Rao (1970, 1971, 1972)[55, 56, 57] to estimate the “variance components". We show both theoretically and empirically that MINQUE has nice properties. The remaining chapter is arranged as follows: Section 2 provides the description of the KNN model and the MINQUE estimation procedure; Section 3 provides theoretical compari- son of prediction performance between KNN and LMM and Section 4 discusses the inclusion of fixed effects; results from simulation study and a real data application are presented in Sections 5 and 6, respectively. Before we proceeding to the main text, we first summarize notations that will be frequently used in this paper. Notations: Throughout the chapter, capital bold italic letters A, B, . . . , Γ, Θ, . . . will be used to denote matrices; small bold italic letters a, b, . . . , α, β, . . . will be used to denote vectors and other small letters will be used to denote scalars. In will be used to denote an n × n identity matrix and the symbol “(cid:45)" will be used to denote asymptotically less than. 51 3.2 Methodologies Kernel methods are widely used in recent machine learning area due to its capability of capturing nonlinear features from the data so that the prediction error can be diminished. As has been mentioned in Shawe-Taylor et al. (2004)[63], given a kernel and a training set, the kernel matrix acts as an information bottleneck, as all the information available to a kernel algorithm must be extracted from this matrix. On the other hand, linear mixed effect models are also widely used in the area of genetic risk prediction. Therefore, it seems natural to combine these two methods together. A naive way is to change the covariance matrix of the random effect in the linear mixed model to a kernel matrix as was did in [78] and [79]. As we have seen in section 1.3.2 that for a linear mixed model with a kernel matrix as the covariance matrix of the random effect is equivalent to a semiparametric regression model. Now, we move a little bit further. By extending the Representer Theorem, we will see that a nonlinear mixed effect model is equivalent to a hierarchical linear mixed model. Specifically, consider m i.i.d. random features u1, . . . , um and the following nonlinear mixed effect model: yi = xT i β + f (ui1, . . . , uim) + i, i = 1, . . . , n, (3.1) where 1, . . . , n are i.i.d. N (0, φ) random variables and f ∈ HK for some RKHS HK. We consider the following loss function: (cid:104) (cid:105) (y − Xβ − f (u1, . . . , um))T (y − Xβ − f (u1, . . . , um)) L(β, f ) = 1 2φ EU + λ 2 . (cid:107)f(cid:107)2HK (3.2) The first thing we are going to do is to extend Theorem 1.3.2 to meet with our need. Theorem 3.2.1 (Extended Representer Theorem). For a fixed kernel K, let HK be the 52 corresponding RKHS and let X1, . . . , Xn be random variables. Then for a convex function L : Rn → R and non-decreasing Ω : R → R, if the optimization problem can be expressed as (cid:17)(cid:111) (cid:16)(cid:107)f(cid:107)2HK , J(f∗) = min f∈HK J(f ) = min f∈HK then the solution can be expressed as (cid:110)E [L(f (X1), . . . , f (Xn))] + Ω n(cid:88) i=1 αiE [K(·, Xi)] f∗ = Proof. Consider the subspace S ⊂ HK given by S = span{E [K(·, Xi)] : i = 1, . . . , n} . Since S is a finite dimensional space, it is therefore closed. The projection theorem then implies HK = S ⊕ S⊥, i.e., for every f ∈ HK, we can uniquely write f = fS + f⊥, where fS ∈ S and f⊥ ∈ S⊥. Noting that (cid:104)f⊥, E [K(·, Xi)](cid:105) = 0 for each i, the reproducing property implies E [f (Xi)] = E [(cid:104)f, K(·, Xi)(cid:105)] = (cid:104)f, E[K(·, Xi)](cid:105) = (cid:104)fS , E [K(·, Xi)](cid:105) = E [fS (Xi)] = fS (Xi) On the other hand, Pythagoras Theorem implies that (cid:107)f(cid:107)2HK = (cid:107)fS(cid:107)2HK + (cid:107)f⊥(cid:107)HK ≥ (cid:107)fS(cid:107)2HK . 53 Since Ω is non-decreasing, we can know that Ω from Jensen’s inequality that (cid:16)(cid:107)f(cid:107)2HK (cid:17) . Then it follows (cid:16)(cid:107)fS(cid:107)2HK (cid:17) ≥ Ω (cid:17) (cid:16)(cid:107)f(cid:107)2HK (cid:17) (cid:16)(cid:107)fS(cid:107)2HK (cid:17) (cid:16)(cid:107)fS(cid:107)2HK (cid:17) (cid:16)(cid:107)fS(cid:107)2HK . J(f ) = E [L(f (X1), . . . , f (Xn))] + λΩ ≥ L (E[f (X1)], . . . , E [f (Xn)]) + λΩ = L (fS (X1), . . . , fS (Xn)) + λΩ = E [L (fS (X1), . . . , fS (Xn))] + λΩ Hence, J(f ) is minimized if f ∈ S and we can express the minimizer as n(cid:88) f∗(·) = αiE[K(·, Xi)]. i=1 Let W i be the ith row of the random matrix U = [u1 ··· um], based on Theorem 3.2.1, the general solution for f (·) in (3.2) can be expressed as n(cid:88) f (·) = αiE [K(·, W i)] , i=1 where α = [α1, . . . , αn]T are unknown parameters. Substituting this back into (3.2), we 54 have J(α, β) = 1 2φ EU (cid:104) (cid:105) (y − Xβ − EU [K(U )]α)T (y − Xβ − EU [K(U )]α) (y − Xβ − EU [K(U )]α)T (y − Xβ − EU [K(U )]α) (cid:42) n(cid:88) i=1 + λ 2 (cid:88) n(cid:88) i=1 j=1 + λ 2 αiE[K(·, W i)], n(cid:88) αjE(cid:2)K(·, W j)(cid:3)(cid:43) (cid:105) αiαjE(cid:104)(cid:10)K(·, W i), K(·, W j)(cid:11) j=1 HK HK = 1 2φ = 1 2φ (y − Xβ − EU [K(U )]α)T (y − Xβ − EU [K(U )]α) + αT EU [K(U )]α, λ 2 where K(U ) is an n×n matrix whose (i, j)th element if K(W i, W j). Differentiating J(α, β) with respect to α and β, we get = −φ−1XT (y − Xβ − EU [K(U )] α) = −φ−1EU [K(U )]T (y − Xβ − EU [K(U )]α) + λEU [K(U )] α ∂J ∂β ∂J ∂α By setting the second equation to 0, we get by the symmetry and positive definiteness of EU [K(U )], λEU [K(U )]α + φ−1EU [K(U )]T EU [K(U )]α = φ−1EU [K(U )]T (y − Xβ) ⇒α = (λφ)−1(cid:16) (cid:17)−1 In + (λφ)−1EU [K(U )] (y − Xβ). 55 Similarly, for the first equation, we have XT Xβ = XT (y − EU [K(U )]α) = XT(cid:16) (cid:17) y − (λφ)−1EU [K(U )](In + (λφ)−1EU [K(U )])−1(y − Xβ) = XT (In + (λφ)−1EU [K(U )])−1y + (λφ)−1XT EU [K(U )](In + (λφ)−1EU [K(U )])−1Xβ, which implies that XT(cid:16) (cid:17)−1 In + (λφ)−1EU [K(U )] Xβ = XT(cid:16) (cid:17)−1 In + (λφ)−1EU [K(U )] y. Hence, we get (cid:20) XT(cid:16) ˆα = (λφ)−1(cid:16) ˆβ = (cid:17)−1 In + (λφ)−1EU [K(U )] (cid:17)−1 In + (λφ)−1EU [K(U )] X (y − X ˆβ), (cid:21)−1 XT(cid:16) (cid:17)−1 In + (λφ)−1EU [K(U )] y so that ˆf = EU [K(U )] ˆα = (λφ)−1EU [K(U )] (cid:16) (cid:17)−1 In + (λφ)−1EU [K(U )] (y − X ˆβ) Through some simple calculations, it can be seen that ˆβ and ˆα is the solution to the following equation: XT R−1X R−1X R−1 + τ−1 (EU [K(U )])−1 XT R−1 56   = β f XT R−1y R−1y  , where τ = λφ, R = φIn and this equation corresponds exactly to the Henderson’s mixed model equation of the linear mixed model y = Xβ + f + , (3.3) where E[f ] = 0, Var[f ] = τEU [K(U )] and E[] = 0, Var[] = φIn. This is equivalent to the following hierarchical model. In the model, we consider a more general setup that the matrix K(U ) is a positive linear combination of different kernel matrices. y|f ∼ Nn (Xβ + f , φIn) 0, J(cid:88) 0, j=1 τjKj(U ) L(cid:88)   , f|u1, . . . , um ∼ Nn (3.4) u1, . . . , um ∼ i.i.d. Nn ξlKl(G) l=1 where U = [u1,··· , um] ∈ Rn×m; Kj(U ), j = 1, . . . , J are n×n kernel matrices constructed based on the latent variables u1, . . . , um and Kl(G), l = 1, . . . , L are kernel matrices con- structed based on the genetic variables. For instance, if we have p genetic variants (p can be greater than n), we can define K(G) = p−1GGT , which is known as the product kernel or linear kernel. To see the connection between model (3.4) and model (3.3), we note that E[f ] = EU (E[f|u1, . . . , um]) = 0 Var[f ] = EU [Var (f|u1, . . . , um)] + Var [E (f|u1, . . . , um)] J(cid:88) (cid:2)Kj(U )(cid:3) .  J(cid:88)  = = EU τjKj(U ) τjEU j=1 j=1 57 This is the same as we obtained from model (3.3). As an illustration, the basic hierarchical structure of the model can be seen from Figure 3.1. Due to the similarity in the network structure as in the case of popular neural networks, we thus call our model a kernel neural network (KNN). KNN has several nice features. First, by considering genetic effects as random effects, the method can simultaneously deal with millions of variants. This addresses the limitation of the fixed-effect conventional neural network, which is computationally prohibitive on such a large number of variables. Second, by using random genetic effects, the model complexity is also greatly reduced, as we no longer require to estimate a large number of fixed genetic effects. Third, KNN allows for a large number of hidden units without increasing model complexity. Finally, using hidden units, KNN can capture non-linear and non-additive effects, and therefore is able to model complex functions beyond linear models. In the remaining part of this section and section 3.3, we will focus on the scenario where there is no fixed effects, that is β = 0. In section 3.4, we will consider the general estimation procedure when β (cid:54)= 0 and we will also calculate the prediction error. 3.2.1 Quadratic Estimators for Variance Components Popular estimation strategies for variance components in linear models are the maximum likelihood estimator (MLE) and the restricted maximum likelihood estimator (REML) [15]. However, both methods depend on the marginal distribution of y. In our KNN model, it is generally difficult to obtain the marginal distribution of y, which involves high dimensional integration with respect to u1, . . . , um. Moreover, the ui’s are embedded in the kernel matrices Kj(U ), j = 1, . . . , J, which makes the integration even more complicated. On the other hand, given the KNN model describes above, we can easily obtain the 58 Figure 3.1: An illustration of the hierarchical structure of the kernel neural network model with L input kernel matrices K1(G), . . . , KL(G) and J hidden kernel matrices K1(U ), . . . , KJ (U ). The output layer is the prediction for the random effect f. 59 conditional distribution of y|u1, . . . , um is given by J(cid:88) y|u1, . . . , um ∼ Nn 0, τjKj(U ) + φIn  . Then the marginal mean and variance of y can be obtained via conditioning arguments: j=1 E [y] = E (E [y|u1, . . . , um]) = 0. Var[y] = E [Var(y|u1, . . . , um)] + Var [E(y|u1, . . . , um)]   J(cid:88) J(cid:88) J(cid:88) j=1 j=1 = E = := τjKj(U ) + φIn τjE[Kj(U )] + φIn τjE[Kj(U )], (3.5) (3.6) j=0 where τ0 = φ and K0(U ) = In. Given the marginal mean and covariance matrix, the MInimum Quadratic Unbiased Estimator (MINQUE) proposed by Rao (1970, 1971, 1972)[55, 56, 57] can be used to estimate the variance components. The basic idea of MINQUE is to use a quadratic form yT Θy to estimate a linear combination of variance components. The MINQUE matrix Θ is obtained by minimizing a suitable matrix norm, which is typically chosen to be the Frobenius norm, of the difference between Θ and the matrix in the quadratic estimator by assuming that we know the random components in the linear mixed models. The constraint in the optimization is to guarantee the unbiasedness of the estimators. One advantage of MINQUE is that it has a closed form solution provided by Lemma 3.4 in Rao (1971)[56] so that it can be computed efficiently. However, MINQUE can also provide a negative estimator for a single variance component. When this occurs, we simply set the 60 negative estimators to be zero, as we usually do for MLE or REML of variance components, except for the error variance component. If the MINQUE estimator for the error variance component becomes negative, we project the MINQUE matrix Θ onto the positive semi- definite cone S+ n . Specifically, the modified estimator for error variance component becomes max{λ1, 0} ˆτ0 = yT O ...   OT y, max{λn, 0} where Odiag{λ1, . . . , λn}OT is the eigen-decomposition of Θ. 3.2.2 MINQUE in KNN For the ease of theoretical justifications, throughout the remaining of the chapter, we illus- (cid:104) 1 g (cid:105) trate the method with one hidden kernel matrix (i.e. J = 1) with the form of Kij(U ) = nwT i wj , where w1, . . . , wn are the rows of the matrix U and g[A] means that the function g is applied to each element in the matrix A. We start by considering the simplest case g(x) = x, in which case, the hidden kernel matrix becomes K(U ) = 1 variance for y. Since U U T ∼ Wn(m,(cid:80)L m U U T . In this case, we have an explicit form of the marginal l=1 ξlKl(G)), where Wn(m, A) is a Wishart distri- bution with m degrees of freedom and covariance matrix A, we have L(cid:88) V := Var[y] = τEU [K(U )] + φI = τ ξlKl(G) + φIn. (3.7) l=1 From equation (3.7), we can see that there is an identifiability issue if we directly estimate τ and ξl’s. To solve this issue, we reparameterize the covariance matrix by letting θl = τ ξl, 61 θ0 = φ, K0(G) = In and rewrite Var[y] as L(cid:88) l=0 L(cid:88) l=0 θlSl(G)ST l (G), (3.8) V = θlKl(G) = where S0(G), . . . , SL(G) are the Cholesky lower triangles for the kernel matrices K0(G), . . . , KL(G), respectively. Then the parameters θ0, . . . , θL can be estimated via MINQUE. Specifically, ˆθ0 = yT ˆA0y ˆθi = yT Aiy ∨ 0, where  L(cid:88) l=0 L(cid:88) l=0 ηl −1 Kl(G) Ai = Kl(G) i = 1, . . . , L, −1 Kl(G)  L(cid:88) l=0 , i = 0, . . . , L, and η0, . . . , ηL are the solutions to Γη = ei, where ei is a vector of zero except that the (i + 1)th elements is 1 and   L(cid:88) l=0 −1 Kl(G) Ki(G) −1 Kl(G)  L(cid:88) l=0  . Kj(G) Γij = tr Moreover, ˆA0 = PS+ K(U ) = g (cid:104) 1 m U U T(cid:105) A0 as mentioned above. For a general kernel matrix of the form n , we focus on the case where g satisfies the following property, which we called the Generalized Linear Separable Condition:  k(cid:88) α=1  = g cαxα k(cid:48)(cid:88) α=1 πα(c1, . . . , ck)γα(x1, . . . , xk), (3.9) where c1, . . . , ck ∈ R are coefficients and π1, . . . , πk(cid:48), γ1, . . . , γk(cid:48) are some functions. Exam- 62 ples of kernel functions satisfying the condition are polynomial kernels. We know that , wT s wt m (cid:33) (cid:32)  ∼ i.i.d. N2 Kst(U ) = f s, t = 1, . . . , n (3.10) and wi1  , . . . , 0  , σii σij where σii, σjj, σij are the corresponding elements in Σ =(cid:80)L wim  0 σij σjj wj1 wjm   , (3.11) l=1 ξlKl(G). We can then use Taylor expansion to obtain an approximation of E[Kij(U )], which is given in Lemma 3.2.1. The proof of Lemma 3.2.1 can be found in Appendix B. Lemma 3.2.1. Let the random vectors wi, wj ∈ Rm be as in (3.11) and the Kij(U ) as defined in (3.10). Then if(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)g(cid:48)(cid:48) P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U ) (cid:12)(cid:12)(cid:12) > δ for some M > 0 and all λ ∈ [0, 1], we have where ˆKij(U ) = g(σij) + g(cid:48)(σij) wT m − σij i wj (cid:32) λσij + (1 − λ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M, wT i wj m a.s., (3.12) (cid:40) −m (cid:17) ≤ 4 exp (cid:32) (cid:33)(cid:41) , (cid:114) 2δ M δ 20M sij ∧ δ M σ2 ij ∧ 1 4|σij| 1 ∧ (cid:32) (cid:33) . Lemma 3.2.1 and Remark B.0.1 in Appendix B show that when we use ˆK(U ) = [ ˆKij(U )] to approximate K(U ) and when the number of hidden features u1, . . . , um is large enough, the approximation will be sufficiently small. Hence, we can write K(U ) as follow: 63 K(U ) = ˆK(U ) + oP (1) = g[Σ] + g(cid:48)[Σ] (cid:12) (cid:19) U U T − Σ (cid:18) 1 m + oP (1), (3.13) where (cid:12) means the Hadamard product of two matrices. Moreover, the Strong Law of Large Numbers implies 1 m U U T → Σ a.s. Therefore, equation (3.13) can be further written as K(U ) = g[Σ] + oP (1), i.e., K(U ) P−→ f [Σ] as m → ∞ element-wisely. Following from the version of Dominated Convergence Theorem based on convergence in probability, the following lemma can be easily shown. Lemma 3.2.2. Under the assumptions of Lemma 3.2.1, if g(cid:48)(cid:48)(cid:0)ηij (cid:33)2 = o(1), g(cid:48)(cid:48)(cid:0)ηij L1(P), then 1 (cid:1)(cid:32) − σij wT i wj m E 2 (cid:1)(cid:32) (cid:33)2 ∈ wT m − σij i wj where ηij = λσij + (1 − λ) wT m for some λ ∈ [0, 1]. i wj Based on Lemma 3.2.2, the marginal variance-covariance matrix V can be written as V = τE [K(U )] + φIn + φIn πl(ξ1, . . . , ξL)γl [K1(G), . . . , KL(G)] + φIn (cid:39) τ g[Σ] + φIn (cid:39) τE(cid:104) ˆK(U ) (cid:105) L(cid:48)(cid:88) L(cid:48)(cid:88) = τ l=1 = θlSl(G)ST l (G), l=0 64 where θ0 = φ, θl = τ πl(ξ1, . . . , ξL), l = 1, . . . , L(cid:48) and S0(G) = In, Sl(G) is the Cholesky lower triangle for the matrix γl [K1(G), . . . , KL(G)], l = 1, . . . , L. As an example, we may consider g(x) = (1 + x)2, which corresponds to the output polynomial kernel. g[Σ] = p GGT )∧(cid:13)2, where the symbol ∧(cid:13)2 means the elementwise square. In this case, L = 1 1 (J n +ξ1 and K1(G) = p−1GGT . Then note that g[Σ] = J n + 2ξ1 p J (cid:12) GGT + = J n + 2ξ1 1 p GGT + ξ2 1 ξ2 1 p2 (GGT )∧(cid:13)2 p2 (GGT )∧(cid:13)2. 1 This shows that for polynomial output kernel and one input product kernel, L(cid:48) = 3 with 1 and γ1[K1(G)] = J n, γ2[K1(G)] = K1(G) = π1(ξ1) = 1, π2(ξ1) = 2ξ1, π3(ξ1) = ξ2 p−1GGT , γ3[K1(G)] = p−2(GGT )∧(cid:13)2. The parameters θ0, . . . , θL can be estimated via MINQUE as well. Based on the above discussion, we can see that the estimation of the variance components in KNN through MINQUE is an approximation. What we basically do here is to use a complex mixed model to approximate the KNN. 65 3.3 Predictions In this section, we make a theoretical comparison of prediction performance between KNN and LMM. Based on our model, the best predictor for f is given by ˆy = E [f|y] = EU [E (f|y, u1, . . . , um)] j=1   J(cid:88)   J(cid:88)   J(cid:88) j=1 j=1 = EU = EU := EU j=1  J(cid:88)  J(cid:88)  J(cid:88) j=1 j=1 τjKj(U ) τjKj(U ) + φIn φ−1τjKj(U ) φ−1τjKj(U ) + In ˜τjKj(U ) ˜τjKj(U ) + In −1 y −1 y −1 y, where ˜τj = τjφ−1, j = 1, . . . , m. The prediction error based on ˆy is given by R = (y − ˆy)T (y − ˆy) In − EU = yT   J(cid:88) In − EU j=1  J(cid:88)   J(cid:88) j=1 j=1  J(cid:88) j=1 ˜τjKj(U ) ˜τjKj(U ) + In −1 T −1  y. ˜τjKj(U ) ˜τjKj(U ) + In 66 In − EU ˜τjKj(U ) + In Note that =EU =EU we have j=1 j=1 ˜τjKj(U )   J(cid:88)  J(cid:88)   J(cid:88) ˜τjKj(U ) + In − J(cid:88) −1 ,   J(cid:88)   J(cid:88) EU ˜τjKj(U ) + In R = yT j=1 j=1 j=1 −1  J(cid:88) j=1 −1 ˜τjKj(U ) ˜τjKj(U ) + In −1 2 y. ˜τjKj(U ) + In j=1 Direct evaluation of the prediction error R is complicated. Instead, we approximate it based on asymptotic results. Same as the above, we focus on the case where J = 1 and K(U ) = g . The proof of the following Lemma can be found in Appendix B. (cid:104) 1 m U U T(cid:105) Lemma 3.3.1 (Approximation of Prediction Error). (i) When g(x) = x, then as m → ∞, R (cid:39) yT  L(cid:88) l=1 −2 ˜τ ξKl(G) + In y. (ii) When g is continuous and g[Σ] ∈ Sn +, then as m → ∞, ˜τ g  L(cid:88) R (cid:39) yT ξKl(G)  + In −2 y. l=1 Now we compare the average prediction error between kernel neural network and linear mixed model. For a linear mixed model, the prediction error using best predictor can be 67 obtained as follow. The proof can be found in Appendix B. Proposition 3.3.1 (Prediction Error for a Linear Mixed Model). Consider the linear mixed effect model (cid:17) ; (cid:16) y = f + ; f ∼ Nn RΣ  ∼ Nn (0, φIn) . 0, σ2 RΣ(cid:0)˜σ2 RΣ + In (cid:1)−1 y The prediction error based on quadratic loss and the best predictor ˆy = E[f|y] = ˜σ2 (˜σ2 R = σ2 Rφ−1) is given by (cid:17)−1 , ˜σ2 Rλi(Σ) + 1 n(cid:88) (cid:16) i=1 P ELM M = φ where P ELM M is the average prediction error for the linear mixed model and λ1(Σ), . . . , λn(Σ) are the eigenvalues of Σ. Proposition 3.3.2. Assuming that σ2 = φ and ˜σ2 R ≤ ˜τ min1≤l≤L ξl, we have P EKN N (cid:45) P ELM M , where P EKN N stands for average prediction error for kernel neural network. (cid:2)(˜τ K(U ) + In)−1(cid:3). Then for the kernel neural network, the average Proof. Let A = EU 68 prediction error is given by P EKN N = E(cid:104) (cid:17)(cid:105) yT A2y = EU (cid:104)E(cid:16) tr A2 (˜τ K(U ) + In) yT A2y|u1, . . . , um (cid:105) (cid:16) (cid:104) (cid:26)(cid:16)EU (˜τ K(U ) + In)−1(cid:105)(cid:17)2 EU [˜τ K(U ) + In] (cid:104)  ˜τ L(cid:88)  ˜τ L(cid:88) ˜τ min n(cid:88) (cid:17)(cid:105) −2˜τ −1  + 1  L(cid:88) −1 L(cid:88) l=1 ξlKl(G) + In ξlKl(G) + In l=1 l=1 1≤l≤L ξlλi Kl(G) = φEU = φtr (cid:39) φtr = φtr ≤ φ ξlKl(G) + In (cid:27)   Under the assumptions in this proposition and the linear mixed model with Σ =(cid:80)L l=1 i=1 l=1 Kl(G), we have φ(cid:0)˜τ min1≤l≤L λi(Σ) + 1(cid:1)−1 σ2(cid:0)˜σ2 Rλi(Σ) + 1(cid:1)−1 = φ σ2 ˜σ2 Rλi(Σ) + 1 ˜τ min1≤l≤L ξlλi(Σ) + 1 ≤ 1, which implies that P EKN N (cid:45) P ELM M . Remark 3.3.1. The result for Proposition 3.3.2 can be illustrated by using Figure 3.2. As shown in the Figure 3.2 for the case of L = 1, there are two paths from the kernel matrix based on G to the response y. One is the kernel neural network path (solid line) and the other is the linear mixed model path (dash-dotted line). The intuition behind the assumption R ≤ ˜τ ξ is that the kernel neural network should explain more variations than the linear ˜σ2 mixed model as it has two portions. We then extend the result to K(U ) = g , where g is as described in Lemma (cid:104) 1 m U U T(cid:105) 69 Figure 3.2: The intuition under the assumption ˜σ2 R ≤ ˜τ ξ in Proposition 3.3.2. 3.3.1(ii). Proposition 3.3.3. Under the above notations, assuming that σ2 = φ, ˜σ2 (cid:105)(cid:17) ≥ 0 with |g(cid:48)(cid:48)(x)| ≤ M for some M > 0 and all x between R ≤ ˜τ min1≤l≤L ξl (cid:16) (cid:104)(cid:80)L and λ1 (g − ι) l=1 ξlKl(G) mini,j σij and maxi,j vT i vj m , we have P EKN N (cid:45) P ELM M , where λ1(Σ) is the smallest eigenvalue of the matrix Σ. 70 Proof. Note that P EKN N = φtr (cid:26)(cid:16)E(cid:104) (cid:27) (˜τ K(U ) + In)−1(cid:105)(cid:17)2 E [˜τ K(U ) + In]  ˜τ g  L(cid:88) n(cid:88) (cid:104)(cid:80)L (cid:16) n(cid:88) (cid:16) −1 (cid:105)  + In (cid:105)(cid:17) l=1 ξlKl(G) ξlKl(G) + 1 i=1 ˜τ λi (cid:104)(cid:80)L 1 + min1≤l≤L ξl l=1 ξlKl(G) i=1 ˜τ λi (g − ι) l=1 1 g (cid:39) φtr = φ ≤ φ (cid:80)L l=1 Kl(G) (cid:17) , + 1 where ι : Σ (cid:55)→ Σ is the identity map. Corollary 4.3.15 in Horn and Johnson (2012)[32] implies that P EKN N (cid:45) φ n(cid:88) i=1 1 ˜τ min1≤l≤L ξlλi(Kl(G)) + ˜τ λ1 (cid:16) (g − ι) (cid:104)(cid:80)L (cid:105)(cid:17) l=1 ξlKl(G) + 1 Corollary 3.3.1. If g (cid:104)(cid:80)L l=1 ξlKl(G) (cid:105) −(cid:80)L l=1 ξlKl(G) is positive semidefinite, then P EKN N (cid:45) P ELM M . Example 3.3.1 (Polynomial Kernels). For a polynomial kernel of degree d, i.e., Kij(U ) = (cid:32) wT i wj m c + (cid:33)d , we have g(x) = (c + x)d =(cid:80)d k=0 (g − ι)(x) = cd + (dcd−1 − 1)x + (cid:0)d k (cid:1)cd−kxk so that d(cid:88) (cid:18)d (cid:19) cd−kxk. k k=2 71 Theorem 4.1 in Hiai (2009)[29] states that for a real function on (−α, α), 0 < α ≤ ∞, it is Schur positive1 if and only if it is analytic and g(k)(0) ≥ 0 for all k ≥ 0. Since g − ι is a polynomial function, it is clearly analytic. We can then expand g(x) using Taylor expansion around 0 and obtain (cid:18)d (cid:19) k Hence, we have cd−k = g(k)(0) k! ⇒ g(k)(0) = d! (d − k)! cd−k, k = 0, . . . , d.  0 (g − ι)(k)(x) = dcd−1 − 1 (d−k)!cd−k d! if k = 1 if k ∈ {0, 1, . . . , d} \ {1} k ≥ d + 1 . To make g − ι Schur positive, we only need to require c ≥ d−1(cid:113) 1 d ≥ 0 so that the minimum eigenvalue condition of Proposition 3.3.3 holds. 3.4 Including Fixed Effects In the previous discussions, we focus on the case where the marginal distribution of the response variable y has mean 0. In many applications, as the one we see in the ADNI real data application, there are many important covariates which may have large effects to the response variable. In this part, we are going to extend the proposed KNN model to take the 1For a real function f on (−α, α) and for n ∈ N, it is Schur-positive of order n if f [A] is positive semidefinite for all positive semidefinite A ∈ Mn(R) with entries in (−α, α). 72 covariates into account. As we have mentioned in (3.4), the general structure of KNN is f|u1, . . . , um ∼ Nn τjKj(U ) (3.14) y|f ∼ Nn (Xβ + f , φIn)  0, J(cid:88) 0, j=1 L(cid:88) l=1 u1, . . . , um ∼ i.i.d. Nn ξlKl(G) Similar to the situation considered before, given the latent variables u1, . . . , um, we have Xβ, J(cid:88) y|u1, . . . , um ∼ Nn τjKj(U ) + φIn and then the marginal mean and covariance matrix of y can be obtained as follow: j=1  .   E [y] = EU [E [y|u1, . . . , um]] = EU [Xβ] = Xβ Var [y] = VarU [E [y|u1, . . . , um]] + EU [Var [E [y|u1, . . . , um]]] = VarU [Xβ] + EU τjKj(U ) + φIn  J(cid:88) (cid:2)Kj(U )(cid:3) , j=1 J(cid:88) j=0 = τjEU where τ0 = φ and K0(U ) = In. Again, we focus on the case where J = 1 and K(U ) is of the form g with g satisfying the generalized linear separable condition so that mU U T(cid:105) (cid:104) 1 after suitable reparameterization, the variance components become estimable. A natural way to obtain the estimators for β and the variance components θ is to first obtain a “good" estimates for the variance components ˆθ and then plug it into the Aitken equation to obtain the estimates for fixed-effect parameters ˆβ. Kackar and Harville (1981)[37] showed that as 73 long as ˆθ is even and translation-invariant, pT ˆβ + qT ˆf is an unbiased prediction for the quantity pT β + qT f. Let R be an r × n matrix with r = n − rank(X) such that RX = O and RRT = Ir. Such a matrix can be found using the QR decomposition of X [47]. The estimators for the variance components can then be estimated based on the transformed model: ˜f|u1, . . . , um ∼ Nr τjRKj(U )RT u1, . . . , um ∼ i.i.d. Nn ξlKl(G)   , ˜y|˜f ∼ Nr(˜f , φIr) 0, J(cid:88) 0, j=1 L(cid:88) l=1 where ˜y = Ry and ˜f = Rf, which is the same model framework we mainly discussed in section 3.2. As we have seen, the estimators for the variance components after reparame- terization is of the form ˜yT Θ˜y = yT RT ΘRy. Since it is a quadratic form, clearly it is an even function in y. To see that it is translation invariant, we note that for any c ∈ C(X), we can know that c = Xb for some vector b ∈ Rr and we have (y − c)T RT ΘR(y − c) = (y − Xb)T RT ΘR(y − Xb) = yT RT ΘRy − 2yT RT ΘRXb + bT XT RT ΘRXb = yT RT ΘRy, where the last equality follows since RX = O as defined. Therefore, we can know that the obtained estimators for variance components are also translation-invariant and based on the results in Kackar and Harville (1981)[37], the obtained estimator for β by plugging in the 74 variance component estimators is unbiased. For the prediction error, we note that when the covariates present, the predictor for y is given by ˆy = X ˆβ + ˆf = X ˆβ + E[f|y]. Based on the result from linear mixed models, we know that   J(cid:88)   J(cid:88) j=1 j=1 ˆf = E = E  J(cid:88)  J(cid:88) j=1 j=1 −1 (y − Xβ) −1(cid:20) (cid:16) In − X ˜τjKj(U ) + In ˜τjKj(U ) + In (cid:17)− (cid:21) y XT V −1 XT V −1X ˜τjKj(U ) ˜τjKj(U ) XT V −1X j=1 τjE[Kj(U )] + φIn and then (cid:21) y XT V −1 (cid:17)−  J(cid:88)  J(cid:88) −1(cid:20) j=1 j=1 (cid:16) ˜τjKj(U ) ˜τjKj(U ) + In ˜τjKj(U ) ˜τjKj(U ) + In (cid:17)− (cid:21) y XT V −1X (cid:16) XT V −1 (cid:17)− In − X XT V −1X XT V −1 (cid:21) y (cid:16) In − X −1(cid:20) −1 (cid:20) (cid:17)− (cid:21) j=1 (cid:20) − E y − ˆy = In − X where V = Var[y] =(cid:80)J (cid:16)   J(cid:88)  In − E  J(cid:88)   J(cid:88)   J(cid:88) In − E = E j=1 j=1 = j=1 ˜τjKj(U ) + In In − X XT V −1X XT V −1 y, where the last equality follows by noting that  J(cid:88) j=1 ˜τjKj(U ) ˜τjKj(U ) + In as shown above. Therefore, by letting A = E 75 −1 = E (cid:20)(cid:16)(cid:80)J   J(cid:88) j=1 −1 ˜τjKj(U ) + In (cid:17)−1(cid:21) and P V = j=1 ˜τjKj(U ) + In XT V −1, the prediction error is obtained as follow: (y − ˆy)T (y − ˆy) = yT(cid:16) (cid:105) When J = 1, as we have shown in the proof of Lemma 3.3.1, E(cid:104) (cid:104)(cid:80)L (cid:16) (cid:17)−1 ˜τ f l=1 ξlKl(G) + In (cid:17) ˆA In − P T V 2 (In − P V ) y. (˜τ K(U ) + In)−1(cid:105) (cid:39) so that we can estimate the prediction error by using simple plug-in estimators. (y − ˆy)T (y − ˆy) (cid:39) yT(cid:16) (cid:92) (cid:17)ˆ˜τ f  L(cid:88) l=1  + In ˆξlKl(G) −2(cid:16) (cid:17) y. In − P ˆV In − P T ˆV 3.5 Simulations In this section, we conducted some simulations to compare the prediction performance of KNN with MINQUE estimation to the prediction performance of the Best Linear Unbiased Estimator (BLUP) in linear mixed models. All the simulations are based on 100 individuals with 500 Monte Carlo iterations. 3.5.1 Nonlinear Random Effect As we have seen in section 3.2 that a KNN is equivalent to a nonlinear fixed effect model and we have also shown that the prediction error for KNN will be smaller than that of linear mixed models. Here, we use a simulation to validate previous observations through some simulation studies. We used the following model to simulate the response: (cid:16) X XT V −1X (cid:17)− (cid:18) (cid:19) 0, GGT 1 p , (3.15) y = 1n + 2ζ + f (u) + , u ∼ Nn 76 where G is an n×p matrix containing the genetic information (SNP) and ζ,  ∼ i.i.d. Nn(0, In). Here ζ is served as the fixed effect in the formulation of a KNN. In this simulation, four types of functions f are considered, which are linear (f (x) = x), sine (f (x) = sin(2πx)), inverse logistic (f (x) = 1/(1 + e−x)) and polynomial function of order 2 (f (x) = x2). When applying the kernel neural network, we set L = J = 1 and choose K(G) and K(U ) as either product kernel or polynomial kernel of order 2. Figure 3.3 shows the results when f is a linear or a sine function. The cases where f is an inverse logistic function or a polynomial function of order 2 are summarized in Appendix B. Figure 3.3: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors. The left panel shows the results when a linear function is used and the right panel shows the results when a sine function is used. In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. 77 123450.40.60.81.01.21.4Linear123450.51.01.52.0Sine As we can observe from the boxplots, when the output kernel is chosen to be the product kernel, the performance of KNN is similar to the performance of LMM although when the input kernel is chosen to be polynomial kernel, KNN gets a slightly better prediction error, which we think is not significantly different from that of LMM. However, KNN performs significantly better than LMM when the output kernel is chosen to be polynomial kernel. As one can tell from the box plots, when both the input and output kernels are chosen to be polynomial, the KNN has the best performance in terms of the prediction error, which is consistent for all nonlinear functions simulated in this section. 3.5.2 Nonadditive Effects In this simulation, we explore the performances of both method under non-additive effects. We conducted two simulations in terms of two different types of non-additive effects. In the first simulation, we mainly focus on the interaction effect and generate the response using the following model: y = f (G) + , where G = [g1, . . . , gp] ∈ Rn×p is the SNP data and  ∼ Nn(0, In). When applying In the both methods, the mean is adjusted so that the response has marginal mean 0. simulation, we randomly pick 10 causal SNPs, denoted by gi1 following function, , . . . , gi10 and consider the (cid:88) f (G) = 1≤j1 0.9 and then the covariance kernel matrix the normalized identity-by-state (IBS) kernel matrix were constructed for analysis. Specifically, the (i, j)the element in each of the two kernel matrix is defined as follows: (cid:32) (cid:88) p(cid:88) p(cid:88) (cid:2)2 − |gik − gjk|(cid:3) , gik − 1 p k=1 gik k (cid:33)(cid:32) gjk − 1 p (cid:88) k (cid:33) gjk Kcov(gi, gj) = 1 p − 1 Kibs(gi, gj) = 1 2p k=1 where p is the number of SNPs in all expressions. Four volume measures of cortical regions, which are hippocampus, ventricles, entorhinal and whole brain volumes were used as phenotypes of interest. We chose these four cortical regions since they play important roles in the Alzheimer’s disease (AD). The loss in the volumes of the whole brain, hippocampus and entorhinal and the increment in the ventricular volume can be detected among AD patients. When we applied both the KNN method and LMM method, we only include the subjects having both genetic information and phenotypic information, which results in 513 individuals for hippocampus; 564 individuals for ventricles; 516 individuals for entorhinal and 570 individuals for whole brain volumes. The response variable was chosen to be the natural logarithm of the volumes of the four cortical regions and the covariates were chosen as the age, gender, education status and 83 APOE4. Then the KNN is based on the model y|u1, . . . , um ∼ Nn (Xβ, τ K(U ) + φIn) u1, . . . , um ∼ Nn 0, ξ1Kcov + ξ2Kibs(cid:17) (cid:16) , and the LMM is based on the following model assumption: (cid:16) y ∼ Nn Xβ, τ1Kcov + τ2Kibs + τ3In (cid:17) . (3.16) The restricted maximum likelihood estimates for τi, i = 1, 2, 3 were then calculated based on the Fisher scoring methods and the BLUP for the LMM were computed based on these estimators. Similarly, when we applied the KNN methods, the two kernel matrices Kcov and Kibs were used as the input kernel matrices and the output kernel matrix is chosen to be either the product kernel or the polynomial kernel of order 2. The average mean square errors on predictors of both methods were summarized in the Table 3.1. Table 3.1: Average mean squared prediction error of KNN with product output kernel matrix, KNN with output kernel matrix as polynomial of order 2 and the BLUP based on LMM. Hippocampus Ventricles Entorhinal Whole Brain Volume 2.01e-06 2.44e-03 1.01e-05 1.25e-07 KNN(prod) KNN(poly) 4.17e-05 1.98e-03 1.06e-04 3.04e-08 LMM 2.11e-02 2.24e-01 4.05e-02 3.68e-11 As we can see from the table, both KNN and LMM have good prediction errors. However, we would say that the prediction errors from KNN are more realistic. It is interesting to note that the average prediction errors of LMM for entorhinal and whole brain volume are 84 extremely close to zero. We checked the estimates of the variance components in this case and noticed that the variance component associated with the identity matrix, which is τ3 as in (3.16). Since the BLUP of the random effect in this case is given by (cid:16) τ1Kcov + τ2Kibs(cid:17)(cid:16) BLU P = X ˆβ + (cid:17)−1 τ1Kcov + τ2Kibs + τ3In (y − X ˆβ) so that if τ3 is very close to 0, we can know that the BLUP is very close to the response y, which leads to the extremely low prediction error. On the other hand, for KNN, since when the estimate of error variance becomes negative, we project the MINQUE matrix onto the positive semidefinite cone so that we will always get a positive estimate for the error variance component, which makes the calculation of prediction error more reasonable. 3.7 Discussion In this chapter, a kernel-based neural network model was proposed for prediction analyses. The kernel-based neural network can be thought of as an extension of linear mixed model since it can reduce to a linear mixed model through choosing product kernel matrix as the output kernel matrix and via reparameterization. A modified MINQUE strategy is used to obtain the estimators of variance components in the KNN model if the output kernel satisfies the generalized linear separable condition. Empirical simulation studies and real data application show that the KNN model can achieve better performance in terms of the mean squared prediction error when the output kernel matrix is chosen as the polynomial kernel matrix. This is analogous to the popular neural network model, where better prediction accuracy can be achieved when nonlinear activation functions are applied. 85 Many extensions can be made to make the KNN model more flexible and have much broader applications. First, it may be possible to consider conducting base kernel matrix selection. Although in this paper, we do not consider how to choose the number of base kernel matrix L, but too many kernel matrices will certainly increase the amount of redundant information. So it may be beneficial to propose a criterion on choosing the base kernel matrices. An other possible extension of the KNN model is more challenging. The theoretical properties discussed in this paper mainly focus on the case where only one output kernel matrix is used and the kernel function has a specific form. It is natural to consider more general kernel functions, but the estimation procedure of the variance component would be more complex. Moreover, it is also advisable to consider the performance of KNN if deep network structures were applied. 86 Chapter 4 Asymptotic Properties of Neural Network Sieve Estimators 4.1 Introduction With the widespread usage of machine learning and artificial intelligence, neural networks have regained their popularity. More and more learning machines are based on deep neural networks and have achieved great classification and prediction accuracy. We refer interested readers to Goodfellow et al. (2016) [25] for more background and details. In classical sta- tistical learning theory, the consistency and the rate of convergence of the empirical risk minimization principle are of the great interest. Many upper bounds have been obtained for the empirical risk and the sample complexity based on the growth function and the Vapnik- Chervonenkis dimension (see for example, [70, 4, 18]). However, not many studies focus on the asymptotic properties for neural networks. As Thomas J. Sargent said, “artificial intel- ligence is actually statistics, but in a very gorgeous phrase, it is statistics." So it is natural and worthwhile to explore whether a neural network model possesses nice asymptotic prop- erties since if it does, it may be possible to conduct statistical inference based on this model. Throughout this chapter, we will focus on the asymptotic properties of neural networks with one hidden layer. 87 Using the language in statistics, fitting a neural network with one hidden layer is simply a parametric nonlinear regression problem: r(cid:88) (cid:16) γT j xi + γ0,j (cid:17) + i, yi = α0 + αjσ j=1 random errors with E[] = 0 and E[2] = σ2 < ∞ and σ(·) where 1, . . . , n are i.i.d. is an activation function, for example σ(z) = 1/(1 + e−z), which will be the main focus in this paper. White and Racine (2001)[75] obtained the asymptotic distribution of the resulting estimators under the assumption that the true parameters are unique. In fact, the authors implicitly assumed that the number of hidden units r is known. However, even if we assume that we know the number of hidden units, it is difficult to establish the asymptotic properties for the parameter estimators. In section 4.6.1, we conduct a simulation based on a single-layer neural network with 2 hidden units. Even for such a simple model, the simulation result suggests that reaching consistency is highly unlikely. Moreover, since the number of hidden units is usually unknown in practice, such assumption can be easily violated. For example, as pointed out in Fukumizu (1996)[21] and Fukumizu et al. (2003) [22], if the true function is f0(x) = ασ(γx), that is the true number of hidden units is 1, and if we fit the model using a neural network with two hidden units, then any parameter θ = [α0, α1, . . . , αr, γ0,1, . . . , γ0,r, γT 1 , . . . , γT r ]T in the high-dimensional set {θ : γ1 = γ, α1 = α, γ0,1 = γ0,2 = α2 = α0 = 0}∪ {θ : γ1 = γ2 = γ, γ0,1 = γ0,2 = α0 = 0, α1 + α2 = α} realizes the true function f0(x). Therefore, when the number of hidden units is unknown, 88 the parameters in this parametric nonlinear regression problem are unidentifiable. Theorem 1 in Wu (1981)[77] showed that a necessary condition for the weak consistency of nonlinear least square estimators is that n(cid:88) [f (xi, θ) − f (xi, θ(cid:48))]2 → ∞, as n → ∞, i=1 for all θ (cid:54)= θ(cid:48) in the parameter space as long as the distribution of the errors has finite Fisher information. Such condition implies that when the parameters are not identifiable, the resulting nonlinear least squares estimators will be inconsistent, which hinders further explorations on the asymptotic properties for the neural network estimators. Liu and Shao (2003)[43] and Zhu and Zhang (2006)[80] proposed some techniques to conduct hypothesis testing under loss of identifiability. However, their theoretical results are hard to implement in real world applications. Even though a function can have different neural network parametrizations, the function itself can be considered as unique. Moreover, due to the Universal Approximation Theorem [33], any continuous function on a compact support can be approximated arbitrarily well by a neural network with one hidden layer. So it seems natural to consider a nonparametric regression setting and approximate the function class through a class of neural networks with one hidden layer. Specifically, suppose that the true nonparametric regression model is yi = f0(xi) + i, where 1, . . . , n are i.i.d. random variables defined on a complete probability space (Ω,A, P) with E[] = 0, Var[] = σ2 < ∞; x1, . . . , xn ∈ X ⊂ Rd are vectors of covariates with X being 89 a compact set in Rd and f0 is an unknown function needed to be estimated. We assume that f0 ∈ F, where F is the class of continuous functions with compact supports. Clearly, f0 minimizes the population criterion function (cid:35) (cid:34) n(cid:88) (yi − f (xi))2 n(cid:88) 1 n i=1 (f (xi) − f0(xi))2 + σ2. Qn(f ) = E = 1 n i=1 A least squares estimator of the regression function can be obtained by minimizing the empirical squared error loss Qn(f ): ˆfn = argminf∈FQn(f ) = argminf∈F 1 n (yi − f (xi))2. n(cid:88) i=1 However, if the class of functions F is too rich, the resulting least squares estimator may have undesired property such as inconsistency [68, 65, 64]. Instead, we may optimize the squared error loss over some less complex function space Fn, which is an approximation of F while the approximation error tends to 0 as the sample size increases. In the language of Grenander (1981)[26], such a sequence of function classes is known as a sieve. More precisely, we consider a sequence of function classes F1 ⊆ F2 ⊆ ··· ⊆ Fn ⊆ Fn+1 ⊆ ··· ⊆ F approximating F in the sense that (cid:83)∞ n=1 Fn is dense in F, that is for each f ∈ F, there exists πnf ∈ Fn such that d(f, πnf ) → 0 as n → ∞, where d(·,·) is some pseudo-metric defined on F. With some abuse of notation, an approximate sieve estimator ˆfn is defined to 90 be where ηn → 0 as n → ∞. Qn( ˆfn) ≤ inf f∈Fn Qn(f ) + Op(ηn), (4.1) Throughout the rest of the chapter, we focus on the sieve of neural networks with one hidden layer and sigmoid activation function. Specifically, we let (cid:16) (cid:17) Frn = αjσ γT j x + γ0,j : γj ∈ Rd, αj, γ0,j ∈ R, d(cid:88) rn(cid:88) j=1 α0 + rn(cid:88) j=0  , |αj| ≤ Vn for some Vn > 4 and max 1≤j≤rn i=0 |γi,j| ≤ Mn for some Mn > 0 where rn, Vn, Mn ↑ ∞ as n → ∞. Such method has been discussed in many works (see for example [73] and [74]). In those papers, consistency of the neural network sieve estimators has been established under a random design. However, there are few results on the asymptotic (4.2) distribution of the neural network sieve estimators, which will be established in this paper. n Frn Moreover, throughout this paper, we focus on the fixed design. is dense in F under the sup-norm. But when considering the asymptotic properties of the sieve estimators, we use the pseudo-norm (cid:107)f(cid:107)2 in Appendix C) defined on F and Frn. n = n−1(cid:80)n i=1 f 2(xi) (see Proposition C.0.1 [33] showed that (cid:83) In section 4.2, we discuss the existence of neural network sieve estimators. The weak consistency and rate of convergence of the neural network sieve estimators will be established in section 4.3 and section 4.4, respectively. Section 4.5 focuses on the asymptotic distribution of the neural network sieve estimators. Several simulations results are presented in section 4.6. 91 Notation: Throughout the rest of the chapter, bold font alphabetic letters and Greek letters are vectors. C(X ) is the set of continuous functions defined on X . The symbol (cid:46) → 1 as n → ∞. means “is bounded above up to a universal constant" and an ∼ bn means an bn For a pseudo-metric space (T, d), N (, T, d) is its covering number, that is the minimum number of -balls needed to cover T . Its natural logarithm is the entropy number and is denoted by H(, T, d). 4.2 Existence A natural question to ask is whether or not the sieve estimator based on neural networks exists. Before we enter the main discussion, we first look at some simple properties of Frn. Proposition 4.2.1 shows that the sigmoid function is a Lipschitz function with Lipschitz constant L = 1/4. Proposition 4.2.1. A sigmoid function σ(z) = ez/(1 + ez) is a Lipschitz function on R with Lipschitz constant 1/4. Proof. For all z1, z2 ∈ R, we know that σ(z) is continuous on [z1, z2] and is differentiable on (z1, z2). Note that σ(cid:48)(z) = σ(z)(1 − σ(z)) ≤ 1 4 ∀z ∈ R. By Mean Value Theorem, we know that σ(z1) − σ(z2) = σ(cid:48)(λz1 + (1 − λ)z2)(z1 − z2) 92 for some λ ∈ [0, 1]. Hence |σ(z1) − σ(z2)| = |σ(cid:48)(λz1 + (1 − λ)z2)||z1 − z2| ≤ 1 4 |z1 − z2|, which means that σ(z) is a Lipschitz function on R with Lipschitz constant 1/4. The second proposition provides an upper bound for the envelope function supf∈Frn |f|. Proposition 4.2.2. For each fixed n (cid:107)f(cid:107)∞ ≤ Vn. sup f∈Frn Proof. For any f ∈ Frn with n fixed, note that for all x ∈ X , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)α0 + rn(cid:88) rn(cid:88) j=1 j=1 (cid:16) (cid:16) |f (x)| = αjσ γT j x + γ0,j ≤ |α0| + |αj|σ γT j x + γ0,j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:17) ≤ rn(cid:88) j=0 |αj| ≤ Vn. Since the right hand side does not depend on x and f, we get sup f∈Frn (cid:107)f(cid:107)∞ = sup f∈Frn sup x∈X |f (x)| ≤ Vn. Now we quote a general result from White and Wooldridge (1991) [76]. The theorem tells us that under some mild conditions, there exists a sieve approximate estimator and that such an estimator is also measurable. 93 Theorem 4.2.1 (Theorem 2.2 in White and Wooldeidge (1991)[76]). Let (Ω,A, P) be a complete probability space and let (Θ, ρ) be a pseudo-metric space. Let {Θn} be a sequence of compact subsets of Θ. Let Qn : Ω× Θn → ¯R be A⊗B(Θn)/B( ¯R)-measurable, and suppose that for each ω ∈ Ω, Qn(ω,·) is lower semicontinuous on Θn, n = 1, 2, . . .. Then for each n = 1, 2, . . ., there exists ˆθn : Ω → Θn, A/B(Θn)-measurable such that for each ω ∈ Ω, Qn(ω, ˆθn(ω)) = infθ∈Θn Qn(ω, θ). Note that Qn(f ) = = = 1 n 1 n 1 n i=1 n(cid:88) n(cid:88) n(cid:88) i=1 i=1 (yi − f (xi))2 (f0(xi) + i − f (xi))2 (f (xi) − f0(xi))2 − 2 1 n n(cid:88) i=1 i(f (xi) − f0(xi)) + n(cid:88) i=1 1 n 2 i . Since the randomness only comes from i’s, it is clear that Qn is a measurable function and for a fixed ω, Qn is continuous in f. Therefore, to show the existence of the sieve estimator, it suffices to show that Frn is compact in C(X ), which is proved in the following lemma. Lemma 4.2.1. Let X be a compact subset of Rd. Then for each fixed n, Frn is a compact set. Proof. For each fixed n, let θn = [α0, . . . , αrn, γ0,1, . . . , γ0,rn, γT [−Mn, Mn]rn(d+1) := Θn. For n fixed, Θn is a bounded closed set and hence it is a compact 1 , . . . , γT rn]T belong to [−Vn, Vn]rn+1× 94 set in Rrn(d+2)+1. Consider a map H : (Θn,(cid:107) · (cid:107)2) → (Frn,(cid:107) · (cid:107)n) θn (cid:55)→ H(θn) = α0 + (cid:16) αjσ rn(cid:88) j=1 (cid:17) γT j x + γ0,j Note that Frn = H(Θn). Therefore, to show that Frn is a compact set, it suffices to show that H is a continuous map due to the compactness of Θn. Let θ1,n, θ2,n ∈ Θn, then α (cid:12)(cid:12)(cid:12)α (cid:12)(cid:12)(cid:12)α  rn(cid:88) (1) j σ (1)T j γ = n α i=1 1 n (1) 0 + rn(cid:88) (cid:107)H(θ1,n) − H(θ2,n)(cid:107)2 n(cid:88) n(cid:88) n(cid:88) 0 − α (2) 0 (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) + 0 − α (2) 0 j=1 i=1 (1) (1) 1 n ≤ 1 n = (cid:18) rn(cid:88) rn(cid:88) j=1 i=1 i=1 n(cid:88)  rn(cid:88) (cid:18) Vn j=0 4 ≤ 1 n ≤ ≤ |α (1) j − α (2) j j=0 |α (1) j − α (2) j | + (cid:19)2 (cid:12)(cid:12)(cid:12)(cid:12)α |α (1) j (1) j σ (1) 0,j | γ γ xi + γ (1)T j (1)T j (cid:18) (cid:12)(cid:12)(cid:12)(cid:12)σ (cid:18) (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) rn(cid:88) j − γ rn(cid:88) j=1 (1) γ (cid:13)(cid:13)(cid:13)γ j=1 j=1 | + Vn 4 (1 ∨ (cid:107)x(cid:107)∞) [rn(d + 1)](cid:107)θ1,n − θ2,n(cid:107)2 2. α (2) j σ (2)T j γ xi + γ (cid:19) − α rn(cid:88) j=1 (2) 0 − (cid:19) (cid:19) (1) 0,j xi + γ − α (2) j σ (2)T j γ xi + γ xi + γ (1) 0,j − σ (2)T j xi + γ (2) 0,j (cid:18) (cid:18) (cid:18) (cid:18) γ (2) 0,j (cid:19)2 2 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:19)(cid:21)2 (2) 0,j (1) j − α |α (cid:17)T xi (2) j (2) j (cid:12)(cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13)1 |σ (cid:12)(cid:12)(cid:12)γ (cid:12)(cid:12)(cid:12)γ + (2)T j γ xi + γ (2) 0,j (1) 0,j − γ (2) 0,j (cid:12)(cid:12)(cid:12) 2 2 (cid:12)(cid:12)(cid:12) (1 ∨ (cid:107)x(cid:107)∞) Vn 4 (1) j − γ (2) j (1) 0,j − γ (2) 0,j Hence, for any  > 0, by choosing δ = / (cid:16) Vn 4 (1 ∨ (cid:107)x(cid:107)∞)(cid:112)rn(d + 1) (cid:17) , we observe that when 95 (cid:107)θ1,n − θ2,n(cid:107)2 < δ, we have (cid:107)H(θ1,n) − H(θ2,n)(cid:107)n < , which implies that H is a continuous map and hence Frn is a compact set for each fixed n. As a simple corollary of Lemma 4.2.1 and Theorem 4.2.1, we can easily obtain the exis- tence of sieve estimator. Corollary 4.2.1. Under the notations above, for each n = 1, 2, . . ., there exists ˆfn : Ω → Frn, A/B(Frn)-measurable such that Qn( ˆfn(ω)) = inff∈Frn Qn(f ). 4.3 Consistency In this section, we are going to show the result on the consistency of the neural network (cid:83) sieve estimator. The consistency result leans heavily on the following Uniform Law of Large Numbers. We start by considering a simple case with Vn ≡ V for all n. In such a case, n Frn is not dense in F but instead in a subset of F containing functions satisfying a certain smoothness condition. Lemma 4.3.1. Let 1, . . . , n be i.i.d. sub-Gaussian random variables with sub-Gaussian parameter σ0. Then if [rn(d + 2) + 1] log[rn(d + 2) + 1] = o(n), we have |Qn(f ) − Qn(f )| p∗−→ 0. sup f∈Frn 96 Proof. For any δ > 0, we have (cid:32) (cid:32) (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n sup f∈Frn sup f∈Frn n(cid:88) i=1 P∗ =P∗ ≤P :=(I) + (II). (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 i − σ2 2 (cid:33) n(cid:88) (cid:32) i=1 |Qn(f ) − Qn(f )| > δ (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ i (f (xi) − f0(xi)) i − σ2 − 2 (cid:33) 2 1 n + P∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 sup f∈Frn i(f (xi) − f0(xi)) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) δ 4 For (I), by Weak Law of Large Numbers, we know that there exists N1 > 0 such that for all n ≥ N1 we have (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (I) = P i − σ2 2 δ 2 < δ 2 . Now, we are going to evaluate (II). From the sub-Gaussianity of 1, . . . , n, we know that i(f (xi)− f0(xi)) is also sub-Gaussian with mean 0 and sub-Gaussian parameter σ0|f (xi)− f0(xi)|. Hence, by Hoeffding inequality (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 P i(f (xi) − f0(xi)) = P i(f (xi) − f0(xi)) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) δ 4 (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88) (cid:40) i=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) nδ 4 (cid:41) . ≤ 2 exp − (cid:80)n i=1(f (xi) − f0(xi))2 n2δ2 32σ2 0 (cid:107)f(cid:107)n ≤ V . Hence, based on Corollary 8.3 From Proposition 4.2.2, we know that supf∈Frn in van de Geer (2000)[68], (II) will have an exponential bound if there exists some constant C and δ > 0, σ > 0 satisfying V > δ/σ and (cid:32)(cid:90) V δ/(8σ) √ nδ ≥ 2C H1/2(u,Frn,(cid:107) · (cid:107)n)du ∨ V 97 (cid:33) . (4.3) Now, we are going to show that (4.3) holds in our case. It follows from Theorem 14.5 in Anthony and Barglett (2009)[4] that an upper bound of the covering number for Frn is given by (cid:17)2 N (,Frn,(cid:107) · (cid:107)∞) ≤ 4e[rn(d + 2) + 1] (cid:16) 1 (cid:16) 1 (cid:17) 4 V − 1 rn(d+2)+1 where ˜Arn,d,V =(cid:0)e[rn(d + 2) + 1]V 2/(V − 4)(cid:1)rn(d+2)+1. By letting 4 V  := ˜Arn,d,V −[rn(d+2)+1], Arn,d,V = log ˜Arn,d,V − [rn(d + 2) + 1] = [rn(d + 2) + 1] log e[rn(d + 2) + 1]V 2 V − 4 (cid:18) (cid:19) − 1 = [rn(d + 2) + 1] log [rn(d + 2) + 1]V 2 V − 4 , and note that V 2−eV +4e ≥ 0 for all V , we get log [rn(d+2)+1]V 2 V −4 ≥ log V 2 V −4 ≥ log e(V −4) V −4 = 1 and then H(,Frn,(cid:107) · (cid:107)∞) = log N (,Frn,(cid:107) · (cid:107)∞) 1  = log ˜Arn,d,V + [rn(d + 2) + 1] log ≤ Arn,d,V + [rn(d + 2) + 1] ≤ Arn,d,V (cid:19) (cid:18) 1 + 1  . 1  (since log x ≤ x − 1 for all x > 0) Note that n(cid:88) i=1 (cid:107)f(cid:107)2 n = 1 n f 2(xi) ≤ (cid:18) sup x (cid:19)2 |f (x)| = (cid:107)f(cid:107)2∞, 98 δ/(8σ) (cid:90) V H1/2(,Frn,(cid:107) · (cid:107)n)d ≤ A we have H(,Frn,(cid:107) · (cid:107)n) ≤ H(,Frn,(cid:107) · (cid:107)∞). Then (cid:18) (cid:90) V (cid:34)(cid:90) 1 (cid:18) (cid:20)√ (cid:90) 1 (cid:104) 1/2 rn,d,V 1/2 rn,d,V 1/2 rn,d,V ≤ A = A 0 0 0 (cid:19)1/2 (cid:19)1/2 d 1 + 1  1 + 1  − 1 2 d +  √ d + √ (cid:105) 2 + 2 2(V − 1) 2 √ 2 ≤ A 1/2 rn,d,V √ = 2 2A 1/2 rn,d,V V. (cid:19)1/2 (cid:35) d 1  (cid:90) V 1 (cid:18) (cid:21) 1 + 2(V − 1) √ Clearly 2 o(n), we get for any δ > 0 there exists N2 > 0 such that for all n ≥ N2, rn,d,V V ≥ V and under the assumption that [rn(d + 2) + 1] log[rn(d + 2) + 1] = 2A 1/2 (cid:18) 1 n √ 4 2V (cid:19)1/2 Arn,d,V < δ 4 , i.e. (4.3) holds with C = 1 and n ≥ N2. Hence, based on Corollary 8.3 in van de Geer (2000)[68], we get for n ≥ N2, i(f (xi) − f0(xi)) δ 4 ∧ 1 n i ≤ σ2 2 ≤ exp − nδ2 64V 2 . (4.4) (cid:33) (cid:26) (cid:27) i=1 99 P∗ (cid:32) Since(cid:82) V sup f∈Frn n(cid:88) i=1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:32) P∗ n(cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n sup f∈Frn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n(cid:88) i=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) δ 4 0 H1/2(,Frn,(cid:107) · (cid:107)n)d < ∞, we may take σ → ∞ in (4.4) to get i(f (xi) − f0(xi)) ≤ exp (cid:26) − nδ2 64V 2 (cid:27) . Let N3 = 64V 2 δ2 log 2 δ , then for n ≥ max{N2, N3}, we have (cid:32) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 (II) = P∗ sup f∈Frn i(f (xi) − f0(xi)) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) δ 4 ≤ δ 2 . Thus we conclude that for any δ > 0, by taking n ≥ max{N1, N2, N3}, we have (cid:32) P∗ sup f∈Frn (cid:33) |Qn(f ) − Qn(f )| > δ < δ, which proves the desired result. Remark 4.3.1. Lemma 4.3.1 shows that if we have a fixed number of features, the desired Uniform Law of Large Numbers holds when the number of hidden units in the neural network sieve does not grow too fast. Now, we are going to extend the result to a more general case. In Lemma 4.3.1, we assumed that the errors 1, . . . , n are i.i.d. sub-Gaussian and Vn ≡ V . lemma, we are going to relax both restrictions. In the following Lemma 4.3.2. Under the assumption that [rn(d + 2) + 1]V 2 n log(Vn[rn(d + 2) + 1] = o(n), as n → ∞, we have |Qn(f ) − Qn(f )| p∗−→ 0, as n → ∞. sup f∈Frn 100 Proof. As in the proof of Lemma 4.3.1, it suffices to show that i(f (xi) − f0(xi)) → 0, as n → ∞. (4.5) (cid:32) P∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 sup f∈Frn (cid:34) E∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 sup f∈Frn (cid:33) δ 4 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) i(f (xi) − f0(xi)) → 0, as n → ∞. By Markov’s inequality, (4.5) holds if we can show Now, since E[] = 0 and note that for each f ∈ Frn, it has its corresponding parametrization θn. Since θn is in a compact set, we know that there exists a sequence θn,k → θn as k → ∞ with θn,k ∈ Qrn(d+2)+1 ∩ ([−Vn, Vn]rn+1 × [−Mn, Mn]rn(d+1)). Each θn,k corresponds to a function fk ∈ Frn. By continuity, we have fk(x) → f (x) for each x ∈ X . By Example 2.3.4 in van der Vaart and Wellner (1996)[69], we know that Frn is P -measurable and by symmetrization inequality we have (cid:34) E∗ sup f∈Frn (cid:34) i(f (xi) − f0(xi)) n(cid:88) ξii (f (xi) − f0(xi)) (cid:35) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤2EEξ sup f∈Frn i=1 (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 i=1 n n(cid:88) i=1 1 n 101 where ξ1, . . . , ξn are i.i.d. Rademacher random variables independent of 1, . . . , n. Based on the Strong Law of Large Numbers, there exists N1 > 0, such that for all n ≥ N1, 2 i < σ2 + 1, a.s. For fixed 1, . . . , n, (cid:80)n i=1 ξii(f (xi) − f0(xi)) is a sub-Gaussian process indexed by f ∈ Frn. Suppose that (Ξ,C, µ) is the probability space on which ξ1, . . . , ξn are defined and let Y (f, ω) =(cid:80)n i=1 ξi(ω)i(f (xi) − f0(xi)) with f ∈ Frn and ω ∈ Ξ. As we have shown above, we have fk → f and by continuity, Y (fk, ω) → Y (f, ω) for any ω ∈ Ξ. This shows that {Y (f, ω), f ∈ Frn} is a separable sub-Gaussian process. Hence Corollary 2.2.8 in van der Vaart and Wellner (1996)[69] implies that there exists a universal constant K and for any f∗ n ∈ Frn with n ≥ N1, (cid:34) n(cid:88) (cid:35) ξii(f (xi)) − f0(xi)) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) n 1√ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:34) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n(cid:88) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n(cid:88) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n(cid:88) i=1 i=1 n n n i=1 ≤ Eξ = Eξ ≤ Eξ Eξ sup f∈Frn = Eξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ n n(cid:88) i=1 i=1 sup f∈Frn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ξii(f (xi) − f0(xi)) (cid:118)(cid:117)(cid:117)(cid:116)log N (cid:16) 1 (cid:17) (cid:90) ∞ 2η,Frn, d (cid:118)(cid:117)(cid:117)(cid:116)log N (cid:16) 1 (cid:17) (cid:90) 2Vn 2 η,Frn, d (cid:118)(cid:117)(cid:117)(cid:117)(cid:116)log N (cid:18) (cid:90) 2Vn + K + K n n 0 0 2 dη dη ξii(f∗ n(xi) − f0(xi)) ξii(f∗ n(xi) − f0(xi)) ξii(f∗ n(xi) − f0(xi)) + K 0 (cid:19) dη, √ η,Frn,(cid:107) · (cid:107)∞ 1 σ2+1 n where the second equality follows by Proposition 4.2.2 and for f, g ∈ Frn, (cid:34) n(cid:88) (cid:18) 1√ (cid:32) n(cid:88) i=1 1 n i=1 d(f, g) = = n i(f (xi) − f0(xi)) − 1√ (cid:33)1/2 n i (f (xi) − g(xi))2 2 (cid:19)2(cid:35)1/2 i(g(xi) − f0(xi)) so that the last inequality follows by noting that (cid:32) (cid:33)1/2 2 i . n(cid:88) i=1 1 n d(f, g) ≤ (cid:107)f − g(cid:107)∞ 102 Now we are going to evaluate these two terms. For the first term, for n ≥ N1, by Cauchy- Schwarz inequality, we have (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 Eξ ξii(f∗ n(xi) − f0(xi)) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 1 n (cid:32) ≤(cid:112) ≤ 1 n n(cid:88) |i||f∗ n(xi) − f0(xi)| (cid:33)1/2(cid:32) n(cid:88) n(cid:88) (f∗ n(xi) − f0(xi))2 i=1 2 i 1 n |f∗ n(x) − f0(x)|, a.s. i=1 i=1 σ2 + 1 sup x∈X (cid:33)1/2 By choosing f∗ by Hornik et al. (1989)[33], we know that supx∈X |f∗ for any ζ > 0, there exists N2 > 0, such that for all n ≥ N2, n = πrnf0 and from the universal approximation theorem of neural network n(xi) − f0(xi)| → 0 as n → ∞ so that |f∗ n(xi) − f0(xi)| < ζ√ σ2 + 1 . sup x∈X Therefore, by choosing n ≥ N1 ∨ N2, we get (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 Eξ ξii(f∗ n(xi) − f0(xi)) (cid:35) < ζ a.s. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) For the second term, we use the same bound from Theorem 14.5 in Anthony and Bartlett (2009)[4] as we did in the proof of Lemma 4.3.1: (cid:18) N (cid:19) √ 1 σ2 + 1 2 η,Frn,(cid:107) · (cid:107)∞ 8 √ ≤ σ2 + 1e[rn(d + 2) + 1] (cid:16) 1 (cid:17) 4 Vn − 1 := ˜Brn,d,Vnη−[rn(d+2)+1], η (cid:16) 1 4 Vn (cid:17)2 rn(d+2)+1 103 (cid:16) √ 2 where ˜Brn,d,Vn = σ2 + 1e[rn(d + 2) + 1]V 2 n /(Vn − 4) (cid:17)rn(d+2)+1 . By letting ≤ 2[rn(d + 2) + 1] log Vn − 4 √ where by choosing N3 so that rn(d + 2) + 1 ≥ 2 noting that V 2 √ σ2 + 1). We also have for log(2 n − Vn + 4 ≥ 0 for all Vn so that log [rn(d+1)+1]V 2 (cid:18) Vn−4 (cid:19) (cid:18) n H √ 2 1 σ2 + 1 η,Frn,(cid:107) · (cid:107)∞ = log N √ 2 1 σ2 + 1 η,Frn,(cid:107) · (cid:107)∞ σ2 + 1, the last inequality follows by √ ≥ log 2 (cid:19) σ2+1(Vn−4) Vn−4 = 1 η Brn,d,Vn = log ˜Brn,d,Vn − [rn(d + 2) + 1] = [rn(d + 2) + 1] log (cid:32) (cid:18) (cid:33) − 1 (cid:19) (cid:112) √ σ2 + 1e[rn(d + 2) + 1]V 2 2 n Vn − 4 [rn(d + 2) + 1]V 2 n Vn − 4 [rn(d + 2) + 1]V 2 n , for all n ≥ N1 ∨ N3, = [rn(d + 2) + 1] log + log(2 σ2 + 1) = log ˜Brn,d,Vn + [rn(d + 2) + 1] log ≤ Brn,d,Vn + [rn(d + 2) + 1] ≤ Brn,d,Vn (cid:18) (cid:19) 1 + 1 η 1 η 104 and hence for all n ≥ N1 ∨ N3, (cid:18) (cid:90) 2Vn H1/2 √ 2 0 (cid:19) dη dη dη + √ 1 σ2 + 1 1 + η,Frn,(cid:107) · (cid:107)∞ (cid:19)1/2 (cid:18) (cid:90) 2Vn (cid:34)(cid:90) 1 (cid:19)1/2 (cid:18) (cid:20)√ (cid:90) 1 1 + 1 η 1 η 0 0 η−1/2dη + ≤ B 1/2 rn,d,Vn = B 1/2 rn,d,Vn ≤ B 1/2 rn,d,Vn √ ≤ 4 2B 2 0 Vn, 1/2 rn,d,Vn (cid:19)1/2 (cid:35) dη 1 η (cid:90) 2Vn (cid:18) 1 1 + (cid:21) 2(2Vn − 1) which implies that (cid:18) (cid:118)(cid:117)(cid:117)(cid:117)(cid:116)H (cid:90) 2Vn 0 √ 1 σ2+1 2 η,Frn,(cid:107) · (cid:107)∞ n (cid:19) √ (cid:114) 2n−1/2B dη ≤ 4 ∼ 8 1/2 rn,d,Vn Vn [rn(d + 2) + 1]V 2 n log(Vn[rn(d + 2) + 1]) , n where the last part follows by noting that log V 2 in the Lemma, there exists N4 > 0, such that for all n ≥ N4, we have Vn−4 ∼ log Vn. Under the assumption given n (cid:114) [rn(d + 2) + 1]V 2 n log(Vn[rn(d + 2) + 1]) n < ζ 8 . Therefore, by choosing n ≥ N1 ∨ N2 ∨ N3 ∨ N4, we get (cid:104) i.e. Eξ supf∈Frn (cid:12)(cid:12)(cid:12) 1 n (cid:12)(cid:12)(cid:12)(cid:105) → 0 a.s.. Moreover, based on what we Eξ n(cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:34) (cid:80)n i=1 ξii(f (xi) − f0(xi)) sup f∈Frn i=1 n ξii(f (xi) − f0(xi)) (cid:35) < 2ζ a.s., (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 105 have shown, we know that for n sufficiently large, (cid:34) Eξ sup f∈Frn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 ξii(f (xi) − f0(xi)) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤(cid:112) σ2 + 1(cid:107)πrnf0 − f0(cid:107)∞+ √ 2KB 4 1/2 rn,d,Vn n−1/2Vn → 0, a s. (cid:104)√ σ2 + 1(cid:107)πrnf0−f0(cid:107)∞+ σ2 + 1(cid:107)πrnf0 − f0(cid:107)∞ + 4 n−1/2Vn → 0 < ∞, by Generalized Dominated Convergence Theorem, we n−1/2Vn 1/2 rn,d,Vn 2KB (cid:105) √ √ = and since E √ 1/2 4 rn,d,Vn 2KB know that E∗ (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:34) n n(cid:88) i=1 sup f∈Frn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 sup f∈Frn ≤ 2EEξ i(f (xi) − f0(xi)) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ξii (f (xi) − f0(xi)) (cid:35) → 0, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) which completes the proof. Based on the above lemmas, we are ready to state the theorem on the consistency of neural network sieve estimators. Theorem 4.3.1. Under the notation given above, if [rn(d + 2) + 1]V 2 n log(Vn[rn(d + 2) + 1] = o(n), as n → ∞, (4.6) then (cid:107) ˆfn − f0(cid:107)n p−→ 0. 106 Proof. Since Q is continuous at f0 ∈ F and Q(f0) = σ2 < ∞, we note that for any  > 0, inf f :(cid:107)f−f0(cid:107)n≥ Qn(f ) − Qn(f0) = inf f :(cid:107)f−f0(cid:107)n≥ 1 n i=1 n(cid:88) (f (xi) − f0(xi))2 ≥ 2 > 0. Hence, it follows from Lemma 4.2.1, Lemma 4.3.2 and Corollary 2.6 in White and Wooldridge (1991)[76] that (cid:107) ˆfn − f0(cid:107)n p−→ 0. Remark 4.3.2. We discuss the condition (4.6) in Theorem 4.3.1 via some simple examples here. If αj = O(1) for j = 1, . . . , rn, then Vn = O(rn) and [rn(d + 2) + 1]V 2 n log(Vn[rn(d + 2) + 1]) = O(r3 n log rn). (cid:16) (n/ log n)1/3(cid:17) rn = o Therefore, a possible growth rate for the number of hidden units in a neural network is . On the other hand, if we have a slow growth rate for the number of hidden units in the neural network, such as rn = log Vn, then we have [rn(d + 2) + 1]V 2 n log(Vn[rn(d + 2) + 1]) = O((Vn log Vn)2). (cid:17) (cid:16) Hence, a possible growth rate for the upper bound of the weights from the hidden layer to the output layer is Vn = o n1/2/ log n . 107 4.4 Rate of Convergence To obtain the rate of convergence for neural network sieves, we are going to apply Theorem 3.4.1 in van der Vaart and Wellner (1996)[69]. Lemma 4.4.1. Let f∗ and δ > 8(cid:107)f∗ n − f0(cid:107)n, we have n = πrnf0 ∈ Frn. Then under the notations given above, for every n δ 2 <(cid:107)f−f∗ sup n(cid:107)n≤δ,f∈Frn Qn(f∗ n) − Qn(f ) (cid:46) −δ2. Proof. First, we note that Qn(f∗ n) − Qn(f ) = 1 n = (cid:107)f∗ i=1 n(cid:88) (f∗ n(xi) − f0(xi))2 + σ2 − 1 n n(cid:88) i=1 (f (xi) − f0(xi))2 − σ2 n − f0(cid:107)2 n − (cid:107)f − f0(cid:107)2 n. So to show the result, we need to provide an upper bound for Qn(f∗ (cid:107)f − f∗ n(cid:107)n. Due to the fact that (cid:107) · (cid:107)n is a pseudo-norm, the triangle inequality gives n) − Qn(f ) in terms of (cid:107)f − f∗ n(cid:107)n ≤ (cid:107)f − f0(cid:107)n + (cid:107)f∗ = (cid:107)f − f0(cid:107)n − (cid:107)f∗ n − f0(cid:107)n n − f0(cid:107)n + 2(cid:107)f∗ n − f0(cid:107)n. Therefore, we have (cid:107)f − f0(cid:107)n − (cid:107)f∗ n − f0(cid:107)n ≥ (cid:107)f − f∗ n(cid:107) − 2(cid:107)f∗ n − f0(cid:107)n, 108 so that for every f such that (cid:107)f − f∗ n(cid:107)2 n ≥ 16(cid:107)f∗ n − f0(cid:107)2 n, i.e., (cid:107)f − f∗ n(cid:107)n ≥ 4(cid:107)f∗ n − f0(cid:107)n, (cid:107)f − f0(cid:107)n − (cid:107)f∗ n − f0(cid:107)n ≥ (cid:107)f − f∗ n(cid:107)n − 1 2 (cid:107)f − f∗ n(cid:107)n = (cid:107)f − f∗ n(cid:107)n ≥ 0. 1 2 By squaring both sides, we obtain (cid:107)f − f∗ n(cid:107)2 n ≤ (cid:107)f − f0(cid:107)2 ≤ (cid:107)f − f0(cid:107)2 = (cid:107)f − f0(cid:107)2 n + (cid:107)f∗ n + (cid:107)f∗ n − (cid:107)f∗ n − f0(cid:107)2 n − f0(cid:107)2 n − f0(cid:107)2 n n − 2(cid:107)f − f0(cid:107)n(cid:107)f∗ n − 2(cid:107)f∗ n − f0(cid:107)2 n n − f0(cid:107)n 1 4 and hence δ 2 <(cid:107)f−f∗ sup n(cid:107)n≤δ,f∈Frn Qn(f∗ n) − Qn(f ) ≤ (cid:107)f−f∗ sup n(cid:107)n> δ 2 ,f∈Frn sup n(cid:107)n> δ 2 ,f∈Frn ≤ (cid:107)f−f∗ (cid:46) −δ2. (cid:107)f∗ (cid:18) n − f0(cid:107)2 n − (cid:107)f − f0(cid:107)2 n (cid:19) (cid:107)f − f∗ n(cid:107)2 n −1 4 Lemma 4.4.2. For every sufficiently large n and δ > 8(cid:107)f∗  E∗ √ δ 2 <(cid:107)f−f∗ sup n(cid:107)n≤δ,f∈Frn n [(Qn − Qn)(f∗ n) − (Qn − Qn)(f )]+ n − f0(cid:107)n, we have (cid:113) log N (η,Frn,(cid:107) · (cid:107)∞)dη  (cid:46) (cid:90) δ 0 109 Proof. Note that (Qn − Qn)(f∗ n) = (Qn − Qn)(f∗ n) = n(cid:88) n(cid:88) i=1 i=1 1 n 1 n i − σ2 − 2 2 n i − σ2 − 2 2 n n(cid:88) n(cid:88) i=1 i=1 i(f∗ n(xi) − f0(xi)) i(f (xi) − f0(xi)), we have (Qn − Qn)(f∗ n) − (Qn − Qn)(f ) = n(cid:88) i=1 2 n i(f (xi) − f∗ n(xi)). Hence by following similar arguments as in the proof of Lemma 4.3.2 as well as applying Corollary 2.2.8 in van der Vaart and Wellner (1996)[69], we have  n) − (Qn − Qn)(f )]+ E∗    (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(xi)) √ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ n [(Qn − Qn)(f∗ n(cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ i(f (xi) − f∗ n(cid:88) i=1 n n i=1 ξii (f (xi) − f∗ n(xi))  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ 2 <(cid:107)f−f∗ sup n(cid:107)n≤δ,f∈Frn ≤E∗ δ  sup n(cid:107)n≤δ,f∈Frn 2 <(cid:107)f−f∗ ≤2EEξ sup (cid:90) δ (cid:113) 2 <(cid:107)f−f∗ n(cid:107)n≤δ,f∈Frn log N (η,Frn,(cid:107) · (cid:107)∞)dη, (cid:46) δ 0 where the last inequality follows since f∗ n ∈ Frn for n large enough. Now we are ready to apply Theorem 3.4.1 in van der Vaart and Wellner (1996)[69] to obtain the rate of convergence for neural network sieve estimators. Theorem 4.4.1. Under the above notations, if ηn = O(cid:16) n, rn(d + 2) log(rnVn(d + 2))/n, rn(d + 2) log n/n}(cid:17) , min{(cid:107)πrnf0 − f0(cid:107)2 110 then (cid:107) ˆfn−f0(cid:107)n = Op (cid:32) max (cid:40) (cid:107)πrnf0 − f0(cid:107)n, (cid:114) rn(d + 2) log[rnVn(d + 2)] n (cid:114) , rn(d + 2) log n n (cid:41)(cid:33) . Proof. Use the same bound from Theorem 14.5 in Anthony and Bartlett (2009)[4] as we did before, we have log N (η,Frn,(cid:107) · (cid:107)n) ≤ log N (η,Frn,(cid:107) · (cid:107)∞) 4e[rn(d + 2) + 1] (cid:16) 1 (cid:16) 1 (cid:17) 4 Vn − 1 η 4Vn (cid:17)2 rn(d+2)+1 ≤ log = [rn(d + 2) + 1] log ˜Crn,d,Vn η , > e. Then it follows from Lemma 3.8 in Mendelson where ˜Crn,d,Vn = e[rn(d+2)+1]V 2 (2003)[48] that for δ < 1, Vn−4 n (cid:90) δ (cid:113) 0 log N (η,Frn,(cid:107) · (cid:107)n)dη ≤ [rn(d + 2) + 1]1/2 (cid:46) [rn(d + 2) + 1]1/2δ log (cid:115) (cid:90) δ (cid:115) 0 ˜Crn,d,Vn dη log η ˜Crn,d,Vn δ := φn(δ). 111 (cid:114) log ˜Crn,d,Vn δ , then since for 0 < δ < 1 log log ˜Crn,d,Vn δ ˜Crn,d,Vn δ − 1 2 − 1 2 log−1/2 ˜Crn,d,Vn δ δ2 ˜Crn,d,Vn ˜Crn,d,Vn δ2 log−1/2 ˜Crn,d,Vn δ  Define h : δ (cid:55)→ φn(δ)/δα = [rn(d + 2) + 1]1/2δ1−α (cid:115) (cid:115) (1 − α)δ−α (1 − α)δ−α h(cid:48)(δ) = [rn(d + 2) + 1]1/2 = [rn(d + 2) + 1]1/2 < 0  for 1 < α < 2, we can know that δ (cid:55)→ φn(δ)/δα is decreasing on (0,∞). Let ρn (cid:46) (cid:107)πrnf0 − f0(cid:107)−1 n , then note that (cid:18) 1 (cid:19) ρn ρ2 nφn = ρn[rn(d + 2) + 1]1/2 log1/2(cid:16) = [rn(d + 2) + 1]1/2ρn log ρn + log ˜Crn,d,Vn (cid:17) ρn ˜Crn,d,Vn (cid:113) and we have log ˜Crn,d,Vn = 1 + log [rn(d + 2) + 1]V 2 n Vn − 4 (cid:46) log [rn(d + 2) + 1]V 2 n Vn − 4 ∼ log[rnVn(d + 2)], (cid:18) 1 (cid:19) ρn (cid:46) √ ρ2 nφn n ⇔ rn(d + 2)ρ2 n (log ρn + log[rnVn(d + 2)]) (cid:46) n. 112 This shows that for n rn(d + 2) log[rnVn(d + 2)] (cid:19)1/2 (cid:18) , n rn(d + 2) log n (cid:19)1/2(cid:41) , we have ρ2 nφn n. Based on these observation, Lemma 4.4.1, Lemma 4.4.2 and Theorem 3.4.1 in van der Vaart and Wellner (1996)[69] imply that ρn (cid:46) min (cid:16) 1 ρn (cid:40)(cid:18) (cid:17) (cid:46) √ (cid:32) max (cid:107) ˆfn−πrnf0(cid:107)n = Op rn(d + 2) log[rnVn(d + 2)] n (cid:40) (cid:107)πrnf0 − f0(cid:107)n, (cid:114) (cid:114) , rn(d + 2) log n n (cid:41)(cid:33) . By triangle inequality, we can further get (cid:114) (cid:107) ˆfn − f0(cid:107)n ≤ (cid:107) ˆfn − πrnf0(cid:107)n + (cid:107)πrnf0 − f0(cid:107)n (cid:40) (cid:107)πrnf0 − f0(cid:107)n, = Op (cid:32) max rn(d + 2) log[rnVn(d + 2)] n (cid:114) , rn(d + 2) log n n (cid:41)(cid:33) . Remark 4.4.1. Recall that a sufficient condition to ensure consistency is rn(d+2)V 2 2)] = o(n). In this case, rn(d + 2) log[rnVn(d + 2)] ≤ n and the rate of convergence can be simplified to n log[rnVn(d+ (cid:32) (cid:40) (cid:107)πrnf0 − f0(cid:107)n, (cid:114) (cid:107) ˆfn − f0(cid:107)n = Op max (cid:41)(cid:33) . rn(d + 2) log n n If we assume f0 ∈ F where F is the space of functions, which have finite first absolute 113 moments of the Fourier magnitude distributions, i.e., (cid:26) (cid:90) (cid:110) (cid:111) F = f : Rd → R : f (x) = (cid:90) exp iaT x dµf (a), max((cid:107)a(cid:107)1, 1)d|µf|(a) ≤ C (cid:27) , (4.7) (cid:107)µf(cid:107)1 := sup(cid:80)∞ and (cid:107)a(cid:107)1 =(cid:80)d n=1 |µ(An)| and the supremum is taken over all measurable partitions {An}∞ where µf is a complex measure on Rd; |µf| denotes the total variation of µf , i.e., |µ|(A) = n=1 of A; i=1 |ai| for a = [a1, . . . , ad]T ∈ Rd. Theorem 3 in Markovoz (1996)[45] shows n and Vn ≡ V in that δn := (cid:107)f0 − πrnf0(cid:107)n (cid:46) r the proof of Theorem 4.4.1, δn must also satisfy the following inequality: . So if we let d fixed and ρn = δ−1 −1/2−1/(2d) n (cid:18) 1 (cid:19) ρn ρ2 nφ (cid:46) ρnr n log1/2(cid:16) 1/2 (cid:17) (cid:46) √ n ρn ˜Crn,d,Vn nrn log rn (cid:46) n ⇒ ρ2 nrn log ρn + ρ2 ⇒ δ−2 n (−rn log δn + rn log rn) (cid:46) n 1+ 1 ⇒ r d n rn log rn (cid:46) n One possible choice of rn to satisfy such condition is rn (cid:16) (n/ log n) obtain d 2+d . In such a case, we (cid:107) ˆfn − f0(cid:107)n = Op (cid:18) n log n 4(1+1/(2d)) (cid:19)− 1+1/d  , (n/ log n)−1/3(cid:17) (cid:16) which is the same rate obtained in Chen and Shen (1998)[12]. It is interesting to note that in the case where d = 1, we have (cid:107) ˆfn− f0(cid:107)n = Op . Such rate is close to the Op(n−1/3), which is the convergence rate in non-parametric least square problems when the class of functions considered has bounded variation in R (see Example 9.3.3 in van de Geer 114 (2000)[68]) and as shown in Proposition C.0.3 in the Appendix C, Frn is a class of functions with bounded variation in R so that the convergence rate we obtained makes sense. 4.5 Asymptotic Normality Through out this section, we will assume that E[||2+λ] < ∞ for some λ > 0. To establish the asymptotic normality of sieve estimator based on neural network, we follow the idea in Shen (1997)[64]. We start by calculating the Gâteaux derivative of the empirical criterion function Qn(f ) = n−1(cid:80)n i=1(yi − f (xi))2. (cid:34) n(cid:88) (yi − f0(xi) − t(f (xi) − f0(xi)))2 − 1 (cid:104) n(cid:88) n [f − f0] = lim t→0 Q(cid:48) n,f0 1 n 1 t (cid:35) (yi − f0(xi))2 n(cid:88) i=1 (yi − f0(xi))2 − 2t(yi − f0(xi))(f (xi) − f0(xi)) +t2(f (xi) − f0(xi))2 − (yi − f0(xi))2(cid:105) i=1 1 t = lim t→0 1 n i=1 n(cid:88) i=1 = − 2 n i(f (xi) − f0(xi)). Then the remainder of first-order functional Taylor series expansion is Rn[f − f0] = Qn(f ) − Qn(f0) − Q(cid:48) n,f0 (yi − f (xi))2 − 1 n 1 n = [f − f0] n(cid:88) i=1 n(cid:88) n(cid:88) i=1 2 n i=1 n(cid:88) i=1 (yi − f (xi))2 + i(f (xi) − f0(xi)) 2 i + 2 n i(f (xi) − f0(xi)) i=1 n(cid:88) n(cid:88) n(cid:88) i=1 i=1 = = 1 n 1 n (i + f0(xi) − f (xi))2 − 1 n (f (xi) − f0(xi))2 = (cid:107)f − f0(cid:107)2 n. 115 empirical process {n−1/2(cid:80)n As will be seen in the proof of asymptotic normality, the rate of convergence for the i=1 i(f (xi) − f0(xi)) : f ∈ Frn} plays an important role. Here we establish a lemma, which will be used to find the desired rate of convergence. Lemma 4.5.1. Let X1, . . . , Xn be independent random variables with Xi ∼ Pi. Define the empirical process {νn(f )} as n(cid:88) νn(f ) = 1√ n [f (Xi) − Pif ]. Let Fn = {f : (cid:107)f(cid:107)∞ ≤ Vn}. Let  > 0 and α ≥ supf∈Fn n−1(cid:80)n i=1 Define t0 by H(t0,Fn,(cid:107)·(cid:107)∞) =  4 ψ(M, n, α), where ψ(M, n, α) = M 2/ 2α 1 + M Vn nα √ 2 . If i=1 Var[f (Xi)] be arbitrary. (cid:104) (cid:16) (cid:17)(cid:105) H(u,Fn,(cid:107) · (cid:107)∞) ≤ Anu−r, (4.8) for some 0 < r < 2 and u ∈ (0, a], where a is a small positive number, and there exists a positive constant Ki = Ki(r, ), i = 1, 2 such that M ≥ K1A r+2 n 2 2−r r+2 n V n r−2 2(r+2) ∨ K2A 1/2 n α 2−r 4 . Then (cid:32) P∗ (cid:33) |νn(f )| > M sup f∈Fn ≤ 5 exp{−(1 − )ψ(M, n, α)} . Proof. The proof of the lemma is similar to the proof of Corollary 2.2 in Alexander (1984) [2] and the proof of Lemma 1 in Shen and Wong (1994)[65]. Since H(u,Fn,(cid:107)·(cid:107)∞) ≤ Anu−r 116 for some 0 < r < 2, then we have (cid:90) t s Based on our assumption, I(s, t) := H1/2(u,Fn,(cid:107) · (cid:107)∞)du ≤ 2(2 − r)−1A 1 2 n t1− r 2 . Ant−r 0 ≥ H(t0,Fn,(cid:107) · (cid:107)∞) = ψ(M, n, α) ⇒ t0 ≤  4 (cid:20)4An (cid:21)1/r ψ . √ Note that if M ≤ 3 √ nα/Vn, then ψ(M, n, α) ≥ M 2/(4α) and if M ≥ 3 nα/Vn, then √ 2( √ nα + M Vn/3) ≤ 4M Vn/3 and hence ψ(M, n, α) ≥ 3 nM/(4Vn). In summary,  M 2/(4α) √ 3 nM/(4Vn) √ if M < 3 nα/Vn, √ if M ≥ 3 nα/Vn . ψ(M, n, α) ≥ √ Therefore, if M ≥ 3 nα/Vn, (cid:19) (cid:18) M √ 64 , t0 n 28−3/2I ≤ 29−3/2(2 − r)−1A 1− r 1/2 2 n t 0 (cid:18)4 (cid:19) 1 r− 1 2  4− 1 1 2r M 1 2− 1 r , (cid:18) 3 4Vn √ nM 1/r n A (cid:19) 1 2− 1 r ≤ 29−3/2(2 − r)−1 = ˜K1A 1/r n V r− 1 2 1 n n 117 where ˜K1 = 29−3/2(2 − r)−1(cid:16) 4 (cid:18) M  (cid:17) 1 (cid:19) 28−3/2I √ 64 n , t0 2(cid:16) 3 (cid:17) 1 4 2− 1 r− 1 r . Hence < M ⇔ ˜K1A r− 1 2 1 1/r n V n 1 ⇔ ˜K1A 1/r n V n ⇔ M > K1A r− 1 2 2 r+2 n r < M 1 1 2r M 2− 1 4− 1 r−2 4r < M 2−r r−2 r+2 2(r+2) , n 1 r + 1 2 n n n V where K1 = ˜K 2r r+2 1 √ . On the other hand, if M < 3 nα/Vn, (cid:19) (cid:18) M √ 64 , t0 n 28−3/2I ≤ 29−3/2(2 − r)−1A 1− r 1/2 2 n t 0 (cid:18)4 (cid:19) 1 r− 1 2 (cid:19) 1 2− 1 r (cid:18) M 2 4α 1/r n A ≤ 29−3/2(2 − r)−1 = ˜K2A (cid:17) 1 r− 1 1/r n M 1− 2 (cid:17) 1 2(cid:16) 1 (cid:19) 2− 1 4  r− 1 2 , 1 r α r . Hence where ˜K2 = 29−3/2(2 − r)−1(cid:16) 4 (cid:18) M  28−3/2I √ 64 n , t0 < M ⇔ ˜K2A 2 < M 1 r− 1 1/r r α n M 1− 2 2−r 2r < M 2−r 4 , 1/2 n α 2 r ⇔ ˜K2A 1/r n α ⇔ M > K2A where K2 = ˜K 28−3/2I (cid:16) M √ 64 n r/2 2 . (cid:17) , t0 In conclusion, if M ≥ K1A < M and by Theorem 2.1 in Alexander (1984) [2], we get the desired 1/2 n α r+2 n n V 2−r r+2 n r−2 2(r+2) ∨ K2A 2−r 4 , then 2 result. As a Corollary to Lemma 4.5.1, we can get the rate of convergence for the the empirical 118 process {n−1/2(cid:80)n i=1 i(f (xi) − f0(xi)) : f ∈ Frn}. Corollary 4.5.1. Let ρn be such that ρn(cid:107) ˆfn − f0(cid:107)n = Op(1) and Frn be the class of neural network sieves as defined in (4.2). Suppose that E[||2+λ] < ∞ for some λ > 0. Then under the conditions (C1) rn(d + 2)Vn log[rnVn(d + 2)] = o(n1/4); (C2) nρ−2 n /V λ n = o(1), we have   P∗ ≤P∗ sup (cid:107)f−f0(cid:107)n≤ρ−1 n ,f∈Frn n ,f∈Frn sup (cid:107)f−f0(cid:107)n≤ρ−1 + P∗  sup (cid:107)f−f0(cid:107)n≤ρ−1 n ,f∈Frn Proof. To establish the desired result, we apply the truncation device. n(cid:88) i=1 n i(f − f0)(xi) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M i(f − f0)(xi) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ n(cid:88) n(cid:88) i=1 i=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ n n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1).   (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M iI{|i|≤Vn}(f − f0)(xi) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ n n(cid:88) i=1 iI{|i|>Vn}(f − f0)(xi)  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M sup (cid:107)f−f0(cid:107)n≤ρ−1 n ,f∈Frn :=(I) + (II) For (I), we can apply Lemma 4.5.1 directly. Note that |I{||≤Vn}(f − f0)(x)| ≤ Vn(Vn + 119 (cid:17)2 4 Vn η 4e[rn(d + 2) + 1] (cid:16) 1 (cid:16) 1 (cid:17) (cid:18) 4 Vn − 1 (cid:18) (cid:19) (cid:18) 1 η = [rn(d + 2) + 1] log ˜Crn,d,Vn + log ≤ [rn(d + 2) + 1] log ˜Crn,d,Vn + rn(d+2)+1 (cid:19) (cid:19) 1 η − 1 1 η (cid:107)f0(cid:107)∞) (cid:46) V 2 n since (cid:107)f0(cid:107)∞ < ∞ and for 0 < η < 1, log N (η,Frn,(cid:107) · (cid:107)∞) ≤ log = Crn,d,Vn ≤ 2Crn,d,Vn 1 + 1 η , and where ˜Crn,d,Vn = e[rn(d+2)+1]V 2 Vn−4 n Crn,d,Vn = [rn(d + 2) + 1] log ˜Crn,d,Vn − [rn(d + 2) + 1] [rn(d + 2) + 1]V 2 n = [rn(d + 2) + 1] log ∼ rn(d + 2) log[rnVn(d + 2)]. Vn − 4 Therefore, equation (4.8) is satisfied with r = 1 and An = 2Crn,d,Vn and it follows from Lemma 4.5.1 that for M (cid:38) C α1/4, we have (I) ≤ 5 exp{−(1 − )ψ(M, n, α)} n n−1/6∨C V 2/3 2/3 rn,d,Vn 1/2 rn,d,Vn  . 2/3 n V 2/3 rn,d,Vn n1/6 and hence (cid:107)f−f0(cid:107)≤ρ−1 sup n ,f∈Frn (cid:12)(cid:12)(cid:12)(cid:12) 1√ n iI{|i|≤Vn}(f − f0)(xi) (cid:12)(cid:12)(cid:12)(cid:12) = Op C 120 Then since by (C1), C 2/3 n V 2/3 rn,d,Vn n1/6 (cid:18)rn(d + 2)Vn log[rnVn(d + 2)] (cid:19)2/3 n1/4 ∼ = op(1). For (II), by Cauchy-Schwarz inequality, we have iI{|i|>Vn}(f − f0)(xi) n(cid:88) i=1 1 n 2 i I{|i|>Vn} (cid:33)1/2 (cid:107)f − f0(cid:107)n. Then it follows from Markov inequality that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:38) M n−1/2  iI{|i|>Vn}(f − f0)(xi) sup n ,f∈Frn I{|i|>Vn} (cid:107)f−f0(cid:107)n≤ρ−1 n(cid:88) n(cid:88) 1 n 2 i i=1 I{|i|>Vn} (cid:38) M 2n−1ρ2 n 2 i ρ−1 n (cid:38) M n−1/2 (cid:33)  It then follows from condition (C2) that (II) → 0 and hence sup (cid:107)f−f0(cid:107)n≤ρ−1 n ,f∈Frn iI{|i|>Vn}(f − f0)(xi) Combining the results we obtained above, we get sup (cid:107)f−f0(cid:107)n≤ρ−1 n ,f∈Frn i(f − f0)(xi) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1). (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 (II) = P∗  (cid:32) (cid:32) 1 n ≤ P = P i=1 (cid:46) M−2nρ−2 (cid:46) M−2nρ−2 n n E[2I||>Vn] E[||2+λ] . V λ n (cid:32) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n(cid:88) (cid:33)1/2 i=1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 n n(cid:88) i=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ n n(cid:88) i=1 121 Remark 4.5.1. Condition (C2) can be further simplified using the results we obtained in Theorem 4.4.1. If ηn = O(cid:0)min{(cid:107)πrnf0 − f0(cid:107)2 n, rn(d + 2) log(rnVn(d + 2))/n, rn(d + 2) log n/n}(cid:1), (cid:110)(cid:107)πrnf0 − f0(cid:107)n,(cid:112)rn(d + 2) log[rnVn(d + 2)]/n,(cid:112)rn(d + 2) log n/n (cid:110)(cid:107)πrnf0 − f0(cid:107)n,(cid:112)rn(d + 2) log n/n (cid:111) (cid:111) n (cid:16)(cid:112)rn(d + 2) log n/n. This holds for functions having n (cid:16) max n (cid:16) max then ρ−1 follows from condition (C1), that ρ−1 simplicity, here we assume that ρ−1 . For . It finite first absolute moments of the Fourier magnitude distributions as we have discussed at the end of section 4.4. Then in this case, nρ−2 n /V λ n (cid:16) rn(d + 2) log n/V λ n , so that condition (C2) becomes rn(d + 2) log n/V λ n → 0. Now we are going to establish the asymptotic normality for neural network estimators. For f ∈ {f ∈ Frn : (cid:107)f − f0(cid:107)n ≤ ρ−1 n }, we consider a local alternative ˜fn(f ) = (1 − δn)f + δn(f0 + ι), (4.9) where 0 ≤ δn = η 1/2 n = o(n−1/2) is chosen such that ρnδn = o(1) and ι(x) ≡ 1. Theorem 4.5.1 (Asymptotic Normality). Suppose that 0 ≤ ηn = o(n−1) and conditions (C1) and (C2) in Corollary 4.5.1 hold. Assume further that the following two conditions hold (C3) sup f∈Frn :(cid:107)f−f0(cid:107)n≤ρ−1 n (C4) sup f∈Frn :(cid:107)f−f0(cid:107)n≤ρ−1 n (cid:16) ˜fn(f ) − ˜fn(f )(cid:107)n = Op(ρnδ2 n); (cid:107)πrn (cid:80)n 1 n i=1 i (cid:17) = Op(δ2 n), ˜fn(f )(xi) − ˜fn(f )(xi) πrn 122 then (cid:104) ˆfn(xi) − f0(xi) (cid:105) d−→ N (0, σ2). n(cid:88) i=1 1√ n Before we proceed to the proof of the theorem, let us focus on the conditions given in the theorem. For condition (C1), note that if (C1) holds, then rn(d + 2)V 2 n log[rnVn(d + 2)] ≤ [rn(d + 2)]4V 4 n (log[rnVn(d + 2)])4 = o(n), mator. As in Remark 4.3.2, we consider some simple scenarios here. so it is a sufficient condition to ensure the consistency of the neural network sieve esti- If Vn = O(rn), (cid:1) so that a possible growth rate for rn is then rn(d + 2)Vn log[rnVn(d + 2)] = O(cid:0)r2 O(cid:0)Vn(log Vn)2(cid:1) and a possible growth rate for Vn is Vn = o(n1/4/(log n)2). Thus, we can see . On the other hand, if rn = log Vn, then rn(d+2)Vn log[rnVn(d+2)] = n1/8/(log n)2(cid:17) n log rn rn = o (cid:16) that in both cases, the growth rate required for the asymptotic normality of neural network sieve estimator is slower than the growth rate required for the consistency as given in Remark 4.3.2. One explanation for this is that due to the Universal Approximation Theorem, a neural network with one hidden layer can approximate a continuous function on compact support arbitrarily well if the number of hidden units is sufficiently large. Therefore, if the number of hidden units is too large, the neural network sieve estimator ˆfn may be very close to the best projector of the true function f0 in Frn so that the error(cid:80)n number of hidden units can increase the variations in the quantity(cid:80)n (cid:105) (cid:104) ˆfn(xi) − f0(xi) (cid:104) ˆfn(xi) − f0(xi) to zero and the variations in this quantity may be small. By allowing slower growth rate of the may be close (cid:105) i=1 , i=1 which makes the asymptotic normality more reasonable. On the other hand, condition (C3) and condition (C4) are similar conditions as in Shen (1997)[64], which are known for condi- tions on approximation error and these conditions dictate that the approximation rate of a 123 single layer neural network cannot be too slow, otherwise it may require a huge number of samples to reach the desired approximation error. Therefore, the conditions in the theorem can be considered as a trade-off between bias and variance. Proof of Theorem 4.5.1. The main idea of the proof is to use the functional Taylor series expansion for Qn(f ) and carefully bound each term in the expansion. For any f ∈ {f ∈ Frn : (cid:107)f − f0(cid:107)n ≤ ρ−1 n }, Qn(f ) = Qn(f0) + Q(cid:48) i − 2 2 n n(cid:88) 1 n = n,f0 [f − f0] + Rn[f − f0] n(cid:88) i(f (xi) − f0(xi)) + i=1 i=1 n(cid:88) i=1 1 n (f (xi) − f0(xi))2. (4.10) Note that (cid:107) ˜fn(f ) − f0(cid:107)n = (cid:107)(1 − δn) ˆfn + δn(f0 + ι) − f0(cid:107)n = (cid:107)(1 − δn)( ˆfn − f0) + δnι(cid:107)n ≤ (1 − δn)(cid:107) ˆfn − f0(cid:107)n + δn, and since δn = o(n−1/2), we can know that with probability tending to 1, (cid:107) ˜fn(f ) − f0(cid:107)n ≤ ρ−1 n . Then replacing f in (4.10) by ˆfn and πrn ˜fn(f ) as defined in (4.9), we get n(cid:88) n(cid:88) i=1 i=1 124 Qn( ˆfn) = Qn(πrn ˜fn(f )) = 1 n 1 n n(cid:88) n(cid:88) i=1 i=1 i − 2 2 n i − 2 2 n i( ˆfn(xi) − f0(xi)) + (cid:107) ˆfn − f0(cid:107)2 n i(πrn ˜fn(f )(xi) − f0(xi)) + (cid:107)πrn ˜fn(f ) − f0(cid:107)2 n. ˜fn(f )(xi) − ˆfn(xi) πrn i (cid:17) + (cid:107) ˆfn − f0(cid:107)2 n − (cid:107)πrn ˜fn(f ) − f0(cid:107)2 n. (cid:16) n(cid:88) i=1 Subtracting these two equations yields Qn( ˆfn) = Qn(πrn ˜fn(f )) + 2 n Now note that (cid:107)πrn ˜fn(f ) − f0(cid:107)2 n = (cid:107)πrn = (cid:107)πrn (cid:68) = (cid:107)πrn = πrn n ˜fn(f ) − ˜fn(f ) + ˜fn(f ) − f0(cid:107)2 ˜fn(f ) − ˜fn(f ) + (1 − δn) ˆfn + δn(f0 + ι) − f0(cid:107)2 ˜fn(f ) − ˜fn(f ) + (1 − δn)( ˆfn − f0) + δnι(cid:107)2 ˜fn(f ) − ˜fn(f ) + (1 − δn)( ˆfn − f0) + δnι, n n πrn ˜fn(f ) − ˜fn(f )(cid:107)2 = (cid:107)πrn ≤ (1 − δn)2(cid:107) ˆfn − f0(cid:107)2 (cid:68) n + δ2 n (cid:69) ˜fn(f ) − ˜fn(f ) + (1 − δn)( ˆfn − f0) + δnι (cid:69) n + (1 − δn)2(cid:107) ˆfn − f0(cid:107)2 (cid:69) (cid:68) (cid:68) ˆfn − f0, ι (cid:69) (cid:69) (cid:68) ˆfn − f0, δnι ˜fn(f ) − ˜fn(f ), ˆfn − f0 ˜fn(f ) − ˜fn(f ), ι + δ2 n πrn πrn + 2δn + 2(1 − δn)δn + 2(1 − δn) n + 2(1 − δn) + 2(1 − δn)(cid:107)πrn + 2δn(cid:107)πrn ˜fn(f ) − ˜fn(f )(cid:107)n(cid:107) ˆfn − f0(cid:107)n ˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn ˜fn(f ) − ˜fn(f )(cid:107)2 n, 125 where the last inequality follows from the Cauchy-Schwarz inequality and since (cid:16) n(cid:88) i=1 2 n ˜fn(f )(xi) − ˆfn(xi) πrn i (cid:17) = = 2 n 2 n n(cid:88) n(cid:88) i=1 i i i=1 − 2 n δn πrn (cid:16) (cid:16) n(cid:88) πrn i=1 (cid:17) ˜fn(f )(xi) − ˜fn(f )(xi) + ˜fn(f )(xi) − ˆfn(xi) (cid:17) ˜fn(f )(xi) − ˜fn(f )(xi) (cid:17) − 2 (cid:16) ˆfn(xi) − f0(xi) i n(cid:88) i=1 i, δn n we have by the definition of ˆfn, −Op(δ2 Qn(f ) − Qn( ˆfn) n) ≤ inf f∈Frn ≤ Qn(πrn = (cid:107)πrn ˜fn(f )) − Qn( ˆfn) ˜fn(f ) − f0(cid:107)2 n − (cid:107) ˆfn − f0(cid:107)2 n(cid:88) (cid:16) ≤ (1 − δn)2(cid:107) ˆfn − f0(cid:107)2 (cid:17) i πrn n − 2 n n + 2(1 − δn)δn i=1 ˜fn(f )(xi) − ˆfn(xi) (cid:68) ˆfn − f0, ι (cid:69) ˜fn(f ) − ˜fn(f )(cid:107)n ˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn (cid:16) (cid:17) ˜fn(f )(xi) − ˜fn(f )(xi) ˜fn(f ) − ˜fn(f )(cid:107)2 n n(cid:88) i=1 + 2 n δn (cid:16) ˆfn(xi) − f0(xi) (cid:17) i n − (cid:107) ˆfn − f0(cid:107)2 + 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn + 2δn(cid:107)πrn n(cid:88) − 2 n(cid:88) n πrn i=1 i + Op(δ2 n) δn i + 2 n i=1 126 = (−2δn + δ2 n + 2(1 − δn)δn (cid:68) ˆfn − f0, ι (cid:69) (cid:17) ˜fn(f )(xi) − ˜fn(f )(xi) n)(cid:107) ˆfn − f0(cid:107)2 + 2δn(cid:107)πrn ˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn (cid:16) n(cid:88) − 2 n(cid:88) n πrn i=1 i + ˜fn(f ) − ˜fn(f )(cid:107)2 + 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn (cid:17) n(cid:88) (cid:16) ˆfn(xi) − f0(xi) δn i n 2 n i=1 ˜fn(f ) − ˜fn(f )(cid:107)n ≤ δ2 + δn i=1 i + Op(δ2 n) 2 n n + 2(1 − δn)δn n(cid:107) ˆfn − f0(cid:107)2 + 2δn(cid:107)πrn n(cid:88) − 2 n(cid:88) n (cid:68) ˆfn − f0, ι (cid:69) ˜fn(f ) − ˜fn(f )(cid:107)n + (cid:107)πrn (cid:16) (cid:17) ˜fn(f )(xi) − ˜fn(f )(xi) πrn i=1 i i + Op(δ2 n), + δn 2 n i=1 + 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn ˜fn(f ) − ˜fn(f )(cid:107)2 n ˜fn(f ) − ˜fn(f )(cid:107)n n(cid:88) i=1 + 2 n δn (cid:16) ˆfn(xi) − f0(xi) i (cid:17) (4.11) where the last inequality follows by noting that (1 − δn)2 − 1 = −2δn + δ2 condition (C1), we can get n ≤ δ2 n. From the [rn(d + 2) + 1]V 2 n log[rnVn(d + 2) + 1] ≤ ([rn(d + 2) + 1]Vn log[rnVn(d + 2) + 1])4 = o(n), combining with Theorem 4.3.1, we obtain that (cid:107) ˆfn− f0(cid:107)n = op(1) and hence δ2 op(δ2 n). From condition (C3), we have n(cid:107) ˆfn− f0(cid:107)2 n = 2(1 − δn)(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn (cid:17) ˜fn(f ) − ˜fn(f )(cid:107)n ≤ 2(cid:107) ˆfn − f0(cid:107)n(cid:107)πrn (cid:16) = Op ρ−1 n ρnδ2 n ˜fn(f ) − ˜fn(f )(cid:107)n = Op(δ2 n). 127 Similarly, since ρnδn = o(1), we have 2δn(cid:107)πrn (cid:107)πrn ˜fn(f ) − ˜fn(f )(cid:107)n = Op(δn · ρnδ2 ˜fn(f ) − ˜fn(f )(cid:107)2 n = Op(ρ2 nδ4 n) = op(δ2 n). n) = op(δ2 n) Based on condition (C4), we also know that i πrn ˜fn(f ) − ˜fn(f ) (cid:17) = Op(δ2 n), (cid:16) n(cid:88) i=1 2 n and from Corollary 4.5.1, we also have (cid:16) ˆfn(xi) − f0(xi) (cid:17) = op(δn · n−1/2). i n(cid:88) i=1 2 n δn It follows from these observations that (cid:68) ˆfn − f0, δnι (cid:69) n(cid:88) i=1 i ≤ Op(δ2 n) + p(δ2 n) + op(δn · n−1/2), + 2δn n −2(1 − δn) which implies that −(1 − δn) (cid:68) ˆfn − f0, ι (cid:69) n(cid:88) i=1 + 1 n i ≤ Op(δn) + op(n−1/2) = op(n−1/2). 128 By replacing ι with −ι, we can obtain the same result and hence (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:68) ˆfn − f0, ι (cid:69) − 1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(1 − δn) i n(cid:88) i=1 (cid:68) ˆfn − f0, ι (cid:69) − 1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + δn (cid:69)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:68) ˆfn − f0, ι i n(cid:88) i=1 Therefore, ≤ op(n−1/2) + δn(cid:107) ˆfn − f0(cid:107)n = op(n−1/2). (cid:68) ˆfn − f0, ι (cid:69) n(cid:88) i=1 = 1 n i + op(n−1/2) and the desired result follows from the classical Central Limit Theorem. Theorem 4.5.1 can be used directly for hypothesis testing based on the model of neural network with one hidden layer if we know the variance σ2 of the random error. In practice, this is rarely the case. To perform hypothesis testing when σ2 is unknown, it is natural to use a good estimator for σ2 and use a “plug-in" test statistic. A natural estimator for σ2 is n(cid:88) (cid:16) (cid:17)2 yi − ˆfn(xi) (cid:17) (cid:16) ˆfn . = Qn ˆσ2 n = 1 n i=1 So we also need to establish the asymptotic normality for the statistic (cid:104) ˆfn(xi) − f0(xi) where X ⊂ Rd is a compact set and 0 ≤ ηn = o(cid:0)n−1(cid:1). Then under condition as in Theorem Theorem 4.5.2 (Asymptotic Normality for Plug-in Statistic). Suppose that f0 ∈ C(X ), (cid:80)n √ 1 ˆσn i=1 n (cid:105) . 4.5.1, we have n(cid:88) i=1 √ 1 ˆσn n (cid:104) ˆfn(xi) − f0(xi) (cid:105) d−→ N (0, 1). 129 Proof. Note that n = Qn( ˆfn) = ˆσ2 = = 1 n 1 n 1 n i=1 n(cid:88) n(cid:88) n(cid:88) i=1 i=1 (cid:16) (cid:17)2 yi − ˆfn(xi) (cid:16) ˆfn(xi) − f0(xi) = 1 n (cid:16) (cid:17)2 n(cid:88) f0(xi) + i − ˆfn(xi) (cid:16) ˆfn(xi) − f0(xi) (cid:17)2 − 2 (cid:17) n(cid:88) (cid:16) ˆfn(xi) − f0(xi) (cid:17) + (cid:107) ˆfn − f0(cid:107)2 i=1 i=1 i n n i − 2 2 n n(cid:88) i=1 i n(cid:88) i=1 2 i + 1 n Based on the rate of convergence of ˆfn we obtained in Theorem 4.4.1 and condition (C1), we can know that Under (C3), (cid:107)πrnf0 − f0(cid:107)2 (cid:13)(cid:13)(cid:13) ˆfn − f0 p n max = O∗ rn(d + 2) log n (cid:18) (cid:13)(cid:13)(cid:13)2 n = o(cid:0)ρ2 (cid:18) rn(d + 2) log n (cid:27)(cid:19) (cid:26) (cid:107)πrnf0 − f0(cid:107)2 n, (cid:1) = o(n−1/2) and under (C1), we have (cid:19) n1/4 log n (cid:33) nδ4 n n (cid:32) (cid:18)log n n3/4 n (cid:19) ≤ o = o = o(n−1/2) which implies that = op(n−1/2). Moreover, by the same arguments as in the proof of Theorem 4.5.1, we can show that n (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13) ˆfn − f0 n(cid:88) n 2 n (cid:16) ˆfn(xi) − f0(xi) (cid:17) = op(n−1/2). i i=1 130 Therefore, n(cid:88) i=1 1 n Qn( ˆfn) = i + op(n−1/2). 2 (cid:80)n n = Qn( ˆfn) = σ2 + op(1) ˆσ2 Based on Weak Law of Large Numbers, we know that 1 n i=1 2 i = σ2 + op(1). Therefore, and it follows from Slutsky’s Theorem and Theorem 4.5.1, we obtain n(cid:88) i=1 √ 1 ˆσn n (cid:105) (cid:104) ˆfn(xi) − f0(xi) n(cid:88) i=1 = √ 1 σ ˆσn σ n (cid:104) ˆfn(xi) − f0(xi) (cid:105) d−→ N (0, 1). 4.6 Simulation Studies In this section, simulations were conducted to check the validity of the theoretical results obtained in the previous sections. We first used a simple simulation to show that it is hard for the parameter estimators in a neural network with one hidden layer to reach parametric consistency. Then the consistency of the neural network sieve estimators were examined un- der various simulation setups. Finally, in the last part, we checked the asymptotic normality of the neural network sieve estimators. 131 4.6.1 Parameter Inconsistency As we have mentioned in the introduction, due to the loss of identifiability of the parameters, the parameter estimators obtained in a neural network model are unlikely to be consistent. In this simulation, we use empirical results to confirm such observations. We simulated the response through the following model: yi = f0(xi) + i, i = 1, . . . , n, (4.12) where the total sample size n = 500, x1, . . . , xn ∼ i.i.d.N (0, 1), 1, . . . , n ∼ i.i.d.N (0, 0.12) and f0(xi) = −1 + σ(2xi + 1) − σ(−xi + 1), (4.13) that is, the true model is a single-layer neural network and the number of hidden units is 2. When we conducted the simulation, we also used a single-layer neural network to fit the data. When fitting the neural network, we set the learning rate as 0.1 and performed 3e4 iterations for the back propagation. The cost after 3e4 iterations is 0.0106. Table 4.1 summarizes the estimated values for the parameters in this model. Table 4.1: Comparison of the true parameters and the estimated parameters in a single-layer neural network with 2 hidden units. Estimated Values Weights Biases γ1 γ2 α1 α2 True Value 2.00 -1.00 1.00 -1.00 γ0,1 1.00 γ0,2 1.00 α0 -1.00 Estimated Value 0.82 1.30 -0.34 -0.58 -0.03 -0.03 -1.04 Based on the results in Table 4.1, it is clear that the estimators for most of the weights 132 and biases (except α0) are far from reaching consistency. On the other hand, if we look at the curve of the true function and the curve of the fitted function as shown in Figure 4.1, we can see that most parts are fitted extremely well except for the tail parts. The approximation error (cid:107) ˆf − f0(cid:107)n is almost zero as shown in the Figure. This suggests that we may study the asymptotic properties of the neural network using the estimated function instead of the estimated parameters. Figure 4.1: Comparison of the true function and fitted function under the simulation model (4.12). The black curve is the true function defined in (4.13) and the blue dashed curve is the fitted curve obtained after fitting the neural network model. 133 -3-2-101234-2.0-1.8-1.6-1.4-1.2HiddenUnits = 2, Sample = 500, Err = 0xyTruthFitted 4.6.2 Consistency for Neural Network Sieve Estimators In this simulation, we are going to check the consistency result obtained in Section ?? and the α: (cid:80)rn validity of the assumption made in Theorem 4.3.1. Based on our construction of the neural network sieve estimators, in each sieve space Frn, there is a constraint on the (cid:96)1 norm for i=0 |αi| ≤ Vn. So finding the nearly optimal function in Frn for Qn(f ) is in fact a constrained optimization problem. Classical way to conduct this optimization is through introducing a Lagrange multiplier for each constraint. But it is usually hard to find an explicit connection between the Lagrange multiplier and the upper bound in the inequality constraint. Instead, we use the subgradient method as discussed in section 7 in Boyd and Mutapcic (2008)[9]. The basic idea is to update the parameter α0, . . . , αrn through (k+1) α i = α (k) i − δkg(k), i = 0, . . . , rn where δk > 0 is a step size and δk is chosen to be 0.1/ log(e + k) throughout the simulation, which is known as a nonsummable diminishing step size rule. g(k) is a subgradient of the j=0 |αj| − Vn at α(k). More specifically, we take objective or the constraint function(cid:80)rn  ∂ g(k) ∈ α(k) ∂ α(k) if (cid:80)rn if (cid:80)rn j=0 |αj| ≤ Vn j=0 |αj| > Vn Qn(f ) (cid:80)rn j=0 |αj| The updating equation of γ1, . . . γrn, γ0,1, . . . , γ0,rn remains the same as in the classical gradient descent algorithm. We still used equation (4.12) as our simulation model. Instead, we assumed that the random error 1, . . . , n are i.i.d. N (0, 0.72) throughout the simulations. For the true function 134 f0(x), we considered the following three functions: (1) Neural network with single hidden layer and 2 hidden units, which is the same as in equation (4.13). (2) A trigonometric function: (cid:17) x (cid:16)π 3 (cid:17) x + 1 (cid:16)π 4 + 1 3 cos f0(x) = sin (3) A continuous function having a non-differential point  −2x (cid:16) √ f0(x) = (cid:17) if x ≤ 0 if x > 0 x − 1 4 x (4.14) (4.15) We then trained a neural network using the subgradient method mentioned at the beginning of this subsection and set the number of iterations used for fitting as 20,000. For the growth rate on the number of hidden units rn and the upper bound for (cid:96)1 norm of the weights and bias from the hidden layer to the output layer Vn, we chose rn = n1/4 and Vn = 10n1/4. Such choice satisfies the condition mentioned in Remark 4.3.2 and hence satisfies the condition in Theorem 4.3.1. We compared the errors (cid:107) ˆfn − f0(cid:107)2 n and the least square errors Qn( ˆfn) under different sample sizes and the results are summarized in Table 4.2. As we can see from Table 4.2, the errors (cid:107) ˆfn − f0(cid:107)2 n indeed has a decreasing pattern as the sample size increases although there are some cases where the error becomes a little bit larger when the sample sizes increases (e.g. the errors using 500 sample in all scenarios is larger than those errors using 200 sample). One explanation is that the number of hidden units increase from 3 (for 200 sample) to 4 (for 500 sample) under our simulation setup. So 135 Table 4.2: Comparison of errors (cid:107) ˆfn − f0(cid:107)2 iterations under different sample sizes. n and the least square errors Qn( ˆfn) after 20,000 Sample Sizes 50 100 200 500 1000 2000 Neural Network (cid:107) ˆfn − f0(cid:107)2 3.33E-2 2.79E-2 6.05E-3 8.15E-3 3.02E-3 2.88E-3 n Qn( ˆfn) 0.519 0.552 0.500 0.484 0.475 0.500 Sine (cid:107) ˆfn − f0(cid:107)2 n Qn( ˆfn) 0.513 6.04E-2 3.04E-2 0.587 0.501 1.05E-2 0.499 1.19E-2 0.480 1.54E-2 9.72E-3 0.506 Piecewise Continuous (cid:107) ˆfn − f0(cid:107)2 n Qn( ˆfn) 1.124 6.20E-1 3.20E-1 0.920 0.786 2.51E-1 0.769 3.26E-1 0.489 2.98E-2 1.69E-2 0.515 there may be some variations among the estimation performance. Overall, the table shows that the estimated function ˆfn is indeed consistent in the sense that (cid:107) ˆfn − f0(cid:107)n = o∗ Figure 4.2 illustrates the fitted functions and the true function, from which we can visualize p(1). the result more straightforwardly. 4.6.3 Asymptotic Normality for Neural Network Sieve Estimators The last part of the simulation focus on the asymptotic normality derived in Theorem 4.5.1. We still considered the same three types of true functions as used in section 4.6.2 but the random errors are sampled from the standard normal distribution and we still used the subgradient method to obtain the fitted model. The number of iterations used for fitting was set as 20,000. What is different in the simulation setup from what we did in section 4.6.2 is that the growth rates for rn and Vn. As mentioned in section 4.5, the growth rates required for asymptotic normality are slower than those required for consistency. Therefore, in the simulation we chose rn = n1/8 and Vn = 10n1/10. Such choice satisfies the condition (C1) in Theorem 4.5.1. In order to get the normal Q-Q plot for n−1/2(cid:80)n (cid:104) ˆfn(xi) − f0(xi) i=1 (cid:105) , we repeated the simulation 200 times. 136 Figure 4.2: Figures on comparison of the true function and the fitted function used in simulations. The top panel shows the scenario when the true function is a single layer neural network; the middle panel shows the scenario when the true function is a sine function and the bottom panel show the scenario when the true function is a continuous function having a non-differentiable point. As we can see from all the cases, the fitted curve becomes closer to the truth as the sample size increases. 137 -2-1012-2-101True Function vs Fitted FunctionsxyTruth5010020050010002000-2-1012-1012True Function vs Fitted FunctionsxyTruth5010020050010002000-2-1012-101234True Function vs Fitted FunctionsxyTruth5010020050010002000 Figure 4.3 to Figure 4.5 are the normal Q-Q plots under each simulation setup and various sample sizes. Based on the figures, we can see that the statistic n−1/2(cid:80)n i=1 fits the normal distribution pretty well under all simulation models. It is also worth to note (cid:104) ˆfn(xi) − f0(xi) (cid:105) (cid:104) ˆfn(xi) − f0(xi) (cid:105) i=1 that the Q-Q plots looks similar under all simulation models and this is what we should expect is N (0, 1). Another implication we can obtain from the Q-Q plots is that the statis- since under all scenarios, the limiting distribution for the statistic n−1/2(cid:80)n tic n−1/2(cid:80)n n−1/2(cid:80)n (cid:80)n (cid:104) ˆfn(xi) − f0(xi) (cid:105) (cid:105) (cid:104) ˆfn(xi) − f0(xi) (cid:105) (cid:104) ˆfn(xi) − f0(xi) is robust to the choice of f0. Therefore, as long as the true function f0 is continuous, N (0, 1) can be served as a good asymptotic distribution for Besides the Q-Q plots, we also conducted the normality tests to check whether n−1/2 follows the standard normal distribution. Specifically, we used the and can be used to conduct hypothesis testing. i=1 i=1 i=1 Shapiro-Wilks test and Kolmogorov-Smirnov test to perform the normality test. Table 4.3 summarizes the p-values for both normality tests and we can see from the p-values, in all cases, we failed to reject that the distribution of n−1/2(cid:80)n (cid:104) ˆfn(xi) − f0(xi) (cid:105) follows the i=1 standard normal distribution. Table 4.3: p-values for Shapiro-Wilks test and Kolmogorov-Smirnov test for nomality test. We use “NN" to denote the true function as a neural network described in (4.13); “TRI" to denote the true function as a trigonometric function described in (4.14) and “ND" to denote the true function as a continuous function having a non-differential point described in (4.15) Kolmogorov-Smirnov Test NN 0.584 0.472 0.731 0.976 0.670 0.733 TRI 0.597 0.508 0.719 0.986 0.754 0.769 ND 0.595 0.484 0.708 0.973 0.708 0.733 Sample Sizes 50 100 200 300 400 500 Shapiro-Wilks Test NN ND 0.881 0.878 0.095 0.098 0.940 0.944 0.872 0.884 0.513 0.514 0.768 0.768 TRI 0.884 0.095 0.944 0.888 0.525 0.778 138 Figure 4.3: Normal Q-Q plot for n−1/2(cid:80)n under 200 iterations and various sample sizes. The true function f0 is a single-layer neural network with 2 hidden units as defined in (4.13). i=1 (cid:104) ˆfn(xi) − f0(xi) (cid:105) 139 -3-2-10123-2-10123Sample Size = 50Theoretical QuantilesSample Quantiles-3-2-10123-3-2-1012Sample Size = 100Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 200Theoretical QuantilesSample Quantiles-3-2-10123-2-1012Sample Size = 300Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 400Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 500Theoretical QuantilesSample Quantiles Figure 4.4: Normal Q-Q plot for n−1/2(cid:80)n under 200 iterations and various sample sizes. The true function f0 is a trigonometric function as defined in (4.14). i=1 (cid:104) ˆfn(xi) − f0(xi) (cid:105) 140 -3-2-10123-2-10123Sample Size = 50Theoretical QuantilesSample Quantiles-3-2-10123-3-2-1012Sample Size = 100Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 200Theoretical QuantilesSample Quantiles-3-2-10123-2-1012Sample Size = 300Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 400Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 500Theoretical QuantilesSample Quantiles Figure 4.5: Normal Q-Q plot for n−1/2(cid:80)n under 200 iterations and various sample sizes. The true function f0 is a continuous function having a non-differential point as defined in (4.15). i=1 (cid:104) ˆfn(xi) − f0(xi) (cid:105) 141 -3-2-10123-2-10123Sample Size = 50Theoretical QuantilesSample Quantiles-3-2-10123-3-2-1012Sample Size = 100Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 200Theoretical QuantilesSample Quantiles-3-2-10123-2-1012Sample Size = 300Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 400Theoretical QuantilesSample Quantiles-3-2-10123-2-10123Sample Size = 500Theoretical QuantilesSample Quantiles 4.7 Discussion We have investigated the asymptotic properties, including the consistency, rate of conver- gence and asymptotic normality for neural network sieve estimators with one hidden layer. While in practice, the number of hidden unites is often chosen ad hoc, it is important to note that the conditions in the theorems provide theoretical guidelines on choosing the number of hidden units for a neural network with one hidden layer in order to achieve the desired statistical properties. The validity of the conditions made in the theorems has also been checked through simulation results. Theorem 4.5.1 and Theorem 4.5.2 can be served as preliminary work for conducting hypothesis testing on H0 : f0 = h0 for a fixed function h0 according to neural networks. However, currently there is no simple way to check conditions (C3) and (C4) in the theorem, which requires further researches on local entropy numbers for classes of neural networks. The work conducted in this chapter mainly focus on sieve estimators based on neural networks with one hidden layer and standard sigmoid activation function. There are many extensions on the results presented in this chapter. In fact, the main theorems in this chapter depend heavily on the covering number or the entropy number of the function class consisting of neural network with one hidden layer. Theorem 14.5 in Anthony and Bartlett (2009)[4] provides a general upper bound for the covering number of a function class consisting of deep neural networks with Lipshitz continuous activation functions. Therefore, it is possible to extend our results discussed in this chapter to a deep neural network with Lipshitz continuous activation functions. It is also worthwhile to investigate asymptotic properties of other commonly used deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). 142 When we train a deep neural network, we usually need to fact an overfitting issue and in practice, regularization is frequently used to reduce overfitting. Another natural extension of the results discussed in this chapter is to modify the loss function by involving some regularization terms. By taking regularization into account, we believe the theories may have broader applicability in real world data applications. 143 Chapter 5 Epilogue The main goal of this dissertation is to investigate some network-based models with their application to genetic association study and risk prediction. We have seen in Chapter 2 that a conditional autoregressive (CAR) model can achieve higher power when individuals have heterogeneous genetic effects and have robust results under misspecification of weights. The kernel neural network (KNN) model discussed in Chapter 3 provides a novel way to conduct genetic risk prediction and we have shown theoretically and empirically that KNN can reach lower prediction error than classical linear mixed model. Finally, in Chapter 4, we have established the asymptotic properties for neural network sieve estimators, which serves as a foundation for conducting hypothesis testing based on neural networks for later researches. By using the advanced theories in empirical processes, the results were developed based on minimal number of conditions to be checked and as a byproduct, the conditions also provide some guidelines on choosing the number of hidden units in order to achieve desired statistical properties. Here, we will briefly discuss our future work. In the CAR model discussed in Chapter 2, we mainly focused on the continuous phe- notype. In the future, we will extend the CAR model so that it can be applied to discrete dependent variables. Specifically, the CAR model can be extended under the generalized 144 linear mixed model (GLMM) framework: (cid:27) (cid:26) yiηi − b(ηi)  γ(cid:80) (cid:88) φ j(cid:54)=i sij j(cid:54)=i sijaj, yi|ai ∼ ind exp ai|aj, j (cid:54)= i ∼ N , i = 1, . . . , n  (5.1) + c(yi, φ) τ(cid:80) j(cid:54)=i sij ηi = g(µi) = xT i β + ai, where g(·) is the canonical link function. For example, g(x) = x, known as the identity link when yi|ai is normally distributed and g(x) = log x 1−x, known as the logit link when yi|ai follows a Bernoulli distribution. To test whether genotype-phenotype association exits, it is equivalent to test H0 : τ = 0 vs H1 : τ > 0. However, due to the fact that the number of random effects is the same as the sample size, the commonly-used Laplace approximations will not work. So a new framework of approximate inference should be established to conduct the statistical inference. Some empirical studies reveal that the magnitude difference between the variance com- ponent estimates of random effects and the variance component estimate of random error is large, which will tend to make the prediction error small. In future work on KNN, it is necessary to control the magnitude for the variance component estimates of random effects. A commonly used technique to control the magnitude of the parameter in statistics and machine learning is to use regularization. However, direct adding regularization term to loss function of MINQUE is difficult. Instead, we will consider another way to estimate the variance components, known as the variance least square[3]. The basic idea of variance least square is to minimize the matrix distance between sample covariance matrix and population covariance matrix. Specifically, let ˆβ be an OLS estimator for β and pretend it was true. 145 Then ˆE = (y − Z ˆβ)(y − Z ˆβ)T is clearly an unbiased estimator for the variance-covariance matrix. We know that the j=1 τjE[Kj(U )] + φIn. So we can marginal covariance matrix of y in KNN is Var[y] =(cid:80)J   ˆE − φIn − J(cid:88) estimate the variance component θ = [φ, τ T , ξT ]T based on ˆθ = argminθtr 2 τjE[Kj(U )] j=1 Therefore, to control the magnitude of the variance components, we can consider the penal- ized estimators via ˆθp = argminθtr   ˆE − φIn − J(cid:88) j=1 2 + penλ(θ). τjE[Kj(U )] We will also investigate the statistical properties including the bias and consistency in the futre work. Having established the basic asymptotic properties for neural networks with one hidden layer, as mentioned in Chapter 4, we will extend the results to deep neural networks, which is a popular tools in deep learning and artificial intelligence, as well as the convolutional neural networks (CNNs) and recurrent neural networks (RNNs). By extending the results to deep neural networks, we expect that some guidelines on selecting the number of hidden units in each layer and the number of layers in the deep network. Considering the loss function with regularization will also be one of the focus in our future research. 146 APPENDICES 147 Appendix A Technical Details and Supplementary Materials for Chapter 2 Proof of Proposition 2.2.1 Proof. Fix a(cid:48) = 0, then based on the first equation in the Brook’s Lemma, we have i=1 n(cid:89) n(cid:89) n(cid:89) i=1 π(a) π(0) = = = i=1 = exp (cid:17)2(cid:27) (cid:17)2(cid:27) − 1 2 exp (2πτ 2 i ) − 1 2τ 2 i − 1 2τ 2 i π(ai|a1, . . . , ai−1, 0, 0) (cid:16) π(0|a1, . . . , ai−1, 0, . . . , 0) ai −(cid:80)i−1 (cid:16)−(cid:80)i−1  i−1(cid:88)  i−1(cid:88) (cid:26) (cid:26) a2 i − 2 n(cid:88) bijaiaj bijaiaj 2τ 2 i 2 exp a2 i τ 2 i 1 τ 2 i i=1 j=1 j=1 + j=1 bijaj j=1 bijaj − 1 exp (2πτ 2 i ) − 1 −1 n(cid:88) −1 i=1 2 2 n(cid:88) i=1 π(a) π(0) = exp Similarly, from the second equation in the Brook’s Lemma, we have n(cid:88) n(cid:88) 1 τ 2 i i=1 j=i+1 bijaiaj  a2 i τ 2 i + 148 By the symmetry of bij, the density of a can then be expressed as 1 2 n(cid:88) −(cid:88) i=1 j(cid:54)=i 1 τ 2 i bij τ 2 i + a2 i τ 2 i  a2 i τ 2 i aT ∆−1a − aT ∆−1Ba aT ∆−1(I − B)a 2 −1 −1 (cid:26) (cid:26) −1 2 −1 2 2 i=1 n(cid:88) n(cid:88) (cid:104) (cid:104) i=1 π(a) ∝ exp = exp = exp = exp  bijaiaj  (cid:105)(cid:27) (cid:88) j(cid:54)=i aiaj (cid:105)(cid:27) where  0 b12 b13 b21 b31 ... 0 b23 b32 ... 0 ... bn1 bn2 bn3 ··· ··· ··· ... ···  b1n b2n b3n ... 0 B = τ 2 1  τ 2 2 ...  . τ 2 n , ∆ = This shows that a ∼ Nn(0, (I − B)−1∆) and the proof is finished. Proof of Proposition 2.2.2 The proof of this result is based on some facts in matrix analysis. We first provide the definition of a diagonally dominant matrix. Definition A.0.1 (Diagonally Dominant Matrix). An n × n real matrix J is diagonally dominant if ∆i(J ) := |J ii| −(cid:88) j(cid:54)=i |J ij| ≥ 0, for i = 1, . . . , n. (A.1) 149 If the inequality in (A.1) is a strict inequality, then J is called a strictly diagonally dominant matrix. The following result is a rephrase of Corollary 5.6.17 in Horn and Johnson (1990)[31]. Corollary A.0.1. Let A is an n × n matrix. If A is strictly diagonally dominant, then A is nonsingular. Based on Corollary A.0.1, it is easy to obtain the desired property. The following is the proof for Proposition 2.2.2. Proof. Based on Corollary A.0.1, it suffices to check that D − γS is strictly diagonally dominant if |γ| < 1. Note that J := D − γS = then since sij ≥ 0 for i, j by the definition of similarity, we have j(cid:54)=i |J ij| ∆i(J ) = |J ii| −(cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) −(cid:88) (cid:88) sij − |γ|(cid:88) (cid:88) (cid:88) = (1 − |γ|) j(cid:54)=i j(cid:54)=i = = sij j(cid:54)=i sij. j(cid:54)=i | − γsij| j(cid:54)=i sij 150 (cid:80) j(cid:54)=1 s1j −γs21 ...  (cid:80) −γs12 j(cid:54)=2 s2j ... ··· −γs1n ··· −γs2n ... ... ··· (cid:80) j(cid:54)=n snj −γsn1 −γsn2  , Hence which finishes the proof. ∆i(J ) > 0 ⇔ |γ| < 1, Top 10 Significant Genes Found by CAR, GGRF and SKAT for Entorhinal, Ventricle and Whole Brain Table A.1, Table A.2 and Table A.3 summarize the top 10 significant gene selected by the three methods associated with entorhinal, ventricle and whole brain respectively. According to the result, we can find that most of the significant genes associated with entorhinal and whole brain found by GGRF and SKAT are the same. This may be from the fact that both methods are constructed based on the same idea that individuals with similar genotypes tend to have similar phenotypes. Besides, both CAR and SKAT find the gene G1D76 significant for entorhinal. These two methods also find the gene G0A21 significant for the whole brain. For ventricle, all the three method discovered that the gene G14B2 is significant. 151 Table A.1: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Entorhinal. GGRF SKAT CHR Gene Name P -value CHR Gene Name P -value CHR Gene Name P -value 6.35E-05 1.36E-04 1.41E-04 1.55E-04 1.74E-04 2.24E-04 2.54E-04 3.46E-04 3.95E-04 4.07E-04 2.43E-05 5.11E-05 5.82E-05 7.46E-05 8.99E-05 1.34E-04 OR6F1 1.55E-04 FBXO9 CAMK2D 1.95E-04 2.12E-04 FLJ25758 2.96E-04 MIR3654 MGAM HSD3B1 GPR85 LRRC40 DMGDH UGT2A3 HDLBP PTGER4 C10orf107 FBXO9 GPR85 LRRC40 HSD3B1 OXR1 CSF1 DMGDH UGT2A3 PTGER4 C10orf107 MIR4514 1.96E-04 2.42E-04 2.58E-04 2.86E-04 4.61E-04 4.79E-04 5.22E-04 5.27E-04 5.67E-04 6.33E-04 5 5 11 5 4 1 6 4 19 7 7 1 1 8 1 5 4 5 10 15 CAR MZB1 PROB1 CAT DNAJC18 METTL14 7 1 7 1 5 4 2 5 10 6 152 Table A.2: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Ventricle. CAR Gene Name LIG3 CNDP1 KIAA0232 PGM5P4-AS1 CTSL ICAM4 RAD51D LAT JCHAIN DNAH17-AS1 CHR 17 18 4 2 9 19 17 17 16 4 SKAT Gene Name MEG8 JCHAIN CCDC61 MIR4428 TREML2 MIR4743 BCL2L2 PHACTR3 LOC105378349 TINAGL1 P -value 1.69E-05 5.47E-05 1.76E-04 2.76E-04 3.41E-04 3.63E-04 4.73E-04 4.74E-04 5.13E-04 5.23E-04 GGRF P -value CHR Gene Name P -value CHR 14 4.57E-05 4 6.21E-05 7.89E-05 19 1 1.28E-04 6 1.58E-04 18 2.15E-04 2.33E-04 14 20 2.42E-04 10 2.54E-04 2.73E-04 1 JCHAIN CYP20A1 CEP120 TINAGL1 CCDC61 NPIPB4 RAPH1 ABCC11 MEG8 3.23E-05 9.99E-05 1.12E-04 1.58E-04 2.29E-04 2.94E-04 3.07E-04 4.27E-04 5.01E-04 5.13E-04 4 2 5 1 19 16 2 16 14 17 LINC00670 153 Table A.3: Top 10 Genes Detected by GGRF, SKAT and CAR Associated with Whole Brain. CAR Gene Name BCL11A TAOK2 TRMT6 MIR559 TMEM219 LOC101559451 LOC101928429 CRNN SOAT1 TRIM24 CHR 2 16 20 2 16 17 6 1 1 7 P -value CHR 17 1.95E-05 20 7.92E-05 1.57E-04 5 7 1.63E-04 8 1.76E-04 1 1.78E-04 1.97E-04 17 11 2.62E-04 2 2.66E-04 2.91E-04 1 GGRF Gene Name ANKRD13B B4GALT5 BRIX1 EEPD1 LOC101929217 C1orf194 C17orf82 ME3 TMEM163 IL6R P -value CHR 1.18E-04 1.23E-04 1.53E-04 1.85E-04 2.18E-04 2.45E-04 3.42E-04 3.45E-04 3.76E-04 4.87E-04 1 5 17 2 8 7 2 1 4 1 SKAT Gene Name C1orf194 BRIX1 ANKRD13B BCL11A LOC101929217 EEPD1 TMEM163 LINC00467 MAD2L1 IL6R P -value 1.13E-04 1.29E-04 1.34E-04 1.63E-04 1.83E-04 2.44E-04 3.40E-04 4.06E-04 4.45E-04 5.03E-04 154 Appendix B Technical Details and Supplementary Materials for Chapter 3 In this appendix, we are going to provide the proofs for the results in Chapter 3. Before we head into the proof, we need some background knowledge on concentration inequalities and matrix analysis. Some Results from Concentration Inequality and Matrix Analysis Sub-Gaussian and Sub-Exponential Inequalities In this part, we present two basic concentration inequalities used in the main text. For more details on concentration inequality, readers may refer to Buldygin and Kozachenko (2000)[11], Boucheron et al. (2004)[8] and Ledoux (2005)[39]. Definition B.0.1. (i) A random variable X with mean µ = E[X] is called sub-Gaussian if there exists a number σ ≥ 0 such that E(cid:104) eλ(X−µ)(cid:105) ≤ exp (cid:27) λ2σ2 (cid:26)1 2 ∀λ ∈ R , 155 The constant σ is known as the sub-Gaussian parameter. (ii) A random variable X with mean µ = E[X] is called sub-exponential (also called pre- Gaussian) if there exist non-negative constants (β, α) such that E(cid:104) eλ(X−µ)(cid:105) ≤ exp (cid:26)1 2 (cid:27) λ2β2 ∀|λ| < , 1 α . The pair of constants (β, α) is known as the sub-exponential parameter. Based on the definition of sub-Gaussian random variables, it is clear that if X ∼ N (µ, σ2), then X is sub-Gaussian with sub-Gaussian parameter σ. The tail probabilities of both sub-Gaussian and sub-exponential random variables can be bounded exponentially. For sub-Gaussian random variables, the result is the famous Hoeffding inequality. Theorem B.0.1. [30] Suppose that the random variables X1, . . . , Xn are independent and Xi has mean µi and sub-Gaussian parameter σi. Then for all t > 0, (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88) i=1 P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:33) (cid:40) (cid:41) . 2(cid:80)n t2 i=1 σ2 i (Xi − µi) ≤ 2 exp − Theorem B.0.2. Suppose that X is a sub-exponential random variable with mean µ and sub-exponential parameters (β, α). Then for all t > 0, P (|X − µ| > t) ≤  2 exp (cid:26) 2 exp(cid:8)− t − t2 2β2 (cid:27) (cid:9) , 2α if 0 < t ≤ β2 α , if t > β2 α . 156 Results on Matrix Analysis Proposition B.0.1 shows that a inverse map of a matrix is a continuous map, which will be frequently used in later parts when we approximate the average prediction errors. Proposition B.0.1 (Gentle (2007)[23]). Let Ψ be an arbitrary invertible matrix. Then the map f : Ψ (cid:55)→ Ψ−1 is continuous. Proof. Since Ψ−1 = |Ψ|−1Adj(Ψ), where Adj(Ψ) is the adjugate matrix of Ψ, i.e., Adj(Ψ) = CT and C is the cofactor matrix of Ψ with Cij = (−1)i+jM ij and M ij is the (i, j) cofactor of Ψ. Based on the definition of determinant, it is easy to see that the map g : Ψ (cid:55)→ |Ψ| is continuous and the map h : Ψ (cid:55)→ Adj(Ψ) is continuous as well. Therefore, the map f : Ψ (cid:55)→ Ψ−1 is continuous. Another result that will be used later is that any two matrix norms are equivalent in the sense that for any given pair of matrix norms (cid:107) · (cid:107)s and (cid:107) · (cid:107)t, there is a finite positive constant Cst such that (cid:107)A(cid:107)s ≤ Cst(cid:107)A(cid:107)t, ∀A ∈ Mn, where Mn is the collection of all n × n matrices. Proof of Lemma 3.2.1 Proof. Note that (cid:34) E wT i wj m (cid:35) m(cid:88) k=1 = 1 m E(cid:2)wikwjk (cid:3) = σij. 157 Now we consider the Taylor expansion of Kij(U ) around σij: Kij(U ) = g(σij) + g(cid:48)(σij) wT i wj m − σij g(cid:48)(cid:48)(ηij) + 1 2 (cid:32) (cid:33) (cid:32) (cid:33)2 , wT i wj m − σij i wj m . Then the truncation error can be evaluated as follows. where ηij is between σij and wT For all δ > 0, P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U ) (cid:12)(cid:12)(cid:12) > δ  > δ (cid:33)2  (cid:32) |g(cid:48)(cid:48)(ηij)| (cid:32) wT i wj m i wj m − σij 2 2 M 1 1 (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wT (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wT (cid:32) m − σij i wj (cid:33) σjj − σ2 ij σii (cid:32) σjj − σ2 ij σii > δ wT i wj m − σij (cid:33)2 − σij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) (cid:114) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:33) (cid:113) 2δ (cid:33) 2δ M M . Im , (cid:33) (cid:33) wT i wi (cid:17) = P ≤ P = P (cid:32) σij σii wi, σij σii wT i wi, 158 So it suffices to evaluate the tail probability P . Note that we get wj|wi ∼ Nm (cid:32) i wj|wi ∼ N wT Therefore, given wi, the random variable wT eter σ−1 i wi, where sij = σiiσjj − σ2 ii si,jwT i wj is sub-Gaussian with sub-Gaussian param- ij. Since (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wT P (cid:33) (cid:114) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wT i wj m − σij ≤ P 2δ M i wj m − σij σii wT i wi m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:114) 1 2δ 2 M − σij (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > wT i wi m (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σij σii (cid:33) (cid:114) + 1 2 2δ M P := (I) + (II), it suffices to provide bounds for (I ) and (II ). For (II), note that wi ∼ Nm(0, σiiIm), we have 1 σii i wi ∼ χ2 wT m, which implies that (cid:32) λ e E vT i vi σii −m e (cid:33) = e−λmE (cid:32) e−λ√ 1 − 2λ = vT i vi σii λ (cid:33)m  = e−λm(1 − 2λ) − m 2 , for λ < 1 2 ≤ e2mλ2 4mλ2 , = e 2 for all |λ| < for all |λ| < , 1 4 1 4 , √ i wi is sub-exponential with parameters (2 wT i.e., 1 σii m, 4). Hence, for σij (cid:54)= 0, by Theorem B.0.2, we have (II) = P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:40) σii (cid:12)(cid:12)(cid:12)(cid:12) > (cid:114) 2m (cid:114) |σij| (cid:33) (cid:33)(cid:41) 2δ M i wi − m (cid:32) wT ≤ 2 exp − δm σ2 ijM ∧ m 4|σij| 2δ M If σij = 0, then (II) = 0. 159 (B.1) For (I), by Hoeffding inequality, we have (cid:34) (cid:34) (cid:34) P P i wj m (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wT − σij (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)wT σii i wj − σij (cid:40) σii − σiim2δ 4M sijwT 2 exp wT i wi m wT i wi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12)(cid:12) > (cid:41)(cid:35) 1 2 m 2 (cid:114) (cid:114) (cid:33)(cid:35) (cid:33)(cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) wi 2δ M 2δ M (B.2) i wi (I) = Ewi = Ewi ≤ Ewi From Theorem A in Inglot (2010)[34] stated that for a random variable χ ∼ χ2 α)th percentile is upper bounded by m + log(1/α) + 2(cid:112)m log(1/α), which is of the order m, the 100(1− O(m) as m → ∞. Now, since σ−1 ii wT i wi ∼ χ2 m, we get for any α ∈ (0, 1), (cid:32) σ−1 ii wT P (cid:114) i wi ≥ m + 2 log 1 α + 2 2m log = α. Let q(α, m) = m + 2 log(1/α) + 2(cid:112)2m log(1/α). Since the function exp{−a/x} is increasing (cid:33) 1 α (cid:35) in x for a > 0, we can further bound (B.2) as follows: (cid:34) (cid:40) (I) ≤ Ewi 2 exp − i wi (cid:41) (cid:40) + 2P(cid:16) 2 exp Ewi m2δ (cid:34) 4M sijσ−1 ii wT (cid:27) (cid:27) m2δ m2δ 4M sijq(α, m) 4M sijq(α, m) (cid:26) (cid:26) − − ≤ 2 exp ≤ 2 exp + − I{σ−1 i wi≤q(α,m)} ii wT m2δ 4M sijσ−1 ii wT σ−1 ii wT (cid:41) I{σ−1 (cid:17) i wi i wi ≥ q(α, m) (cid:35) ii wT i wi≥q(α,m)} + 2α. 160 By choosing α = exp{−m}, we get q(α, m) = m + 2m + 2 √ m2 = 5m so that (I) ≤ 2 exp ≤ 2 exp − m2δ (cid:18) 20M msij −m 1 ∧ + 2 exp{−m} (cid:19)(cid:27) . (B.3) (cid:26) (cid:26) (cid:27) δ 20M sij Combining (B.3) and (B.1), we obtain for all δ > 0, P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U ) (cid:12)(cid:12)(cid:12) > δ (cid:17) ≤ (I) + (II) (cid:40) ≤ 4 exp −m (cid:32) 1 ∧ (cid:33)(cid:41) . (cid:114) 2δ M δ 20M sij ∧ δ M σ2 ij ∧ 1 4|σij| Remark B.0.1. The condition (3.12) can be weakened as follows: (cid:32) (cid:33) g(cid:48)(cid:48) λσij + (1 − λ) wT i wj m = Op(1) for all λ ∈ [0, 1]. In such case, the evaluation of truncation error can be modified as follows: For all δ > 0, there exists Mδ > 0 such that P(cid:0)(cid:12)(cid:12)g(cid:48)(cid:48)(ηij)(cid:12)(cid:12) > Mδ (cid:1) < δ 2 , 161 where ηij = λσij + (1 − λ) P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U ) (cid:12)(cid:12)(cid:12) > δ Now, we can choose m > = P (cid:32) − σij  g(cid:48)(cid:48)(ηij) wT i wj m wT m for some λ ∈ [0, 1] and then i wj (cid:33)2 (cid:17) (cid:33)2 − σij P(cid:0)|g(cid:48)(cid:48)(ηij)| > Mδ (cid:33) wT i wj m g(cid:48)(cid:48)(ηij) (cid:115) (cid:32) ≤ P > δ > δ 2 1 1 (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wT (cid:40) 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ 20Mδsij ≤ P i wj m − σij (cid:32) 1 ∧ ≤ 4 exp −m δ 1 ∧ (cid:32) P(cid:16)(cid:12)(cid:12)(cid:12)Kij(U ) − ˆKij(U ) 20Mδsij Mδσ2 ij ∧ δ ∧ 1 4|σij| (cid:17) (cid:12)(cid:12)(cid:12) > δ  ∩(cid:8)|g(cid:48)(cid:48)(ηij)| ≤ Mδ (cid:1) (cid:9) + (cid:33)(cid:41) (cid:115) 2δ Mδ + δ 2 + δ 2 2δ Mδ ∧ (cid:113) 2δ Mδ ∧ 1 4|σij| δ Mδσ2 ij (cid:33)−1 log 8 δ so that < δ 2 + δ 2 = δ and hence Kij(U ) = ˆKij(U ) + op(1). Proof of Lemma 3.3.1 Proof. (i) When f (x) = x, we have K(U ) = 1 m U U T and since the hidden random vectors u1, . . . , um are i.i.d, we can know that u1uT 1 , . . . , umuT m ∼ i.i.d. Wn 1, L(cid:88) l=1  , ξlKl(X) 162 (cid:16) 1,(cid:80)L (cid:17) dom 1 and covariance matrix (cid:80)L where Wn l=1 ξlKl(X) Numbers implies that as m → ∞, stands for a Wishart distribution with degrees of free- l=1 ξlKl(X). Therefore, the Strong Law of Large K(U ) = 1 m U U T = 1 m m(cid:88) i=1 i → E(cid:104) uiuT u1uT 1 (cid:105) = L(cid:88) l=1 ξlKl(X), a.s.. Since the the map ψ : A (cid:55)→ A−1 for non-singular matrix A is continuous, then we can obtain the following result by using the Continuous Mapping Theorem.  L(cid:88) −1 (˜τ K(U ) + In)−1 → ˜τ ξKl(X) + In a.s., as m → ∞ , l=1 Let ˜A = (˜τ K(U ) + In)−1, we have max 1≤i,j≤n | ˜Aij| ≤ (cid:107) ˜A(cid:107)∞ (cid:46) (cid:107) ˜A(cid:107)op (cid:16) = λmax (˜τ K(U ) + In)−1(cid:17) = 1 ˜τ λmin(K(U )) + 1 ≤ 1 < ∞. Therefore, by Bounded Convergence Theorem, A := E(cid:104) (˜τ K(U ) + In)−1(cid:105) →  L(cid:88) l=1 −1 a.s. as m → ∞. , ˜τ ξKl(X) + In Asymptotically, we get as m → ∞.  L(cid:88) −2 y. ˜τ ξKl(X) + In R (cid:39) yT l=1 163 (ii) Note that equation (3.13) can be further written as K(U ) = f [Σ] + oP (1), or equivalently, K(U ) sumption of (cid:107)K(U )(cid:107)op < ∞ a.s., we have P−→ f [Σ] as m → ∞ element-wisely. Similarly, under the as- E [˜τ K(U ) + In] → ˜τ f [Σ] + In. Hence by Bounded Convergence Theorem and Continuous Mapping Theorem, we have A = E(cid:104) (˜τ K(U ) + In)−1(cid:105) → (˜τ f [Σ] + In)−1 , as m → ∞, which shows that as m → ∞, R (cid:39) yT ˜τ f  L(cid:88) l=1  + In ξKl(X) −2 y. 164 Proof of Proposition 3.3.2 Proof. The result follows by noting that RΣ + In)−1(cid:17) (cid:21) y In − ˜σ2 RΣ(˜σ2 (cid:105) RΣ + In)−1(cid:17)T(cid:16) (cid:17)(cid:21) (cid:21) y σ2 RΣ + φIn (y − E[a|y])T (y − E[a|y]) = E RΣ(˜σ2 In − ˜σ2 AP ELM M = E(cid:104) (cid:20) yT(cid:16) (cid:20) yT(cid:16) RΣ + In)−1(cid:17)2 (cid:20)(cid:16) RΣ + In)−1(cid:17)2(cid:16) RΣ + In)−1(cid:105) (cid:104) (cid:17)−1 (cid:16) n(cid:88) = φtr = E (˜σ2 = tr (˜σ2 (˜σ2 = φ ˜σ2 Rλi(Σ) + 1 . i=1 More Simulation Results Figure B.1 demonstrates the performance of LMM and KNN in terms of prediction error when the inverse logistic function and the polynomial function of order 2 are used. 165 Figure B.1: The boxplots for linear mixed models (LMM) and kernel neural network (KNN) in terms of prediction errors. The left panel shows the results when an inverse logistic function is used and the right panel shows the results when a polynomial function of order 2 is used. In the horizontal axis, “1" corresponds to the LMM; “2" corresponds to the KNN with product input kernel and product output kernel; “3" corresponds to the KNN with product input and polynomial output; “4" corresponds to the KNN with polynomial input and product output and “5" corresponds to the polynomial input and polynomial output. 166 123450.20.40.60.81.01.21.4inv−Logit123450.51.01.52.02.53.03.5Polynomial Appendix C Technical Details and Supplementary Materials for Chapter 4 In this appendix, we are going to explore some basic properties of the parameter space (F,(cid:107) · (cid:107)n) discussed in Chapter 4. Proposition C.0.1. The space (F,(cid:107) · (cid:107)n) is a pseudo-normed space. Proof. Note that (cid:107)f(cid:107)n = i=1 f 2(xi) , then (cid:16) 1 n (cid:80)n (cid:17)1/2 (i) Based on the definition of (cid:107) · (cid:107)n, it is clear that (cid:107)f(cid:107)n ≥ 0, for any f ∈ F. (ii) For any λ ∈ R and f ∈ F, (cid:107)λf(cid:107)n = (cid:32) n(cid:88) i=1 1 n (cid:33)1/2 λ2f 2(xi) = |λ|(cid:107)f(cid:107)n (iii) For any f, g ∈ F, (cid:32) n(cid:88) (cid:32) n(cid:88) (cid:18) 1√ 1 n i=1 i=1 (cid:107)f + g(cid:107)n = ≤ (f (xi) + g(xi))2 (cid:33)1/2 (cid:32) n(cid:88) (cid:32) n(cid:88) (cid:19)2(cid:33)1/2 (cid:18) 1√ i=1 = + (cid:19)2(cid:33)1/2 g(xi) (cid:18) 1√ n 1√ n f (xi) + (cid:19)2(cid:33)1/2 f (xi) n g(xi) n = (cid:107)f(cid:107)n + (cid:107)g(cid:107)n, i=1 167 where we have used the triangle inequality for classical Euclidean norm. Therefore, we can know that (F,(cid:107) · (cid:107)n) is a pseudo-normed space. Proposition C.0.2. There is an pseudo-inner product on F such that (cid:107)f(cid:107)2 = (cid:104)f, f(cid:105) for any f ∈ F. Moreover, the pseudo-inner product is given by n(cid:88) i=1 (cid:104)f, g(cid:105) = 1 n f (xi)g(xi), ∀f, g ∈ F. Proof. Based on the theorem attributed to Fréchet, von Neumann and Jordan (see for ex- ample, Proposition 14.1.2 in Blanchard and Brüning (2015)[7]), to show the existence of the inner product, it suffices to check the parallelogram law of the pseudo-norm and the corresponding pseudo-inner product can be obtained via the polarization identity. To check to validity of the parallelogram law, we note that for any f, g ∈ F, (cid:107)f + g(cid:107)2 n + (cid:107)f − g(cid:107)2 n = = 1 n 1 n = 2 n n(cid:88) n(cid:88) i=1 i=1 n(cid:88) i=1 n(cid:88) i=1 1 n n(cid:88) i=1 n(cid:88) i=1 f 2(xi) − 2 n(cid:88) n g2(xi) i=1 2 n n(cid:88) i=1 2 n + 1 n f 2(xi) + (f (xi) + g(xi))2 + (f (xi) − g(xi))2 f 2(xi) + f (xi)g(xi) + g2(xi) n(cid:88) i=1 1 n f (xi)g(xi) + 1 n n(cid:88) i=1 g2(xi) = 2(cid:107)f(cid:107)2 n + 2(cid:107)g(cid:107)2 n. Hence, the parallelogram law is satisfied by the pseudo-norm and hence the pseudo-inner 168 product does exist and by the polarization identity, we get for any f, g ∈ F, n(cid:88) i=1 1 n n(cid:88) g2(xi) − 1 n(cid:88) n i=1 f (xi)g(xi) − 1 n i=1 f 2(xi) (cid:33) g2(xi) f (xi)g(xi) + i=1 + 2 n n(cid:88) i=1 (cid:16)(cid:107)f + g(cid:107)2 (cid:32) n(cid:88) (cid:17) n − (cid:107)f − g(cid:107)2 n(cid:88) n f 2(xi) + 2 n 1 n i=1 (cid:104)f, g(cid:105) = = 1 4 1 4 n(cid:88) i=1 = 1 n f (xi)g(xi). Let (cid:26) g : R → R, (cid:90) (cid:12)(cid:12)g(cid:48)(z)(cid:12)(cid:12) dz ≤ M (cid:27) G = be the class of functions of bounded variation in R (see Example 9.3.3 in van de Geer (2000)[68]). The following proposition shows that Frn ⊂ G for fixed n. Proposition C.0.3. For fixed n, Frn ⊂ G. Proof. For any f ∈ Frn, we have f (x) = α0 + αjσ(cid:0)γjx + γ0,j (cid:1) rn(cid:88) j=1 so that f(cid:48)(x) = rn(cid:88) j=1 αjγjσ(cid:0)γjx + γ0,j (cid:1)(cid:2)1 − σ(cid:0)γjx + γ0,j (cid:1)(cid:3) . 169 Without loss of generality, we assume that γj (cid:54)= 0 for j = 1, . . . , rn. Then note that (cid:90) |f(cid:48)(x)|dx = ≤ ≤ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) rn(cid:88) rn(cid:88) rn(cid:88) j=1 j=1 j=1 |αj||γj| |αj||γj| γj αjγjσ(cid:0)γjx + γ0,j (cid:90) (cid:90) (cid:1)(cid:2)1 − σ(cid:0)γjx + γ0,j (cid:1)(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx σ(γjx + γ0,j)[1 − σ(γjx + γ0,j)]dx σ(uj)(1 − σ(uj))duj, where in the last inequality, we let uj = γjx + γ0,j. Clearly, |γj|/γj = sign(γj). Moreover, since (cid:90) σ(x)(1 − σ(x))dx = ex (cid:90) (cid:90) ∞ (1 + ex)2 dx = −(1 + u)−1(cid:12)(cid:12)(cid:12)∞ (1 + u)2 du (by letting u = ex) = 1 0 0 we get for fixed n, = 1, rn(cid:88) j=1 (cid:90) |f(cid:48)(x)|dx ≤ |αj|sign(γj) ≤ rn(cid:88) j=1 |αj| ≤ Vn. Therefore, f ∈ G and the desired result follows. 170 BIBLIOGRAPHY 171 BIBLIOGRAPHY [1] Gad Abraham and Michael Inouye. Genomic risk prediction of complex human disease and its clinical application. Current opinion in genetics & development, 33:10–16, 2015. [2] Kenneth S Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm. The Annals of Probability, pages 1041–1067, 1984. [3] Takeshi Amemiya. A note on a heteroscedastic model. Journal of Econometrics, 6(3):365–370, 1977. [4] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. [5] Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical modeling and analysis for spatial data. Crc Press, 2014. [6] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in prob- ability and statistics. Springer Science & Business Media, 2011. [7] Philippe Blanchard and Erwin Brüning. Mathematical methods in Physics: Distribu- tions, Hilbert space operators, variational methods, and applications in quantum physics, volume 69. Birkhäuser, 2015. [8] Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Advanced Lectures on Machine Learning, pages 208–240. Springer, 2004. [9] Stephen Boyd and Almir Mutapcic. Subgradient methods (notes for EE364B Winter 2006-07, Stanford University), 2008. [10] D Brook. On the distinction between the conditional probability and the joint prob- ability approaches in the specification of nearest-neighbour systems. Biometrika, 51(3/4):481–483, 1964. [11] Valeri˘ı Vladimirovich Buldygin and IU V Kozachenko. Metric characterization of ran- dom variables and random processes, volume 188. American Mathematical Soc., 2000. [12] Xiaohong Chen and Xiaotong Shen. Sieve extremum estimates for weakly dependent data. Econometrica, pages 289–314, 1998. [13] 1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061–1073, 2010. 172 [14] Wellcome Trust Case Control Consortium et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661, 2007. [15] Robert R Corbeil and Shayle R Searle. Restricted maximum likelihood (reml) estimation of variance components in the mixed model. Technometrics, 18(1):31–38, 1976. [16] Robert B Davies. Algorithm as 155: The distribution of a linear combination of χ 2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(3):323–333, 1980. [17] Robert B Davies. Hypothesis testing when a nuisance parameter is present only under the alternatives. Biometrika, 74:33–43, 1987. [18] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recog- nition, volume 31. Springer Science & Business Media, 2013. [19] Luca Ferrarini, Walter M Palm, Hans Olofsen, Mark A van Buchem, Johan HC Reiber, and Faiza Admiraal-Behloul. Shape differences of the brain ventricles in alzheimer’s disease. NeuroImage, 32(3):1060–1069, 2006. [20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [21] Kenji Fukumizu. A regularity condition of the information matrix of a multilayer per- ceptron network. Neural networks, 9(5):871–879, 1996. [22] Kenji Fukumizu et al. Likelihood ratio of unidentifiable models and multilayer neural networks. The Annals of Statistics, 31(3):833–851, 2003. [23] James E Gentle. Matrix algebra: theory, computations, and applications in statistics. Springer Science & Business Media, 2007. [24] David B Goldstein, Andrew Allen, Jonathan Keebler, Elliott H Margulies, Steven Petrou, Slavé Petrovski, and Shamil Sunyaev. Sequencing studies in human genetics: design and interpretation. Nature Reviews Genetics, 14(7):460, 2013. [25] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016. [26] Ulf Grenander. Abstract Inference. Wily, New York, 1981. [27] Zihuai He, Min Zhang, Xiaowei Zhan, and Qing Lu. Modeling and testing for joint association using a genetic random field model. Biometrics, 70(3):471–479, 2014. 173 [28] Charles R Henderson. Sire evaluation and genetic trends. Journal of Animal Science, 1973(Symposium):10–41, 1973. [29] Fumio Hiai. Monotonicity for entrywise functions of matrices. Linear Algebra and its Applications, 431(8):1125–1146, 2009. [30] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Jour- nal of the American statistical association, 58(301):13–30, 1963. [31] Roger A Horn, Roger A Horn, and Charles R Johnson. Matrix analysis. Cambridge university press, 1990. [32] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2 edition, 2012. [33] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward net- works are universal approximators. Neural networks, 2(5):359–366, 1989. [34] Tadeusz Inglot. Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics, 30(2):339–351, 2010. [35] Luke Jostins and Jeffrey C Barrett. Genetic risk prediction in complex disease. Human molecular genetics, 20(R2):R182–R188, 2011. [36] K Juottonen, MP Laakso, R Insausti, M Lehtovirta, A Pitkänen, K Partanen, and H Soininen. Volumes of the entorhinal and perirhinal cortices in alzheimer’s disease. Neurobiology of aging, 19(1):15–22, 1998. [37] Raghu N Kackar and David A Harville. Unbiasedness of two-stage estimation and prediction procedures for mixed linear models. Communications in statistics-theory and methods, 10(13):1249–1261, 1981. [38] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 521(7553):436, 2015. Deep learning. nature, [39] Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2005. [40] Bingshan Li and Suzanne M Leal. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics, 83(3):311–321, 2008. [41] Ming Li, Zihuai He, Min Zhang, Xiaowei Zhan, Changshuai Wei, Robert C Elston, and Qing Lu. A generalized genetic random field method for the genetic association analysis of sequencing data. Genetic epidemiology, 38(3):242–253, 2014. 174 [42] Dawei Liu, Xihong Lin, and Debashis Ghosh. Semiparametric regression of multidimen- sional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics, 63(4):1079–1088, 2007. [43] Xin Liu and Yongzhao Shao. Asymptotics for likelihood ratio tests under loss of iden- tifiability. The Annals of Statistics, 31(3):807–832, 2003. [44] Bo Eskerod Madsen and Sharon R Browning. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet, 5(2):e1000384, 2009. [45] Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85(1):98–109, 1996. [46] Jon McClellan and Mary-Claire King. Genetic heterogeneity in human disease. Cell, 141(2):210–217, 2010. [47] Charles E. McCulloch, S. R. Searle, and John M. Neuhaus. Generalized, linear, and mixed models. Wiley, Hoboken, N.J, 2nd edition, 2008. [48] Shahar Mendelson. A few notes on statistical learning theory. In Advanced lectures on machine learning, pages 1–40. Springer, 2003. [49] Stephan Morgenthaler and William G Thilly. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (cast). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1):28– 56, 2007. [50] NE Morton and CJ MacLean. Analysis of family resemblance. 3. complex segregation of quantitative traits. American journal of human genetics, 26(4):489, 1974. [51] Yangling Mu and Fred H Gage. Adult hippocampal neurogenesis and its role in alzheimer’s disease. Molecular neurodegeneration, 6(1):1, 2011. [52] Wei Pan, Junghi Kim, Yiwei Zhang, Xiaotong Shen, and Peng Wei. A powerful and adaptive association test for rare variants. Genetics, 197(4):1081–1095, 2014. [53] Yongjin Park and Manolis Kellis. Deep learning for regulatory genomics. Nature biotech- nology, 33(8):825, 2015. [54] Long Qu, Tobias Guennel, and Scott L. Marshall. Linear score tests for variance compo- nents in linear mixed models and applications to genetic association studies. Biometrics, 69(4):883–892, 2013. [55] C Radhakrishna Rao. Estimation of heteroscedastic variances in linear models. Journal of the American Statistical Association, 65(329):161–172, 1970. 175 [56] C Radhakrishna Rao. Estimation of variance and covariance components—minque the- ory. Journal of multivariate analysis, 1(3):257–275, 1971. [57] C Radhakrishna Rao. Estimation of variance and covariance components in linear models. Journal of the American Statistical Association, 67(337):112–115, 1972. [58] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applica- tions. CRC Press, 2005. [59] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer the- In International conference on computational learning theory, pages 416–426. orem. Springer, 2001. [60] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. [61] N Schuff, N Woerner, L Boreta, T Kornfield, LM Shaw, JQ Trojanowski, PM Thompson, CR Jack, MW Weiner, Disease Neuroimaging Initiative, et al. Mri of hippocampal volume loss in early alzheimer’s disease in relation to apoe genotype and biomarkers. Brain, 132(4):1067–1077, 2009. [62] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Willer, Yun Li, William L Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson, et al. A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. science, 316(5829):1341–1345, 2007. [63] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cam- bridge university press, 2004. [64] Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591, 1997. [65] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. The Annals of Statistics, pages 580–615, 1994. [66] Steven E Stern and Alan H Welsh. Likelihood inference for small variance components. Canadian Journal of Statistics, 28(3):517–532, 2000. [67] Madhav Thambisetty, Andrew Simmons, Abdul Hye, James Campbell, Eric Westman, Yi Zhang, Lars-Olof Wahlund, Anna Kinsey, Mirsada Causevic, Richard Killick, et al. Plasma biomarkers of brain atrophy in alzheimer’s disease. PloS one, 6(12):e28527, 2011. [68] Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000. 176 [69] Aad W van der Vaart and Jon A Wellner. Weak convergence and empirical processes. Springer, 1996. [70] Vladimir Vapnik. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998. [71] Sheng Wang, Jian Peng, Jianzhu Ma, and Jinbo Xu. Protein secondary structure pre- diction using deep convolutional neural fields. Scientific reports, 6:18962, 2016. [72] Yalu Wen, Xiaoxi Shen, and Qing Lu. Genetic risk prediction using a spatial autore- gressive model with adaptive lasso. Statistics in medicine, 37(26):3764–3775, 2018. [73] Halbert White. Learning in artificial neural networks: A statistical perspective. Neural computation, 1(4):425–464, 1989. [74] Halbert White. Connectionist nonparametric regression: Multilayer feedforward net- works can learn arbitrary mappings. Neural networks, 3(5):535–549, 1990. [75] Halbert White and Jeffrey Racine. Statistical inference, the bootstrap, and neural- network modeling with application to foreign exchange rates. IEEE Transactions on Neural Networks, 12(4):657–673, 2001. [76] Halbert White and J Wooldridge. Some results on sieve estimation with dependent observations. In W.A. Barnett, J. Powell, and G. Tauchen, editors, Nonparametric and Semiparametric Methods in Economics, pages 459–493. Cambridge University Press New York, 1991. [77] Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics, pages 501–513, 1981. [78] Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93, 2011. [79] Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011. [80] Hongtu Zhu and Heping Zhang. Asymptotics for estimation and testing procedures under loss of identifiability. Journal of Multivariate Analysis, 97(1):19–45, 2006. 177