A NEURAL NETWORKS BASED METHOD WITH GENETIC DATA ANALYSIS OF COMPLEX DISEASES By Jinghang Lin A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics — Doctor of Philosophy 2021 ABSTRACT A NEURAL NETWORKS BASED METHOD WITH GENETIC DATA ANALYSIS OF COMPLEX DISEASES By Jinghang Lin The genetic etiologies of common diseases are highly complex and heterogeneous. Classic statistical methods, such as linear regression, have successfully identified numerous genetic variants associated with complex diseases. Nonetheless, for most complex diseases, the identified variants only account for a small proportion of heritability. Challenges remain to discover additional variants contributing to complex diseases. In this dissertation, we developed an expectile neural network (ENN) method and applied the method to genetic data analysis. ENN provides a comprehensive view of relationships between genetic variants and disease phenotypes and can be used to discover genetic variants predisposing to sub- populations (e.g., high-risk groups). We integrate the idea of neural networks into ENN, making it capable of capturing non-linear and non-additive genetic effects (e.g., gene-gene interactions). Through simulations, we showed that the proposed method outperformed an existing expectile regression when there exist complex relationships between genetic variants and disease phenotypes. We also applied the proposed method to the genetic data from the Study of Addiction: Genetics and Environment(SAGE), investigating the relationships of candidate genes with smoking quantity. Neural networks have been widely used in ap- plications. However, few studies have been focused on the statistical properties of neural networks. We further investigate the Asymptotic properties of ENN (e.g., consistency). Simulations have been conducted to test the validity of the theory. I dedicate this dissertation to my parents, Xianghua Chen and Xilong Lin for their endless love and support. iii ACKNOWLEDGMENTS There are many people who helped me along the way on this journey. Without their help, I could not complete this dissertation. I want to take a moment to thank them. First of all, I would like to express my sincere gratitude to my advisors Dr. Qing Lu and Dr. Yuehua Cui for their invaluable advice, continuous support, and patience during my PhD study. Their immense knowledge and plentiful experience have encouraged me in all the time of my academic research. They lead me into the area of statistical genetics and train me to be an independent researcher. I would also like to extend my sincere thanks to my dissertation committee members, Dr. Hyokyoung (Grace) Hong and Dr. Haolei Weng. Their comments and suggestions are beneficial to my research. My special thanks to Dr. Guowei Wei for his help in my job search. I am deeply grateful to Dr. Xiaoxi Shen and Dr. Xiaoran Tong for their insight in theory and computational support. During my PhD study, I made a lot of friends. My special thanks to my friends: Steven Gagnon, Tengfei Ma, Peide Li, Zihuan Liu, Dr. Cheuk (Ken) Lee for their constant help in my life and study. I would like to express my sincere gratitude to group members in Dr. Qing Lu’s group: Shan Zhang, Chang Jiang, Yuan Zhou, Tingting Hou, Mingsheng Tang for creating a positive research atmosphere. My special thanks to my girlfriend Dr. Liping Sun for her love and accompany. Last but not least, I would like to express my sincere thanks to my parents for their support and endless love for me. I would never make this journey without them. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 A review of basic human genetics . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Statistical learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Artificial intelligence in healthcare . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2 Expectile Neural Networks for Genetic Data Analysis of Com- plex Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Expectile regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Expectile neural network . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3 Theoretical result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Simulation I - nonlinear relationship . . . . . . . . . . . . . . . . . . 20 2.4.2 Simulation II - interactions among SNPs . . . . . . . . . . . . . . . . 23 2.4.3 Simulation III - interactions between genes . . . . . . . . . . . . . . . 25 2.5 Real data applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 The relationship between candidate SNPs with smoking quantities . . 28 2.5.2 Gene-gene interactions between the CHRNA5-CHRNA3-CHRNB4 gene cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 3 Asymptotic Theory of Expectile Neural Networks . . . . . . . 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Method of sieves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.1.1 Simulation results of consistency with τ = 0.5 . . . . . . . . 56 3.6.1.2 Simulation results of consistency with τ = 0.75 . . . . . . . 58 3.6.2 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 v 3.6.2.1 Simulation result of normality with τ = 0.5 . . . . . . . . . 61 3.6.2.2 Simulation result of normality with τ = 0.75 . . . . . . . . . 63 3.7 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 66 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Appendix A Technical Details of Chapter 2 . . . . . . . . . . . . . . . . . . . . . 70 Appendix B Technical Details of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 76 Appendix C Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . 78 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 vi LIST OF TABLES Table 2.1: The accuracy performance of two models built by ENN and ER based on 149 candidate SNPs and 3 covariates . . . . . . . . . . . . . . . . . . . . . 28 Table 2.2: Evaluating a pairwise interaction between CHRNA5 and CHRNA3 by using ENN and ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 2.3: Evaluating a pairwise interaction between CHRNA5 and CHRNB4 by using ENN and ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 2.4: Evaluating a pairwise interaction between CHRNA3 and CHRNB4 by using ENN and ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table C.1: Real data application result of CHRNA5 . . . . . . . . . . . . . . . . . . 80 Table C.2: Real data application result of CHRNA3 . . . . . . . . . . . . . . . . . . 80 Table C.3: Real data application result of CHRNB4 . . . . . . . . . . . . . . . . . . . 80 Table C.4: Real data application result of ADNI . . . . . . . . . . . . . . . . . . . . . 82 vii LIST OF FIGURES Figure 1.1: A graphical representation of Chromosome, DNA and gene. Credit to Genetic Alliance UK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 1.2: Similarity between biological and artificial neural networks . . . . . . . . 5 Figure 2.1: Quantiles and expectiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 2.2: A graphical representation of expectile neural network . . . . . . . . . . 15 Figure 2.3: Performance comparison between ENN and ER under various relation- ships between genotypes and phenotypes and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 2.4: Performance comparison between ENN and ER for different types of in- teractions and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) . . . 24 Figure 2.5: An alternative architecture for gene-gene interaction analyses . . . . . . 25 Figure 2.6: Performance comparison between ENN with a fully connected architecture and ENN with a non-fully connected architecture for gene-gene interaction analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 2.7: A comprehesive view of the conditional distribution of smoking quantity for five expectile levels (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) . . . . . . . . . 29 Figure 2.8: The conditional distribution of CPD considering the interaction between CHRNA5 and CHRNA3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 2.9: The conditional distribution of CPD considering the interaction between CHRNA5 and CHRNB4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 2.10: The conditional distribution of CPD considering the interaction between CHRNB4 and CHRNA3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3.1: Comparison between the true function f0 and fitted functions under dif- ferent sample sizes, where f0 is a neural network with one single hidden layer and two hidden units τ = 0.5. . . . . . . . . . . . . . . . . . . . . . 57 Figure 3.2: Comparison between the true function f0 = x3 + 1 and fitted functions under different sample sizes with τ = 0.5. . . . . . . . . . . . . . . . . . . 57 viii Figure 3.3: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and fitted functions under different sample sizes with τ = 0.5. . . . . . . . . . 58 Figure 3.4: Comparison between the true function f0 and fitted functions under dif- ferent sample sizes, where f0 is a neural network with one single hidden layer and two hidden units with τ = 0.75. . . . . . . . . . . . . . . . . . 59 Figure 3.5: Comparison between the true function f0 = x3 + 1 and fitted functions under different sample sizes with τ = 0.75. . . . . . . . . . . . . . . . . . 59 Figure 3.6: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and fitted functions under different sample sizes with τ = 0.75. . . . . . . . . 60 Figure 3.7: Q-Q plot with different sample sizes, where the true function f0 is a neural network with one single hidden layer and two hidden units with τ = 0.5 . 61 Figure 3.8: Q-Q plot with different sample sizes, where the true function is f0 = x3 +1 with τ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 3.9: Q-Q plot with different sample sizes, where the true function is f0 = sin(x) + 2exp((−16)x2 ) with τ = 0.5. . . . . . . . . . . . . . . . . . . . . 62 Figure 3.10: Q-Q plot with different sample sizes, where the true function f0 is a neural network with one single hidden layer and two hidden units with τ = 0.75. 63 Figure 3.11: Q-Q plot with different sample sizes, where the true function is f0 = x3 +1 with τ = 0.75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 3.12: Q-Q plot with different sample sizes, where the true function is f0 = sin(x) + 2exp((−16)x2 ) with τ = 0.75. . . . . . . . . . . . . . . . . . . . 64 ix Chapter 1 Introduction 1.1 Overview With the development of biotechnology, especially next-generation sequencing technologies (NGS), it is easy to sequence an entire human genome. New technologies arising from the Human Genome Project and HapMap Project have generated a surge of methodological development for unsolved problems in human genetics. To find genetic variations associated with a particular disease, a genome-wide association study (GWAS) that involves rapidly scanning markers across the complete sets of DNA, or genomes, can be adopted[1]. GWAS investigates the entire genome and identify SNPs and other variants in DNA associated with a disease, but they cannot infer which genes are causal. Once new genetic associations are identified, researchers can use the information to understand, treat and prevent the disease[2]. Successful GWAS has been conducted to identify genetic variations that contribute to the risk of type 2 diabetes, Parkinson’s disease, heart disorders, obesity, Crohn’s disease and prostate cancer, as well as genetic variations that influence response to anti-depressant medications[5; 6]. Such research lays the groundwork for personalized medicine. Based on prior knowledge of a gene’s biological function on the trait or disease, candidate genes are most often studied in risk prediction research[3]. In risk prediction research, we are interested in developing a new genetic risk prediction model to identify the high-risk 1 individuals for certain diseases. If we could predict high-risk individuals at the early stage, targeted screening and appropriate intervention methods can be used to reduce mortality and morbidity[4]. However, there are tremendous analytic and computational challenges when we implement a risk prediction model. Genetic data is high-dimensional. For example, there are millions of single nucleotide polymorphisms (SNP), and the signal-to-noise ratio of genetic data is quite low, which makes us hard to capture underlying genetic effects. Moreover, the study sample is massive (e.g., a million samples in the UK Biobank), which brings the computational issue. In this chapter, we will first review some basic knowledge of human genetics in section 1.2. In section 1.3, we will briefly introduce the neural network and its application in healthcare. We give the overall organization of this dissertation in section 1.4. 1.2 A review of basic human genetics In the human genome, the genetic material is stored on chromosomes in the nucleus of the cell. There are 23 pairs of chromosomes in the human genome: 22 pairs of them are autosomal and the 23rd pair is the sex chromosomes. For the sex chromosomes, males have one X and Y, while females have two non-identical copies of the X chromosome. Each chromosome is composed of long strands of deoxyribonucleic acid (DNA), which determines how proteins are manufactured in the human body. Genes are segments of DNA that code for specific proteins that function in one or more types of cells in the body. These proteins control how our body grows and works; they are also responsible for many of our characteristics, such as our eye color, blood type or height. Genes are the basic physical units of inheritance, which are passed from parents to offspring 2 and contain the information needed to specify traits. Most parts of DNA are the same in all people, but a small proportion of DNA (less than 1 percent of the total DNA) are different between people. These differences contribute to each person’s unique physical features. An allele is one of two or more versions of a gene. An individual inherits two alleles, one from each parent. Figure 1.1: A graphical representation of Chromosome, DNA and gene. Credit to Genetic Alliance UK SNPs are the most common type of genetic variation among people, which are typically coded as the number of minor frequent alleles (e.g., AA=2, Aa=1, aa=0). A trait is any gene- determined characteristic and is often determined by more than one gene. The genotypes for the traits are often not observable and should be inferred from linked markers. In statistical genetics, we intend to construct a statistical model that connects genotypes and phenotypes[7]. 3 1.3 Statistical learning We give a brief introduction of the statistical learning framework. Suppose X stands for the vector space of input and Y for the vector space of output. In statistical learning, we assume that there is an underlying unknown probability distribution over the product space Z = X × Y . The training set D = {(x1 , y1 ), ..., (xn , yn )} comprise of n samples from the probability distribution. The goal of statistical learning is to find the unknown function f : X → Y from the data D. We start with a set of candidate hypotheses H = {h1 , h2 ..., }, which are likely to represent f . The hypothesis space is the space of functions that the algorithm will search through. We want to select a hypothesis f from H. The way we do this is called a learning algorithm. Let L(f (x, y) be the loss function that is a metric of the difference between the predicted value f (x) and the observed value y. The problem of statistical learning is to minimize the expected risk: Z R(f ) = L(f (x, y)dF (x, y). Since the probability distribution F (x, y) is unknown, a proxy measure for the expected risk must be used. We try to minimize empirical risk: n 1X Remp (f ) = L(f (xi , yi ). n i=1 1.3.1 Neural network The basis of the biological neural networks is the nerve cells, which is composed of a cell body, a dendrite and an axon. At the high-level view, incoming stimuli are transmitted to the cell body via dendrites. Outputs generated after operations in the cell body are transmitted to 4 other nerve cells via axons. In the neural network model, it imitates the functioning of the human brain. The biological nervous system in the human body consists of a three-layered structure that includes receiving data, interpreting them, and making decisions. A neuron model is composed of three layers: input layer, hidden layer and output layer. Here we give a graphical representation of similarity between biological and artificial neural networks with one hidden layer in Figure 1.2[58]. x1 , ..., xm are input units, which mimic the dendrites of a neuron. Σ is a computation unit, which is involved in the same role in the cell body. The computation unit is the most important part in neural networks, which is the linear combination of inputs units and bias and then apply the activation function. The number of computation units and the type of activation function are crucial in building models. Figure 1.2: Similarity between biological and artificial neural networks Common activation functions of neural networks used in perceptrons and neural networks 5 are • Rectified Linear Unit (ReLU): σ(x) = x+ = max{x, 0}, • Standard Sigmoid: σ(x) = (1 + e−x )−1 , • Hyperbolic Tangent (Tanh): ex − e−x σ(x) = tanh(x) = x . e + e−x The output layer consists of a single layer, where the generated data are transmitted to the outside world. This is analogous to the axon of a neuron. Neural networks with multiple hidden layers are called deep neural networks. Deep neural networks contain multiple non-linear hidden layers and this enables them learn very compli- cated relationships between inputs and outputs. For most data sets, neural networks with one hidden layer are enough to build a decent model. While various theoretical perspectives have been developed to explain why deep learning is successful, the general consensus of the community is to attribute the success to the joint forces of straightforward neural model- ing, simple learning techniques, the availability of big data and the hardware revolution in high-performance computing[40]. Deep neural networks are a powerful tool in analyzing a large dataset. However, overfitting is a serious issue in deep neural networks. Dropout is a technique to address the overfitting problem. Dropout randomly drops units (along with 6 their connections) from the neural network during training[39]. By reducing the number of parameters, the model performance of a deep neural network can be improved. Deep learning has been implemented in many software frameworks, such as Tensorflow and Pytorch[62; 63]. Those frameworks offer building blocks for designing, training and validating deep neural networks, through a high-level programming language, like Python. They also provide a clear and concise way to simplify the implementation of complex and large-scale deep learning models by using a collection of pre-built and optimized components. It is worthwhile to mention a well-known result of neural networks: the universal approx- imation theorem. A neural network with one hidden layer could approximate any continuous function[61]. Theorem 1.3.1 (Universal Approximation Theorem). For every continuous function f : [a, b]d → R and for every  > 0, there exists a neural network with one hidden layer ψ(x) such that sup |f (x) − ψ(x)| < . x∈[a,b]d 1.3.2 Artificial intelligence in healthcare Deep learning or AI has been applied to many applications, such as natural language pro- cessing and computer vision. AI also holds great promise for healthcare. With the develop- ment of biotechnology, healthcare data has a large size and complexity that traditional data management tools cannot store or process it efficiently. Many successful AI applications in healthcare have been conducted. For example, AI can be used to optimize the care trajectory of chronic disease patients, suggest precision therapies for complex illnesses, reduce medical errors, and improve subject enrollment into clinical trials[8]. Fakoor et al. showed that how 7 unsupervised feature learning can be used for cancer detection and cancer type analysis from gene expression data[9]. Krittanawong et al. gave a glimpse of AI’s application in cardio- vascular clinical care and discussed its potential role in facilitating precision cardiovascular medicine[12]. Pham et al used a deep learning approach to read medical records, store previ- ous illness history, infer current illness states and predict future medical outcomes[59]. Plis et al. applied deep learning methods to learn physiologically important representations and detect latent relations in neuroimaging data[60]. There is a great promise that the applica- tions of AI can provide substantial improvement in all areas of healthcare from diagnostics to treatments. Although there are many instances in which AI can perform healthcare tasks better than humans, implementation will prevent large-scale automation of healthcare pro- fessional jobs for a considerable period[76]. However, AI will not take over the jobs which require unique human skills such as empathy and persuasion. 1.4 Organization The dissertation is organized as follows. In chapter 2, we develop a neural-network-based method called expectile neural networks. In chapter 3, The asymptotic properties of ENN are discussed. In chapter 4, we summarize this dissertation and discuss some potential future work. 8 Chapter 2 Expectile Neural Networks for Genetic Data Analysis of Complex Diseases 2.1 Overview The genetic etiologies of common diseases are highly complex and heterogeneous. Classic statistical methods, such as linear regression, have successfully identified numerous genetic variants associated with complex diseases. Nonetheless, for most complex diseases, the identified variants only account for a small proportion of heritability. Challenges remain to discover additional variants contributing to complex diseases. Expectile regression is a generalization of linear regression and provides completed information on the conditional distribution of a phenotype of interest. While expectile regression has many nice proper- ties and holds great promise for genetic data analyses (e.g., investigating genetic variants predisposing to a high-risk population), it has been rarely used in genetic research. In this chapter, we develop an expectile neural network (ENN) method for genetic data analyses of complex diseases. Similar to expectile regression, ENN provides a comprehensive view of relationships between genetic variants and disease phenotypes and can be used to discover 9 genetic variants predisposing to sub-populations (e.g., high-risk groups). We further inte- grate the idea of neural networks into ENN, making it capable of capturing non-linear and non-additive genetic effects (e.g., gene-gene interactions). Through simulations, we showed that the proposed method outperformed an existing expectile regression when there exist complex relationships between genetic variants and disease phenotypes. We also applied the proposed method to the genetic data from the Study of Addiction: Genetics and Environ- ment(SAGE), investigating the relationships of candidate genes with smoking quantity. 2.2 Introduction Converging evidence suggests that the genetic etiologies of complex diseases are highly het- erogeneous [13; 14] and various genetic factors and environmental determinants could play different roles in subgroups of the population. Linear regression has been commonly used in genetic studies to investigate the effects of genetic variants on the mean of a continuous phenotype. However, if we are interested in a complete view of genetic effects across the entire distribution of phenotypes or are interested in investigating genetic contribution to a sub-population(e.g., a high-risk population), quantile regression and expectile regression are great alternative choices [15; 16]. Quantile regression generalizes median regression and has been widely used in fields such as economics [17], medicine [18; 19] and environmental science [20] to study entire conditional distributions of responses given covariates. While quantile regression has many good properties (e.g., being robust to distribution assumption and outlies), as pointed out by Newey and Powell [16], quantile regression has several lim- itations. First, quantile regression uses the check function with the absolute least error as loss function, which is not continuously differentiable and is computationally difficult for pa- 10 rameter estimation. Second, quantile regression is relatively inefficient for error distributions that are close to Gaussian or have low densities at the corresponding percentile. Third, it is challenging to estimate the density function values of quantile regression. To address these issues, Newey and Powell [16] proposes expectile regression, which uses the sum of asymmetric residual squares as the loss function. Since the loss function is convex and differentiable, expectile regression has a computational advantage over quantile regression. Similar to quantile regression, expectile regression makes no assumption on error distribution (e.g., homoscedasticity) and can be used to study the entire distribution of the responses. Expectile regression can be viewed as a generalization of linear regression. A typical expectile regression assumes a linear relationship between the expectile and the covariates, which may not be suitable for genetic data analysis as genetic variants likely influence phenotypes in a complicated manner (e.g., through interactions) [21]. Simply considering linear and additive genetic effects can’t fully take this complexity into account. In this chapter, we integrate the idea of neural networks into expectile regression and de- velop an expectile neural network (ENN) method to model the complex relationship between genotypes and phenotypes. While several methods have been developed to integrate neural networks into quantile regression[22; 23; 24], few studies have been focused on investigating nonlinear expectile regressions, especially using neural networks. Compared to quantile re- gression neural networks(QRNN), ENN has several advantages. The empirical loss function in ENN is differentiable everywhere. Moreover, ENN can detect the heteroscedasticity in the data since ENN is more sensitive to extreme values than QRNN[25; 26; 27; 28; 29]. The rest of the chapter is organized as follows: in Section 2, we review expectile regres- sion and propose an ENN method. We then give an inequality that bounds the integrated squared error of an expectile function estimator in terms of risk functions. The proof of 11 inequality is detailed in the Appendix. Simulations were conducted in Section 3 to evalu- ate the performance of the new method. In Section 4, we applied ENN to the SAGE data, studying genetic contribution to smoking quantity. We provide the summary and concluding remarks in Section 5. 2.3 Method In this section, we briefly introduce expectile regression and then propose an expectile neural network. Suppose we have n samples,{(xi , yi ), i = 1, ..., n}, where xi = (1, xi,1 , ..., xi,p )T and yi denote a p−dimensional covarites and the response for the ith sample, respectively. In this chapter, the covariates are primarily genetic variants, such as single nucleotide poly- morphisms (SNPs), which are typically coded as the number of minor frequent allele (e.g., AA=2, Aa=1, aa=0). The covariates xi can also include personal characteristics (e.g., gen- der) and environmental determinants. The response yi is the set of observable characteristics of an individual in genetics. For example, yi could be the type of diabetes, or the height of an individual. By building models between xi and yi , we tend to explore the relationship of candidate genes and certain disease. 2.3.1 Expectile regression Given the data, linear regression is commonly used to model the relationship between the covariates and the mean response. However, if we want to explore a complete relationship between the covarites and the response (e.g., genetic contribution to a high-risk population), an expectile regression can be used. To simplify the notation, we denote expectile regression 12 as ER. The expectile regression for the τ −expectile can be expressed as, Expectile(τ ) = xT β̂, (2.1) where β̂ is the estimator of coefficients β = (β0 , β1 , ..., βp )T . The expectile is also closely related to two commonly used measures in mathematical finance, value at risk and expected shortfall. The regression parameters, β̂, can be obtained by minimizing an asymmetric L2 loss function, n 1X RLτ (β; τ ) = Lτ (yi , xi T β), 0 < τ < 1, (2.2) n i=1 where Lτ (·) is asymmetric squared loss with convex form  (1 − τ )(yi − xi T β)2 , if yi < xi T β   T L(yi , xi β) = (2.3) τ (yi − xi T β)2 , if yi ≥ xi T β.   Minimizing asymmetically weighted sums of squared errors yields the the expectiles. If we minimize sums of asymmetrically weighted absolute errors, the estimators are quantiles. In contrast to the quantiles, expectiles have a more global dependence on the form of the distribution. Shifting mass in the lower tail of a distribution has no impact on the quantiles of the upper tail, but it will affect all expectiles. We cite the Figure 2.1 to show the relationship between quantiles and expectiles[54]. For a model with a large p, a penalty term can be added to the risk function to reduce 13 the model complexity, n p 1X Lτ (yi − xi T β) + λ βi2 . X RLτ (β; τ ) = (2.4) n i=1 i=1 τ is a hyperparameter between 0 and 1. By tuning τ , we could get different conditional distributions of responses which is similiar to quantile regression. However, quantile regres- sion uses asymmetric absolute value function. When τ = 0.5, the corresponding expectile regression degenerates to a standard linear regression. Therefore, expectile regression can also be viewed as a generalization of linear regression. Quantile regression can be seen as a generalization of median regression, expectiles as alternative are a generalized form of mean regression. Figure 2.1: Quantiles and expectiles. 14 2.3.2 Expectile neural network A typical expectile regression model focuses on linear relationships between covariates and responses. In reality, the underlying relationship could be non-linear and involve complicated interactions among covariates. In order to model complex relationships between covariates and responses, we integrate the idea of neural networks into expectile regression and propose an ENN method. Neural network is a powerful nonlinear approximator. For every continuous function, neural network with one hidden layer could approximate it well[33]. We don’t assume a particular functional form of covariates and use neural networks to approximate the underlying expectile regression function. ENN can be considered as a nonparametric expectile regression or neural networks with asymmetric L2 loss function, We illustrate ENN with one hidden layer. The method can be easily extended to an expectile regression deep neural network with multiple layers. Figure 2.2: A graphical representation of expectile neural network 15 Given the covariates xt , we first build the hidden nodes hq,t , P (1) (1) f (1) ( X hq,t = xp,t wpq + bq ), q = 1, ..., Q, t = 1, ..., n, (2.5) p=1 where Q is the number of nodes in the first hidden layer, wpq denotes weights and bq denotes the bias; f (1) is the activation function for the hidden layer that can be a sigmoid function, a hyperbolic tangent function, or a rectified linear units(ReLU) function. Similar to hidden nodes in neural networks, the hidden nodes in ENN can learn complex features from covari- ates x, which makes ENN capable of modelling non-linear and non-additive effects. Based on these hidden nodes, we can model the conditional τ -expectile, ŷτ (t), Q (2) f (2) ( hq,t wq + b(2) ), X ŷτ (t) = (2.6) q=1 (2) where f (2) , wq , and b(2) are the activation function, weights, and bias in the output layer, respectively. f (2) can be an identity function, a sigmoid function, or a rectified linear units(ReLU) function. To illustrate the structure of ENN, a graphical representation of ENN is given in Figure 2.1. From equations (2.5) and (2.6), we can have the ENN model: Q P (1) (1) (2) f (2) ( f (1) ( xp,t wpq + bq )wq + b(2) ). X X ŷτ (t) = (2.7) q=1 p=1 If we choose τ = 0, f (1) and f (2) as identity function, ENN is reduced to linear regression. (1) (1) (2) To estimate wpq , bq , wq , b(2) , we minimize the empirical risk function 16 n 1X R(τ ) = Lτ (yi , f (xi )), (2.8) n i=1 where  (1 − τ )(yi − f (xi ))2 ,   if yi < f (xi ) Lτ (yi , f (xi )) = (2.9) τ (yi − f (xi )))2 ,  if yi ≥ f (xi ).  The model tends to be overfitted with the increasing number of covariates. To address the overfitting issue, a L2 penalty is added to the risk function, n P Q 1X X X (1) (2) R(τ ) = Lτ (yi , f (xi )) + λ (wpq )2 + (wq )2 . (2.10) n i=1 p=1 q=1 The loss function for ENN is differentiable everywhere which gives us computation ad- vantage. Even though ENN is differentiable, it is not easy to get exact estimator like lin- ear regression because of the existence of indicator function. We can obtain the estima- tor of ENN by using gradient-based optimization algorithms (e.g., quasi-Newton Broyden- Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm). In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems[34]. 2.3.3 Theoretical result Intuitively, if we fix τ , the upper and lower bound of τ −expectile is related to risk function. To illustrate well, some notations are changed. We give one theoretical result which shows that upper bound and lower bound of error of τ −expectile are bounded by risk function RLτ ,P (f ). In ENN, τ −expectiles fL∗ τ ,P can be estimated by minimizing the asymmetric 17 least squares (ALS) loss, Z R∗Lτ ,P = inf {RLτ ,P (f ) = Lτ (y, f (x))dP (x, y)|f : X → R measurable}, X×Y where P is the distribution on X × Y and f : X → R is some predictor. The following theorem describe the upper bound and lower bound of error of fL∗ τ ,P . Theorem 2.3.1. Let Lτ be the ALS loss function and P be the distribution on X × Y . We further assume that fL∗ τ ,P < ∞ is the τ −expectile for fixed τ ∈ (0, 1). Then, for an arbitrary neural network function f , we have −1/2 −1/2 Cτ (RLτ ,P (f ) − R∗Lτ ,P )1/2 ≤ ||f − fL∗ τ ,P ||L (Px ) ≤ cτ (RLτ ,P (f ) − R∗Lτ ,P )1/2 , 2 where cτ = min{τ, 1 − τ }, Cτ = max{τ, 1 − τ }. Proof of this theorem can be found in the appendix of the chapter. 18 2.4 Simulation Simulation studies were conducted to compare the performance of ENN and ER under dif- ferent settings. The genetic data used in the simulation is the real sequencing data from the 1000 Genomes Project, located on Chromosome 17 : 7344328 − 8344327 [30]. Totally 1000 replicates were simulated for each simulation setting. In each replicate, we randomly selected a number of samples and SNPs from the 1000 Genomes Project based on the simu- lation settings. Given the genotypes, we further simulated the phenotype by using different linear/non-linear functions or by assuming different types of interactions among SNPs or genes. We divided the samples into training, validation, and testing sets with the ratio 3: 1: 1. ENN and ER were applied to the training set to build models. While a variety of activation functions can be used in ENN, we choose ReLU due to its performance and computational advantage[10]. Since the loss function of ENN is differentiable, we use the quasi-Newton BFGS optimization algorithm to estimate the parameters in ENN. We chose the starting point carefully to avoid the local minimum. To select a proper starting point, we generated a set of initial values from U [−1, 1], ran the algorithm for a few steps, and chose the initial values achieving the smallest loss as the initial values. Based on the initial values, the quasi- Newton BFGS optimization algorithm is implemented to iteratively estimate the parameters until the convergence criterion is satisfied. The models built on the training set were then applied to the validation set to choose the most parsimonious model with the optimal tuning parameter (i.e., λ). To choose the best λ, we use the grid search with different values of 0,0.1,1,10,100. This final model was then evaluated on the testing set by using the mean squared error (MSE). We chose the number of hidden nodes with smallest MSE value by 19 doing simulation. We simplify those terms: expectile neural network, expectile regression, training data and testing data as ENN, ER, TR, TS in three simulations. 2.4.1 Simulation I - nonlinear relationship In simulation I, we varied the relationships between genotypes and phenotypes. Since the existence of hyperparameter τ , we compared the performances of ENN with ER. If we wanted to compare with other model, we need to fix τ . The existence of τ gived us a complete view of genetic effects across the entire distribution of phenotypes, like quantile regression. If τ is close to 0 or 1, we could investigate genetic contribution to high-risk individuals. Specially, we considered the following four nonlinear functions as true functions to simulate the relationship between genotypes and phenotypes. For comparison purpose, we also include a linear function. We compare ENN with ENN under four different nonlinear functions: hyperbolic function, mixed function, quadratic function, cubic function. 1. linear function: y = α + , α = xT β, 2. Hyperbolic function: |α| y= + , α = xT β, (1 + |α|) 3. Mixed function: y = sin(α) + 2 ∗ exp(−16α2 ) + , α = xT β, 4. Quadratic function: y = α2 + , α = xT β, 20 5. Cubic function: y = α3 + , α = xT β, where x is the vector of SNPs (coded as 0, 1 or 2), β represents the genetic effects generated from the uniform distribution of U (−1, 1), and  ∼ N (0, 1). Totally 1000 replicates were simulated by setting  with different seed. For each replicate, We randomly choose 500 samples and 50 SNPs from the 1000 Genomes Project. For each nonlinear function, we choose five different value τ of 0.1, 0.25, 0.5, 0.75, 0.9 in order to get different expectiles. To have better readability, the columns of validation data are not shown. 21 Figure 2.3: Performance comparison between ENN and ER under various relationships be- tween genotypes and phenotypes and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) 22 The results from the simulation I are summarized in Figure 2.3. ENN outperforms ER in terms of MSE under four different nonlinear relationships, and has comparable performance with ER when the underlying relationship is linear. The pattern is consistent across different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9). While ENN outperforms ER for all four non-linear cases, ENN attains its best performance relative to ER when the underlying relationship is a high-order polynomial function (i.e., a cubic function). From the simulation result, ENN has advantages to explore the underlying nonliner relationship between genetic variants and certain disease. By fixing τ as 0.1 or 0.9, we could apply ENN into real data to identify high-risk individuals. 2.4.2 Simulation II - interactions among SNPs Increasing empirical evidence from model organisms and human studies suggests that in- teractions among loci contribute broadly to complex traits[36; 37; 38]. In simulation II, we considered three different interactions scenarios that attempt to mimic simple biological mechanisms. Those three types of interactions included a two-way multiplicative interaction, a two-way threshold interaction, and a three-way interactions [14]. Similar to simulation I, we simulated 1000 replicates for each type of interaction. We use the same structure of ENN like simulatino II. For each replicate, 500 samples and 50 SNPs were chosen from the 1000 Genomes Project. Among the 50 SNPs, we randomly selected 20% of SNPs and simulated different types of interactions among the selected SNPs. Based on the simulated data, we compared MSEs of ENN and ER. For the comparison purpose, we also included a baseline model without any interaction. Only training and testing data are shown. 23 Figure 2.4: Performance comparison between ENN and ER for different types of interactions and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) The results of the simulation II are summarized in Figure 2.4. Overall, ENN outperforms ER under all three interaction scenarios due to its ability of taking interactions into account. Among all interaction models, ENN attains its best performance relative to ER when there are three-way interactions. ENN also has more advantage over ER at the upper and lower expectiles (e.g., 0.1 and 0.9). When there is no interaction, ENN has comparable performance with ER. 24 2.4.3 Simulation III - interactions between genes Following the identification of several disease-associated polymorphisms by whole genome association analysis, investigating interactions among two or more than two genes is often interested in genetic studies[35]. Detecting gene-gene interaction will allow us to elucidate the biological and biochemical pathways underpinning disease. Figure 2.5: An alternative architecture for gene-gene interaction analyses While a fully connected neural network can be built on all SNPs in the genes of interest, a neural network with a simpler architecture reflecting the underlying genetic data structure can be used to reduce the model’s complexity and improve the model’s performance. In this simulation, We illustrate the idea by modeling interactions between two genes with a non- fully connected architecture. In the non-fully connected architecture, the hidden units are only locally connected to SNPs in one gene (Figure 2.5). By using this simple architecture, we can reduce the number of parameters and build ”gene-specific” hidden units to capture abstract features of a specific gene. To evaluate the performance of such an architecture, we 25 simply simulated four SNPs for each gene, considered a two-way multiplicative interaction between two genes, and compared ENN with the non-fully connected architecture to ENN with a fully connected architecture. Figure 2.6: Performance comparison between ENN with a fully connected architecture and ENN with a non-fully connected architecture for gene-gene interaction analyses Figure 2.6 summarizes the results from simulation III. The results show that ENN with the non-fully connected architecture attains lower MSE than ENN with the fully-connected architecture. As expected, the non-fully connected architecture requires fewer parameters and more reflects the underlying genetic data structure (i.e., genes are separate functional units), and therefore attains better performance than the fully-connected architecture. By reducing the number of parameters, we have more computational advantage. 2.5 Real data applications Tobacco use is the leading cause of preventable disease and death in the United States. In 2019, nearly 34 million adults currently smoked cigarettes. More than 16 million Americans 26 are related to a disease caused by smoking. More than 300 billion a year are spent in direct medical care for adults or in lost productivity due to premature death and exposure to secondhand smoke in United States. More than 7 million deaths per year are caused by tobacco use in the world(https://www.cdc.gov/tobacco/data_statistics/index.htm). Predicting high-risk individuals at early stage so that appropriate prevention methods can be used to reduce mortality and morbidity. In this section, we applied ENN into analyzing two real data set. The first one is to explore genetic effects on nicotine dependence. In the second real data analysis, we take gene-gene interactions into consideration. Since the existence of hyperparameter τ , we choose ER as baseline. Five different τ values 0.1, 0.25, 0.5, 0.75, 0.9 are chosen. We use mean square error(MSE) as metrics to measure the performance of ENN and ER. 27 Table 2.1: The accuracy performance of two models built by ENN and ER based on 149 candidate SNPs and 3 covariates ENN ER τ Train Test Train Test 0.1 409.612 678.331 504.215 694.809 0.25 346.118 579.164 394.836 588.759 0.5 358.783 502.752 342.144 535.925 0.75 344.399 604.969 421.955 613.676 0.9 570.994 809.733 699.654 882.781 2.5.1 The relationship between candidate SNPs with smoking quan- tities We applied both ENN and ER to the genetic data from the Study of Addiction: Genet- ics and Environment(SAGE). The participants of the SAGE are selected from three large and complementary studies: the Family Study of Cocaine Dependence(FSCD), the Collab- orative Study on the Genetics of Alcoholism(COGA), and the Collaborative Genetic Study of Nicotine Dependence(COGEND). In this application, we selected 155 SNPs, which were previously shown to have a potential role in nicotine dependence. After quality control, 149 SNPs remained for the analysis. There are a total of 3897 samples in the SAGE data from different ethnic groups. We only included 3888 Caucasian and African American samples due to the small sample size of other ethnic groups. Our interest is to use ENN and ER to build models on 149 SNPs, 3 covariates (i.e., sex, age, and race), and smoking quantities, which is measured by the largest number of cigarettes smoked in 24 hours. We divided the whole sample into the training, validation and test samples in the ratio of 3:1:1 to build the models, select the turning parameter, and evaluate the models, respectively. Table 2.1 summarizes MSE of the models built by ENN and ER for five expectile levels (i.e., τ = 0.1, 0.25, 0.5, 0.75, and 0.9). For readability, MSE of validation data is omitted. 28 Table 2.1 shows that ENN outperforms ER, indicating the possibility of non-linear or non- additive effects among candidate SNPs and covariates. Figure 2.7: A comprehesive view of the conditional distribution of smoking quantity for five expectile levels (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) To provide a comprehensive view of the conditional distribution of smoking quantity, we ordered the expectiles estimated from ENN from lowest to highest and plotted their values for all five expectile levels. Figure 2.7 shows that the distributions of estimated expectiles are different across five expectile levels. Under different expectile levels, different expectiles are predicted. When τ = 0.5, ENN models the mean response, in which the estimated expectiles are similar for all individuals. Nonetheless, for high expectile levels (e.g., τ = 0.9), the estimated expectiles vary among individuals and high-ranked individuals have much higher expectiles than low-ranked individuals. ENN gives us more information compared to linear regression which only shows predicted value with τ = 0.5. 29 2.5.2 Gene-gene interactions between the CHRNA5-CHRNA3- CHRNB4 gene cluster Based on previous genome-wide association studies, variants in the CHRNA5-CHRNA3- CHRNB4 gene cluster on chromosome 15 that encode the α5, α3 and β4 subunits of the nicotinic acetylcholine receptor (nAChRs) are associated with nicotine dependence (ND) in European Americans (EAs) or others of European origin[31]. In the second data analysis, we focused on the CHRNA5-CHRNA3-CHRNB4 gene cluster, and evaluated potential interac- tions by using ENN and ER. We consider three pairwise interactions between CHRNA5 and CHRNA3, CHRNA5 and CHRNB4, CHRNA3 and CHRNB4. The phenotype of interest in this analysis is the number of cigarettes smoked per day (CPD), which has been popularly used in the genetic study of nicotine dependence. 30 Table 2.2: Evaluating a pairwise interaction between CHRNA5 and CHRNA3 by using ENN and ER ENN ER τ Train Test Train Test 0.1 1.106 2.022 1.183 2.036 0.25 0.994 1.699 1.027 1.737 0.5 0.896 1.266 0.908 1.304 0.75 1.148 1.045 1.136 1.066 0.9 2.015 1.335 2.069 1.357 Table 2.3: Evaluating a pairwise interaction between CHRNA5 and CHRNB4 by using ENN and ER ENN ER τ Train Test Train Test 0.1 1.139 2.020 1.186 2.049 0.25 0.980 1.701 1.029 1.735 0.5 0.901 1.277 0.908 1.305 0.75 1.149 1.047 1.136 1.071 0.9 2.054 1.318 2.070 1.351 Tables 2.2-2.4 summarize MSE of the interaction models built by using ENN and ER for five expectile levels. For all 3 scenarios, expectile neural network outperforms expectile regression in terms of MSE slightly because the signal-to-noise ratio of genetic data is low. To graphically view the conditional distribution of CPD, we ranked the expectiles esti- mated from ENN and plotted the values against the estimated expectiles (Figures 2.8-2.10). Table 2.4: Evaluating a pairwise interaction between CHRNA3 and CHRNB4 by using ENN and ER ENN ER τ Train Test Train Test 0.1 1.133 2.019 1.183 2.035 0.25 0.979 1.683 1.020 1.696 0.5 0.892 1.278 0.896 1.279 0.75 1.150 1.048 1.128 1.081 0.9 2.020 1.342 2.040 1.386 31 Overall, the estimated expectiles tends to be similar when τ = 0.5 (i.e., mean), while they are quite different for high expectile levels (e.g., τ = 0.9). This suggest that the gene-gene interactions may play a more important role in models with high expectiles than the mean models. 32 Figure 2.8: The conditional distribution of CPD considering the interaction between CHRNA5 and CHRNA3 2.6 Summary and discussion In this chapter, we develop an ENN method, which inherits advantages from both neural networks and expectile regression. Using the hierarchical structure from neural networks, ENN can learn complex and abstract features from genotypes, making it suitable for modeling the complex relationship between genotypes and phenotype. Similar to ER, ENN can also explore the conditional distribution and provide a comprehensive view of the genotype- phenotype relationship. Through simulations and a real data application, we demonstrate that ENN outperforms ER when there are non-additive and non-linear effects. Evidence also suggests that ENN has more advantages than ER when the model involves high-order interaction effects or non-linear effects. This may suggest ENN has improved performance when the underlying genotype-phenotype relationships become more complicated. The real data analysis shows 33 Figure 2.9: The conditional distribution of CPD considering the interaction between CHRNA5 and CHRNB4 Figure 2.10: The conditional distribution of CPD considering the interaction between CHRNB4 and CHRNA3 34 that genetic effects can vary among different expertiles. Compared to the classical linear regression, ENN provides us more information about the genotype-phenotype relationship via the conditional distributions for different expectile levels. While regularization has been incorporated into ENN to avoid overfitting, ENN can still be subject to overfitting when the number of SNPs becomes extremely large (e.g., one million). To deal with such a large number of SNPs, we can model the overall genetic effect as a random effect and extend ENN, which is an interesting topic for future work. 35 Chapter 3 Asymptotic Theory of Expectile Neural Networks In the previous chapter, we focus on introducing the ENN model and providing an inequality that bounds the integrated squared error of an expectile function estimator. Statistical prop- erties of ENN (e.g., consistency) are also important topics that worth further investigation. In this chapter, we study the asymptotic properties of expectile neural networks, including consistency and normality. In ENN, we use the asymmetric square loss as the loss function. When the size of parameters is too large, the standard maximum likelihood procedures may not work. Therefore, we use the sieve method to constrain the parameter space of ENN, and prove the consistency and normality under the nonparametric regression framework. 3.1 Introduction Neural networks have been widely used in industry and academy. However, the theoreti- cal properties of neural networks have not been thoroughly studied. For a typical artificial neural network, we use the squared loss function to estimate parameters. A general result for the asymptotic normality of squared loss function could be find[52]. By the universal approximation theorem, a neural network with one hidden layer can approximate any con- tinuous functions[43]. In this chapter, we use the asymmetric squared loss function, which 36 gives us a comprehensive view of conditional distribution and computation advantage. In statistics, fitting a neural network can be considered as a parametric nonlinear regression problem, r αj σ(γjT xi + γ0,j ), X yi = α0 + j=1 where 1 , . . . , n are i.i.d. random errors with E[] = 0 and E[2 ] = σ 2 < ∞ and σ(z) = 1/(1+ e−z ). However, it is impractical to fix the number of hidden units r,. If we do not fix r, the parameter in unidentifiable. Fukumizu (1996)[55] and Fukumizu et al. (2003) [56] provided an example to illustrate the unidentifiable issue. If the true function is f0 (x) = ασ(γx) with one hidden unit, we fit the model using a neural network with two hidden units. Then, any parameter Θ = [α0 , α1 , . . . , αr , γ0,1 , . . . , γ0,r , γ1T , . . . , γrT ]T in the following set {Θ : γ1 = γ, α1 = α, γ0,1 = γ0,2 = α2 = α0 = 0}∪ {Θ : γ1 = γ2 = γ, γ0,1 = γ0,2 = α0 = 0, α1 + α2 = α} realizes the true function f0 (x). Therefore, when the number of hidden units is unknown, the parameters in this parametric nonlinear regression problem are unidentifiable. To address this issue, we can consider the neural network in the nonparametric setting. We assume that the true nonparametric regression model is as follows: yi = f0 (xi ) + i , where 1 , ..., n are i.i.d random variables defined on (Ω, A, P) with E() = 0 and E(2 ) = σ 2 < ∞. f0 ∈ F is an unknown function, where F is the class of continuous function with compact support. However, if the complexity of F is large, the estimator may be 37 inconsistent[48]. The standard and penalized maximum likelihood procedures may be ineffi- cient, whereas the method of sieves may be able to overcome this difficulty[52]. The method of sieves provides one way to tackle such difficulties by optimizing an empirical criterion over a sequence of approximating parameter spaces (i.e., sieves). The sieves are less complex but are dense in the original space, and the resulting optimization problem becomes well-posed. To address this issue, we constrain the class of F and use method of sieves to prove the normality of ENN. 3.2 Method of sieves Sieve is a sequence of increasing functions that can be used to reduce the number of param- eters. Sieve plays an important role in infinite-dimensional unknown parameter, such as in a nonparametric or semiparametric model. When the method of sieves is implemented, a nonparametric or semiparametric estimation problem is often reduced to a parametric one. However, to obtain the desired theoretical properties of the estimator, it is necessary that the number of parameters increases slowly with the sample size[42]. We consider a sequence of function classes, F1 ⊆ F2 ⊆ · · · ⊆ Fn ⊆ Fn+1 ⊆ · · · ⊆ F, S∞ approximating F in the sense that n=1 Fn is dense in F, that is for each f ∈ F, there exists πn f ∈ Fn such that d(f, πn f ) → 0 as n → ∞, where d(·, ·) is some pseudo-metric defined on F. The method of sieves consists of two key ingredients: a loss function and sieve parameter spaces (a sequence of approximating spaces). Both loss function and the sieve parameter spaces are flexible. Almost all of the classical loss functions, so long as they allow for 38 identification, can be used as loss functions in the method of sieve estimation. Therefore, the main challenge is the choice of sieve parameter spaces. In this chapter, we focus on the sieve of neural networks with one hidden layer and the sigmoid activation function. rn   rn γjT x + γ0,j Rd , αj , γ0,j X X Frn = {α0 + αj σ : γj ∈ ∈ R, |αj | ≤ Vn j=1 j=0 (3.1) Xd for some Vn ≥ 4 and max |γi,j | ≤ Mn for some Mn > 0}, 1≤j≤rn i=0 where rn , Vn , Mn → ∞ as n → ∞. Frn has some important properties. For example, Frn is dense in F and f ∈ Frn has upper bound. When we consider the asymptotic properties of the sieve estimators, we use the pseudo-norm kf k2n = n−1 n 2 P i=1 f (xi ). With some abuse of notation, an approximate sieve estimator fˆn is defined to be Qn (fˆn ) ≤ inf Qn (f ) + Op (ηn ), (3.2) f ∈Fn where ηn → 0 as n → ∞. We refer the reader to Chen for more details in the method of sieves [42]. Since we use the asymmetric loss function, we establish the upper bounds for the empirical risk and the sample complexity based on the covering number and the Vapnik-Chervonenkis dimension [41]. The estimator of expectile neural networks can also be regarded as M-estimator[50]. 3.3 Existence Before we study the consistency and normality of ENN, it is crucial to ask if the sieve esti- mator based on neural networks exists. In this chapter, we focus on Frn as sieve estimator. 39 First, we show that any function in Frn has an upper bound. Lemma 3.3.1. For each fixed n, sup kf k∞ ≤ Vn . f ∈Frn Proof. For any f ∈ Frn with a fixed n and x ∈ X , we have rn   γ Tj x + γ0,j X |f (x)| = α0 + αj σ j=1 rn   rn γ Tj x + γ0,j X X ≤ |α0 | + |αj |σ ≤ |αj | ≤ Vn . j=1 j=0 Since the right-hand side does not depend on x and f , we have sup kf k∞ = sup sup |f (x)| ≤ Vn . f ∈Frn f ∈Frn x∈X Lemma 3.3.2. Let χ be a compact subset of Rd , then for each fixed n, Frn is a compact set. The proof of this lemma is in the appendix. This lemma tells us that Frn is compact in C(X ), which is the set of all continuous functions. We use the theorem 3.3.1 to show the existence of estimator of ENN[47]. Theorem 3.3.1. Let (Ω, F, P ) be a complete probability space and (Θ, ρ) be a metric space. Let {Θn } be a sequence of compact subsets of Θ and Qn : Ω × Θn → R be measurable F × B(Θn )/B. Assuming that for each ω in Ω, Qn (ω, ·) is lower semicontinuous on Θn , n = 1, 2, ...., then for each n = 1, 2, ..., there exists θ̂n : Ω → Θn measurable F/B(Θn ) such that 40 for each ω in Ω, Qn (ω, θ̂n (ω)) = inf θ∈Θn Qn (ω, Θ). Proof. n 1 X  Qn (f ) = |τ − 1{y 0 and G be a set of functions Rd → R. Each finite collection of functions g1 , ..., gN : Rd → R has the following property. For every g ∈ G, there is a j = j(g) ∈ {1, ..., N } such that ||g − gj || = sup |g(x) − gj (x)| <  x is called an −cover of G with respect to || · ||∞ . Definition 3.4.2. let  > 0, G be a set of functions Rd → R, and N (, G, || · ||∞ ) be the size of the smallest − cover of G w.r.t. || · ||∞ . By taking N (, G, || · ||∞ ) = ∞ if no finite − cover exists, N (, G, || · ||∞ ) is called an −covering number of G. To prove the uniform law of large numbers, we need to introduce the lemma 3.4.1 [44]. 42 Lemma 3.4.1. For n ∈ N , Assuming Gn be a set of functions g : Rd → [0, B] and  > 0, we have ( n ) 2 1X − 2n2 P sup g(Zi ) − Eg(Z) >  ≤ 2N (/3, Gn )e 9B . g∈Gn n i=1 Theorem 3.4.1 (The uniform law of large numbers). Assuming Zi = (Xi , Yi ), i = 1, ..., n. If [rn (d + 2) + 1] log [rn (d + 2) + 1] = o(n). We can get n 1X sup τ [g1 (Zi ) − E(g1 (Zi ))] + (1 − τ )[g2 (Zi ) − E(g2 (Zi ))] → 0 a.s, n → ∞. (3.3) f ∈Fn n i=1 Proof. Since the loss function has two parts, the empirical risk can also be considered as two parts. n 1X sup τ [g1 (Zi ) − E(g1 (Zi ))] + (1 − τ )[g2 (Zi ) − E(g2 (Zi ))] f ∈Fn n i=1 n n (3.4) 1X 1X ≤ sup τ g1 (Zi ) − E(g1 (Zi )) + sup (1 − τ ) g2 (Zi ) − E(g2 (Zi )) . g1 ∈Gn,1 n i=1 g2 ∈Gn,2 n i=1 We focus on the first part since the proof of second part can be derived in the same manner. n 1X sup τ g1 (Zi ) − E(g1 (Zi )) → 0. (3.5) g1 ∈Gn,1 n i=1 For B > 0, let G(x) = supg1 ∈Gn,1 |g1 (x)|,GB = {g1 1{G < B} : g1 ∈ Gn,1 }. 43 If g1 ∈ Gn,1 , n 1X g1 (Zi ) − E(g1 (Zi )) n i=1 n n 1X 1X ≤ g1 (Zi ) − g1 (Zi )1{G(Z )≤B} + g1 (Zi )1{G(Z )≤B} − E(g1 (Zi ))1{G(Z)≤B} n i n i i=1 i=1 n 1X + E(g1 (Zi ))1{G(Z)≤B} − E(g1 (Zi )) n i=1 n n 1X 1X ≤ G(Zi )1{G(Z >B)} + g1 (Zi )1{G(Z )≤B} − E(g1 (Zi ))1{G(Z)≤B} n i n i i=1 i=1 + E(G(Z)1{G(Z)>B} ). (3.6) This implies n 1X sup g1 (Zi ) − E(g1 (Zi )) g1 ∈Gn,1 n i=1 n n 1X 1X ≤ sup g1 (Zi ) − E(g1 (Zi )) + G(Zi )1{G(Z >B)} (3.7) g1 ∈G n n i B i=1 i=1 + E(G(Z)1{G(Z)>B} ). Based on E(G(Z)) < ∞ and the strong law of large numbers, we get n 1X G(Zi )1{G(Z >B)} → E(G(Z)1{G(Z)>B} ) a.s. if n → ∞ n i i=1 If B → ∞, E(G(Z)1{G(Z)>B} ) → 0. 44 Therefore, we only need to consider, n 1X sup τ g1 (Zi ) − E(g1 (Zi )) → 0. g1 ∈GB n i=1 Recall that if g is a function g : R → [0, B], then by Hoeffding’s inequality   n 2 − 2n2  1X  P g(Zj ) − E(g(Z)) >  ≤ 2e B . (3.8)  n  j=1 By lemma 3.4.1, we have   n 2 1 − 2n2  X  P sup g(Zj ) − E(g(Z)) >  ≤ 2N (/3, Gn,1 , k · k∞ )e B . (3.9) g ∈G 1 n,1 n  j=1 We use the the upper bound covering number result from the Theorem 14.5 in Anthony and Gartlett, !(rn (d+2)+1) 12e [rn (d + 2) + 1] ( 41 V )2 N (/3, Frn , k · k∞ ) ≤ . (3.10) ( 14 V − 1) Recall the definition of covering number, N (/3, Frn , k · k∞ ) = N , is minimum number such that there exist functions f1 , ..., fN with the property that for every f ∈ Frn there is a j = j(f ) ∈ 1, ..., N such that sup |f (x) − fj (x)| < . x Since f (x) andfj (x) is close enough, y − f (x) and y − fj (x) are either negative or positive 45 in the following situation, sup |(y − f (x))2 1{y−f (x)≥0} − (y − fj (x))2 1{y−f (x)≥0} | x j ≤ sup |(y − f (x))2 − (y − fj (x))2 | x (3.11) = sup |2y(fj − f ) + (f − fj )(f + fj )| x < 2(M1 + M2 ). Since y ∈ GB and any functions in Frn are bounded, there exist M1 and M2 such that |y| < M1 and |f | < M2 . So N (/3, Gn,1 , k · k∞ ) ≤ N (/3, Frn , k · k∞ ). If [rn (d + 2) + 1] log [rn (d + 2) + 1] = 0(n), then ∞ ( !) 2 X 12e [rn (d + 2) + 1] ( 41 V )2 − 2n2 exp [rn (d + 2) + 1] log ·e B < ∞. (3.12) n=1 ( 14 V − 1) (3.5) follows by using the Borel-Cantelli lemma. Since we have proven the uniform laws of large numbers, we use it to show the consistency of the neural networks. We rewrite the population loss criterion function: n 1 X h i Qn (f ) = E |τ − 1{y 0, P (ω : sup |Qn (ω, θ) − Q(θ)| > ) → 0 as n → ∞. θ∈Θn and inf Q(θ) − Q(θ0 ) > 0. θ∈η c (θ0 ,) If {Θn } is an increasing sequence and ∪n Θn is dense in Θ, then P ρ(θ̂n , θ0 ) → 0. 1 Pn (f (x )−f (x ))2 Lemma 3.4.2. Suppose n i=1 0 i i > 2τ −1 1 σ2 1−τ for 2 < τ < 1, then inf f :kf −f kn ≥ Qn (f ) − Qn (f0 ) > 0. 0 Proof. n 1 X h i Qn (f ) = E |τ − 1{y 0 (3.18) n i=1 If τ > 21 , n n 1X 1 X h 2i Qn (f ) − Qn (f0 ) ≥ τ (f0 (xi ) − f (xi ))2 + (1 − 2τ ) E (i ) n n i=1 i=1 n 1X + (1 − 2τ ) (f0 (xi ) − f (xi ))2 n i=1 (3.19) n 1 (f0 (xi ) − f (xi ))2 + (1 − 2τ )σ 2 X = (1 − τ ) n i=1 1 Pn (f (x ) − f (x ))2 0 i i 2τ − 1 > 0, if n i=1 2 > σ 1−τ Therefore, inf Qn (f ) − Qn (f0 ) > 0. (3.20) f :kf −f0 kn ≥ Since the conditions of the corollary 2.6 satisfy, we have the consistency of ENN sieve 49 estimator. 1 Pn 2 Theorem 3.4.3. Under the notation given above, if n i=1 (f0 (xi )−f (xi )) > 2τ −1 σ2 1−τ for 1 < τ < 1 and [rn (d + 2) + 1] log [rn (d + 2) + 1] = o(n), then 2 P kfˆn − f0 kn → 0. Proof. By using the Theorem 3.4.2, Theorem 3.4.1, lemma 3.4.2 and lemma 3.3.2, we have P kfˆn − f0 kn → 0. 3.5 Normality We use the following theorem to prove the normality of the ENN sieve estimator[48]. Theorem 3.5.1. Suppose that F is a P −Donsker class of measurable function and fˆn is a sequence of random functions that take their values in F such that Z  2 P fˆn (x) − f0 (x) dP (x) → 0. For some f0 ∈ L2 (P ), we have n 1 X ˆ  P √ (fn − f0 )(Xi ) − P (fˆn − f0 ) → 0, n i=1 50 and n 1 X ˆ √ fn (Xi ) − P fˆn ∼ N (0, P f02 − (P f0 )2 ). n i=1 From theorem 3.5.1, We need to check two conditions: Frn is P −Donsker class and R 2 P fˆn (x) − f0 (x) dP (x) → 0. Next, we give the definition of the Donsker class. In short, if the sequence of processes √ n(Pf − P f ) converges in distribution to a tight limit process, then a class F of measurable functions f is called Donsker. We review the formal definition of the Donsker class. Let (X , A, P) be a probability space and Gp be a Gaussian process with zero mean and covariance E[Gp (f )Gp (g)] = P (f g) − P f · P g. We define a class F ⊂ L2 (X , A, P) as a Gp BU C class if and only if the process Gp (f, ω) can be chosen so that for all ω, the sample functions f 7−→ Gp (f, ω), f ∈ F are bounded and continuous for ρp . Definition 3.5.1 (Donsker Class). A class F ⊂ L2 (X , A, P) is called a Donsker class if and only if it is a Gp BU C class. There are processes Yj (f, ω), f ∈ F, ω ∈ Ω, where Yj are independent copies of Gp with f 7−→ Yj (f, ω) bounded and ρP −uniformly continuous on F for each j, such that for every  > 0,   m P∗ n−1/2 max sup k X f (Xj ) − P f − Yj (f )k >  → 0 as n → ∞. m≤n f ∈F j=1 It is not convenient to check if one class of functions is the Donsker class by definition. A sufficient condition for a class to be Donsker is that they do not grow too fast. The speed can be measured by bracketing integral Z δq J[] (δ, F, L2 (P )) = logN[] (, F, L2 (P ))dδ, 0 51 where N[] (, F, L2 (P ) is the bracketing number. If this integral is finite, then the class F is a Donsker class. Theorem 3.5.2. Frn is a Donsker class. Proof. By using the result of the uniform covering number for deep neural neural networks, we have ! 4e[rn (d + 2) + 1]( 14 Vn )2 N (, Frn , || · ||sup ) ≤ . ( 14 Vn − 1) By using the relationship between packing number and covering number, for a small enough , we have    logN[ ] (2, Frn , || · ||∞ ) ≤ log 2N ( , Frn , || · ||sup ) 2    ≤ 2log N ( , Frn , || · ||sup ) 2   1 ≤ 2[rn (d + 2) + 1] log Ãrn ,Vn ,d + log ,  2eVn2 [rn (d+2)+1] where Ãrn ,Vn ,d = Vn −4 . By letting Arn ,Vn ,d = [rn (d + 2) + 1]log Ãrn ,Vn ,d − [rn (d + 2) + 1] 2eVn2 [rn (d + 2) + 1]   = [rn (d + 2) + 1] log −1 Vn − 4 2Vn2 [rn (d + 2) + 1]   = [rn (d + 2) + 1] log , Vn − 4 and Vn2 − eVn + 4e ≥ 0 for all Vn , we have 2Vn2 [rn (d + 2) + 1] V2 e(Vn − 4) log ≥ log n ≥ log = 1. Vn − 4 Vn − 4 Vn − 4 52 Then   1 logN[ ] (2, Frn , || · ||∞ ) ≤ 2[rn (d + 2) + 1] log Ãrn ,Vn ,d + log    1 ≤ 2 Ãrn ,Vn ,d + 2[rn (d + 2) + 1]( + 1)  1 ≤ 2Ãrn ,Vn ,d + [rn (d + 2) + 1] (since logx ≤ x − 1 for all x > 0)    1 ≤ 2Ãrn ,Vn ,d 1 + .  Since Frn is uniformly bounded by Vn , it is clear that N[ ] (2, Frn , || · ||∞ ) = 1 for all  ≥ Vn . Therefore, for each fixed n, we have 2 1/2 Z ∞ 1/2 Z Vn   logN[ ] (2, Frn , || · ||∞ ) . 1+ d 0 0  < ∞. Then Frn is a Donsker class. More details about the Donsker class can be found in Van der Vaart and A.W., Wellner [49]. It is easy to check that the sigmoid activation function is squashing function since σ(x) = 1 is nondecreasing (limx→∞ σ(x) = 1 and limx→−∞ σ(−x) = 0). We use the 1+e−x R 2 P theorem 3.5.3 to check fˆn (x) − f0 (x) dP (x) → 0. Theorem 3.5.3. Let σ be a squashing function. For each probability measure µ on Rd , each measurable f : Rd → R with |f (x)|2 µ(dx) < ∞, and each  > 0, there exists a neural R network h(x) in k ci σ(aTi x + bi ) + c0 : k ∈ N, ai ∈ Rd , bi , ci ∈ R}, X h(x) = { i=1 53 such that Z |f (x) − h(x)|2 µ(dx) < . Next, we establish the asymptotic normality of ENN. We assume that f0 ∈ F, where F is the class of continuous functions with compact supports. f0 is a function needed to be estimated. Theorem 3.5.4. Suppose fˆn (x) ∈ F is a sequence of random functions and |f0 (x)|2 dP (x) < ∞. If conditions in consistency exist, we can get R Z  2 ˆ P fn (x) − f0 (x) dP (x) → 0. For some f0 ∈ L2 (P ), we have n 1 X ˆ  P √ (fn − f0 )(Xi ) − P (fˆn − f0 ) → 0, n i=1 and n 1 X ˆ √ fn (Xi ) − P fˆn ∼ N (0, P f02 − (P f0 )2 ). n i=1 Proof. Assuming πrn f0 ∈ Frn , we have kfˆn (x) − f0 (x)k2 ≤ kfˆn − πn f0 k2 + kπn f0 − f0 k2 . (3.21) Using the result of consistency of ENN, we have p kfˆn − πn f0 k2 → 0. 54 From the theorem 3.5.3, kπn f0 − f0 k2 < . Therefore, we can get Z  2 P fˆn (x) − f0 (x) dP (x) → 0. Based on the theorem 3.5.1, we can obtain the result for the normality, n 1 X ˆ ˆ  P √ (fn − f0 )(Xi ) − P (fn − f0 ) → 0, n i=1 and n 1 X ˆ √ fn (Xi ) − P fˆn ∼ N (0, P f02 − (P f0 )2 ). n i=1 3.6 Simulation To validate the theoretical properties of ENN, we ran simulations on the consistency and normality of ENN. We obtained the estimator of ENN by using the gradient-based optimiza- tion algorithms (e.g., quasi-Newton Broyde-Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm). The response was simulated through the following equation: yi = f0 (xi ) + i , i = 1, .., n, (3.22) i.i.d where x1 , .., xn ∼ N (0, 1), 1 , ..., n ∼ N (0, 0.12 ). For the true function f0 , we consider three different nonlinear functions: 55 1. a neural network with one single hidden layer and two hidden units, 2. a polynomial function: f0 = x3 + 1, 3. a complex nonlinear function: f0 = sin(x) + 2exp((−16)x2 ). 3.6.1 Consistency In this section, we used simulations to check the validity of consistency result in Section 4. Since τ was between 0 and 1. For ENN with 0.5 < τ < 1, ENN had one more condition than ENN with 0 < τ < 0.5. we mainly considered ENN with τ = 0.5, 0.75. For ENN with τ = 1 Pn 2 0.75, we made σ2 smaller(e.g., σ2 = 0.01) to satify the condition: n i=1 (f0 (xi )−f (xi )) > σ2 2τ −1 for 12 < τ < 1. 1−τ 3.6.1.1 Simulation results of consistency with τ = 0.5 We chose five different sample sizes: 50, 100, 200, 500 and 1000. From Figure 3.1 to Figure 3.3, the fitted curve is closer to the true function as the sample increases. 56 Figure 3.1: Comparison between the true function f0 and fitted functions under different sample sizes, where f0 is a neural network with one single hidden layer and two hidden units τ = 0.5. Figure 3.2: Comparison between the true function f0 = x3 + 1 and fitted functions under different sample sizes with τ = 0.5. 57 Figure 3.3: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and fitted functions under different sample sizes with τ = 0.5. 3.6.1.2 Simulation results of consistency with τ = 0.75 We also chose five different sample sizes: 50, 100, 200, 500 and 1000. From Figure 3.4 to Figure 3.6, the fitted curve is closer to the true function as the sample increases. Overall, the simulation results are consistent with the theoretical finding. 58 Figure 3.4: Comparison between the true function f0 and fitted functions under different sample sizes, where f0 is a neural network with one single hidden layer and two hidden units with τ = 0.75. Figure 3.5: Comparison between the true function f0 = x3 + 1 and fitted functions under different sample sizes with τ = 0.75. 59 Figure 3.6: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and fitted functions under different sample sizes with τ = 0.75. 3.6.2 Normality In this section, we demonstrated our asymptotic normality derived in theorem 3.5.4. The same true function were used but the random errors were sampled from standard normal distribution. We used n 1 Xˆ  √ fn (xi ) − f0 (xi ) (3.23) n i=1 as test statistic to draw the Q-Q plots. We varied sample sizes (i.e., 50, 100, 200, 300, 400, and 500) when evaluating three nonlinear functions. 60 Figure 3.7: Q-Q plot with different sample sizes, where the true function f0 is a neural network with one single hidden layer and two hidden units with τ = 0.5 3.6.2.1 Simulation result of normality with τ = 0.5 From Figure 3.7 to Figure 3.9, data points appear as roughly a straight line. The test statistic n 1 Xˆ  √ fn (xi ) − f0 (xi ) (3.24) n i=1 fits the normal distribution pretty well. 61 Figure 3.8: Q-Q plot with different sample sizes, where the true function is f0 = x3 + 1 with τ = 0.5. Figure 3.9: Q-Q plot with different sample sizes, where the true function is f0 = sin(x) + 2exp((−16)x2 ) with τ = 0.5. 62 Figure 3.10: Q-Q plot with different sample sizes, where the true function f0 is a neural network with one single hidden layer and two hidden units with τ = 0.75. 3.6.2.2 Simulation result of normality with τ = 0.75 From Figure 3.10 to Figure 3.12,we used the same test statistic and data points appeared as roughly a straight line. Based on the simulation results, we demonstrated the validity of normality of ENN. 3.7 Summary and discussion In this section, we study the consistency and normality of ENN sieve estimators with one hidden layer. To overcome the issue of unidentifiability, we use the method of sieve to narrow down the choice of parametric space. The covering numbers is used to find an approximation to a rich class Frn . By establishing an upper bound for the covering number of Frn , we prove the consistency and normality of ENN. To check the validity of theoretical results, we also ran simulations based on the theorem conditions. If we choose τ as 0.5, then ENN becomes the traditional neural network. The ENN method inherits advantages from both neural networks 63 Figure 3.11: Q-Q plot with different sample sizes, where the true function is f0 = x3 + 1 with τ = 0.75. Figure 3.12: Q-Q plot with different sample sizes, where the true function is f0 = sin(x) + 2exp((−16)x2 ) with τ = 0.75. 64 and expectile regression. Using the hierarchical structure from neural networks, ENN can learn complex and abstract features from covariates, making it suitable for modeling the complex relationship between covariates and response by tuning hyperparameter τ . Although we focus on one hidden layer neural network sieve estimators with sigmoid ac- tivation function in this chapter, the results of this chapter can be extended to other neural networks and activation functions. For instance, it can be potentially extended to other popular activation functions (e.g., the rectified linear unit). Deep neural network struc- tures are commonly used in convolutional neural networks and recurrent neural networks. Therefore, it is worthwhile to investigate the asymptotic theory of different neural network architectures. It may be also worthwhile to consider the regularization of neural networks into consid- eration. Since the number of parameters in deep neural networks is large, the overfitting issue is common in practice. Dropout is an approach of regularization in neural networks, which reduces the number of hidden units [53]. In statistics, we also add a penalty term as a regularization approach. To avoid overfitting, it is common to add a penalty term. Es- tablishing the asymptotic theory of neural networks with regularization is crucial when we apply neural networks into real data analysis. We will consider this problem in the future. 65 Chapter 4 Summary and Discussion This dissertation focuses mainly on developing a neural-network-based method, ENN, with application in risk prediction of genetic data. We also study the statistical properties of ENN, including consistency and normality. In chapter 2, we develop a neural-network-based method called ENN. To demonstrate the performance of ENN, we run three different simulation settings: nonlinear, interactions among SNPs and interactions between genes. If there are nonlinear or high-order interaction effects in genetic data, ENN outperforms ER. To model the more complex relationship be- tween genotypes and phenotypes, we change the architecture of ENN to non-fully connected architecture. By tuning the hyperparameter τ , ENN can provide a comprehensive view of the genotype-phenotype relationship for different expectile levels. Different expectile levels could also help us to identify high-risk individuals for certain disease, especially at the low expectile level(τ = 0.1) and the high expectile level(τ = 0.9). Through two real data applications, we also demonstrate that ENN outperforms ER when the underlying genotype-phenotype relationships become complicated. For different expectiles, genetic effects vary, which provides us more information about the genotype- phenotype relationship via the conditional distributions. By studying different expectile levels, it may help us to predict high-risk individuals since genetic variations can have large effects on a particular disease. 66 In chapter 3, we study the consistency and normality of ENN sieve estimators with one hidden layer. We consider neural networks as a nonparametric regression problem to avoid the issue of unidentifiability. The method of sieve is used to narrow down the choice of parametric space. To measure the complexity of neural networks, we use covering number as measurement. By establishing an upper bound for the covering number of Frn , we first prove the uniform law of large numbers of ENN. With some regularity conditions, we also prove the consistency and normality of ENN. Simulations have also been conducted to test the validity of theoretical results. Most complex diseases are not only explained by genetic effects but also can be influ- enced by environmental determinants, which can be physical, chemical, biological, behavior patterns or life events. A small difference in one person’s genes can cause them to respond differently to the same environmental exposure to another person. As a result, some peo- ple may develop the disease after being exposed to the environment while others may not. Therefore, it is worthwhile to take environmental determinants into consideration. In the future, we could apply ENN to study a disease with a potential gene-environment interaction component. By doing this, we could gain a better understanding of the disease and increase prediction accuracy. Many researchers focus on improving the prediction accuracy of neural network estima- tors, while the statistical inference based on neural network estimator is not fully studied. By establishing the asymptotic properties of ENN, it is worthwhile to investigate the statistical inference of ENN. We could also incorporate other machine learning techniques into ENN. For many genetic datasets, Caucasian samples are larger than African samples. Due to the limitation of African samples, we could first train ENN on Caucasian samples and get the estimator, which can be used to improve prediction accuracy in African samples. We also 67 apply ENN into real data with transfer learning, which is described in the appendix. By using this technique, we could improve the performance of ENN. 68 APPENDICES 69 Appendix A Technical Details of Chapter 2 Proof of theorem 2.3.1 Theorem A.0.1. Let Lτ : Y × R → [0, ∞) be the asymmetric least square loss function and Q be a distribution on Y = [−M, M ]. Then, the inner Lτ − risks of Q could be defined as Z Cτ,Q (t) = Lτ (y, t)dQ(y), t = f (xi ) ∈ R, Y and the minimal inner Lτ − risk is ∗ CL = inft∈R CLτ ,Q (t). τ ,Q Lemma A.0.1. Let Lτ be the asymmetric least square loss function and Q be a distribution on R with CL∗ < ∞. For a fixed τ ∈ (0, 1) and for all t ∈ R, we have τ ,Q cτ (t − t∗ )2 ≤ CLτ ,Q (t) − CL∗ τ ,Q ≤ Cτ (t − t∗ )2 , where cτ = min{τ, 1 − τ } and Cτ = max{τ, 1 − τ }, t∗ is τ −expectile . Proof. Let us fix τ ∈ (0, 1). We use the result obtained in Newey and Powell [16]. For a 70 distribution Q on R satisfies CL ∗ < ∞, the τ −expectile t∗ is the only solution of τ ,Q Z Z τ (y − t∗ )dQ(y) = (1 − τ ) (t∗ − y)dQ(y). (A.1) y≥t∗ y 0, by choosing δ = / 4 (1 ∨ kxk∞ ) rn (d + 1) , we observe that when kθ 1,n − θ 2,n k2 < δ, we have kH(θ 1,n ) − H(θ 2,n )kn < , which implies that H is a continuous map and hence Frn is a compact set for each fixed n. 77 Appendix C Supplementary Materials Expectile neural networks with transfer learning Normally, machine learning models focus on one single and specific task. If we have two related tasks, one task could inherit some information from the other task. We call this technique transfer learning. Transfer learning focuses on storing knowledge gained by solving one problem and applying the knowledge to a different but related problem. It is easier to transfer knowledge if tasks are more related. Transfer learning has been implemented in a wide area, like natural language processing (NLP)[69], medical image[66]. Transfer learning could be applied in both classification and regression scenarios. For example, Syed proposes seeded transfer learning in a regression context to improve predic- tion performance in target domain[71]. Many approaches could be implemented in transfer learning. Yosinski et al. show how lower layers in neural networks act as conventional computer-vision feature extractors, such as edge detectors, while the final layer works to- ward task-specific features[65]. Rosenstein uses naive Bayes classification algorithm to detect, perhaps implicitly, that the inductive bias learned from the auxiliary tasks will actually hurt performance on the target task [68]. In this chapter, we focused on applying the transfer learning technique into expectile neural networks. We focus on parameter transfer or in- stance reweighting. This approach works on the assumption that the models for related 78 tasks share some parameters. There are some advantages of doing these. First, if the initial task and target task are relevant, we could improve our result. Second, since we inherit information from the initial task, the number of parameters in target task is reduced, which gives us some computational advantages, especially in large datasets. Real data application In this section, we integrate expectile regression and transfer learning to improve prediction performance. To verify if transfer learning works, we run two real data sets to compare the performance of ENN with transfer learning and ENN without transfer learning. First real data application Intuitively, participants in this study tend to be addicted to drinking who have the nicotine addiction. We applied ENN to the genetic data from the Study of Addiction: Genetics and Environment(SAGE). The participants of the SAGE are selected from three large, comple- mentary studies: the Family Study of Cocaine Dependence(FSCD), the Collaborative study on the Genetics of Alcoholism(COGA), and the Collaborative Genetic Study of Nicotine Dependence(COGEND). We choose max cigs as smoking quantity, which is measured by the largest number of cigarettes smoked in 24 hours, ranged from 0-240. We choose max drinks as drinking quan- tity, which is measured by the largest number of alcoholic drinks consumed in 24 hours, range from 0-258. To have better performance, we transfer smoking-related information to drinking-related information. We use the following algorithm. First, we choose max cigs as phenotype, and get the estimator of the expectile neural network. Second, we get the estimator obtained from the first step as the initial value(transfer 79 Table C.1: Real data application result of CHRNA5 ENN.tsf ENN τ Train Test Train Test 0.1 551.83 605.79 546.90 672.44 0.25 325.84 439.18 321.94 473.10 0.5 282.57 433.058 275.83 444.16 0.75 304.81 484.60 297.81 487.44 0.9 347.17 544.24 339.79 549.08 Table C.2: Real data application result of CHRNA3 ENN.tsf ENN τ Train Test Train Test 0.1 554.11 605.10 533.04 753.96 0.25 325.71 441.47 311.85 517.05 0.5 281.20 439.40 260.45 491.62 0.75 304.60 486.80 292.63 502.86 0.9 350.01 558.95 335.89 573.92 learning part). Third, we choose max drinks as a new phenotype and keep the parameter from the input layer to the hidden layer and then train the expectile neural network again. Finally, we compare two models: ENN with transfer learning and ENN without transfer learning. We divide the data into three parts: training(60%), validation(20%), testing(20%). We get the following results. Table C.3: Real data application result of CHRNB4 ENN.tsf ENN τ Train Test Train Test 0.1 558.39 622.18 564.57 673.97 0.25 327.63 448.63 325.48 473.34 0.5 283.28 435.11 270.50 453.76 0.75 306.05 488.15 303.02 489.71 0.9 349.24 544.85 343.52 553.41 80 Table C.1-C.3 summarize the MSE of ENN with transfer learning and ENN without transfer learning for five different expertiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9). From those three tables, we show the expectile neural networks with transfer learning outperform expecilt neural networks without transfer learning. Second real data application In this real data application, we apply our method to the Alzheimer’s Disease Neuroimaging Initiative(ADNI), which is a multisite study that aims to improve clinical trials for the prevention and treatment of Alzheimer’s disease. APOE allele is the most important genetic risk factor for Alzheimer’s disease[67]. We focus our ENN model on APOE gene. After quality control, 168 SNPs remained for the analysis. We only included 699 Caucasian and African American individuals due to the small sample size of other ethnic groups. To improve the performance of ENN, we also included 3 covariates: sex(male=1, female=2), age, and education in the analysis. Hippocampus is the part of the brain area associated with memories. Alzheimer’s disease usually first damages hippocampus, leading to memory loss and disorientation. Study shows that hippocampal volume and ratio was reduced by 25% in Alzheimer’s disease[72]. The Mini-Mental State Examination (MMSE) is a 30-point questionnaire that is used extensively in clinical and research settings to measure cognitive impairment. For more information, re- fer to https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?id=phd001525.1. We transfer Hippocampus bl to MMSE. To have stable performance, we randomly split the dataset 50 times and average the result. From table C.4, expectile neural network with transfer learning outperforms expectile 81 Table C.4: Real data application result of ADNI ENN.tsf ENN τ Train Test Train Test 0.1 8.21 8.40 8.65 9.50 0.25 5.00 5.17 5.30 6.78 0.5 4.11 4.31 4.30 4.82 0.75 4.67 4.88 4.85 6.86 0.9 5.87 6.10 5.99 6.69 regression without transfer learning under different τ . Summary and discussion From these two real data application, transfer learning improves performance of expectile neural networks. However, transfer learning relies on data heavily based on our experience. If the data does not fit the model, the negative transfer happens where the transfer of knowledge from the source to the target does not lead to any improvement, but rather causes a drop in the overall performance of the target task. 82 BIBLIOGRAPHY 83 BIBLIOGRAPHY [1] Genome-Wide Association Studies. National Human Genome Research Institute. https://www.genome.gov/about-genomics/fact-sheets/Genome-Wide-Association- Studies-Fact-Sheet. [2] Manolio TA. Genome wide association studies and assessment of the risk of disease. The New England Journal of Medicine, 363 (2): 166–76, 2010. [3] Kwon JM, Goate AM. The candidate gene approach. Alcohol Research & Health, 24 (3): 164–8, 2000. [4] Xuexia Wang, Michael J Oldani, Xingwang Zhao, Xiaohui Huang, Dajun Qian. A Re- view of Cancer Risk Prediction Models with Genetic Variants. Cancer Inform, 13(2): 19–28, 2014. [5] The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447, 661–678, 2007. [6] Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, String- ham HM, Chines PS, Jackson AU, Prokunina-Olsson L, Ding CJ, Swift AJ, Narisu N, Hu T, Pruim R, Xiao R, Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M, White PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins L, Xiang F, Saramies J, Buchanan TA, Watanabe RM, Valle TT, Kinnunen L, Abecasis GR, Pugh EW, Doheny KF, Bergman RN, Tuomilehto J, Collins FS, Boehnke M. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science, 316(5829):1341-5, 2007. [7] Nan M. Laird, Christoph Lange. The Fundamentals of Modern Statistical Genetics. Springer-Verlag, 2011. [8] Miller DD, Brown EW. Artificial Intelligence in Medical Practice: The Question to the Answer? Am J Med, 131(2):129-133, 2018. [9] Fakoor R, Ladhak F, Nazi A, Huber M. Using deep learning to enhance cancer diag- nosis and classification. Proceedings of the 30th International Conference on Machine Learning, 2013. [10] Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press, 96-161, 2016. [11] Le Cun Y, Bengio Y, Hinton G. Deep learning. Nature, 521:436- 444, 2015. 84 [12] Krittanawong C, Zhang H, Wang Z, Aydar M, Kitai T. Artificial Intelligence in Precision Cardiovascular Medicine. J Am Coll Cardiol, 69(21):2657-2664, 2017. [13] McClellan J, King MC. Genetic heterogeneity in human disease. Cell, 141(2):210-7, 2010. [14] Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet, 37(4):413-7, 2005. [15] R. Koenker, G.W. Bassett Jr. Regression quantiles. Econometrica, 46(1):33-50, 1978. [16] W. Newey, J. Powell. Asymmetric least squares estimation and testing. Econometrica, 55(4):819-847, 1987. [17] Moshe Buchinsky. Quantile regression, Box-Cox transformation model, and the U.S. wage structure, 1963–1987. Journal of Econometrics, 65(1):109-154, 1995. [18] John Crowley, Marie Hu. Covariance Analysis of Heart Transplant Survival Data. Jour- nal of the American Statistical Association, 72-357, 1977. [19] Stuart R. Lipsitz Garrett M. Fitzmaurice Geert Molenberghs Lue Ping Zhao. Quantile Regression Methods for Longitudinal Data with Drop-outs: Application to CD4 Cell Counts of Patients Infected with the Human Immunodeficiency Virus. Jornal of the Royal Statistical Society: Applied Statistics Series C, 46(4):463-476, 1997. [20] G.R.PandeyaV, T.V.Nguyenb. A comparative study of regression based methods in regional flood frequency analysis. Journal of Hydrology, 225:92-101, 1999. [21] H. J. Cordell. Detecting gene-gene interactions that underlie human diseases, Nat. Rev. Genet. 10:392–404, 2009. [22] A. Cannon. Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes. A.J. Stoch Environ Res Risk Assess, 32:3207, 2018. [23] A. Cannon. Quantile regression neural networks: Implementation in R and application to precipitation downscaling. Computers & Geosciences, 37:1277-1284, 2011. [24] J. Taylor. A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19:299-311, 2000. [25] C. Jiang, M. Jiang, Q. Xu, X. Huang. Expectile regression neural network model with applications. Neurocomputing, 247:73-86, 2017. [26] L. Liao, C. Park, H. Choi. Penalized expectile regression: an alternative to penalized quantile regression. Ann Inst Stat, 71:409–438, 2018. 85 [27] L. Waltrup, F. Sobotka, T. Kneib, G. Kauermann. Expectile and quantile regression- David and Goliath? Statistical Modelling, 15(5): 433–456, 2015. [28] M. Kim, S. Lee. Nonlinear expectile regression with application to Value-at-Risk and expected shortfall estimation. Computational Statistics and Data Analysis, 94:1-19, 2016. [29] Q. Yao, H. Tong. Asymmetric least squares regression estimation: a nonparametric approach. Journal of Nonparametric Statistics, 6:2-3, 1996. [30] Durbin, R., Altshuler, D., Durbin, R. et al. A map of human genome variation from population-scale sequencing. Nature, 467:1061–1073, 2010. [31] Li MD, Xu Q, Lou XY, Payne TJ, Niu T, Ma JZ. Association and interaction anal- ysis of variants in CHRNA5/CHRNA3/CHRNB4 gene cluster with nicotine depen- dence in African and European Americans. Am J Med Genet B Neuropsychiatr Genet, 153B(3):745–756, 2010. [32] M. Farooq, I. Steinwart. Learning rate for kernel-based expectile regression. Mach Learn- ing, 108: 203–227, 2019. [33] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 251-257, 1991. [34] Fletcher, Roger, Practical methods of optimization(2nd ed.), New York: John Wiley & Sons, 1987. [35] Heather J. Cordell, Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet, 10(6):392–404, 2009. [36] Mackay, T.F. Quantitative trait loci in Drosophila. Nat. Rev. Genet, 2:11–20, 2001. [37] Routman EJ, Cheverud JM. Gene effects on a quantitative trait: Two-locus epistatic effects measured at microsatellite markers and at estimated QTL. Evolution, 51: 1654–1662, 1997. [38] Zerba, K.E., Ferrell, R.E. & Sing, C.F. Complex adaptive systems and human health: the influence of common genotypes of the apolipoprotein E (ApoE) gene polymorphism and age on the relational order within a field of lipid metabolism traits. Hum. Genet, 107: 466–475, 2000. [39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhut- dinovDropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15: 1929-1958, 2014. 86 [40] Chengxi Ye, Yezhou Yang, Cornelia Fermuller, Yiannis Aloimonos. On the Importance of Consistency in Training Deep Neural Networks, arXiv:1708.00631, 2017. [41] Anthony, M. and Bartlett, P.L., Neural network learning: Theoretical foundations, Cambridge university press, 2009. [42] X Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics, 2007. [43] Kurt Hornik, Maxwell Stinchcombe, Halbert White. Multilayer feedforward networks are universal approximators. Neural newtorks, 2(5):359-366, 1989. [44] László Györfi. A Distribution-Free Theory of Nonparametric Regression. Springer New York, 2006. [45] Jinghang Lin, Xiaoran Tong, Chenxi Li, Qing Lu. Expectile Neural Networks for Genetic Data Analysis of Complex Diseases, arXiv:2010.13898, 2020. [46] Grenander. Abstract Inference. Wily, New York, 1981. [47] White, H. and Wooldridge, J. Some results on sieve estimation with dependent obser- vations. In Nonparametric and Semiparametric Methods in Economics (W. A. Barnett, J. Powell and G. Tauchen, eds.) 459-493. Cambridge University Press New York. 1991. [48] Van der Vaart. Asymptotic Statistics, Cambridge University Press, 1998. [49] Van der Vaart, Jon A. Wellner. Weak convergence and empirical processes. Springer, 1996. [50] Van de Geer. Empirical Processes in M-estimation. Cambridge university press, 2020. [51] Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, Qing Lu. Asymptotic Properties of Neural Network Sieve Estimators, arXiv:1906.00875, 2019. [52] Xiaotong Shen, On Methods of sieves and penalization. The Annals of Statistics, 25(6):2555-2591, 1997. [53] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in neural information processing sys- tems, 2012. [54] Koenker, Roger. Quantile regression. Cambridge University Press, 2005. [55] Kenji Fukumizu. A regularity condition of the information matrix of a multilayer per- ceptron network. Neural networks, 9(5):871–879, 1996. 87 [56] Kenji Fukumizu et al. Likelihood ratio of unidentifiable models and multilayer neural networks. The Annals of Statistics, 31(3):833–851, 2003. [57] Hongtu Zhu and Heping Zhang. Asymptotics for estimation and testing procedures under loss of identifiability. Journal of Multivariate Analysis, 97(1):19–45, 2006. [58] Ergün Akgün, Metin Demir. Modeling Course Achievements of Elementary Education Teacher Candidates with Artificial Neural Networks. International Journal of Assess- ment Tools in Education, 2018. [59] T. Pham, T. Tran, D. Phung, S. Venkatesh. Predicting healthcare trajectories from medical records: a deep learning approach. J Biomed Inform, 69:218-229, 2017. [60] Plis, Sergey M. and Hjelm, Devon R. and Salakhutdinov, Ruslan and Allen, Elena A. and Bockholt, Henry J. and Long, Jeffrey D. and Johnson, Hans J. and Paulsen, Jane S. and Turner, Jessica A. and Calhoun, Vince D. Deep learning for neuroimaging: a validation study. Front. Neurosci, 229: 8, 2014. [61] Devroye, Luc and Györfi, László and Lugosi, Gábor. A probabilistic theory of pattern recognition,Springer Science & Business Media, 2013. [62] Martin Abadi and Paul Barham and Jianmin Chen and Zhifeng Chen and Andy Davis and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Geoffrey Irving and Michael Isard and Manjunath Kudlur and Josh Levenberg and Rajat Monga and Sherry Moore and Derek G. Murray and Benoit Steiner and Paul Tucker and Vijay Vasudevan and Pete Warden and Martin Wicke and Yuan Yu and Xiaoqiang Zheng. TensorFlow: A system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation, 2016. [63] Adam Paszke and S. Gross and Francisco Massa and A. Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Z. Lin and N. Gimelshein and L. Antiga and Alban Desmaison and Andreas Köpf and Edward Yang and Zach DeVito and Martin Raison and Alykhan Tejani and Sasank Chilamkurthy and Benoit Steiner and Lu Fang and Junjie Bai and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS, 2019. [64] Jindong Wang, Yiqiang Chen, Han Yu, Meiyu Huang, and Qiang Yang. Easy Trans- fer Learning By Exploiting Intra-domain Structures. arXiv preprint arXiv:1904.01376, 2019. [65] J Yosinski, J Clune, Y Bengio, H Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 3320-3328, 2014. [66] Amin Khatami, Morteza Babaie, H.R. Tizhoosh, Abbas Khosravi, Thanh Nguyen, Saeid Nahavandi. A sequential search-space shrinking using CNN transfer learning and a 88 radon projection pool for medical image retrieval. Expert Systems with Applications, 100:224–233, 2018. [67] Liu Y, Tan L, Wang HF, Liu Y, Hao XK, Tan CC, Jiang T, Liu B, Zhang DQ, Yu JT; Alzheimer’s Disease Neuroimaging Initiative. Multiple Effect of APOE Genotype on Clinical and Neuroimaging Biomarkers Across Alzheimer’s Disease Spectrum. Mol Neurobiol, 53(7):4539-47, 2016. [68] M.T. Rosenstein, Z. Marx and L.P. Kaelbling, To Transfer or Not to Transfer. Neural Information Processing Systems, Workshop Inductive Transfer: 10 Years Later, 2005. [69] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1-67, 2020. [70] S. J. Pan, Q. Yang. A Survey on Transfer Learning, IEEE Transactions on Knowledge and Data Engineering, 22(10): 1345-1359, 2010. [71] S.M. Salaken, A. Khosravi, T. Nguyen, S. Nahavandi. Seeded transfer learning for re- gression problems with deep learning. Expert Syst. Appl, 115:565-577, 2019. [72] Vijayakumar A . Comparison of hippocampal volume in dementia subtypes. ISRN Ra- diol, 2012. [73] Yoshua Bengio. Deep Learning of Representations for Unsupervised and Transfer Learn- ing. Proceedings of ICML Workshop on Unsupervised and Transfer Learning, JMLR Workshop and Conference Proceedings, 27:17-36, 2012. [74] Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Igna-cio Lopez Moreno et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Conference on Neural In- formation Processing Systems, 2018. [75] Zhilin Yang, Ruslan Salakhutdinov, William W Cohen. Transfer learning for sequence tagging with hierarchical recurrent networks.ICLR, 2017. [76] Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. 6(2):94-98, 2019. [77] Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics, 501–513, 1981. [78] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. 89