NEURAL NETWORKS MODELS WITH APPLICATIONS TO GENETIC STUDIES By Shan Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics — Doctor of Philosophy 2021 ABSTRACT NEURAL NETWORKS MODELS WITH APPLICATIONS TO GENETIC STUDIES By Shan Zhang Artificial intelligence (AI) is a thriving research field with many successful applications in areas such as computer vision and speech recognition. Machine learning methods, such as neural networks (NN), play a central role in modern AI technology. When it comes to genetic studies, a vanilla NN model suffers from small genetic effects without considering the underlying genetic structure (e.g., linkage disequilibrium). Furthermore in emerging research fields (e.g., imaging genetics), researchers need to deal with different types of data. To address these challenges, I propose a functional neural networks (FNN) method which combines functional linear model and neural networks structure. FNN uses a series of basis functions to model genetic effects within gene sequences and phenotype data with spatial- temporal structure . Through simulations and real data applications, I demonstrate the advantages of FNN for high-dimensional genetic data analysis in terms of robustness and accuracy. The source code of FNN is available on https://github.com/szhang0629/FNN. The conditions for the consistency of proposed FNN model are also provided. Transfer learning has been widely used in text and image classification and has demon- strated its outstanding performance in applications. Nonetheless, it has been rarely used in genetic research, and its performance on genetic data is still largely unknown. I use vanilla neural networks to investigate whether the knowledge learned from UKBiobank databases or Caucasian samples can be used to facilitate genetic research in small-scale studies or in minority populations. The experiments shows that transfer learning is useful in most genes. I dedicate this dissertation to my parents, Houfa Zhang, Rongqun Shan and my friends, Xiaoxi, Peng, Xiaoran, Yuan, Chang and many others. iii ACKNOWLEDGMENTS Here, I would like to express my deepest gratitude to my advisors Dr. Qing Lu and Dr. Yuehua Cui for their support and guidance towards my studies and researches. Dr. Lu and Dr. Cui are extremely kind and knowledgeable. They always provide valuable insights and suggestions on the improvements of my work. I would also like to extend my sincere thanks to my dissertation committee members, Dr. Lyudmila Sakhanenko and Dr. Haolei Weng. Their comments and suggestions are beneficial for my studies. I am also grateful to the help I obtained from all the professors in the Department of Statistics and Probability. During my Ph.D. studies, I took most of the courses offered by our department. During my five years at Michigan State University, I made lots of friends and because of them, I never feel lonely in these years. Many thanks to my senior classmates: Dr. Jingyi Zhang, Dr. Xiaoran Tong and Dr. Liangliang Zhang. They have become successful faculty members and statisticians in big companies. My thanks also go to my friends: Jian Song, Ken Lee, Runze Su, Peide Li, Chang Jiang, Yuan Zhou, Jinghang Lin and Di Wu. Without your help, I could not be who I am now and I sincerely wish all of you have a wonderful future. Last but not least, I would like to express my sincere thanks to my parents for their support and confidence in me. I would like to treat this dissertation as my unique gift to both of you. iv TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Genetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Functional Neural Networks for High-dimensional Genetic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Functional Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1.1 FLM for a scalar phenotype . . . . . . . . . . . . . . . . . . 8 2.2.1.2 FLM for complex phenotypes . . . . . . . . . . . . . . . . . 9 2.2.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Functional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Simulation 1: The effect of the phenotype dimension . . . . . . . . . 17 2.3.2 Simulation 2: The effect of the underlying function . . . . . . . . . . 19 2.3.3 Simulation 3: The effect of the input dimension . . . . . . . . . . . . 20 2.4 Real Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 The role of CHRNA5, race, gender, and age in predicting cigarette smoking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 The role of APOE in predicting hippocampus volume change over time 25 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 3 Consistency of Functional Neural Networks . . . . . . . . . . . . 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Universal Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Definitions for New Construction . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Step One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Step Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Covering Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Smooth Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Deductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 4 Transfer Learning for Genetic Studies . . . . . . . . . . . . . . . 59 v 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.3 Technical Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.2.1 Constraint regularization . . . . . . . . . . . . . . . . . . . 69 4.3.2.2 Parameter Regularization . . . . . . . . . . . . . . . . . . . 70 4.3.2.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.3 Cross-ethnicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.4 Cross-Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 5 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 APPENDIX A Solutions of FLM . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 APPENDIX B Functional Neural Networks . . . . . . . . . . . . . . . . . . . . . 81 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 vi LIST OF FIGURES Figure 2.1: An illustration of neural networks and functional neural networks . . . . 14 Figure 2.2: Accuracy comparison of three methods for three different types of pheno- types (i.e., scalar, vector, and matrix) and three different levels of noise (i.e., σ =0.3, 0.6, and 1.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 2.3: Accuracy comparison of three methods for different underlying functions 21 Figure 2.4: Accuracy comparison of three methods for different numbers of input vari- ables (i.e., 80, 200, and 500) . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 2.5: The role of CHRNA5, race, gender, and age in predicting smoking quantity . "NN" and "FNN" refer to the neural networks method and the func- tional neural networks method, while the number "1" and "2" indicate the number of hidden layers. . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 2.6: Prediction accuracy of hippocampus volume change over time based on the gene APOE. "NN" and "FNN" refer to the neural networks method and functional neural networks method, respectively. The number "1" or "2" indicates the number of hidden layers. . . . . . . . . . . . . . . . . . 26 Figure 4.1: Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 4.2: Transfer Learning in neural networks . . . . . . . . . . . . . . . . . . . . 66 Figure 4.3: Constraint Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 4.4: Parameter Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 4.5: Transfer the knowledge from the British population to the Black population 72 Figure 4.6: Transfer the knowledge from the British population to the Irish population 72 Figure 4.7: Transfer the knowledge from UK Biobank to SAGE . . . . . . . . . . . . 73 vii Chapter 1 Introduction 1.1 Overview In recent years, we have witnessed the emerging development of artificial intelligence (AI). In 2016, the AI AlphaGO beats professional Go player and dominates the game ([35]). Chatbot has become a basic tool for customer service in many companies ([17]). In the field of high technology, Tesla is a leading electric vehicles company in self-driving technology, and its Autopilot technology is based on AI ([5]). With these successful applications, AI has become a thriving research field, which attracts a lot of researchers from various fields. Although AI technology has achieved impressive performance in gaming, natural language processing, image classification, and many other areas, it has been less popularly used in genetic studies, partially due to the high noise and small sample size of the genetic data. Normally a large amount of samples are required to train a good AI model. While it is easy to find millions of pictures of cars on the Internet, it is time consuming and costly to collect genetic data, along with individuals’ clinical records and demographic information. Most genetic studies have a sample size of around several thousand, which is under-powered to train a sophisticated AI model, and the dimension of genetic data is also huge. The genetic information is stored in the nucleotide sequences of DNA, which has over one billion nucleotides. Even if we focus on only one gene, there are still possible over ten thousand 1 nucleotides. It is hard to build an AI model with ten thousand variables given limited sample size. Finally, the genetic information alone can only explain the limited variance of phenotypes. Complex phenotypes, such as height, are determined by both gene and environmental factors. The AI model intends to explain all the variance with the given genetic data, which can lead to great prediction in the training dataset, while has low performance in the testing dataset. Many regularization methods are available to alleviate this problem, but overfitting remains a challenging problem for AI methods, especially in the high-dimensional setting. In this chapter, the background of genetic research is introduced in section 1.2. In section 1.3, the vanilla neural networks models are briefly introduced. The objective and organization of the dissertation are given in section 1.4. 1.2 Genetic Data I start with a brief introduction to genetic data. Although DNA has a huge number of nucleotides, over 99% of them are the same among all human beings. So, I only consider the nucleotide with a variation into consideration, which is called single nucleotide poly- morphisms (SNPs). Most SNPs have two copies of alleles. Assuming an additive mode of inheritance, SNPs can be coded as {0, 1, 2} based on the number of minor alleles. In my dissertation, I focus on studying genes, which involve multiple SNPs (e.g., around 30 SNPs). In a gene, linkage disequilibrium (LD) exists among SNPs [7]. Based on LD, I can assume the genetic effect of an SNP is related to that of its nearby SNPs. Under this assumption, I develop a functional neural network method, which is introduced in detail in Chapter 2. A variety of phenotypes, such as brain volume, weight, or smoking frequency, are used 2 in genetic studies. If the age is considered as the index, these phenotypes can be naturally modeled as a function of age. The temporal relationship within the phenotype can be considered by assuming the continuity of the function. This concept has been used in the proposed model of the dissertation. 1.3 Neural Networks The neural networks (NN) model plays the fundamental role in the AI field. A basic NN model f : Rd → R can be realized by: XJ f (x) = aj σ(wj x + bj ) + c, j=1 where J, aj c, wj and bj are parameters of the NN model. Among them, wj is a d-dimension vector while all other parameters are scalars. Based on the universal approximation theorem, given enough hidden units, the NN model can approximate to any continuous functions which have compact supports on Rd → R. The domain of the NN function is on a finite dimensional space. In this dissertation, a new neural network model defined on a functional space is proposed and its statistical properties such as consistency are developed. 1.4 Organization The remaining dissertation is organized as follows. In Chapter 2, I propose a new neural network model called the functional neural networks model, which can deal with functional data and take the spatial and temporal relationships into consideration. In Chapter 3, I 3 prove the consistency of the proposed model. In Chapter 4, I apply the transfer learning idea to genetic studies. In Chapter 5, I discuss possible extensions of these methods and future work. 4 Chapter 2 Functional Neural Networks for High-dimensional Genetic Data Analysis 2.1 Introduction With the recent advancement in high-throughput technologies, genome-wide association studies (GWAS) and sequencing studies have been commonly adopted to uncover new ge- netic variants predisposing to common complex diseases ([15], [32]). During the past decade, thousands of genetic variants have been identified through GWAS and sequencing studies, some with compelling biological plausibility for a role in disease etiology ([22]). Despite such success, for most complex diseases, identified genetic variants are associated with small effect sizes and only explain a small fraction of total heritability ([20]). Many common diseases are influenced by the interplay of multiple genes and other risk factors (e.g., environmental determinants) in a complex manner ([4]). This complexity, however, has not been taken fully into account by many existing approaches, which often assume genetic variants related to disease phenotypes in an additive and linear manner. Machine learning methods, such as the neural network (NN) method, holds great promise for genetic research ([6]). Based on its hierarchical structure, NN learns complicated features from simpler ones, making it capable of capturing non-linear and non-additive genetic effects 5 (e.g., gene-gene interaction effects). Given its exceptional performance, NN has been increas- ingly used in genomics, especially in regulatory genomics, variant calling, and pathogenicity scores ([39]). Despite its successful applications in these areas, the use of NN in revealing the complex relationships between genetic variants and common diseases is still limited. While the great potential of NN in genetic data analysis of complex diseases is reasonably expected, the high dimensionality of the genetic data and the complexity of genetic structure bring tremendous analytic challenges. The high-throughput technology allows us to simulta- neously evaluate the role of thousands or even millions of variants in complex diseases ([10]). Nevertheless, fitting an NN on such a large number of genetic variables could bring a serious overfitting issue. As for genetics studies, linkage disequilibrium (LD) exists among neigh- boring variants, and disease-associated variants are often in an LD block. Considering the underlying genetic structure can help us to combat the overfitting issue. Moreover, different types of phenotypes are often collected in a study. Besides the measurement of a disease phenotype, which is typically a scalar variable, researchers are also interested in studying a vector of variables (e.g., the progression of disease measured over time). With the rapidly evolving technologies and ever-decreasing cost, studies are starting to collect omics and imag- ing data, which are often high-dimensional and stored as matrices or tensors. While omics and imaging data provide us a great opportunity to study complex diseases, few methods are currently available for high-dimensional genetic data analysis of complex phenotypes (e.g., vectors and matrices) with the consideration of non-linear and non-additive effects. To address these emerging challenges and facilitate the high-dimensional genetic data analysis of complex diseases, I propose a functional neural network (FNN) that inherits the strengths from both NN and the functional linear model (FLM). FLM is a popularly used method in functional data analysis (FDA) ([25]; [24]), which deals with data in the form of 6 functions. FLM has been commonly used to analyze data measured over time and has been increasingly used in high-dimensional data analysis, such as genetic data analysis ([7]; [33]) and imaging data analysis ([36]). Specifically, with the high-dimensional genetic variants as the input layer and various types of phenotypes as the output layer, the proposed FNN uses a series of basis functions to obtain a representation of functional data in each layer ([27]), and further builds multiple hidden layers via functional linear models. FNN has a number of attractive features: 1) it has a built-in facility to account for the underlying structures of complex phenotypes and genetic data (e.g., LD), which helps capture disease- related variants and overcome the curse of dimensionality; 2) it uses a functional neural network to model the complex relationship between genetic variants and disease phenotypes; and 3) it can be used to analyze different types of phenotypes (e.g., scalar, vector, and matrix). Through simulations and two real data applications, I show that FNN outperforms conventional methods, such as NN and FLM. I have also shown, at certain conditions, FNN can be simplified to NN. In other words, FNN can be viewed as the generalization of NN to high-dimensional data with complex phenotypes. The remaining chapter is organized as follows: the FNN method is proposed in Section 2. Section 3 uses simulations to compare the performance of FNN with those of FLM and NN. In Section 4, I illustrate the methods via two real data applications. Summary and discussion are provided in Section 5. The technical details of the methods are described in the Appendix. 7 2.2 Method In this section, I introduce NN and two types of FLMs for different types of phenotypes at first. Based on the concepts of FLM and NN, I develop FNN for high-dimensional genetic data analysis. Throughout the Chapter 2, I use lower case letters, bold lower case letters, upper case letters and bold upper case letters to denote scalars, vectors, matrices and functions respec- tively when it comes to Roman alphabet. For example, b is a scalar. b is a vector with bj as the j-th element of b. B is a matrix with Bi as the ith row of matrix B. Bij denotes a scalar whose value is the j-th element of Bi . B is a function. Due to large amount of parameters and functional representations, Greek alphabets are also used to denote values or functions. 2.2.1 Functional Linear Model This section will introduce the functional linear models for a scalar phenotype and complex phenotypes separately. The details on the solutions of them are provided in Appendix A. 2.2.1.1 FLM for a scalar phenotype I briefly introduce FLM in the setting of genetic data analysis with a scalar phenotype ([26], [34]). Let yi and Zi = (zi1 , . . . , ziq ) denote a scalar phenotype (e.g., systolic blood pres- sure) and quantitative/qualitative covariates (e.g., age and gender) for the i-th individual. Let [Gi1 , Gi2 , ..., Gip ] be the single-nucleotide variants (SNVs) in a SNV set (e.g., a gene), coded as additive (i.e., Gik ∈ {0, 1, 2}, refers to the number of minor allele), and tk be the corresponding position scaled into [0, 1]. Given the genotypes and the positions, I model the 8 genetic variant function Gi (t), t ∈ [0, 1] as a linear combination of Dirac Delta functions, i.e. Xp Gi (t) = Gik δtk (t), (2.1) k=1 R∞ where δtk (t) satisfies t=−∞ f (t)δtk (t)dt = f (tk ). The functional linear model can then be used to model the relationship between the scalar phenotype yi and the genetic variant function Gi (t) with the consideration of Zi : Z ŷi = θ0 + Zi θ + Gi (t)β(t)dt, (2.2) where β(t) is the coefficient function measuring the genetic effects in a genetic region. The parameters θ0 and θ are the intercept and coefficients of covariates, respectively. 2.2.1.2 FLM for complex phenotypes FLM can be easily extended to handle complex phenotypes (e.g., disease phenotypes mea- sured over time) by modeling complex phenotypes as functions. For instance, I can modify the phenotype as Yi (s) and use the following model ([14]) to consider the spatial/temporal dependence at location/time s, Z Ŷi (s) = Zi θ + α0 (s) + α(s, t)Gi (t)dt, (2.3) where α(s, t) is a bivariate function and α0 (s) is an intercept function. The observed phe- notype is treated as the function realization in discrete points sij , denoted as Z ŷij = Ŷi (sij ) = Zi θ + α0 (sij ) + α(sij , t)Gi (t)dt. (2.4) 9 FLM has many desirable features for high-dimensional genetic data analysis. By consid- ering the effects of SNVs as functions, I could utilize information from adjacent SNVs (i.e., nearby SNVs tend to have similar effects due to LD) and reduce the number of parameters, which help us capture true signals and overcome the curse of dimensionality. Moreover, FLM has its own uniqueness of handling measurement errors and missing data, which are quite common in genetic data ([34]). Despite these advantages, FLM is not suitable to model complex relationships between SNVs and phenotypes or the inter-relationship between SNVs (e.g., interactions). In order to address these issues, I integrate the idea of NN into FLM to improve its capacity of modeling complex non-linear and non-additive genetic effects. 2.2.2 Neural networks Neural networks can be viewed as multi-stage nonlinear regression models. A D-layer NN model FN : Rp+q → R can be expressed in a recursive manner: (1) Xi = σ(b(1) + Zi θ + Gi W (1) ), (d) (d−1) Xi = σ(Xi W (d) + b(d) ), d = 2, . . . , D − 1, (D−1) ŷi = f (Xi W (D) + b(D) ), where Z = {zik }n×m , G = {Gik }n×p = {Gi (tk )}n×p and {X (d) , d = 1, ..., D − 1} are co- variates, genetic variables (e.g., SNVs) and hidden layers, respectively. σ (d) is the activation function (e.g.,the sigmoid function) and f is the link function (e.g., the identity function). W (d) and b(d) are the weight matrix and bias vector for the d-th layer. The detail of NN is described in Goodfellow et al [9]. 10 A L2 penalty is commonly used in NN to control the model complexity. The penalized mean square error loss function is thus defined as, n D 2 (kW (d) k2F + kb(d) k22 ). X X R(W, b) = (yi − ŷi ) + λ i=1 d=1 The back-propagation and gradient descent algorithms can then be used to solve (W, b). In this chapter, the learning rate of NN is tuned by using the adaptive optimization method, Adadelta ([37]). The number of hidden units and the parameter λ are also tuning parameters in this chapter. To determine the value of λ, I use the hyperparameter optimization algorithm introduced in [9]. Specifically, I choose the value of λ in a logarithmic scale so that the range of the value is wide enough, i.e., {10−2 , 10−1.5 , 10−1 , 10−0.5 , 100 }. The training data is split into the subtrain dataset and the validation dataset in the ratio of 4 : 1. I apply NN to the subtrain dataset with all possible λ values, and choose the best λ value based on the mean square error in the validation dataset. I then use the selected λ value to train the final model in the whole training set and obtain the model parameters. The same procedure can be used to determine the number of hidden units, which is chosen from {2, 4, 8, 16, 32, 64}. 2.2.3 Functional neural networks Functional neural networks can be viewed as an extension of NN, where the weights and biases are modeled as functions. By treating the weights and biases as functions, FNN can consider LD among SNVs and spatial-temporal relation of phenotypes, as well as reduce the number of parameters. In FNN, a genetic variant function Gi (t) based on SNVs is obtained at first as described (1) in 2.1. Given the genetic variant function Gi (t), the first hidden layer Xi (t(1) ) can be 11 (d) constructed by using equation (2.5). The additional D − 2 hidden layers, Xi (t(d) ), are then built recursively with different functional coefficients as shown in (2.6). Depending on the types of phenotypes, equation (2.7) or (2.8) can be used to fit a scalar phenotype or a complex phenotype, respectively. The FNN model can thus be expressed as:  Z  (1) (1) Xi (t(1) ) = σ Zθ + α0 (t(1) ) + α(1) (t(1) , t)G(t)dt , (2.5)  Z  (d) (d) (d−1) (d−1) Xi (t(d) ) =σ α0 (t(d) ) + α(d) (t(d) , t(d−1) )Xi (t )dt(d−1) , 1 < d < D, (2.6)  Z  (D) (D−1) (D−1) ŷi = f a0 + αD (t(D−1) )Xi (t )dt(D−1) , (2.7)  Z  (D) (D−1) (D−1) ŷij = Ŷi (sij ) = f α0 (sij ) + α(D) (sij , t(D−1) )Xi (t )dt(D−1) ) , (2.8) where σ and f are the activation function and the link function. In this chapter, I use the sigmoid function as the activation function and the identity function as the link function. (1) (d) (D) (D) α0 (t(1) ), α0 (t(d) ), a0 and α0 (sij ) are respectively bias functions or bias for the input layer, the hidden layer, the output layer with the scalar phenotype and the output layer with the complex phenotype, where t, t(d) and sij are respectively the coordinate systems for the genetic variant function (e.g., locus positions), d-th layer and the response function (e.g., time). α(1) (t(1) , t), α(d) (t(d) , t(d−1) ), αD (t(D−1) ) and α(D) (sij , t(D−1) ) are weight functions for the input layer, the hidden layer, the output layer with the scalar phenotype and the output layer with the complex phenotype, respectively. The explicit forms of the bias and weight function are written below, 12 l (d) (d) (d) (d) α0 (t(d) ) b (d) β (d) (t(d) ), X = k k k (d) =1 l(d) l(d−1) (d) (d−1) (d) α(d) (t(d) , t(d−1) ) = W (d−1) (d) β (d−1) (t(d−1) )β (d) (t(d) ), X X k k k k k (d) =1 k (d−1) =1 (d) where (W, b) = {W (d) , b(d) |d = 1, . . . , D} are parameters of interest. {β (d) (t(d) )|k (d) = k 1, . . . , l(d) } is the predetermined basis functions of the d-th layer. (D) When Y is a scalar, a0 is a scalar and α(D) is a univariate function. If Y is a vector (D) (e.g., measurements over different time points), then α0 is a univariate function and α(D) is a bivariate function. The model can be extended to a variety of phenotypes. For instance, if Y is a matrix, then the weight function can be expressed as a Cartesian product of three basis function systems. As shown in Figure 1, NN and FNN have different types of weights and biases. While weights and biases in NN are vectors or scalars, weights and biases in FNN are functions. Correspondingly, NN uses multiplication on the weight matrix W (d) , while FNN performs integration on the functional coefficient α(d) . In practice, integration is realized by the numeric integration in the form of summation. In this sense, the weights and biases in NN are independent values, while those in FNN are realizations of a parameterized curve function. To estimate the parameters in the FNN model, I define the loss function L(y, ŷ), cost function R(α, α0 ) and objective function O(α, α0 ) as follows: 13 Neural Networks Functional Neural Networks Figure 2.1: An illustration of neural networks and functional neural networks 14 L(y, ŷ) = (y − ŷ)2 , X R(α, α0 ) = L(y(tij )ŷ(tij )), i,j O(α, α0 ) = R(α, α0 ) + λJ(α, α0 ). A penalty J(α, α0 ) term is added to the error function to address the overfitting issue raised by the high-dimensional genetic data. Instead of imposing a penalty (e.g., the lasso penalty) on the parameters in NN, I impose a penalty on the second-order derivative of hidden functions to control the smoothness of the functions, D (d) J(d) (α(d) , α0 ), X J(α, α0 ) = d=1  !2 !2  ∂ 2 α(d) ∂ 2 α(d) Z Z (d)  dt(d) dt(d−1) , J(d) (α(d) , α0 ) =  + ∂t(d)2 ∂t(d−1)2 where λ is a hyper-parameter determined by cross-validation. The penalized loss function can be optimized via a back-propagation algorithm, which is described in the appendix B. 2.3 Simulations Simulations were conducted to compare the performance of FNN with those of FLM and NN. In order to mimic the real structure of genetic data (e.g., allele frequencies and LD), the genetic data was obtained from the real sequencing data located on Chromosome 17: 7344328 - 8344327 from the 1,000 Genome project [3]. Specifically, I randomly chose a segment of SNVs, 200 training samples, and 50 testing samples. Due to the intensive computation time 15 of analyzing complex phenotypes (e.g., matrices), 100 replicates are simulated to illustrate the performance. Based on the genetic data, we simulated different types of phenotypes (i.e., scalars, vectors, and matrices), and evaluated both linear and non-linear relationships between geno- types and phenotypes. To reflect the real disease scenarios, we added random noise to the simulated data, and compared the performance of three methods by gradually increasing the noise variance level. A general form of the simulation model is given below, Yij = Yi (sij ) + εij , j = 1, 2, . . . , m, where Yi (s) = F(s; Gi (t)) corresponds to a function-function mapping L2 ([0, 1]) → L2 ([0, 1]). It models a functional phenotype Yi (s) with the genetic variant function Gi (t), where s is the coordinate system of the phenotype function and sij is a point in the coordinate space, The noise εij follows an i.i.d normal distribution, N (0, σ 2 ), and m is the dimension of the phenotype. The explicit expression of the model is given in the following simulations. In all simulations, the function has K points as discrete realizations of the function, and the points are randomly generated from [0, 1]. When m = 1, Yi (x) is a scalar. For vector types of phenotypes, I set m = 20 (i.e., the phenotype of ith individual is a vector of 20 elements). In the matrix setting, I apply outer product on two functions in the form of (2.9). The discrete observational points are randomly generated from a rectangle in [0, 1] × [0, 1], where m = 400. For simplicity, I compare NN and FNN with one hidden layer. The number of hidden units in NN, the number of basis functions for FLM and the number of basis functions for the hidden layer in FNN are fixed at 40. The numbers of basis functions of the first 16 layer and the last layer of FNN are 60 and 20, respectively. The basis function system used in FLM and FNN is a fourth-order B-spline basis function system. I use an adaptive method, Adadelta [37], to determine the learning rate. The hyper parameter λ is chosen from {10−2 , 10−1.5 , 10−1 , 10−0.5 , 100 } and is determined by the validation process. 2.3.1 Simulation 1: The effect of the phenotype dimension In this simulation, I evaluated the performance of three methods under different types of phenotypes (i.e., scalar, vector, and matrix). For each replicate, I randomly chose 200 samples as training data and 50 samples as testing data, each with 200 SNVs obtained from the 1,000 Genome project. Based on the genotypes, I simulated phenotypes using the following model, 20 Z 1 cl Gel (t)B1l (t)B2l (x)dt, X F(x; G(t)) = (2.9) l=1 0 where B1l and B2l are two 4th-order B-spline basis functions with 10 knots randomly gener- ated from [0, 1]. The coefficient ci and ei are randomly generated from uniform distributions, U [−2, 2] and U [1/3, 3], respectively. To compare the performance of the three methods, I computed the mean squared error (MSE) on both training and testing samples for three different types of phenotypes (i.e., scalar, vector, and matrix) and three different levels of noise variance (i.e., σ =0.3, 0.6, and 1.2). The left group of box plots and the right group of box plots in each panel of Figure 2 summarize the MSE of three methods calculated from the training data and testing data, respectively. Overall, FNN attains better or comparable performance than FLM and NN in the testing data. NN performs the best in training data but is subject to low performance in 17 scalar vector matrix 1.5 1.0 0.3 0.5 0.0 2.0 1.5 method FLM mse 1.0 0.6 NN FNN 0.5 3 2 1.2 1 train test train test train test class Figure 2.2: Accuracy comparison of three methods for three different types of phenotypes (i.e., scalar, vector, and matrix) and three different levels of noise (i.e., σ =0.3, 0.6, and 1.2) 18 testing data, which can be potentially explained by overfitting caused by hundreds of genetic variables. The overfitting issue becomes more serious with the increased noise level. On the other hand, FLM provides robustness against the overfitting issue due to its simplicity and fewer parameters. However, the linear structure of FLM fails to capture complex features of the data, leading to low performance in both training data and testing data. FNN has a good balance between bias and variance. It is able to capture complex features of data while remains robust to the overfitting issue. 2.3.2 Simulation 2: The effect of the underlying function I further evaluated the performance of three methods under different underlying functions (i.e., polynomial, logistic, and linear functions). I denote the polynomial function as “polyno”, which is described in equation (2.9). In addition, we simulated a “logist” function, which performs the logit transformation on the genetic variables, X20 Z 1 F(x; G(t)) = cl (1 − exp(el G(t)))cos(al t + ul )cos(bl x + vl )dt, (2.10) l=1 0 where al , bl , cl ∼ U [−12, 12], el ∼ U [0.1, 1], and ul , vl ∼ U [−π, π]. To evaluate the method’s performance in a linear setting, we also simulated a linear function denoted by “linear”, X20 Z 1 F(x; G(t)) = (cl G(t)cos(al t + ul )cos(bl x + vl ) (2.11) l=1 0 + dl G(t)B1l (t)B2l (x))dt, 19 where al , bl , cl , dl ∼ U [−12, 12], el ∼ U [0.1, 1], and ul , vl ∼ U [−π, π], B1l and B2l are two fourth-order B-spline basis functions with 10 knots generated randomly from [0, 1]. As shown in Figure 3, overall, FNN has better or at least comparable performance than the other two methods. As expected, FLM performs the best in the linear setting, although the performance of FNN is close to that of FLM. FNN is slightly better than NN but outperforms FLM in the non-linear settings. While NN also outperforms FLM in the non- linear settings, it suffers the overfitting issue due to the high dimensionality of the data. 2.3.3 Simulation 3: The effect of the input dimension In this simulation, I compared the performance of three methods with the increased dimen- sion of input variables (i.e., p = 80, 200, 500, and m = 20). The simulation setting is the same as the one used in simulation 2 with the polynomial function, except for the dimension of input variables. As shown in Figure 4, when the number of input variable p and the noise level increase, all three methods have decreased accuracy in terms of testing MSE. Among the three methods, FLM has the most robust performance due to the linear structure it imposes. Nonetheless, such a structure also limits its performance for modeling non-linear effects. While NN is powerful for capturing non-linear effects, it suffers from the overfitting issue, especially with the increase of the number of input variables and the noise level. Overall, FNN attained the best performance. FNN inherits the advantages of both NN and FLM. It has the capacity of capturing non-linearity as NN and is generally robust to the increased number of input variables and the increased noise level as FLM. 20 polyno logist linear 0.8 0.6 0.3 0.4 0.2 0.9 method FLM mse 0.6 0.6 NN FNN 0.3 2.00 1.75 1.2 1.50 1.25 train test train test train test class Figure 2.3: Accuracy comparison of three methods for different underlying functions 21 80 200 500 0.8 0.6 0.4 0.3 0.2 1.0 0.8 method FLM mse 0.6 0.6 NN FNN 0.4 0.2 2.0 1.6 1.2 1.2 train test train test train test class Figure 2.4: Accuracy comparison of three methods for different numbers of input variables (i.e., 80, 200, and 500) 22 2.4 Real Data Applications 2.4.1 The role of CHRNA5, race, gender, and age in predicting cigarette smoking Cigarette smoking is one of the leading causes of preventable disease, contributing to 5 million deaths worldwide each year ([21]). During the last decade, a great deal of progress has been made in identifying genetic variants associated with smoking. Among those findings, the CHRNA5 gene has been identified and confirmed in several large-scale studies ([19]). In this application, I will evaluate the role of CHRNA5, race, gender, and age in predicting cigarette smoking, measured by smoking frequency. The genetic dataset to be analyzed is from a large-scale genetic study of addiction: genetics and environment (SAGE). The participants of the SAGE were unrelated individuals selected from three independent studies: COGEND, COGA, and FSCD. The SAGE included smoking measurements and personal characteristics (e.g., age). Prior to the analysis, I re- assessed the quality of the genetic data. Markers and samples with low quality and high missing value are removed from the analysis. The final data includes 2502 samples, among which 795 samples are African-American, 1707 samples are Caucasian, 1479 samples are female, and 1023 samples are male. Since cigarette smoking is dependent on age, smoking quantity is considered as a function of age, and model its relationship with CHRNA5, race, gender, and age. To evaluate the prediction performance of the three methods, I randomly split the whole data into training (80%) and testing (20%) data. By applying the methods to the training data, I built models and obtained the training MSE. The models were then evaluated on the testing data by 23 1.1 method 1.0 FLM NN1 mse NN2 FNN1 0.9 FNN2 train test class Figure 2.5: The role of CHRNA5, race, gender, and age in predicting smoking quantity . "NN" and "FNN" refer to the neural networks method and the functional neural networks method, while the number "1" and "2" indicate the number of hidden layers. calculating the testing MSE. To avoid chance finding due to random splitting, I repeated the procedure 100 times and obtained the average MSE values. As is shown from Figure 5, the FNN method attains higher prediction accuracy than FLM and NN in terms of testing MSE. In this analysis, I evaluated NN and FNN with one and two hidden layers. While adding additional layers could allow us to model more abstract and complex features from the data, the result suggests that the models with one hidden layer are sufficient to model the relationship between CHRNA5, race, gender, age and smoking quantity. Among all the five models, the best model is FNN with one hidden layer. 24 2.4.2 The role of APOE in predicting hippocampus volume change over time Accurately identifying high-risk Alzheimer’s Disease (AD) individuals at an early stage is important for early AD prevention, as treatments prior to the onset of dementia can ensure intervention occurs before irreversible neuronal death. In this application, I study the role of APOE in predicting hippocampus volume change over time, an important indicator for AD. Data used in the application were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu)([28]). The ADNI was launched in 2003 as a public-private partnership. The primary goal of ADNI is to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease. It has three phases: ADNI1, ADNI GO, and ADNI2. ADNI includes standardized diagnostic assessments of AD (i.e., Case vs. Control). DNA samples were obtained from 808 ADNI participants and were sent to Illumina where non-CLIA whole-genome sequencing (WGS) was performed on each sample. MR image data (e.g., hippocampus volumes), and clinical data (e.g., biospecimen and cognitive tests) were also collected both at the baseline and through follow-up visits. I downloaded and reformatted the analysis-ready ADNI dataset. Prior to the data anal- ysis, I re-assessed the quality of the sequencing data. Markers and samples with a low proportion of successful genotype calls were removed from the analysis. After a careful qual- ity assessment, I selected all SNVs in the APOE gene and evaluated its role in predicting hippocampus volume change over time (i.e., the phenotype is a longitudinal phenotype). 25 1.2 method FLM 1.0 NN1 mse NN2 FNN1 FNN2 0.8 train test class Figure 2.6: Prediction accuracy of hippocampus volume change over time based on the gene APOE. "NN" and "FNN" refer to the neural networks method and functional neural networks method, respectively. The number "1" or "2" indicates the number of hidden layers. After removing samples with only one visit, I have 547 samples and 1434 measure of hippocampus volume over time. Our phenotype is the hippocampus volume change from the initial value. The initial and current year of investigation is known as the time points of interest. For FLM and NN, to consider time varying effects, I model the initial year and the current year as covariates in the models. Figure 6 shows that FNN and FLM attain comparable performance, and both of them obtain higher accuracy than NN. Among all the models, FNN with 2 hidden layers has the best performance in terms of testing MSE. While there is no obvious evidence of overfitting for NN, the model built by NN could be too complicated to reflect the underlying relationship 26 between AP OE and hippocampus volume change over time, leading to the sub-optimal performance. 2.5 Discussion Ongoing GWAS and sequencing studies allow researchers to comprehensively investigate the role of a deep catalog of human genome variations in human complex diseases. Although these studies hold great promise for uncovering novel variants predisposing to human diseases, the high-dimensionality of genetic data and complex relationships between genotypes and disease phenotypes bring tremendous challenges to data analysis. To facilitate the gene discovery process, I develop an FNN model for high-dimensional genetic data analysis of complex phenotypes. FNN uses the hierarchy of the functional neural networks framework to learn complicated features from genetic data, making it feasible to capture non-linear genetic effects. As we observed from simulations and real data analyses, FNN attained better or at least comparable performance as compared to FLM and NN. When the underlying relationship between genotypes and disease phenotypes is nonlinear, both FNN and NN outperform FLM. Even when the underlying relation is linear, FNN has similar performance as FLM and attain better performance than NN. In addition to its capacity to capture non-linear effects, FNN inherits advantages from FLM, which allows FNN to utilize the prior information of effects smoothness to improve accuracy and robustness. By modeling the effects of SNVs as a function in the form of a combination of basis functions, FNN is able to take LD information from nearby SNVs into account and reduces model complexity. Similarly, by modeling complex phenotypes as functions, FNN reduces the number of parameters, making it applicable to high-dimensional 27 phenotypes (e.g., matrices). Through simulations, the FNN method can have a close perfor- mance to FLM in the linear setting, where NN was subject to an overfitting issue. Even in the nonlinear cases, FNN has demonstrated its robustness against overfitting and attained better or comparable performance than NN. FNN can be applied to different types of phenotypes (e.g., vectors and matrices), which make it useful for emerging genetic research (e.g., imaging genetics research or multi-omics research). Similar to FLM, FNN has a unique advantage for modeling complex phenotypes with temporal and spatial correlations (e.g., imaging or longitudinal phenotypes). Such an advantage has been shown in two real data applications. For instance, in the cigarette smoking application, I have also tried a traditional approach by treating age as a covariate in FNN. The result from such an approach is only slightly better than the result from NN. Therefore, utilizing the functional structure of phenotypes (i.e., considering smoking quantity as a function of age) can improve the performance of FNN significantly. Nonetheless, FNN is different from the classic functional model. Unlike FLM, FNN can handle the situation when the evaluation points are different among phenotypes (e.g., individuals’ phenotype data are collected at different time points or locations), which is very common in real scenarios. Methods have been developed to integrate the spline method into convolutional neural networks for image classification. In [8], B-splines are used for kernel construction in the convolutional layers to make the computation time independent from the kernel size, resulting in great performance in image classification. Different from the previous methods, FNN model is based on vanilla neural networks, which uses basis functions to model temporal and spatial correlations of the phenotype data (e.g., brain volumes measured over time) and LD structure of genetic data. The contribution of this chapter is thus to introduce a new tool, FNN, that addresses the ongoing challenges from high-dimensional genetic data analysis and 28 facilitates the gene discovery with complex phenotypes. FNN inherits advantages from both NN and FLM, which makes it capable of considering complex relationships (e.g., non-linear relationships) between genotypes and phenotypes while not substantially increasing model complexity. Therefore, compared to NN and FLM, it achieves a better bias-variance balance. These features make it ideal for analyzing high-dimensional genetic data that has a large number of SNVs, low signal-to-noise ratio, and complex relationships with phenotypes. In this chapter, I mainly focus on introducing the FNN model and illustrating FNN with simulations and real data applications. Statistical properties of FNN (e.g., consistency and rate of convergence) are important topics that may be worth further investigation in the future work. 29 Chapter 3 Consistency of Functional Neural Networks In the previous chapter, I introduced the FNN model and conducted simulations and real data analysis to evaluate its performance. While simulations and real data applications showed FNN had advantages over existing methods, it is also important to study the statistical property of FNN. In this chapter, the consistency of FNN will be proved. In the process of proof, some lemmas and theorems are applied and make significant effects. I quote them and provide the citation when I introduce them before the application. Their proofs can be found from the provided sources. As for my own theorems, the proof is given below these theorems. The notations of the proposed model are changed compared with Chapter 2 for the consistency of quoted theorems and ease of proofs. 3.1 Introduction Consistency is an important property of a statistical model. It determines whether the estimate can converge to the true function as the sample size n increases. As a new nonlinear function, it is also necessary to check the consistency of FNN. In the proposed FNN, the underlying function can be any continuous function, which forms a large function space. In section 3, I prove that for any continuous function, there 30 exists one FNN model that is approximated to the continuous function at any given extend. However, it is uncertain that the approximate FNN model can be learnt from the given data even the FNN model exists. If the space of the candidate models is too large, it is hard to find the right one. To measure how large the function space of FNN is, I utilize the covering number to quantify the extend in section 4. In section 5, I prove that the covering number of our proposed model is not too large, and our estimated model only deviates from the best possible model at a controlled rate. Combined with all these results, we prove the consistency of FNN. 3.2 Model First, some definitions needed in the proof are introduced. Definition 3.2.1. A function space B = {Bi (t)|i ∈ N+ } is a basis function system in a compact space C, if for every  > 0 and every f as a continuous real-valued function defined PN in C, there exists a function g(x) = i=1 ai Bi (t), ai ∈ R satisfying |f (x) − g(x)| <  on C. In the following, the concept smoothness is used to restrict the function space of linear combinations of basis functions. Therefore, the basis functions must be differentiable. In chapter 2, B-spline basis functions are utilized as a basis function system, which are dif- ferentiable. Moreover, we can select one B-spline basis function, which is exactly a order differentiable. By Stone-Weierstrass theorem, monomial functions also compose a basis func- tion system. Other common basis function systems, such as the monomial basis or Fourier basis, are also differentiable on a compact space. Based on the basis function system, I can define a function-to-function operator. 31 Definition 3.2.2. Let x(t) be a function in L2 ([0, 1]). Define an operator M (·) : L2 ([0, 1]) → L2 ([0, 1]) as M (x(t)) = x(t)α(s, t)dt + β(s), where α(s, t) and β(s) are linear combinations R of basis functions. To write in a recursive pattern, denote that x(t) = x0 (t0 ) and yi (ti ) = Mi ◦ xi−1 (ti−1 ) = Mi (xi−1 (ti−1 )) Z = xi−1 (ti−1 )αi (ti , ti−1 )dti−1 + βi (ti ), (3.1) where Ji Ji−1 X X αi (ti , ti−1 ) = Wi,j,k Bi−1,k (ti−1 )Bi,j (ti ), j=1 k=1 Ji X βi (ti ) = bi,j Bi,j (ti ), j=1 and {Bi,j (t)|j ∈ N+ } is a predetermined basis function system for each i, and Wi,j,k and bi,j are parameters of the operator. Definition 3.2.3. A function Ψ : R → [0, 1] is called a squashing function if it is non- decreasing, limλ→∞ = 1 and limλ→−∞ = 0. If x(t) is a function, Ψ ◦ x(t) = Ψ(x(t)). To express in a recursive way, I have xi (ti ) = Ψ(yi (ti )). The xi (ti ) and the yi (ti ) refers to the same object as in 3.1, therefore I can construct a 32 model by connecting these operator: Definition 3.2.4. A model F : L2 ([0, 1]) × [0, 1] → R is called FNN if it satisfies the form, F (x0 (t0 ), td ) = yd (td ) = Md ◦ Ψ ◦ Md−1 ◦ Ψ ◦ · · · M2 ◦ Ψ ◦ M1 ◦ x0 (t0 ), (3.2) where d is the number of hidden layers in FNN. The function space of all possible FNN satisfying the above form is denoted by Fd (Ψ). In the following, I use H to denote the model domain L2 ([0, 1]) × [0, 1]. Definition 3.2.5. For functions defined on L2 ([0, 1]), x(t) and y(t) are considered as in the same equivalence class if 01 |x(t) − y(t)|dt = 0. R The distance of elements in space H is defined by Z 1 d(hi , hj ) = |xi (t) − xj (t)|dt + |si − sj |, 0 where hi = (xi (t), si ), hj = (xj (t), sj ) ∈ H. Remark 3.2.1. Normally random variable are defined in a subset of Euclidean space, while the domain of the variable in the proposed FNN model involves a function space L2 ([0, 1]). There comes an issue on whether the definition of random variable is valid. From the section on convergence of sample distribution in [23], random variable is constructed on any separable metric space. Therefore the domain H is meaningful in terms of the space of random variable since L2 ([0, 1]) is separable. 33 3.3 Universal Approximation Theorem Hornik [12] proved that multi-layer forward neural networks are universal approximators in 1989. In this section, I prove that FNN is also a universal approximator based on approxi- mations. To outline the proof, I first construct another function space Gd (Ψ), and then prove that Gd (Ψ) is dense in all real continuous function defined on H following the idea in [12]. Finally, that Fd (Ψ) is dense is proved by showing that any function in Gd (Ψ) can be approximated by functions in Fd (Ψ). 3.3.1 Definitions for New Construction Definition 3.3.1. Let x(t) be a function on L2 ([0, 1]). Define an operator L(·) as L(x(t)) = R x(t)α(s, t)dt + β(s), where α(s, t) and β(s) are step functions. To express in a recursive pattern, I have yi (ti ) = Li ◦ xi−1 (ti−1 ) = Li (xi−1 (ti−1 )) Z = xi−1 (ti−1 )αi (ti , ti−1 )dti−1 + βi (ti ), where Ji Ji−1 X X αi (ti , ti−1 ) = Wi,j,k χE (ti−1 )χEi,j (ti ), i−1,k j=1 k=1 Ji X βi (ti ) = bi,j χEi,j (ti ). j=1 Among them, Wi is a matrix where the j-th row and k-th column element is Wi,j,k . bi is 34 a vector where the element in the j-th position is bi,j . E(·) are sub intervals of [0, 1], where Ei,j ∩ Ei,j 0 = ∅ if j 6= j 0 . χE is a characteristic function satisfying:   if x ∈ E  1  χE (t) =   0  if x ∈ / E. For every set E, denote that aE + b = {ax + b|x ∈ E} Definition 3.3.2. The function space Gd (Ψ) is a set of all possible functions f of the the following form: f (x0 (t0 ), td ) = yd (td ) = Ld ◦ Ψ ◦ Ld−1 ◦ Ψ ◦ . . . L2 ◦ Ψ ◦ L1 ◦ x0 (t0 ). An operator L(·) whose domain is a set of measurable functions is bounded by λ if R |L(x(t)) − L(y(t))| ≤ λ |x(t) − y(t)|dt. 3.3.2 Step One Recall that a family A of real functions defined on a set E is an algebra if A is closed under addition, multiplication, and scalar multiplication. A family A separates points on E if, for every x, y in E, x 6= y, there exists a function f in A such that f (x) 6= f (y). The family A vanishes at no point of E if, for each x in E, there exists f in A such that f (x) 6= 0. Theorem 3.3.1. (Stone-Weierstrass Theorem.) Let A be an algebra of real continuous functions on a compact set K. If A separates points on K and A vanishes at no point of K, 35 then the uniform closure B of A consists of all real continuous functions on K. In this chapter, the term C(H) is used to denote all real continuous functions defined on H. Definition 3.3.3. A subset S of a metric space (X, ρ) is ρ - dense in a subset T if, for every  > 0 and every t ∈ T , there is an s ∈ S such that ρ(s, t) < . Definition 3.3.4. A subset A of C is said to be uniformly dense on compacta in C if, for every compact subset K which is subset of the domain of C, A is ρK - dense in C, where ρK (f, g) = supx∈K |f (x) − g(x)|. By the definition, A is uniformly dense on compacta in C has the same meaning with that the uniform closure B of A consists of all real continuous functions on K. In order to apply the Stone-Weierstrass theorem on Gd (Ψ), I have to show that Gd (Ψ) is an algebra. Lemma 3.3.1. Gd (Ψ) is an algebra on H if Ψ is the cosine function. Proof. It is necessary to show that Gd (Ψ) is closed under addition, multiplication, and scalar multiplication on C if Ψ is the cosine function. (i) Gd (Ψ) is closed under the scalar multiplication: ∀f ∈ Gd (Ψ), λf can be represented by f in which Wd and bd are replaced by λWd and λbd without changing any other parameters. (ii) Gd (Ψ) is closed under addition: It is clear that G1 (Ψ) is closed under addition since it is a simple functional linear regression. 36 ∀f, f˜ ∈ Gd (Ψ), I need to prove that f + f˜ ∈ Gd (Ψ). For f˜, the corresponding functions αi (ti , ti−1 ) and βi (ti ) are denoted by α̃i (ti , ti−1 ) and β̃i (ti ). The function which equals to f + f˜ can be constructed with the following coefficient _ _ functions including α i and β i For i = 1, _ α1 (t1 , t0 ) := α1 (2t1 , t0 ) + α̃1 (2t1 − 1, t0 ), _ β1 (t1 ) := β1 (2t1 ) + β̃1 (2t1 − 1). For 1 < i < d, _ αi (ti , ti−1 ) := 2αi (2ti , 2ti−1 ) + 2α̃i (2ti − 1, 2ti−1 − 1), _ βi (ti ) := βi (2ti ) + β̃i (2ti − 1). For i = d, _ αd (td , td−1 ) := αi (td , 2td−1 ) + α̃d (td , 2td−1 − 1), _ βd (td ) := βd (2td ) + β̃d (2td − 1). (iii) Gd (Ψ) is closed under multiplication: ∀f, f˜ ∈ Gd (Ψ), yd−1 (td−1 ), ỹd−1 (td−1 ) are the corresponding object in the previous layer, i.e., Z f (x(t0 ), td ) = αd (td , td−1 )Ψ(yd−1 (td−1 ))dtd−1 + βd (td ) 37 Z f˜(x̃(t0 ), td ) = α̃d (td , td−1 )Ψ(ỹd−1 (td−1 ))dtd−1 + β̃d (td ). For f˜, the corresponding parameters Wi,j,k , bi,j and Ei,j are denoted by W̃i,j,k , b̃i,j and Ẽi,j . Since Gd (Ψ) is closed in addition, it satisfies to check each component of f · f˜ in the following: βd (td ) · β̃d (td ) Jd J˜d X X = bd,j χE (td ) · b̃d,j χẼ (td ) d,j d,j j=1 j=1 Jd J˜d X X = bd,j b̃d,j 0 χE (td ), d,j ∩Ẽd,j 0 j=1 j 0 =1 satisfies the form of βd (td ). Z β̃d (td ) · αd (td , td−1 )Ψ(yd−1 (td−1 ))dtd−1 Z = αd (td td−1 )β̃d (td )Ψ(yd−1 (td−1 ))dtd−1 Z X Jd Jd−1 J˜d X X = Wd,j,k χE (td−1 )χE (td ) · b̃d,j χẼ (td )Ψ(yd−1 (td−1 ))dtd−1 d−1,k d,j d,j j=1 k=1 j=1 Z Jd−1 Jd J˜d X XX = Wd,j,k b̃d,j 0 χE (td−1 )χE (td )Ψ(yd−1 (td−1 ))dtd−1 , d−1,j d,j ∩Ẽd,j 0 k=1 j=1 j 0 =1 R satisfies the form of αd (td , td−1 )Ψ(yd−1 (td−1 ))dtd−1 . R In the same manner, it is easy to show that βd (td ) · α̃d (td , td−1 )Ψ(ỹd−1 (td−1 ))dtd−1 R satisfies the form of α̃d (td , td−1 )Ψ(ỹd−1 (td−1 ))dtd−1 . 38 By using the step function decomposition, I have, Jd−1 X yd−1 (td−1 ) = ck χE (td−1 ), d−1,k k=1 J˜d−1 X ỹd−1 (td−1 ) = c̃k χẼ (td−1 ), d−1,k k=1 where yd−1 (td−1 ) Z Jd−2 Jd−1 Jd−1 X X X = xd−2 (td−2 ) χE (td−1 )χE (td−2 )dtd−2 + bd−1,j χE (td−1 ) d−1,j d−2,k d−1,j k=1 j=1 j=1 Jd−1 Jd−2 Z X X = ( xd−2 (td−2 )χE (td−2 )dtd−2 + bd−1,j )χE (td−1 ). d−2,k d−1,j j=1 k=1 R Define |E| = χE (s)ds. By changing the parameter Ed−1,j to be independent with Ẽd−1,j without changing its length, the condition |Ed−1,j ||Ed−1,j 0 | = |Ed−1,j ∩Ed−1,j 0 | is satisfied. The transformed yd−1 (td−1 ) with the new Ed−1,j still satisfies the form of Gd−1 (Ψ). I used the transformed yd−1 (td−1 ) in the following proofs without altering its form. Z αd (td , td−1 )Ψ(x(td−1 ))dtd−1 Z X Jd Jd−1 Jd−1 X X = Wd,j,k χE (td−1 )χE (td ) · Ψ( ck χE (td−1 ))dtd−1 d−1,k d,j d−1,k j=1 k=1 k=1 Jd Jd−1 X X = Ψ(ck )Wd,j,k |Ed−1,k |χE (td ), d,j j=1 k=1 39 where Ψ is the cos function that satisfies Ψ(x)Ψ(y) = (Ψ(x + y) + Ψ(x − y))/2. Z Z αd (td , td−1 )Ψ(x(td−1 ))dtd−1 ] · α̃2 (td , td−1 )Ψ(x̃(td−1 ))dtd−1 Jd Jd−1 X X = Ψ(ck )Wd,j,k |Ed−1,k |χE (td )· d,j j=1 k=1 J˜d J˜d X X Ψ(c̃k0 )Wd,j 0 ,k0 |Ed−1,k0 |χE (td ) d,j 0 j 0 =1 k 0 =1 Jd Jd−1 J˜d J˜d−1 X X X X = Wd,j,k Wd,j 0 ,k0 |Ed−1,k ||Ed−1,k0 |Ψ(ck )Ψ(c̃k0 )χE (td ) d,j ∩Ẽd,j 0 j=1 k=1 j 0 =1 k 0 =1 Jd Jd−1 J˜d J˜d−1 X X X X 1 = Wd,j,k Wd,j 0 ,k0 |Ed−1,k ∩ Ed−1,k0 | 0 0 2 j=1 k=1 j =1 k =1 (Ψ(ck + c̃k0 ) + Ψ(ck − c̃k0 ))χE (td ) d,j ∩Ẽd,j 0 Jd Jd−1 J˜d J˜d−1 X X X X 1 = W W 0 0 |E ∩ Ed−1,k0 |Ψ(ck + c̃k0 )χE ∩Ẽ (td ) 0 0 2 d,j,k d,j ,k d−1,k d,j d,j 0 j=1 k=1 j =1 k =1 Jd Jd−1 J˜d J˜d−1 X X X X 1 + Wd,j,k Wd,j 0 ,k0 |Ed−1,k ∩ Ed−1,k0 |Ψ(ck − c̃k0 )χE ∩Ẽ (td ). 0 0 2 d,j d,j 0 j=1 k=1 j =1 k =1 Since x(td−1 ), x̃(td−1 ) ∈ Gd−1 (Ψ) and Gd−1 (Ψ) is closed in addition and scalar multi- plication, (x(td−1 ) + x̃(td−1 )), (x(td−1 ) − x̃(td−1 )) ∈ Gd−1 (Ψ). Z _ αd (td , td−1 )Ψ(x(td−1 ) + x̃(td−1 ))dtd−1 _ ˜ J d Jd−1 Jd−1 X X X _ = Ψ(ck + c̃k0 ) W d,j,k |Ed−1,k ∩ Ẽd−1,k0 |χ_ (td ). E d,j j=1 k=1 k 0 =1 _ _ By choosing E d,j := Ed,j ∩ Ẽd,j 0 , W d,j,K := 21 Wd,j,k Wd,j 0 ,k0 , the first term satisfies 40 R _ _ the form of α d (td , td−1 )Ψ( y d−1 (td−1 ))dtd−1 . The same condition applies to x(td−1 ) − x̃(td−1 ). Since Gd (Ψ) is closed under addition, the whole component belongs to Gd (Ψ). By using the Stone-Weierstrass theorem, if Gd (Ψ) separates points on K for any compact set K ⊆ H and vanishes at no point, I can have the following theorem. Theorem 3.3.2. Gd (Ψ) is uniformly dense on compacta in C(H) if G is the cosine function. Proof. It is easy to show that Gd (Ψ) vanishes at no point by using βd (td ) = 1 while keeping other functions as 0. If t̂d 6= t̃d , we can assume that t̂d < t̃d without loss of generality. The points can be separated by using βd (td ) := χ t̂ +t̃ (td ), while keeping other functions as 0. [ d d ,1] 2 It is sufficient to show that ∀x(t0 ), x̃(t0 ) ∈ C, ∃f ∈ Gd (Ψ) s.t. f (x(t0 ), td ) 6= f (x̃(t0 ), td ) if R1 0 |x(t0 ) − x̃(t0 )| > 0. Choose 1 α1 (t1 , t0 ) := (χx(t )>x̃(t ) − χx(t )x̃(t0 ) 0 0 x(t0 )x̃(t ) x(t0 ) − x̃(t0 )dt0 − x(t ) 1, choose 2 + 2e αi (ti , ti−1 ) := χ (t )χ (t ), e − 1 [0,1] i [0,1] i−1 e+1 βi (ti ) := χ (t ). e − 1 [0,1] i 41 Therefore f (x(t0 ), td ) = 0 and f (x̃(t0 ), td ) = −1, i.e., Gd (Ψ) separates points on compact function space C. The following lemma is quoted from the Lemma A.3 in [12]. Lemma 3.3.2. For every squashing function Ψ,  > 0, and M > 0, there is a function σ(x) Pq that satisfies σ(x) = j=1 bj Ψ(aj x + cj ); bj , aj , cj ∈ R, q ∈ N+ such that sup |σ(x) − cos(x)| < . x∈[−M,M ] The above lemma shows that the cosine function can be approximated by any squashing function with some linear transformations. Based on this lemma, the squashing function Ψ is not limited to the cosine function. Theorem 3.3.3. Gd (Ψ) is uniformly dense on compacta in C(H) if Ψ is a squashing func- tion. Proof. For each given f ∈ Gd (Ψ), a common bound B > 1 can be found for each operator Li (·). For any given compact set K ⊆ H, there exists M s.t. the values inside the cosine function are within [−M, M ]. I will compare the squashing function σ(·) defined in the previous lemma and the cosine function. In each step, there exists the difference between the inside functions, which we distinguish _ by ỹi (t) and y i (t). 42 For each δ > 0, choose  = δ(B − 1)/B d , _ |σ(ỹi (ti )) − cos( y i (ti ))| _ ≤|σ(ỹi (ti )) − cos(ỹi (ti ))| + | cos(ỹi (ti )) − cos( y i (ti ))| _ ≤δ(B − 1)/B d + |ỹi (ti )− y i (ti )| _ ≤δ(B − 1)/B d + B max |x̃i−1 (ti−1 )− x i−1 (ti−1 )|. 0≤ti−1 ≤1 _ Suppose that f˜ uses the squashing function σ and f uses the cosine function, we have _ |f˜(x0 (t0 ), td )− f (x0 (t0 ), td )| _ =|Ld (σ(ỹd−1 (td−1 ))) − Ld (cos( y d−1 (td−1 ))|) _ ≤B max |σ(ỹd−1 (td−1 )) − cos( y d−1 (td−1 ))| 0≤td−1 ≤1 _ ≤δ(B − 1)/B d−1 + B max |Ld−1 (σ(ỹd−2 (td−2 ))) − Ld−1 (cos( y d−2 (td−2 ))| 0≤td−2 ≤1 d−1 1/B i ≤ δ. X ≤δ(B − 1) i=1 It satisfies to verify that f˜ ∈ Gd (Ψ). Since aj L˜1 + cj is still an operator of L(·), it is sufficient to show that Pq L̃i+1 ( j=1 bj Ψ(L̃i,j (xi−1 (ti−1 )))) can be rewritten in the form Li+1 (Ψ(Li (xi−1 (ti−1 )))). Let X q αi (ti , ti−1 ) := α̃i (qti + (1 − j), ti−1 ), j=1 X q βi (ti ) := β̃i (qti + (1 − j)), j=1 43 X q αi+1 (ti+1 , ti ) := qbj α̃i+1 (ti+1 , qti + (1 − j)), j=1 βi (ti+1 ) := β̃i+1 (ti+1 ). we have f ∈ Gd (Ψ) that satisfies f = f˜, i.e., _ |f˜(x0 (t0 ), td )− f (x0 (t0 ), td )| ≤ δ. Gd (Ψ) is therefore uniformly dense on compacta in C(H). 3.3.3 Step Two I want to replace the step functions in Gd (Ψ) with the linear combinations of basis functions in Fd (Ψ). With the knowledge of real analysis, every measurable function is almost a con- tinuous function. By the definition of a basis function system, every continuous function can be approximated by linear combinations of basis functions. The conclusions are proceeded by these approximations. I quote the Lusin theorem below. The proof of theorem can be found in book [29]. Theorem 3.3.4. (Lusin Theorem.) Suppose f is measurable and finite valued on E with E of finite measure, for every  > 0, there exists a closed set F with F ⊆ E and m(E − F ) ≤  such that f |F is continuous. Denote G bd (Ψ) is the same as Gd (Ψ) except that the step functions in Gd (Ψ) are replaced by continuous functions. If G bd (Ψ) is ρ - dense in Gd (Ψ), I have the following theorem. 44 Theorem 3.3.5. G bd (Ψ) is uniformly dense on compacta in C(H) if Ψ is a Lipschitz con- tinuous squashing function. Proof. It is sufficient to prove that for every δ > 0, any compact set K ⊆ H and any f ∈ Gd (Ψ), ∃fb ∈ Gbd (Ψ), s.t. |fb(x0 (t0 ), td ) − f (x0 (t0 ), td )| < δ, for all (x0 (t0 ), td ) ∈ K. Since K is compact, there exists E > 1 and |x0 (t0 )| < E. A common bound B > 1 can be found for each operator Li (·) in function f . By the Lusin Theorem, ∀ > 0, there exists Fi, and Fi, 0 s.t. α bi (ti , ti−1 )χFi, = αi (ti , ti−1 )χFi, , βbi (ti )χF 0 = βi (ti )χF 0 . i, i, Find a common bound D for αi (ti , ti−1 ) and βi (ti ). The squashing function Ψ is Lipschitz continuous, therefore there exists L > 1 s.t. |Ψ(x) − Ψ(y)| ≤ L|x − y|. And the difference between each functional linear transformation can also be bounded: |Li+1 (xi (ti )) − L bi+1 (b xi (ti ))| ≤|Li+1 (xi (ti )) − Li+1 (b xi (ti ))| + |Li+1 (b xi (ti )) − L bi+1 (b xi (ti ))| Z 1 ≤B |xi (ti ) − xbi (ti )|dti + 4D(E + 1) 0 ≤B max |xi (ti ) − x bi (ti )| + 4D(E + 1), 0≤t ≤1 i 45 Combined the two inequalities and recursive representation of the two functions, |f (x0 (t0 ), td ) − fb(x0 (t0 ), td )| =|Ld (xd−1 (td−1 )) − L bd (bxd−1 (td−1 ))|) ≤4D(E + 1) + BL max |Ld−1 (xd−2 (td−2 )) − L bd−1 ((bxd−2 (td−2 ))| 0≤td−1 ≤1 d−1 (BL)i ≤ 4D(BL)d /(BL − 1). X ≤4D(E + 1) i=0 |fb(x(t)) − f (x(t))| < δ is satisfied if  < δ(BL − 1)/4D(E + 1)(BL)d . Following the same strategy of the last proof, I can show that Fd (Ψ) is ρ - dense in G bd (Ψ) and therefore Fd (Ψ) is a universal approximator. Theorem 3.3.6. Fd (Ψ) is uniformly dense on compacta in C(H) if Ψ is Lipschitz continuous squashing function. Proof. It is sufficient to show that for every δ > 0, any compact set K ⊆ H and any f ∈ Gd (Ψ), fb ∈ G bd (Ψ), ∃f ∈ Fd (Ψ), s.t. |f (x0 (t0 ), td ) − fb(x0 (t0 ), td )| < δ, for all (x0 (t0 ), td ) ∈ K. Since K is compact, there exists E > 1 and |x0 (t0 )| < E. Based on the property of basis function, ∀ > 0, we have αi (ti , ti−1 ) − α bi (ti , ti−1 ) ≤ , βi (ti ) − βbi (ti ) ≤ . 46 A common bound B > 1 can be found for each operator Li (·) in function f . |Mi+1 (xi (ti )) − L bi+1 (b xi (ti ))| ≤|Mi+1 (xi (ti )) − L bi+1 (xi (ti ))| + |L bi+1 (xi (ti )) − L bi+1 (b xi (ti ))| Z ≤(1 + E) + B |xi (ti ) − x bi (ti )|dti ≤(1 + E) + B max |xi (ti ) − x bi (ti )|. 0≤t ≤1 i Suppose that the squashing function Ψ is L-Lipschitz continuous, |f (x0 (t0 ), td ) − fb(x0 (t0 ), td )| =|Md (xd−1 (td−1 )) − L bd (bxd−1 (td−1 ))|) ≤(1 + E) + LB max |Md−1 (xd−2 (td−2 )) − L bd−1 ((b xd−2 (td−2 ))| 0≤td−1 ≤1 d−1 (LB)i ≤ (1 + E)(LB)d /(LB − 1). X ≤(1 + E) i=0 |fe(x(t)) − fb((x(t)))| < δ is satisfied if  < δ(LB − 1)/(1 + E)(LB)d . 47 3.4 Covering Number 3.4.1 Smooth Functions I introduce a class of functions on a bounded set χd in Rd quoted from [31]. A differential operator Dk for any vector k = (k1 , · · · , kd ) is defined as: ∂ k. Dk = k , k ∂x11 · · · ∂xdd Pd where k. = i=1 ki . Then, for a function f : χd → R, |Dk f (x) − Dk f (y)| kf ka = max sup |Dk f (x)| + max sup , k.≤a x k.=a x,y kx − yka−a where the suprema taken over all x, y in the interior of χd with x 6= y. a measures how smooth the function class is. α is the greatest integer smaller than a. a,d Let CM be the set of all continuous functions f : χd → R with kf ka ≤ M . To obtain the covering number of the function space, I quote the theorem 2.7.1 from [31] Theorem 3.4.1. Let χd be a bounded, convex subset of Rd with nonempty interior. There exists a constant K depending only on a and d such that a,d 1 log N (, C1 (χd ), k · k∞ ) ≤ Kλ(χ1d )( )d/a ,  for every  > 0, where λ(χ1d ) is the Lebesgue measure of the set {x : kx − χd k < 1}. It is easy to show that the value of λ(χ1d ) is 3 and 5 + π when χd is [0, 1] and [0, 1] × [0, 1]. 48 Since the two cases are the only cases that are considered in the proof, I have 1 log N (, C1a (χd ), k · k∞ ) ≤ K( )d/a ,  by replacing K with 9K. Since CMa (χ ) = {f |f /M ∈ C a (χ )}, we have d 1 d N (M , CM a (χ ), k · k ) = N (, C a (χ ), k · k ). d ∞ 1 d ∞ Finally, the covering number of function space N (M , CM a (χ ), k · k ) satisfies d ∞ a (χ ), k · k ) ≤ K( M d/a log N (, CM d ∞ ) .  To calculate the covering number of functions space within functional neural networks space Fd (Ψ), I have to define the distance of the space in the following: d(f, f˜) = max{sup |αi − α̃i |, sup |βi − β̃i |; i = 1, · · · , d}, where f, f˜ ∈ Fd (Ψ) and their corresponding coefficient functions are αi , βi ) and α̃i , β̃i ). 3.4.2 Deductions Since the domain H of our model is a compact set, the input function x0 (t0 ) should have a common boundary, which is denoted by E. The term Xi is used to denote the set consisting of all possible xi (ti ). I use the term Si to denote the set 49 Z 1  a,2 a/2,1 α(ti , ti−1 )xi (ti−1 )dti−1 + βi (ti )|α(ti , ti−1 ) ∈ CM , βi (ti ) ∈ CM . 0 Lemma 3.4.1. M 2/a log N∞ (M  + (Ei + 1)δ, Si ) ≤ log N∞ (, Xi−1 ) + 2K( ) , δ where the functions in Xi−1 is bounded by Ei . a,2 a/2,1 Proof. For each xi−1 (ti ) ∈ Xi−1 , αi (ti , ti−1 ) ∈ CM , βi (ti ) ∈ CM , we can find their representatives x̃i−1 (ti−1 ), α̃i (ti , ti−1 ), β̃i (ti ) from their corresponding -cover, δ-cover and δ-cover. Z 1 Z 1 αi (ti , ti−1 )xi−1 (ti−1 )dti−1 + βi (ti ) − α̃i (ti , ti−1 )x̃i−1 (ti−1 )dti−1 − β̃i (ti ) 0 0 Z 1 = (αi (ti , ti−1 ) − α̃i (ti , ti−1 ))xi−1 (ti−1 )dti−1 + 0 Z 1 α̃i (ti , ti−1 )(xi−1 (ti−1 ) − x̃i−1 (ti−1 ))dti−1 + βi (ti ) − β̃i (ti ) 0 ≤Ei δ + M  + δ. Therefore, the Cartesian product of the two covering is the (M  + (Ei + 1)δ)-covering of Si , i.e., log N∞ (M  + (Ei + 1)δ, Si ) a,2 a/2,1 ≤ log N∞ (, Xi−1 ) + log N∞ (δ, CM ) + log N∞ (δ, CM ) M 2/a ≤ log N∞ (, Xi−1 ) + 2K( ) . δ 50 E1 = E in the first layer. When the layer i > 1, the value is within the range of a squashing function, therefore Ei = 1. In the first layer, since X0 (t0 ) only consists of identity function without any parameters, (E+1)M 2/a log N∞ (, X0 ) = 0, log N∞ (δ, S1 ) ≤ 2K( δ ) . To illustrate the effect of nonlinear activation, I quote lemma 14.13 in [1]. Lemma 3.4.2. Suppose F is a class of real-valued functions in the vector space from domain X to Rp and σ : R → R is a element-wise function satisfying Lipschitz condition: |σ(x) − σ(y)| ≤ L|x − y|. Let σ(F ) denote the class {σ ◦ f : f ∈ F }, then N∞ (, σ ◦ F, m) ≤ N∞ (/L, F, m). Suppose the squashing function Ψ is L-Lipschitz, I have log N∞ (, Xi ) ≤ log N∞ (/L, Si ). Based on the results of the two lemmas, if i > 1, therefore M 2/a log N∞ (M L + 2δ, Si ) ≤ log N∞ (, Si−1 ) + 2K( ) . δ (E+1)M 2/a We have known that log N∞ (δ, S1 ) = 2K( δ ) . Suppose that log N∞ (δ, Si−1 ) = M bi−1 2/a ci−1 K( δ ) , then M bi−1 2/a M log N∞ (M L + 2δ, Si ) ≤ ci−1 K( ) + 2K( )2/a .  δ 51 Take  = δbi−1 , M 2/a log N∞ ((M Lbi−1 + 2)δ, Si ) ≤ (ci−1 + 2)K( ) . δ Furthermore, by scaling the distance, we have M (M Lbi−1 + 2) 2/a log N∞ (δ, Si ) ≤ (ci−1 + 2)K( ) . δ i.e. ci = ci−1 + 2 and bi = M Lbi−1 + 2. Considering that c1 = 2 and b1 = E + 1, we can have that cd = 2d and bd = (E + M M L+1 )(M L)d−1 − 2 L−1 M L−1 . Therefore it can be concluded that M (E + M L+1 M L−1 )(M L) d−1 − 2 M L−1 2/a log N∞ (δ, Sd ) ≤ 2dK( ) . δ 3.5 Consistency When it comes to consistency, the error can be divided into the approximation error and the estimation error, which is expressed by Lemma 10.1 in the book on non-parametric regression [11]. Lemma 3.5.1. Let Fn = Fn (Dn ) be a class of functions f : Rd → R depending on the data Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. If mn satisfies n 1X mn (·) = arg min |f (Xj ) − Yj |2 . f ∈Fn n j=1 52 then Z |mn (x) − m(x)|2 µ(dx) (3.3) n 1X ≤2 sup |f (Xj ) − Yj |2 − E{f (X) − Y )2 } (3.4) f ∈Fn n j=1 Z + inf |f (x) − m(x)|2 µ(dx). (3.5) f ∈Fn As is seen in the decomposition, the estimation error (3.4) refers to the the supremum difference between the error of the trained data and the error expectation over all function space, while the approximation error (3.5) refers to the infimum difference between the true function and the model function space. As model the function space becomes larger, the approximation error is decreasing while the approximation error is increasing. To make the total error as small as possible, I have to make a trade off in the selection of function space Fn . I quote theorem 10.2 in [11] to clarify what is required in the consistency problem. Theorem 3.5.1. Let Fn = Fn (Dn ) be a class of functions f : Rd → R and the estimator mn satisfies n n 1X 1X m̃n ∈ Fn and |m̃n (Xj ) − Yj |2 = min |f (Xj ) − Yj |2 , n f ∈Fn n j=1 j=1 mn (x) = Tβn m̃n (x). (a) If lim βn = ∞, n→∞ 53 Z lim inf |f (x) − m(x)|2 µ(dx) = 0 a.s., n→∞ f ∈Fn ,kf k∞ ≤βn n 1X lim inf |f (Xj ) − Yj,L |2 − E{(f (X) − YL )2 } = 0, n→∞ f ∈T Fn n βn j=1 a.s. for all L > 0, then Z lim |mn (x) − m(x)|2 µ(dx) = 0 a.s.. n→∞ (b) If lim βn = ∞, n→∞ ( Z ) lim E inf |f (x) − m(x)|2 µ(dx) = 0, n→∞ f ∈Fn ,kf k∞ ≤βn   n  1  |f (Xj ) − Yj,L |2 − E{(f (X) − YL )2 } = 0, X lim E inf n→∞ f ∈T Fn n βn j=1  for all L > 0, then Z  lim E |mn (x) − m(x)|2 µ(dx) = 0. n→∞ I quote theorem 9.1 in [11] to bound the estimation error. Theorem 3.5.2. Let G be a set of functions g : Rd → [0, B]. For any n and  > 0, n ( ) 1X P sup | g(Zi ) − E{g(Z)}| >  ≤ 8E{N1 (/8, G, Z1n )} exp(−n2 /128B 2 ). g∈G n i=1 The domain of functions in above theorems are defines on Rd for neural networks model. I checked the proofs of them and make sure that their validity has no issues with the explicit form of its domain. Therefore I can apply these theorems in our FNN model whose domain 54 is defined in H. The consistency of FNN model Fd (Ψ) is given below. Theorem 3.5.3. Let Fn be a class of functions satisfying n o a,2 a,1 Fn = f ∈ Fd (Ψ)|αi (ti , ti−1 ) ∈ CM , βi (ti ) ∈ CM , Ji ≤ kn , n n where Ψ is L-Lipschitz continuous squashing function. Let n 1X mn = arg min |f (Xj ) − Yj |2 . f ∈Fn n j=1 If Mn and kn satisfy 4+ 2d+2 a Mn → 0, βn → ∞, Mn → ∞, n then K (mn (x) − m(x))2 µ(dx) → 0 almost surely with E|Y |2 < ∞ and any compact set R K ⊆ H. Proof. For the approximation error, universal approximation theorem implies limn→∞ Mn → ∞ and limn→∞ kn → ∞. For the estimation error, theorem 3.5.1 implies that I can assume |Y | ≤ L almost surely, for some L, and then it is required to show that n 1X sup E|f (X) − Y |2 − |f (Xj ) − Yj |2 → 0 a.s.. f ∈Fn n j=1 Let H = L2 × [0, 1] and f : H → R. Define Z = (X, Y ), Z1 = (X1 , Y1 ), . . . , Zn = (Xn , Yn ), 55 where Xi ∈ H, Yi ∈ R and Hn = {h : H × R → R : ∃f ∈ Fn s.t. h(x, y) = |f (x(t), td ) − y|2 }. Assume Mn ≥ L so that functions in Hn satisfy 0 ≤ h(x, y) ≤ 2Mn2 + 2L2 ≤ 4Mn2 . Using the boundary of theorem 3.5.2, we have, for arbitrary  > 0, n ( ) 1X P sup | |f (Xi ) − Yi |2 − E{|f (X) − Y |2 }| >  f ∈Fn n i=1 n ( ) 1X =P sup | h(Zi ) − E{h(Z)}| >  h∈Hn n i=1 − n2  2 2 ≤8EN1 ( , Hn , Z1n )e 128(4Mn ) . 8 Next, I bound the covering number and get n 1X |h1 (Zi ) − h2 (Zi )| n i=1 n 1X = ||f1 (Xi ) − Yi |2 − |f2 (Xi ) − Yi |2 | n i=1 n 1 X = |f1 (Xi ) − f2 (Xi )||f1 (Xi ) − Yi + f2 (X2 ) − Yi | n i=1 n 1X ≤4Mn |f1 (Xi ) − f2 (Xi )|. n i=1 56 Thus,   N1 ( , Hn , Z1n ) ≤ N1 ( , Fn , X1n ). 8 32Mn n ( ) 1X P sup | |f (Xi ) − Yi |2 − E{|f (X) − Y |2 }| >  f ∈Fn n i=1 − n2  2 2 ≤8N1 ( , Fn , X1n )e 128(4Mn ) 32Mn − n2  2 2 ≤8N∞ ( , Fn , X1n )e 128(4Mn ) 32Mn 32Mn2 (E + M n L+1 d−1 − 2 ! Mn L−1 )(Mn L) Mn L−1 2/a n2 ≤8 exp 2dK( ) −  128(4Mn2 )2 Mn L+1 2 d−1 − M 2L−1 2/a ! n1−δ 2 2dKMn4 32Mn (E + Mn L−1 )(Mn L) ≤8 exp −nδ ( − ( n ) ) , Mn4 2048 n  if δ > 0 exists. To satisfy the condition that ∞ n ( ) 1X |f (Xi ) − Yi |2 − E{|f (X) − Y |2 }| >  < ∞, X P sup | f ∈Fn n n=1 i=1 The following conditions are required: n1−δ lim → ∞, n→∞ Mn4 and Mn L+1 2 2dKMn4 32Mn (E + Mn L−1 )(Mn L) d−1 − M 2L−1 2/a ( n ) → 0. n  The latter condition can be simplified as 4+ 2d+2 a Mn →0 n 57 which implies the former one, therefore the former one is not necessary. In terms of consistency, approximation error and estimation error should be bounded together. The approximation error encourages the functional space to be larger, i.e. kn → ∞ and Mn → ∞. The estimation error controls the increasing rate of Mn . The above theorem 4+ 2d+2 a requires that the increasing rate of Mn is slower than n. 3.6 Discussion In this chapter, the consistency of our proposed FNN model is proved, which implies that the error between the FNN model and the true model can be as small as possible if the sample size is large enough. In the future, I will investigate other statistical properties, such as the convergence rate and asymptotic normality of FNN. When it comes to proof of the rate of convergence rate of FNN model, the idea is normally similar with that of consistency. I have to find the convergence rate of estimation error and approximation error and find a trade off. The convergence rate of estimation error is not hard to get given the covering number of the model, while that of approximation error is a mathematical problem, which is still under investigation. 58 Chapter 4 Transfer Learning for Genetic Studies In the previous chapter,a new statistical model FNN is proposed and its consistency is proved. In this chapter, I will introduce transfer learning and study whether transfer learning can be used to improve the performance of genetic analyses. 4.1 Introduction With the vast amounts of genetic data collected from biobank projects and from the Cau- casian population, an interesting scientific question is whether these resources can be used to enhance the genetic analysis in small-scale studies or in minority populations (e.g., African American population). With the increased amount of genetic data becoming publicly avail- able for genetic research, challenges remain in how to efficiently utilize these enriched re- sources in the studies of small-scale studies and in minority populations. A common as- sumption made by existing approaches is that two studies should be similar (e.g., the same population). However, this assumption could fail in reality. The study design and study population could be different between the two studies (e.g., Caucasian vs. African Ameri- can). Transfer learning does not require data from two studies drawn from exactly the same feature space or the same distribution. It learns possibly useful feature from mature studies and applies learned features based on focused problem. Therefore it holds great promise to use the enriched resources from large-scale studies for uncovering novel genetic variants in 59 small-scale studies or in minority populations [30]. Transfer learning has become increasingly popular in AI and is expected to become a key driver of machine learning success in the future. Various models have been tested and therefore showed the method and strength of transfer learning ([18]). Transfer learning is a widely used method in image classification and natural language processing[13], which transfer the knowledge from a source study/population to a target study/population. While transfer learning has achieved success in areas such as natural language processing, it has rarely been used in genetic data analysis. In this chapter, I integrate the idea of transfer learning into neural networks and use the knowledge learned from the large-scale UK Biobank dataset to facilitate genetic analysis in small-scale studies or in minority populations. By integrating transfer learning into neural networks, I am able to transfer knowledge regarding complex genotype-phenotype relationships (e.g., gene-gene interactions) between two studies. The remaining chapter is organized as follows: The details of our models and two can- didate regularization methods are introduced in section 2. Real data analyses are given in section 3, which provides the comparison and selection between the two regularization methods and the performance of transfer learning. Summary and discussion are provided in section 4. 4.2 Method Transfer learning applies knowledge gained from one research problem to a different but related problem, and has been widely used in text and image recognition. The basic idea of transfer learning is illustrated in Figure 1. In this chapter, the model I used to illustrate transfer learning is the vanilla neural networks (NN) model. In this section, I briefly introduce 60 Figure 4.1: Transfer Learning the NN model, and then integrate the idea transfer learning into NN (TL-NN) to transfer the knowledge between two datasets. Finally, the technical details of the TL-NN will be provided. To distinguish the different types of data types in Chapter 4, I use lower case letters, bold lower case letters and upper case letters to denote scalar, vector and matrix, respectively. I also use lower case letters denote the output function. In this section, the vanilla neural networks is introduced at first. The transfer learning model is then constructed based on neural networks model. The details of technical issues are discussed in the end. 4.2.1 Neural Networks Suppose our research interest is to find a predictive function f that models the response variable y and predictor variable x = (x1 , . . . , xq ), where q is the dimension of input. The true model is written as: y = f (xx) + , (4.1) where  is the noise in this model. 61 I use the fully connected NN model to demonstrate transfer learning: h 1 = σ(W1x + b 1 ), (4.2) h d = σ(Wdh d−1 + b d ), d = 2, . . . , l − 1, (4.3) ŷ = fˆ(x x) = w lh l−1 , (4.4) where h d is the feature which is learned from the source study and has a dimension of md . σ is a nonlinear activation function, such as the sigmoid function σ(x) = 1 used this 1+e−x chapter. l is the number of layers of the NN model. Wd and bd are weights and biases, respectively. I denote all the parameters by S = {Wd , b d , w l |d = 1, . . . , l − 1}. In NN, I can normalize the response variable y. Suppose that ȳ and v̄ are the estimated mean value and standard deviation of y, the final model is realized by: ŷ = fˆ(x x) = v̄w w Dh D−1 + ȳ. (4.5) NN can be applied to both regression and classification problems. For simplicity, I here focus on the regression problem, and use L2 loss and mean square error (MSE) to estimate parameters and evaluate the model’s performance. The model can also be applied to clas- sification problem by using different loss functions (e.g., cross entropy) and measurements (e.g., misclassification error) Based on the L2 loss, the parameters can be estimated, n b = arg min 1 xi ))2 , X S (yi − fˆ(x S n i=1 where n is the sample size. S b is the set of all estimated parameters. yi and x i are the 62 response and predictor variable of the i-th sample. Directly fitting NN to a large number of genetic covariates could bring an overfitting issue. To avoid the overfitting issue, various regularization methods, such as penalty regularization, drop out method and early stopping method can be used. In this chapter, I impose parameter norm penalty p = p(S S ; λ) over our function space as a regularization method. where λ is a hyperparameter controling the solution space. In a typical NN, a L2 penalty is usually applied on the weight term, l−1 kW d k22 + kww l k22 ). X S ; λ) = λ( p(S d=1 . In genetic studies, genetic effects are usually smaller than the noise, and therefore heavy penalties are required to avoid overfitting. There are two regularization methods regarding the parameter norm penalty, parameter regularization and constraint regularization. Their solutions are: n ! 1X Sb = arg min xi ))2 + p(S (yi − f (x S ; λ) , (4.6) S n i=1 n b = arg min 1 xi ))2 X S (yi − f (x s.t. p(S S ; λ) ≤ 1. (4.7) S n i=1 The backpropagation method is typically used to obtain the solution. Since p(S S ; λ) is a norm penalty, the larger λ is, the smaller the parameter space is. Parameter regularization is a widely used regularization method to avoid overfitting. It adds a penalty term to the objective function, which encourages the weight of NN to be small in each epoch of 63 backpropagation. Nonetheless, large weights are still possible for parameter regularization. Constraint regularization can be used to avoid large weights by forcing the weights to be small, but the penalty term would not affect the backpropagation. In the section 4.3.2, I compare the performance between parameter regularization and constraint regularization. Based on the result, I recommend constraint regularization for genetic studies. The selection of hyperparameter λ is also discussed in the next section. 4.2.2 Transfer Learning If we have a source data (e.g., UK biobank), in which we model the response variable y 0 with predictor variable x 0 = (x01 , . . . , x0q ), where x and x 0 shares similar data structure and effects. Usually, x and x 0 are the same properties of different subjects. We have the similar model: h 01 = σ(W10 x0 + b 01 ), (4.8) h 0d = σ(Wd0 h 0d−1 + b 0d ), d = 2, . . . , l − 1, (4.9) ŷ 0 = fˆ(x x0 ) = v¯0w 0lh 0l−1 + y¯0 , (4.10) where y¯0 and v¯0 are the estimated mean value and standard deviation of y 0 . The parameters space of this model is denoted by S 0 . The set of their solutions are denoted by Sb0 . Since the solutions are derived from source studies, it is reasonable to suppose that h 0d is a good representation of features in x 0 . The relation between x 0 and h 0d are realized through 0 {Ŵd0 , b̂bd |d = 1, . . . , l − 1}, which is denoted by Sc01 . The other part {ŵ w 0l } is represented by Sc02 . Furthermore, we use f1 and f2 to denote the relationship, i.e., 64 h 0d = f10 (xx|Sc01 ), (4.11) h0d |Sc02 ). ŷ 0 = f20 (h (4.12) Similarly, the parameter set S b from the focused problem can also be divided to Sc1 and Sc2 . The corresponding relationship is denoted by f1 and f2 . The penalty term p(S S ; λ) can also be written as p(S S 1 , S 2 ; λ). Based on the idea of transfer learning, I use Sc01 as the solution of the focused problem rather than Sc1 . When it comes to the solution of S 2 , it is derived from the dataset of focused problem given S 1 = Sc01 Our final solution in transfer learning is: n 1X 0 Sf1 = Sc1 = arg min (yi − f 0 (xx0i ))2 , S 1 , S 2 ; λ) ≤ 1, s.t. p(S (4.13) S2 n S 1 ,S i=1 n 1X S 2 = arg min f (yi − f (x S 1 = Sf1 ))2 , xi |S s.t. p(Sf1 , S 2 ; λ) ≤ 1. (4.14) S2 n i=1 The difference between neural networks and transfer learning methods is illustrated in Figure 2. 4.2.3 Technical Issue The optimization algorithm in this chapter is the Adam algorithm [16]. To satisfy the restriction condition p(S S 1 , S 2 ; λ) ≤ 1, the back propagation procedure is realized as: 65 Figure 4.2: Transfer Learning in neural networks    n1 n xi ))2 , P i=1 (yi − f (x S 1 , S 2 ; λ) ≤ 1   if p(S c(SS) = , (4.15)   p(S  S 1 , S 2 ; λ1 , λ2 ), S 1 , S 2 ; λ) > 1 if p(S ∂c(S S) s ← s − r1 , s ∈ S 1, (4.16) ∂s ∂c(S S) s ← s − r2 , s ∈ S 2, (4.17) ∂s where r1 , r2 is the learning rate of S 1 and S 2 , respectively. The Adam algorithm is an adaptive optimization method where the learning rate is determined element wisely. While the default setting of the Adam algorithm works well in most cases, it is not suitable for genetic studies. Due to the heavy penalty in the last layer, the solution of S can be too small compared with the default learning rate of the Adam algorithm. Therefore, I set the parameter of the learning rate based on λ while keeping 66 other parameters as the default value. The selection of the parameter is described in detail in the next section. For transfer learning, I keep the same settings as the one used in NN, i.e., r1 = 0 while r2 . The iteration process stops when MSE does not decrease for 3000 epochs. The formed NN model has 2 hidden layers (i.e., l = 3). The numbers of hidden units of each layer is 16 and 4. For the comparison purpose, I also added a baseline model, which is essentially the mean value of predictor variable ȳ. The measurement comparing two models is therefore: Pn xi ))2 i=1 (yi − f (x m(f ) = 1 − P n 2 . i=1 (yi − ȳ) The larger the value is, the better the formed model as compared to the baseline model. To illustrate the prediction ability, I split investigated dataset into the training set (80%) and test set (20%). I tuned the parameters of our models on the training set, and compare their performance on the test set. In order to remove the random effect of splitting, I repeat our experiment for 100 times by random splitting the sample. The results are summarized by the boxplots, where the left part shows the value of m(f ) in the training set and the right show that of the test set. 4.3 Real Data Analysis 4.3.1 Data Description Cigarette smoking is one of the leading causes of preventable disease, contributing to 5 million deaths worldwide each year[21]. During the last decade, a great deal of progress has 67 been made in identifying genetic variants associated with smoking. Among those findings, the nAChRs subunit genes (e.g., CHRNA5 ) have been identified and confirmed in several large-scale studies[19]. In this application, I use transfer learning to study the complex relationships nAChRs subunit genes with nicotine dependence. The datasets to be analyzed are the large-scale UK Biobank (UKB) dataset and the rela- tively small-scale dataset from the Study of Addiction: Genetics and Environment (SAGE). [2] UKB is a population-based prospective cohort of nearly 500, 000 individuals recruited in the United Kingdom who were aged 40-69 years. UKB contains a wealth of data on detailed clinical information, genome-wide genotype data and whole-exome sequencing data. SAGE is one of the most comprehensive studies conducted to date, aimed at discovering genetic contributions to substance use disorders. Samples from SAGE were selected from three studies: COGEND, COGA, and FSCD. Multiple phenotypes were measured in the SAGE studies. For the focus of our gene-based analysis, I used cigarettes per day (CPD) that is available in both SAGE and UKB, and only considered genes from two clusters, CHRNA5- CHRNA3-CHRNB4 and CHRNB3-CHRNA6. SAGE comprises 1445 female and 1272 male samples, among which 807 samples are African-American and 1910 samples are Caucasian. Prior to the analysis, I re-assessed the quality of the data (e.g., check for successful genotype calls, missing rate, deviation from the Hardy-Weinberg equilibrium, unexpected relationships67). After a careful quality assessment, I first use the Irish sample to find the hyperparameters. Two studies of transfer learning are investigated in this chapter. One is a cross-project study and the other is a cross-ethnicity study. In the first study, I am trying to investigate whether the genetic finding from the British population can be transferred to the Black and the Irish populations. In the second study, I try to utilize the data from UKBiobank to help 68 the genetic research in SAGE. To illustrate the effect of transfer learning strategy, I split investigated dataset into the training set (80%) and the test set (20%). I model our original NN model and transfer learning model respectively on the train set, and compare their performance on the test set. In order to remove the random effect of splitting, I repeat our experiment for 100 times with random splitting. A validation process is used to determine the value of λ. In the validation process, I split our training set into the subtrain and validation sets with the ratio of 4 : 1. I evaluate a range of possible values of λ, such as {1, 10, 100, 1000}, in the subtrain set, and then select the optimal λ based on the validation set. 4.3.2 Regularization In this section, two regularization methods, parameter regularization and constraint regu- larization methods are compared under different possible values of hyperparameter λ. The Constraint regularization method and suitable hyperparameter selection are then used in the next section. 4.3.2.1 Constraint regularization Various values of hyperparameters for the constraint regularization is illustrated here. The candidates of λ are {102 , 101.5 .101 , 100.5 , 100 }. As is seen from Figure 3, the optimal parameter for the penalty term λ is between 100.5 and 100 . 69 dataset = train dataset = test 0.2 0.1 log 0.0 2 m(f), (%) m(f), (%) 1.5 0.1 1 0.5 0 0.2 0.3 0.4 CHRNA3 CHRNA5 CHRNB4 CHRNA6 CHRNB3 CHRNA3 CHRNA5 CHRNB4 CHRNA6 CHRNB3 name name Figure 4.3: Constraint Regularization 4.3.2.2 Parameter Regularization Possible values of λ for the constraint regularization are {100 , 10−0.5 .10−1 , 10−1.5 , 10−2 }. The corresponding parameter of learning rate in the Adam method is 10−2.5 · λ−0.5 . The optimal parameter for the penalty term λ is between 10−1.5 and 10−2 according to Figure 4. 4.3.2.3 Comparison Compared with parameter regularization, constraint regularization encourages a higher learn- ing rate, which enables NN to jump out of local minimum and attain better performance. The result is more stable with various λ values in the penalty constraint. Therefore, in the following analysis, I will use the constraint regularization method with the selected parameter setting. 70 dataset = train dataset = test 0.2 0.0 log -0.5 m(f), (%) m(f), (%) -1 0.2 -1.5 -2 -2.5 0.4 0.6 CHRNA3 CHRNA5 CHRNB4 CHRNA6 CHRNB3 CHRNA3 CHRNA5 CHRNB4 CHRNA6 CHRNB3 name name Figure 4.4: Parameter Regularization 4.3.3 Cross-ethnicity The vast amount of genetic data collected from the Caucasian population provides us with great resource for genetic research in other populations, especially minority populations. In the UK Biobank dataset, there are around 300, 000 British people, 10, 000 Irish people and 10, 000 Black people. In this project, I transfer the knowledge learned from the British population to the Irish and the Black to enhance the genetic study of these populations. By applying the NN model and transfer learning model, I obtain the following result in Figure 5 for the Black population. The performance comparison in the Irish population is given in Figure 6. As is seen in the figure, transfer learning helps improve the performance for CHRNA3, CHRNA5 and CHRNB4, while has limited gain for CHRNA6 and CHRNB3. [38] found genetic heterogeneity of CHRNA6 and CHRNB3 in the studies of nicotine dependence, while the heterogeneity is not significant in the other three genes, which is consistent with our finding. 71 Figure 4.5: Transfer the knowledge from the British population to the Black population Figure 4.6: Transfer the knowledge from the British population to the Irish population 72 Figure 4.7: Transfer the knowledge from UK Biobank to SAGE 4.3.4 Cross-Projects In this scenario, I am transferring the knowledge from the UK Biobank data to the SAGE data. For this analysis, I focus on the Caucasian population in both datasets. By applying the NN model and transfer learning model to the datasets, I have the result which is illustrated in Figure 7. As we can see from the figure, the NN model performs poorly due to the small sample size (i.e., around 1000) in the SAGE dataset, and the genetic effect can only explain less than 1 percent of the variation. Compared to the NN, transfer learning improves the performance by using the information from the UK biobank. The improvement is observed for all genes as we focus on the Caucasian population in this application and genetic heterogeneity has little influence on the performance. 73 4.4 Discussion In this chapter, I have shown the advantages of transfer learning for genetic research. Trans- fer learning can improve the prediction performance across populations, but we need to be cautious about the heterogeneity of data or genetic mechanisms, such as genetic heterogene- ity. While I illustrate the method by transferring the knowledge to a small-scale study and minority population, transfer learning can be potentially used for other purposes, such as transferring the genetic findings between different traits (e.g., smoking and alcohol drinking) and different species (e.g., rats and mice). 74 Chapter 5 Epilogue The main goal of this dissertation is to investigate neural networks based models with their application to genetics. In Chapter 2, I proposed a new neural networks model, FNN, which combines functional linear model and nonlinear activation in a recursive way. By treating genetic and phenotype data as functional data, The underlying smooth property is utilized to improve the models’ performance in simulations and real data analyses. The consistency of the proposed FNN model is proved in Chapter 3 by providing the conditions to bound the approximation error and estimations error. In Chapter 4, I show how transfer learning can help transfer knowledge between different populations and studies. In the FNN model discussed in Chapter 2, I mainly focused on the continuous phenotype. In the future, I will extend the FNN model to discrete variables and henceforth solve the classification problem. This can be done by changing the loss function and the corresponding link function. Another extension of the FNN model is the application to multivariate func- tions. In the real data analyses, the input function is SNPs, whose indexes are their positions on the chromosome. The output function is phenotypes over age. These functions have only one index. In the simulations, I have also shown that FNN outperformed existing methods when the output function is a bivariate function. Therefore, it will also be interesting to apply FNN to the brain surface data. The consistency of the proposed FNN model is proved in Chapter 3. While consistency 75 has been proved for the vanilla NN model, the result is not directly applicable to the FNN model. The domain of variables in the NN model is defined on the Euclidean space, whose dimension is finite. Nevertheless, the domain of the FNN model contains the integrable function L2 ([0, 1]), whose dimension is infinite. Therefore, I have to prove the universal approximation theorem and calculate its covering number from scratch. Chapter 3 proves that the estimated FNN model can approximate to the underlying model to any extend if enough samples are provided, but it cannot quantify the speed of the approximation, i.e., the rate of convergence. The difficulty lies in a mathematical problem, i.e., the distance of the FNN model function space and continuous function space defined on the domain of the FNN model. The is an interesting problem for future research. Transfer learning is applied to genetic studies under the framework of vanilla NN model in Chapter 4. An obvious extension is to change the framework to our proposed FNN model. It can be realized by replacing NN with FNN. The real data analyses in this dissertation are limited to human genetic data, where I transfer knowledge from the British population to the Irish and Black populations. The formed model can also be used for cross-species studies, where the understanding of the genetic effects on human subjects can benefit from the knowledge learned from animals. Animal studies can be more powerful and designed in a way that is not allowed in human studies. I am cooperating with other groups on transfer learning between rats and mice. One of the challenges of cross-species studies is SNPs from different species are not aligned. Efforts are required to find the positions of the corresponding SNPs in other spices. Currently, I am trying to overcome the problems and using FNN in cross-species studies. 76 APPENDICES 77 APPENDIX A Solutions of FLM At first, the solution to equation (2.2) whose phenotype is a scalar value is discussed. In practice, I do not have the true functions but have the observations at discrete points tj , j = 1, . . . , p. Using the genetic data as an example, we observe genotypes Gij for i-th samples at locus j. By considering the effects of genotypes as a function, I obtain the following beta-smooth function([7]): Xp ŷi = θ0 + Zi θ + Gij β(tj ). j=1 By using pre-specified basis functions βk (t), k = 1, . . . , K (e.g., B-spline basis functions), PK I can further expand β(t) as β(t) = k=1 wk βk (t) and rewrite FLM as, Xp Xk ŷi = θ0 + Zi θ + Gij ( wk βk (tj )) j=1 k=1 Xk Xp = θ0 + Zi θ + wk ( Gij βk (tj )) k=1 j=1 Xk = θ0 + Zi θ + wk dik k=1 = θ0 + Zi θ + Di w, Pp where dik = j=1 Gij βk (tj ), w = (w1 , . . . , wK )T and Di = (di1 , . . . , diK )T . FLM (equation 78 (2.2)) is then transformed to a linear model: Ŷ = XΘ + Dw, where X = (1n , Z), Z = (Z1T , . . . , ZnT )T , Θ = (θ0 , θ T )T . 1n refers to a column vector with all 1s. The penalized loss function to solve the model is Z R(w, Θ) = (Ŷ − Y )T (Ŷ −Y)+λ (β 00 (t))2 dt = (Ŷ − Y )T (Ŷ − Y ) + λwT P w. By minimizing the loss function, I have the solution: ŵ = (DT (I − X(X T X)−1 X T )D + λP )−1 DT (I − X(X T X)−1 )Y Θ̂ = (X T X)−1 X T (Y − Dŵ). Now I would provide the solution to (2.3) whose phenotype is dispersed in a one dimen- sional space. To our best knowledge, no existing analytical method is available to solve the model described in equation (2.3) without imposing restrictions on the model. I adapt the commonly used restriction ([14]) sij = si0 j , ∀i, i0 . Without loss of generality, I use sj , εij to denote sij and εi (sij ). The model can then be rewritten as, Z Yij = Yi (sj ) = Zi θ + α0 (sj ) + α(sj , t)Gi (t)dt + εij . I further generalize the above model by assuming that the coefficients for covariates θ is 79 a function, Z Yij = Zi θ(sj ) + α0 (sj ) + α(sj , t)Gi (t)dt + εij , where θ(s) = (θ1 (s), . . . , θm (s)). By adding another basis functions αl (s), l = 1, . . . , L, I replace α0 (s) and α(s, t) with PL PK PL l=1 cl αl (s) and k=1 l=1 Wkl αl (s)βk (t), respectively. (d) For the bivariate function α(s, t), I use two basis function systems {βk(d) , d ∈ {0, 1}, kd ∈ {1, . . . , ld }}, l (1) X (1) α0 (s) = b (1) β (1) (s), k k k (1) =1 l(1) l(0) X X (0) (1) α(s, t) = W (0) (1) β (0) (t)β (1) (s). k k k k k (1) =1 k (0) =1 By using the package provided by [14], which is based on the REML method, I can obtain the solution of W , b, and θ. 80 APPENDIX B Functional Neural Networks The technical details of the functional neural networks model will be discussed in this ap- pendix. At first, the explicit form of the proposed model will be given step by step. In practice, integration can be approximated by numeric integration in the form of sum- mation. For simplicity, I assume that all functions is scaled in the interval of [0, 1]. If I choose evenly distributed points t1 , t2 , . . . , tm , a naive integration for X(t) is Z X m X(t)dt = X(tj )/m. j=1 I can rewrite functional notation in a matrix form, i.e. (d) (d) m(d) ,l(d) B (d) = [β (d) (tj )] , k j=1,k (d) =1 (d) (d) n,m(d) X (d) = [Xi (tj )]i=1,j=1 . 81 Therefore, the forward propagation algorithm can be written as the following: D(d) = X (d−1) B (d−1) /m(d−1) , C (1) = D(1) W (1) + 1n,1 b(1) + Zθ, (B.1) C (d) = D(d) W (d) + 1n,1 b(d) , d > 1, X (d) = σ(C (d) B (d)T ). As is seen from the above equations, FNN reduces to a traditional neural networks when each matrix of basis B is a diagonal matrix. In other words, NN can be seen as a special case of FNN. When it comes to the solution of the FNN model, the process of back propagation should be explained. A gradient decent method is used to solve the weight functions. The recursive process stops when O(r) converges. The weights and biases are updated as follows: (d) O(r) = O(α(d) , α0 ), ∂O(r) W (r+1) = W (r) −γ , ∂W (r) ∂O(r) b(r+1) = b(r) − γ , ∂b(r) where γ is determined by using the adadelta method [37]. The deductions of derivatives involve matrices, functions, and trace. Two lemmas are given below to simplify the calculation of derivatives. Lemma B.0.1. For any function σ : R → R applied to matrix elementwisely, F : Rm×n → R that maps matrix to scalar. X, A, B are matrices with proper dimensions and A = σ(B). If ∂F(X) ∂F(X) ∂A = D, then ∂B = D ◦ σ 0 (B). 82 Proof. F(X) F(X) Aij = = Dij σ 0 (Bij ) = (D ◦ σ 0 (B))ij . Bij σ(Aij ) Bij Lemma B.0.2. For any function F : Rm×n → R that maps matrix to scalar. A, C are ∂F(X) ∂F(X) square matrices and Y = ABC. If ∂Y = D , then ∂B = AT DC T . Proof. ∂F(X) X ∂F(X) ∂(ABC)k,l = ( )k,l ∂Bij ∂ABC ∂Bij k,l X = Dkl Aki Cjl k,l ATik Dkl CljT X = k,l = (AT DC T )ij . ∂R(W,b) Let M (d) = . By using B.1 and lemma B, I have ∂C (d) ∂R(W, b) = 11,n M (d) , ∂b(d) (B.2) ∂R(W, b) = (X (d−1) B (d−1) )T M (d) /m(d−1) . ∂W (d) 83 For the last layer, σ (D) is a linear function, i.e. Ŷ = C (D) B (D)T . I then have − yij )2 P ∂R(W, b) ∂ i,j (ŷij = == 2(Ŷ − Y ), ∂ Ŷ ∂ Ŷ M (D) = 2(Ŷ − Y )B (D) . (B.3) For other layers, by using B.1, I have C (d+1) = (X (d) B (d) /m(d) )W (d+1) + 1n,1 b(d+1) = σ (d) (C (d) B (d)T )B (d) W (d+1) /m(d) + 1n,1 c(d+1) . Based on lemma B.0.1 and B, I have ∂R(W, b) = M (d+1) W (d+1)T B (d)T /m(d) , ∂σ (d) (C (d) B (d)T ) ∂R(W, b) = M (d+1) W (d+1)T B (d)T ◦ σ 0 (C (d) B (d)T )/m(d) , ∂(C (d) B (d)T ) M (d) = [(M (d+1) W (d+1)T B (d)T ) ◦ σ 0 (C (d) B (d)T )]B (d) /m(d) . (B.4) The formula equation (B.2), equation (B.3) and equation (B.4) are then used for calcu- lating back propagation on cost function R(W, b). As for the penalty term J(W, b), I have equation (B.5), which is illustrated below, Z (d) (d) (d) l(d) ,l(d) P0 = [ βi (t(d) )βj (t(d) )dt(d) ]i=1,j=1 , (d)00 (d)00 Z (d) l(d) ,l(d) P2 = [ βi (t(d) )βj (t(d) )dt(d) ]i=1,j=1 84  !2 !2   2 (d) ∂ 2 α(d) ∂ 2 α(d) ∂ 2 α0 Z Z Z J (d) =  +  dt(d) dt(d−1) +   dt(d) ∂t(d)2 ∂t(d−1)2 ∂t (d)2 (d−1) (d) (d) (d)0 0 (d−1) (d)  0   (d) (d) =tr P0 W P2 W + P0 W (d) P2 W + tr b(d) P2 b(d) Therefore, the final derivatives for back propagation are: ∂J (d) (d−1) (d) (d−1) (d) =2P0 W (d) P2 + 2P2 W (d) P0 , ∂W (d) (B.5) ∂J (d) (d) =2b(d) P2 . ∂b(d) 85 BIBLIOGRAPHY 86 BIBLIOGRAPHY [1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. [2] Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O’Connell, et al. The uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203– 209, 2018. [3] 1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061, 2010. [4] Heather J Cordell. Detecting gene–gene interactions that underlie human diseases. Nature Reviews Genetics, 10(6):392–404, 2009. [5] Murat Dikmen and Catherine M Burns. Autonomous driving in the real world: Ex- periences with tesla autopilot and summon. In Proceedings of the 8th international conference on automotive user interfaces and interactive vehicular applications, pages 225–228, 2016. [6] Gökcen Eraslan, Žiga Avsec, Julien Gagneur, and Fabian J Theis. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7):389– 403, 2019. [7] Ruzong Fan, Yifan Wang, James L Mills, Alexander F Wilson, Joan E Bailey-Wilson, and Momiao Xiong. Functional linear models for association analysis of quantitative traits. Genetic epidemiology, 37(7):726–742, 2013. [8] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 869–877, 2018. [9] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016. [10] Sara Goodwin, John D McPherson, and W Richard McCombie. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333, 2016. [11] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006. 87 [12] Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward net- works are universal approximators. Neural networks, 2(5):359–366, 1989. [13] Mahbub Hussain, Jordan J Bird, and Diego R Faria. A study on cnn transfer learning for image classification. In UK Workshop on Computational Intelligence, pages 191–202. Springer, 2018. [14] Andrada E Ivanescu, Ana-Maria Staicu, Fabian Scheipl, and Sonja Greven. Penalized function-on-function regression. Computational Statistics, 30(2):539–568, 2015. [15] Adam Kiezun, Kiran Garimella, Ron Do, Nathan O Stitziel, Benjamin M Neale, Paul J McLaren, Namrata Gupta, Pamela Sklar, Patrick F Sullivan, Jennifer L Moran, et al. Exome sequencing and the genetic basis of complex traits. Nature genetics, 44(6):623, 2012. [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [17] Tarun Lalwani, Shashank Bhalotia, Ashish Pal, Vasundhara Rathod, and Shreya Bisen. Implementation of a chatbot system using ai and nlp. International Journal of In- novative Research in Computer Science & Technology (IJIRCST) Volume-6, Issue-3, 2018. [18] Torrey Lisa and S Jude. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends, pages 242–264. IGI Global, 2010. [19] Jason Z Liu, Federica Tozzi, Dawn M Waterworth, Sreekumar G Pillai, Pierandrea Muglia, Lefkos Middleton, Wade Berrettini, Christopher W Knouff, Xin Yuan, Gérard Waeber, et al. Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nature genetics, 42(5):436–440, 2010. [20] Teri A Manolio, Francis S Collins, Nancy J Cox, David B Goldstein, Lucia A Hin- dorff, David J Hunter, Mark I McCarthy, Erin M Ramos, Lon R Cardon, Aravinda Chakravarti, et al. Finding the missing heritability of complex diseases. Nature, 461(7265):747–753, 2009. [21] Colin D Mathers and Dejan Loncar. Projections of global mortality and burden of disease from 2002 to 2030. PLoS medicine, 3(11):e442, 2006. [22] Melinda C Mills and Charles Rahal. A scientometric review of genome-wide association studies. Communications biology, 2(1):1–11, 2019. [23] Kalyanapuram Rangachari Parthasarathy. Probability measures on metric spaces, vol- ume 352. American Mathematical Soc., 2005. 88 [24] James O Ramsay and CJ Dalzell. Some tools for functional data analysis. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):539–561, 1991. [25] JO Ramsay. When the data are functions. Psychometrika, 47(4):379–396, 1982. [26] JO Ramsay and BW Silverman. Functional data analysis-methods and case studies, 2002. [27] Fabrice Rossi, Nicolas Delannay, Brieuc Conan-Guez, and Michel Verleysen. Represen- tation of functional data in neural networks. Neurocomputing, 64:183–210, 2005. [28] Andrew J Saykin, Li Shen, Tatiana M Foroud, Steven G Potkin, Shanker Swaminathan, Sungeun Kim, Shannon L Risacher, Kwangsik Nho, Matthew J Huentelman, David W Craig, et al. Alzheimer’s disease neuroimaging initiative biomarkers as quantitative phenotypes: Genetics core aims, progress, and plans. Alzheimer’s & dementia, 6(3):265– 273, 2010. [29] Elias M Stein and Rami Shakarchi. Real analysis: measure theory, integration, and Hilbert spaces. Princeton University Press, 2009. [30] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global, 2010. [31] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer, 1996. [32] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017. [33] Olga A Vsevolozhskaya, Dmitri V Zaykin, David A Barondess, Xiaoren Tong, Sneha Jadhav, and Qing Lu. Uncovering local trends in genetic effects of multiple phenotypes via functional linear models. Genetic epidemiology, 40(3):210–221, 2016. [34] Olga A Vsevolozhskaya, Dmitri V Zaykin, Mark C Greenwood, Changshuai Wei, and Qing Lu. Functional analysis of variance for association studies. PLoS One, 9(9):e105074, 2014. [35] Fei-Yue Wang, Jun Jason Zhang, Xinhu Zheng, Xiao Wang, Yong Yuan, Xiaoxiao Dai, Jie Zhang, and Liuqing Yang. Where does alphago go: From church-turing thesis to alphago thesis and beyond. IEEE/CAA Journal of Automatica Sinica, 3(2):113–120, 2016. 89 [36] Paul Yushkevich, Stephen M Pizer, Sarang Joshi, and James Stephen Marron. Intu- itive, localized analysis of shape variability. In Biennial International Conference on Information Processing in Medical Imaging, pages 402–408. Springer, 2001. [37] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. [38] Xuefen Zhang, Tongtong Lan, Tong Wang, Wei Xue, Xiaoran Tong, Tengfei Ma, Guifen Liu, and Qing Lu. Considering genetic heterogeneity in the association analysis finds genes associated with nicotine dependence. Frontiers in genetics, 10:448, 2019. [39] James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani, and Amalio Telenti. A primer on deep learning in genomics. Nature genetics, 51(1):12–18, 2019. 90