NEURAL NETWORKS MODELS WITH APPLICATIONS TO GENETIC
                              STUDIES
                                  By
                             Shan Zhang
                        A DISSERTATION
                            Submitted to
                    Michigan State University
             in partial fulfillment of the requirements
                          for the degree of
                Statistics — Doctor of Philosophy
                                2021


                                          ABSTRACT
   NEURAL NETWORKS MODELS WITH APPLICATIONS TO GENETIC STUDIES
                                                By
                                           Shan Zhang
    Artificial intelligence (AI) is a thriving research field with many successful applications
in areas such as computer vision and speech recognition. Machine learning methods, such
as neural networks (NN), play a central role in modern AI technology. When it comes to
genetic studies, a vanilla NN model suffers from small genetic effects without considering the
underlying genetic structure (e.g., linkage disequilibrium). Furthermore in emerging research
fields (e.g., imaging genetics), researchers need to deal with different types of data.
    To address these challenges, I propose a functional neural networks (FNN) method which
combines functional linear model and neural networks structure. FNN uses a series of basis
functions to model genetic effects within gene sequences and phenotype data with spatial-
temporal structure . Through simulations and real data applications, I demonstrate the
advantages of FNN for high-dimensional genetic data analysis in terms of robustness and
accuracy. The source code of FNN is available on https://github.com/szhang0629/FNN.
The conditions for the consistency of proposed FNN model are also provided.
    Transfer learning has been widely used in text and image classification and has demon-
strated its outstanding performance in applications. Nonetheless, it has been rarely used in
genetic research, and its performance on genetic data is still largely unknown. I use vanilla
neural networks to investigate whether the knowledge learned from UKBiobank databases
or Caucasian samples can be used to facilitate genetic research in small-scale studies or in
minority populations. The experiments shows that transfer learning is useful in most genes.


   I dedicate this dissertation to my parents, Houfa Zhang, Rongqun Shan and my friends,
Xiaoxi, Peng, Xiaoran, Yuan, Chang and many others.
                                             iii


                                ACKNOWLEDGMENTS
Here, I would like to express my deepest gratitude to my advisors Dr. Qing Lu and Dr.
Yuehua Cui for their support and guidance towards my studies and researches. Dr. Lu and
Dr. Cui are extremely kind and knowledgeable. They always provide valuable insights and
suggestions on the improvements of my work.
    I would also like to extend my sincere thanks to my dissertation committee members, Dr.
Lyudmila Sakhanenko and Dr. Haolei Weng. Their comments and suggestions are beneficial
for my studies.
    I am also grateful to the help I obtained from all the professors in the Department of
Statistics and Probability. During my Ph.D. studies, I took most of the courses offered by
our department.
    During my five years at Michigan State University, I made lots of friends and because of
them, I never feel lonely in these years. Many thanks to my senior classmates: Dr. Jingyi
Zhang, Dr. Xiaoran Tong and Dr. Liangliang Zhang. They have become successful faculty
members and statisticians in big companies. My thanks also go to my friends: Jian Song,
Ken Lee, Runze Su, Peide Li, Chang Jiang, Yuan Zhou, Jinghang Lin and Di Wu. Without
your help, I could not be who I am now and I sincerely wish all of you have a wonderful
future.
    Last but not least, I would like to express my sincere thanks to my parents for their
support and confidence in me. I would like to treat this dissertation as my unique gift to
both of you.
                                              iv


                           TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       vii
Chapter 1 Introduction        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Genetic Data . . . .   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2
   1.3 Neural Networks . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
   1.4 Organization . . . .   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
Chapter 2    Functional Neural Networks for High-dimensional Genetic Data
             Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    5
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    8
       2.2.1 Functional Linear Model . . . . . . . . . . . . . . . . . . . . . . . . .       8
              2.2.1.1 FLM for a scalar phenotype . . . . . . . . . . . . . . . . . .         8
              2.2.1.2 FLM for complex phenotypes . . . . . . . . . . . . . . . . .           9
       2.2.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    10
       2.2.3 Functional neural networks . . . . . . . . . . . . . . . . . . . . . . . .     11
   2.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15
       2.3.1 Simulation 1: The effect of the phenotype dimension . . . . . . . . .          17
       2.3.2 Simulation 2: The effect of the underlying function . . . . . . . . . .        19
       2.3.3 Simulation 3: The effect of the input dimension . . . . . . . . . . . .        20
   2.4 Real Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   23
       2.4.1 The role of CHRNA5, race, gender, and age in predicting cigarette
              smoking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   23
       2.4.2 The role of APOE in predicting hippocampus volume change over time             25
   2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3 Consistency of Functional Neural Networks .                 . . . . . . . . . . . 30
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
   3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 31
   3.3 Universal Approximation Theorem . . . . . . . . . . . .        . . . . . . . . . . . 34
       3.3.1 Definitions for New Construction . . . . . . . . .       . . . . . . . . . . . 34
       3.3.2 Step One . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . 35
       3.3.3 Step Two . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . 44
   3.4 Covering Number . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . 48
       3.4.1 Smooth Functions . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . 48
       3.4.2 Deductions . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . 49
   3.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 52
   3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4    Transfer Learning for Genetic Studies . . . . . . . . . . . . . . .            59
                                              v


  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
  4.2 Method . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . 60
      4.2.1 Neural Networks . . . . . . . . . .      . . . . . . . . . . . . . . . . . . . 61
      4.2.2 Transfer Learning . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . 64
      4.2.3 Technical Issue . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . 65
  4.3 Real Data Analysis . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . 67
      4.3.1 Data Description . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . 67
      4.3.2 Regularization . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . 69
             4.3.2.1 Constraint regularization       . . . . . . . . . . . . . . . . . . . 69
             4.3.2.2 Parameter Regularization        . . . . . . . . . . . . . . . . . . . 70
             4.3.2.3 Comparison . . . . . . . .      . . . . . . . . . . . . . . . . . . . 70
      4.3.3 Cross-ethnicity . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . 71
      4.3.4 Cross-Projects . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . 73
  4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 5   Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   75
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     77
  APPENDIX A Solutions of FLM . . . . . . . . . . . . . . . . . . . . . . . . . . .        78
  APPENDIX B Functional Neural Networks . . . . . . . . . . . . . . . . . . . . .          81
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       86
                                            vi


                                  LIST OF FIGURES
Figure 2.1: An illustration of neural networks and functional neural networks . . . .        14
Figure 2.2: Accuracy comparison of three methods for three different types of pheno-
            types (i.e., scalar, vector, and matrix) and three different levels of noise
            (i.e., σ =0.3, 0.6, and 1.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.3: Accuracy comparison of three methods for different underlying functions          21
Figure 2.4: Accuracy comparison of three methods for different numbers of input vari-
            ables (i.e., 80, 200, and 500) . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.5: The role of CHRNA5, race, gender, and age in predicting smoking quantity
            . "NN" and "FNN" refer to the neural networks method and the func-
            tional neural networks method, while the number "1" and "2" indicate
            the number of hidden layers. . . . . . . . . . . . . . . . . . . . . . . . . .   24
Figure 2.6: Prediction accuracy of hippocampus volume change over time based on
            the gene APOE. "NN" and "FNN" refer to the neural networks method
            and functional neural networks method, respectively. The number "1" or
            "2" indicates the number of hidden layers. . . . . . . . . . . . . . . . . .     26
Figure 4.1: Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61
Figure 4.2: Transfer Learning in neural networks . . . . . . . . . . . . . . . . . . . .     66
Figure 4.3: Constraint Regularization . . . . . . . . . . . . . . . . . . . . . . . . . .    70
Figure 4.4: Parameter Regularization . . . . . . . . . . . . . . . . . . . . . . . . . .     71
Figure 4.5: Transfer the knowledge from the British population to the Black population 72
Figure 4.6: Transfer the knowledge from the British population to the Irish population 72
Figure 4.7: Transfer the knowledge from UK Biobank to SAGE . . . . . . . . . . . .           73
                                              vii


Chapter 1
Introduction
1.1      Overview
In recent years, we have witnessed the emerging development of artificial intelligence (AI). In
2016, the AI AlphaGO beats professional Go player and dominates the game ([35]). Chatbot
has become a basic tool for customer service in many companies ([17]). In the field of high
technology, Tesla is a leading electric vehicles company in self-driving technology, and its
Autopilot technology is based on AI ([5]). With these successful applications, AI has become
a thriving research field, which attracts a lot of researchers from various fields.
    Although AI technology has achieved impressive performance in gaming, natural language
processing, image classification, and many other areas, it has been less popularly used in
genetic studies, partially due to the high noise and small sample size of the genetic data.
Normally a large amount of samples are required to train a good AI model. While it is
easy to find millions of pictures of cars on the Internet, it is time consuming and costly to
collect genetic data, along with individuals’ clinical records and demographic information.
Most genetic studies have a sample size of around several thousand, which is under-powered
to train a sophisticated AI model, and the dimension of genetic data is also huge. The
genetic information is stored in the nucleotide sequences of DNA, which has over one billion
nucleotides. Even if we focus on only one gene, there are still possible over ten thousand
                                                1


nucleotides. It is hard to build an AI model with ten thousand variables given limited
sample size. Finally, the genetic information alone can only explain the limited variance
of phenotypes. Complex phenotypes, such as height, are determined by both gene and
environmental factors. The AI model intends to explain all the variance with the given
genetic data, which can lead to great prediction in the training dataset, while has low
performance in the testing dataset. Many regularization methods are available to alleviate
this problem, but overfitting remains a challenging problem for AI methods, especially in
the high-dimensional setting.
    In this chapter, the background of genetic research is introduced in section 1.2. In
section 1.3, the vanilla neural networks models are briefly introduced. The objective and
organization of the dissertation are given in section 1.4.
1.2      Genetic Data
I start with a brief introduction to genetic data. Although DNA has a huge number of
nucleotides, over 99% of them are the same among all human beings. So, I only consider
the nucleotide with a variation into consideration, which is called single nucleotide poly-
morphisms (SNPs). Most SNPs have two copies of alleles. Assuming an additive mode of
inheritance, SNPs can be coded as {0, 1, 2} based on the number of minor alleles. In my
dissertation, I focus on studying genes, which involve multiple SNPs (e.g., around 30 SNPs).
In a gene, linkage disequilibrium (LD) exists among SNPs [7]. Based on LD, I can assume
the genetic effect of an SNP is related to that of its nearby SNPs. Under this assumption, I
develop a functional neural network method, which is introduced in detail in Chapter 2.
    A variety of phenotypes, such as brain volume, weight, or smoking frequency, are used
                                               2


in genetic studies. If the age is considered as the index, these phenotypes can be naturally
modeled as a function of age. The temporal relationship within the phenotype can be
considered by assuming the continuity of the function. This concept has been used in the
proposed model of the dissertation.
1.3     Neural Networks
The neural networks (NN) model plays the fundamental role in the AI field. A basic NN
model f : Rd → R can be realized by:
                                         XJ
                                 f (x) =     aj σ(wj x + bj ) + c,
                                         j=1
where J, aj c, wj and bj are parameters of the NN model. Among them, wj is a d-dimension
vector while all other parameters are scalars. Based on the universal approximation theorem,
given enough hidden units, the NN model can approximate to any continuous functions which
have compact supports on Rd → R.
    The domain of the NN function is on a finite dimensional space. In this dissertation,
a new neural network model defined on a functional space is proposed and its statistical
properties such as consistency are developed.
1.4     Organization
The remaining dissertation is organized as follows. In Chapter 2, I propose a new neural
network model called the functional neural networks model, which can deal with functional
data and take the spatial and temporal relationships into consideration. In Chapter 3, I
                                                 3


prove the consistency of the proposed model. In Chapter 4, I apply the transfer learning
idea to genetic studies. In Chapter 5, I discuss possible extensions of these methods and
future work.
                                             4


Chapter 2
Functional Neural Networks for
High-dimensional Genetic Data Analysis
2.1      Introduction
With the recent advancement in high-throughput technologies, genome-wide association
studies (GWAS) and sequencing studies have been commonly adopted to uncover new ge-
netic variants predisposing to common complex diseases ([15], [32]). During the past decade,
thousands of genetic variants have been identified through GWAS and sequencing studies,
some with compelling biological plausibility for a role in disease etiology ([22]). Despite such
success, for most complex diseases, identified genetic variants are associated with small effect
sizes and only explain a small fraction of total heritability ([20]). Many common diseases
are influenced by the interplay of multiple genes and other risk factors (e.g., environmental
determinants) in a complex manner ([4]). This complexity, however, has not been taken fully
into account by many existing approaches, which often assume genetic variants related to
disease phenotypes in an additive and linear manner.
    Machine learning methods, such as the neural network (NN) method, holds great promise
for genetic research ([6]). Based on its hierarchical structure, NN learns complicated features
from simpler ones, making it capable of capturing non-linear and non-additive genetic effects
                                                5


(e.g., gene-gene interaction effects). Given its exceptional performance, NN has been increas-
ingly used in genomics, especially in regulatory genomics, variant calling, and pathogenicity
scores ([39]). Despite its successful applications in these areas, the use of NN in revealing
the complex relationships between genetic variants and common diseases is still limited.
    While the great potential of NN in genetic data analysis of complex diseases is reasonably
expected, the high dimensionality of the genetic data and the complexity of genetic structure
bring tremendous analytic challenges. The high-throughput technology allows us to simulta-
neously evaluate the role of thousands or even millions of variants in complex diseases ([10]).
Nevertheless, fitting an NN on such a large number of genetic variables could bring a serious
overfitting issue. As for genetics studies, linkage disequilibrium (LD) exists among neigh-
boring variants, and disease-associated variants are often in an LD block. Considering the
underlying genetic structure can help us to combat the overfitting issue. Moreover, different
types of phenotypes are often collected in a study. Besides the measurement of a disease
phenotype, which is typically a scalar variable, researchers are also interested in studying a
vector of variables (e.g., the progression of disease measured over time). With the rapidly
evolving technologies and ever-decreasing cost, studies are starting to collect omics and imag-
ing data, which are often high-dimensional and stored as matrices or tensors. While omics
and imaging data provide us a great opportunity to study complex diseases, few methods are
currently available for high-dimensional genetic data analysis of complex phenotypes (e.g.,
vectors and matrices) with the consideration of non-linear and non-additive effects.
    To address these emerging challenges and facilitate the high-dimensional genetic data
analysis of complex diseases, I propose a functional neural network (FNN) that inherits the
strengths from both NN and the functional linear model (FLM). FLM is a popularly used
method in functional data analysis (FDA) ([25]; [24]), which deals with data in the form of
                                                 6


functions. FLM has been commonly used to analyze data measured over time and has been
increasingly used in high-dimensional data analysis, such as genetic data analysis ([7]; [33])
and imaging data analysis ([36]). Specifically, with the high-dimensional genetic variants
as the input layer and various types of phenotypes as the output layer, the proposed FNN
uses a series of basis functions to obtain a representation of functional data in each layer
([27]), and further builds multiple hidden layers via functional linear models. FNN has a
number of attractive features: 1) it has a built-in facility to account for the underlying
structures of complex phenotypes and genetic data (e.g., LD), which helps capture disease-
related variants and overcome the curse of dimensionality; 2) it uses a functional neural
network to model the complex relationship between genetic variants and disease phenotypes;
and 3) it can be used to analyze different types of phenotypes (e.g., scalar, vector, and
matrix). Through simulations and two real data applications, I show that FNN outperforms
conventional methods, such as NN and FLM. I have also shown, at certain conditions, FNN
can be simplified to NN. In other words, FNN can be viewed as the generalization of NN to
high-dimensional data with complex phenotypes.
    The remaining chapter is organized as follows: the FNN method is proposed in Section
2. Section 3 uses simulations to compare the performance of FNN with those of FLM and
NN. In Section 4, I illustrate the methods via two real data applications. Summary and
discussion are provided in Section 5. The technical details of the methods are described in
the Appendix.
                                              7


2.2       Method
In this section, I introduce NN and two types of FLMs for different types of phenotypes at
first. Based on the concepts of FLM and NN, I develop FNN for high-dimensional genetic
data analysis.
    Throughout the Chapter 2, I use lower case letters, bold lower case letters, upper case
letters and bold upper case letters to denote scalars, vectors, matrices and functions respec-
tively when it comes to Roman alphabet. For example, b is a scalar. b is a vector with bj
as the j-th element of b. B is a matrix with Bi as the ith row of matrix B. Bij denotes
a scalar whose value is the j-th element of Bi . B is a function. Due to large amount of
parameters and functional representations, Greek alphabets are also used to denote values
or functions.
2.2.1      Functional Linear Model
This section will introduce the functional linear models for a scalar phenotype and complex
phenotypes separately. The details on the solutions of them are provided in Appendix A.
2.2.1.1     FLM for a scalar phenotype
I briefly introduce FLM in the setting of genetic data analysis with a scalar phenotype ([26],
[34]). Let yi and Zi = (zi1 , . . . , ziq ) denote a scalar phenotype (e.g., systolic blood pres-
sure) and quantitative/qualitative covariates (e.g., age and gender) for the i-th individual.
Let [Gi1 , Gi2 , ..., Gip ] be the single-nucleotide variants (SNVs) in a SNV set (e.g., a gene),
coded as additive (i.e., Gik ∈ {0, 1, 2}, refers to the number of minor allele), and tk be the
corresponding position scaled into [0, 1]. Given the genotypes and the positions, I model the
                                                   8


genetic variant function Gi (t), t ∈ [0, 1] as a linear combination of Dirac Delta functions, i.e.
                                                 Xp
                                        Gi (t) =      Gik δtk (t),                          (2.1)
                                                 k=1
                          R∞
where δtk (t) satisfies t=−∞      f (t)δtk (t)dt = f (tk ). The functional linear model can then
be used to model the relationship between the scalar phenotype yi and the genetic variant
function Gi (t) with the consideration of Zi :
                                                     Z
                                  ŷi = θ0 + Zi θ +      Gi (t)β(t)dt,                      (2.2)
where β(t) is the coefficient function measuring the genetic effects in a genetic region. The
parameters θ0 and θ are the intercept and coefficients of covariates, respectively.
2.2.1.2   FLM for complex phenotypes
FLM can be easily extended to handle complex phenotypes (e.g., disease phenotypes mea-
sured over time) by modeling complex phenotypes as functions. For instance, I can modify
the phenotype as Yi (s) and use the following model ([14]) to consider the spatial/temporal
dependence at location/time s,
                                                         Z
                            Ŷi (s) = Zi θ + α0 (s) +       α(s, t)Gi (t)dt,                (2.3)
where α(s, t) is a bivariate function and α0 (s) is an intercept function. The observed phe-
notype is treated as the function realization in discrete points sij , denoted as
                                                             Z
                      ŷij = Ŷi (sij ) = Zi θ + α0 (sij ) +    α(sij , t)Gi (t)dt.         (2.4)
                                                   9


    FLM has many desirable features for high-dimensional genetic data analysis. By consid-
ering the effects of SNVs as functions, I could utilize information from adjacent SNVs (i.e.,
nearby SNVs tend to have similar effects due to LD) and reduce the number of parameters,
which help us capture true signals and overcome the curse of dimensionality. Moreover, FLM
has its own uniqueness of handling measurement errors and missing data, which are quite
common in genetic data ([34]).
    Despite these advantages, FLM is not suitable to model complex relationships between
SNVs and phenotypes or the inter-relationship between SNVs (e.g., interactions). In order to
address these issues, I integrate the idea of NN into FLM to improve its capacity of modeling
complex non-linear and non-additive genetic effects.
2.2.2     Neural networks
Neural networks can be viewed as multi-stage nonlinear regression models. A D-layer NN
model FN : Rp+q → R can be expressed in a recursive manner:
                         (1)
                      Xi     = σ(b(1) + Zi θ + Gi W (1) ),
                         (d)        (d−1)
                      Xi     = σ(Xi       W (d) + b(d) ),  d = 2, . . . , D − 1,
                                (D−1)
                      ŷi = f (Xi      W (D) + b(D) ),
where Z = {zik }n×m , G = {Gik }n×p = {Gi (tk )}n×p and {X (d) , d = 1, ..., D − 1} are co-
variates, genetic variables (e.g., SNVs) and hidden layers, respectively. σ (d) is the activation
function (e.g.,the sigmoid function) and f is the link function (e.g., the identity function).
W (d) and b(d) are the weight matrix and bias vector for the d-th layer. The detail of NN is
described in Goodfellow et al [9].
                                                 10


    A L2 penalty is commonly used in NN to control the model complexity. The penalized
mean square error loss function is thus defined as,
                                  n                   D
                                                2       (kW (d) k2F + kb(d) k22 ).
                                 X                   X
                    R(W, b) =        (yi − ŷi ) + λ
                                 i=1                 d=1
The back-propagation and gradient descent algorithms can then be used to solve (W, b). In
this chapter, the learning rate of NN is tuned by using the adaptive optimization method,
Adadelta ([37]). The number of hidden units and the parameter λ are also tuning parameters
in this chapter. To determine the value of λ, I use the hyperparameter optimization algorithm
introduced in [9]. Specifically, I choose the value of λ in a logarithmic scale so that the range
of the value is wide enough, i.e., {10−2 , 10−1.5 , 10−1 , 10−0.5 , 100 }. The training data is split
into the subtrain dataset and the validation dataset in the ratio of 4 : 1. I apply NN to the
subtrain dataset with all possible λ values, and choose the best λ value based on the mean
square error in the validation dataset. I then use the selected λ value to train the final model
in the whole training set and obtain the model parameters. The same procedure can be used
to determine the number of hidden units, which is chosen from {2, 4, 8, 16, 32, 64}.
2.2.3     Functional neural networks
Functional neural networks can be viewed as an extension of NN, where the weights and
biases are modeled as functions. By treating the weights and biases as functions, FNN can
consider LD among SNVs and spatial-temporal relation of phenotypes, as well as reduce the
number of parameters.
    In FNN, a genetic variant function Gi (t) based on SNVs is obtained at first as described
                                                                                    (1)
in 2.1. Given the genetic variant function Gi (t), the first hidden layer Xi (t(1) ) can be
                                                  11


                                                                                             (d)
constructed by using equation (2.5). The additional D − 2 hidden layers, Xi (t(d) ), are
then built recursively with different functional coefficients as shown in (2.6). Depending on
the types of phenotypes, equation (2.7) or (2.8) can be used to fit a scalar phenotype or a
complex phenotype, respectively. The FNN model can thus be expressed as:
                                              Z                        
       (1)                      (1)
    Xi (t(1) )   = σ Zθ     + α0 (t(1) ) +        α(1) (t(1) , t)G(t)dt ,                           (2.5)
                                       Z                                              
       (d)               (d)                                      (d−1) (d−1)
    Xi (t(d) )   =σ     α0 (t(d) ) +       α(d) (t(d) , t(d−1) )Xi      (t    )dt(d−1)   , 1 < d < D,
                                                                                                    (2.6)
                      Z                                               
               (D)                          (D−1) (D−1)
    ŷi = f   a0     +    αD (t(D−1) )Xi           (t        )dt(D−1) ,                             (2.7)
                                          Z                                                 
                             (D)                                      (D−1) (D−1)
    ŷij = Ŷi (sij ) = f α0 (sij ) + α(D) (sij , t(D−1) )Xi               (t      )dt(D−1) ) ,     (2.8)
where σ and f are the activation function and the link function. In this chapter, I use the
sigmoid function as the activation function and the identity function as the link function.
  (1)          (d)          (D)          (D)
α0 (t(1) ), α0 (t(d) ), a0      and α0 (sij ) are respectively bias functions or bias for the input
layer, the hidden layer, the output layer with the scalar phenotype and the output layer with
the complex phenotype, where t, t(d) and sij are respectively the coordinate systems for the
genetic variant function (e.g., locus positions), d-th layer and the response function (e.g.,
time). α(1) (t(1) , t), α(d) (t(d) , t(d−1) ), αD (t(D−1) ) and α(D) (sij , t(D−1) ) are weight functions
for the input layer, the hidden layer, the output layer with the scalar phenotype and the
output layer with the complex phenotype, respectively. The explicit forms of the bias and
weight function are written below,
                                                          12


                                         l (d)
                         (d)                     (d)   (d)
                       α0 (t(d) )              b (d) β (d) (t(d) ),
                                        X
                                    =
                                                 k     k
                                      k (d) =1
                                         l(d)     l(d−1)
                                                             (d)         (d−1)            (d)
              α(d) (t(d) , t(d−1) ) =                      W (d−1) (d) β (d−1) (t(d−1) )β (d) (t(d) ),
                                        X           X
                                                            k       k    k                k
                                      k (d) =1 k (d−1) =1
                                                                                             (d)
where (W, b) = {W (d) , b(d) |d = 1, . . . , D} are parameters of interest. {β (d) (t(d) )|k (d) =
                                                                                             k
1, . . . , l(d) } is the predetermined basis functions of the d-th layer.
                                   (D)
     When Y is a scalar, a0             is a scalar and α(D) is a univariate function. If Y is a vector
                                                                      (D)
(e.g., measurements over different time points), then α0                   is a univariate function and α(D)
is a bivariate function. The model can be extended to a variety of phenotypes. For instance,
if Y is a matrix, then the weight function can be expressed as a Cartesian product of three
basis function systems.
     As shown in Figure 1, NN and FNN have different types of weights and biases. While
weights and biases in NN are vectors or scalars, weights and biases in FNN are functions.
Correspondingly, NN uses multiplication on the weight matrix W (d) , while FNN performs
integration on the functional coefficient α(d) . In practice, integration is realized by the
numeric integration in the form of summation. In this sense, the weights and biases in
NN are independent values, while those in FNN are realizations of a parameterized curve
function.
     To estimate the parameters in the FNN model, I define the loss function L(y, ŷ), cost
function R(α, α0 ) and objective function O(α, α0 ) as follows:
                                                           13


        Neural Networks                                Functional Neural Networks
Figure 2.1: An illustration of neural networks and functional neural networks
                                       14


                                  L(y, ŷ) = (y − ŷ)2 ,
                                              X
                               R(α, α0 ) =         L(y(tij )ŷ(tij )),
                                               i,j
                               O(α, α0 ) = R(α, α0 ) + λJ(α, α0 ).
    A penalty J(α, α0 ) term is added to the error function to address the overfitting issue
raised by the high-dimensional genetic data. Instead of imposing a penalty (e.g., the lasso
penalty) on the parameters in NN, I impose a penalty on the second-order derivative of
hidden functions to control the smoothness of the functions,
                                    D
                                                      (d)
                                        J(d) (α(d) , α0 ),
                                   X
                       J(α, α0 ) =
                                   d=1
                                                   !2                  !2 
                                           ∂ 2 α(d)            ∂ 2 α(d)
                                 Z Z
                          (d)                                               dt(d) dt(d−1) ,
            J(d) (α(d) , α0 ) =                       +
                                            ∂t(d)2            ∂t(d−1)2
where λ is a hyper-parameter determined by cross-validation. The penalized loss function
can be optimized via a back-propagation algorithm, which is described in the appendix B.
2.3      Simulations
Simulations were conducted to compare the performance of FNN with those of FLM and
NN. In order to mimic the real structure of genetic data (e.g., allele frequencies and LD), the
genetic data was obtained from the real sequencing data located on Chromosome 17: 7344328
- 8344327 from the 1,000 Genome project [3]. Specifically, I randomly chose a segment of
SNVs, 200 training samples, and 50 testing samples. Due to the intensive computation time
                                                   15


of analyzing complex phenotypes (e.g., matrices), 100 replicates are simulated to illustrate
the performance.
    Based on the genetic data, we simulated different types of phenotypes (i.e., scalars,
vectors, and matrices), and evaluated both linear and non-linear relationships between geno-
types and phenotypes. To reflect the real disease scenarios, we added random noise to the
simulated data, and compared the performance of three methods by gradually increasing the
noise variance level. A general form of the simulation model is given below,
                            Yij = Yi (sij ) + εij , j = 1, 2, . . . , m,
where Yi (s) = F(s; Gi (t)) corresponds to a function-function mapping L2 ([0, 1]) → L2 ([0, 1]).
It models a functional phenotype Yi (s) with the genetic variant function Gi (t), where s is
the coordinate system of the phenotype function and sij is a point in the coordinate space,
The noise εij follows an i.i.d normal distribution, N (0, σ 2 ), and m is the dimension of the
phenotype. The explicit expression of the model is given in the following simulations.
    In all simulations, the function has K points as discrete realizations of the function, and
the points are randomly generated from [0, 1]. When m = 1, Yi (x) is a scalar. For vector
types of phenotypes, I set m = 20 (i.e., the phenotype of ith individual is a vector of 20
elements). In the matrix setting, I apply outer product on two functions in the form of (2.9).
The discrete observational points are randomly generated from a rectangle in [0, 1] × [0, 1],
where m = 400.
    For simplicity, I compare NN and FNN with one hidden layer. The number of hidden
units in NN, the number of basis functions for FLM and the number of basis functions
for the hidden layer in FNN are fixed at 40. The numbers of basis functions of the first
                                                 16


layer and the last layer of FNN are 60 and 20, respectively. The basis function system
used in FLM and FNN is a fourth-order B-spline basis function system. I use an adaptive
method, Adadelta [37], to determine the learning rate. The hyper parameter λ is chosen
from {10−2 , 10−1.5 , 10−1 , 10−0.5 , 100 } and is determined by the validation process.
2.3.1     Simulation 1: The effect of the phenotype dimension
In this simulation, I evaluated the performance of three methods under different types of
phenotypes (i.e., scalar, vector, and matrix). For each replicate, I randomly chose 200
samples as training data and 50 samples as testing data, each with 200 SNVs obtained
from the 1,000 Genome project. Based on the genotypes, I simulated phenotypes using the
following model,
                                           20 Z 1
                                                  cl Gel (t)B1l (t)B2l (x)dt,
                                          X
                          F(x; G(t)) =                                                      (2.9)
                                          l=1 0
where B1l and B2l are two 4th-order B-spline basis functions with 10 knots randomly gener-
ated from [0, 1]. The coefficient ci and ei are randomly generated from uniform distributions,
U [−2, 2] and U [1/3, 3], respectively.
    To compare the performance of the three methods, I computed the mean squared error
(MSE) on both training and testing samples for three different types of phenotypes (i.e.,
scalar, vector, and matrix) and three different levels of noise variance (i.e., σ =0.3, 0.6, and
1.2). The left group of box plots and the right group of box plots in each panel of Figure
2 summarize the MSE of three methods calculated from the training data and testing data,
respectively. Overall, FNN attains better or comparable performance than FLM and NN in
the testing data. NN performs the best in training data but is subject to low performance in
                                                  17


                    scalar                   vector                     matrix
      1.5
      1.0
                                                                                        0.3
      0.5
      0.0
      2.0
      1.5
                                                                                              method
                                                                                                 FLM
mse   1.0
                                                                                        0.6
                                                                                                 NN
                                                                                                 FNN
      0.5
       3
       2
                                                                                        1.2
       1
            train            test    train             test     train            test
                                             class
Figure 2.2: Accuracy comparison of three methods for three different types of phenotypes
(i.e., scalar, vector, and matrix) and three different levels of noise (i.e., σ =0.3, 0.6, and 1.2)
                                                      18


testing data, which can be potentially explained by overfitting caused by hundreds of genetic
variables. The overfitting issue becomes more serious with the increased noise level. On the
other hand, FLM provides robustness against the overfitting issue due to its simplicity and
fewer parameters. However, the linear structure of FLM fails to capture complex features
of the data, leading to low performance in both training data and testing data. FNN has a
good balance between bias and variance. It is able to capture complex features of data while
remains robust to the overfitting issue.
2.3.2      Simulation 2: The effect of the underlying function
I further evaluated the performance of three methods under different underlying functions
(i.e., polynomial, logistic, and linear functions). I denote the polynomial function as “polyno”,
which is described in equation (2.9). In addition, we simulated a “logist” function, which
performs the logit transformation on the genetic variables,
                              X20 Z 1
               F(x; G(t)) =           cl (1 − exp(el G(t)))cos(al t + ul )cos(bl x + vl )dt,  (2.10)
                              l=1 0
where al , bl , cl ∼ U [−12, 12], el ∼ U [0.1, 1], and ul , vl ∼ U [−π, π]. To evaluate the method’s
performance in a linear setting, we also simulated a linear function denoted by “linear”,
                                      X20 Z 1
                        F(x; G(t)) =           (cl G(t)cos(al t + ul )cos(bl x + vl )         (2.11)
                                      l=1 0
                                                + dl G(t)B1l (t)B2l (x))dt,
                                                    19


where al , bl , cl , dl ∼ U [−12, 12], el ∼ U [0.1, 1], and ul , vl ∼ U [−π, π], B1l and B2l are two
fourth-order B-spline basis functions with 10 knots generated randomly from [0, 1].
    As shown in Figure 3, overall, FNN has better or at least comparable performance than
the other two methods. As expected, FLM performs the best in the linear setting, although
the performance of FNN is close to that of FLM. FNN is slightly better than NN but
outperforms FLM in the non-linear settings. While NN also outperforms FLM in the non-
linear settings, it suffers the overfitting issue due to the high dimensionality of the data.
2.3.3     Simulation 3: The effect of the input dimension
In this simulation, I compared the performance of three methods with the increased dimen-
sion of input variables (i.e., p = 80, 200, 500, and m = 20). The simulation setting is the
same as the one used in simulation 2 with the polynomial function, except for the dimension
of input variables.
    As shown in Figure 4, when the number of input variable p and the noise level increase, all
three methods have decreased accuracy in terms of testing MSE. Among the three methods,
FLM has the most robust performance due to the linear structure it imposes. Nonetheless,
such a structure also limits its performance for modeling non-linear effects. While NN is
powerful for capturing non-linear effects, it suffers from the overfitting issue, especially with
the increase of the number of input variables and the noise level. Overall, FNN attained the
best performance. FNN inherits the advantages of both NN and FLM. It has the capacity
of capturing non-linearity as NN and is generally robust to the increased number of input
variables and the increased noise level as FLM.
                                                    20


                     polyno                  logist                   linear
       0.8
       0.6
                                                                                      0.3
       0.4
       0.2
       0.9
                                                                                            method
                                                                                               FLM
mse    0.6
                                                                                      0.6
                                                                                               NN
                                                                                               FNN
       0.3
      2.00
      1.75
                                                                                      1.2
      1.50
      1.25
             train            test   train             test   train            test
                                             class
      Figure 2.3: Accuracy comparison of three methods for different underlying functions
                                                      21


                    80                     200                    500
      0.8
      0.6
      0.4                                                                      0.3
      0.2
      1.0
      0.8                                                                            method
                                                                                        FLM
mse   0.6                                                                      0.6
                                                                                        NN
                                                                                        FNN
      0.4
      0.2
      2.0
      1.6                                                                      1.2
      1.2
            train        test     train            test   train         test
                                          class
Figure 2.4: Accuracy comparison of three methods for different numbers of input variables
(i.e., 80, 200, and 500)
                                                  22


2.4      Real Data Applications
2.4.1     The role of CHRNA5, race, gender, and age in predicting
          cigarette smoking
Cigarette smoking is one of the leading causes of preventable disease, contributing to 5 million
deaths worldwide each year ([21]). During the last decade, a great deal of progress has been
made in identifying genetic variants associated with smoking. Among those findings, the
CHRNA5 gene has been identified and confirmed in several large-scale studies ([19]). In this
application, I will evaluate the role of CHRNA5, race, gender, and age in predicting cigarette
smoking, measured by smoking frequency.
    The genetic dataset to be analyzed is from a large-scale genetic study of addiction:
genetics and environment (SAGE). The participants of the SAGE were unrelated individuals
selected from three independent studies: COGEND, COGA, and FSCD. The SAGE included
smoking measurements and personal characteristics (e.g., age). Prior to the analysis, I re-
assessed the quality of the genetic data. Markers and samples with low quality and high
missing value are removed from the analysis. The final data includes 2502 samples, among
which 795 samples are African-American, 1707 samples are Caucasian, 1479 samples are
female, and 1023 samples are male.
    Since cigarette smoking is dependent on age, smoking quantity is considered as a function
of age, and model its relationship with CHRNA5, race, gender, and age. To evaluate the
prediction performance of the three methods, I randomly split the whole data into training
(80%) and testing (20%) data. By applying the methods to the training data, I built models
and obtained the training MSE. The models were then evaluated on the testing data by
                                              23


       1.1
                                                                               method
       1.0
                                                                                    FLM
                                                                                    NN1
 mse                                                                                NN2
                                                                                    FNN1
       0.9                                                                          FNN2
                          train                         test
                                       class
Figure 2.5: The role of CHRNA5, race, gender, and age in predicting smoking quantity .
"NN" and "FNN" refer to the neural networks method and the functional neural networks
method, while the number "1" and "2" indicate the number of hidden layers.
calculating the testing MSE. To avoid chance finding due to random splitting, I repeated the
procedure 100 times and obtained the average MSE values.
       As is shown from Figure 5, the FNN method attains higher prediction accuracy than
FLM and NN in terms of testing MSE. In this analysis, I evaluated NN and FNN with
one and two hidden layers. While adding additional layers could allow us to model more
abstract and complex features from the data, the result suggests that the models with one
hidden layer are sufficient to model the relationship between CHRNA5, race, gender, age
and smoking quantity. Among all the five models, the best model is FNN with one hidden
layer.
                                               24


2.4.2      The role of APOE in predicting hippocampus volume change
           over time
Accurately identifying high-risk Alzheimer’s Disease (AD) individuals at an early stage is
important for early AD prevention, as treatments prior to the onset of dementia can ensure
intervention occurs before irreversible neuronal death. In this application, I study the role
of APOE in predicting hippocampus volume change over time, an important indicator for
AD.
    Data used in the application were obtained from the Alzheimer’s Disease Neuroimaging
Initiative (ADNI) database (adni.loni.usc.edu)([28]). The ADNI was launched in 2003 as a
public-private partnership. The primary goal of ADNI is to test whether serial magnetic
resonance imaging (MRI), positron emission tomography (PET), other biological markers,
and clinical and neuropsychological assessment can be combined to measure the progression
of mild cognitive impairment (MCI) and early Alzheimer’s disease. It has three phases:
ADNI1, ADNI GO, and ADNI2. ADNI includes standardized diagnostic assessments of AD
(i.e., Case vs. Control). DNA samples were obtained from 808 ADNI participants and were
sent to Illumina where non-CLIA whole-genome sequencing (WGS) was performed on each
sample. MR image data (e.g., hippocampus volumes), and clinical data (e.g., biospecimen
and cognitive tests) were also collected both at the baseline and through follow-up visits.
    I downloaded and reformatted the analysis-ready ADNI dataset. Prior to the data anal-
ysis, I re-assessed the quality of the sequencing data. Markers and samples with a low
proportion of successful genotype calls were removed from the analysis. After a careful qual-
ity assessment, I selected all SNVs in the APOE gene and evaluated its role in predicting
hippocampus volume change over time (i.e., the phenotype is a longitudinal phenotype).
                                              25


       1.2
                                                                                  method
                                                                                       FLM
       1.0                                                                             NN1
 mse                                                                                   NN2
                                                                                       FNN1
                                                                                       FNN2
       0.8
                          train                          test
                                        class
Figure 2.6: Prediction accuracy of hippocampus volume change over time based on the
gene APOE. "NN" and "FNN" refer to the neural networks method and functional neural
networks method, respectively. The number "1" or "2" indicates the number of hidden
layers.
       After removing samples with only one visit, I have 547 samples and 1434 measure of
hippocampus volume over time. Our phenotype is the hippocampus volume change from
the initial value. The initial and current year of investigation is known as the time points of
interest. For FLM and NN, to consider time varying effects, I model the initial year and the
current year as covariates in the models.
       Figure 6 shows that FNN and FLM attain comparable performance, and both of them
obtain higher accuracy than NN. Among all the models, FNN with 2 hidden layers has the
best performance in terms of testing MSE. While there is no obvious evidence of overfitting for
NN, the model built by NN could be too complicated to reflect the underlying relationship
                                                26


between AP OE and hippocampus volume change over time, leading to the sub-optimal
performance.
2.5      Discussion
Ongoing GWAS and sequencing studies allow researchers to comprehensively investigate the
role of a deep catalog of human genome variations in human complex diseases. Although
these studies hold great promise for uncovering novel variants predisposing to human diseases,
the high-dimensionality of genetic data and complex relationships between genotypes and
disease phenotypes bring tremendous challenges to data analysis. To facilitate the gene
discovery process, I develop an FNN model for high-dimensional genetic data analysis of
complex phenotypes. FNN uses the hierarchy of the functional neural networks framework
to learn complicated features from genetic data, making it feasible to capture non-linear
genetic effects. As we observed from simulations and real data analyses, FNN attained better
or at least comparable performance as compared to FLM and NN. When the underlying
relationship between genotypes and disease phenotypes is nonlinear, both FNN and NN
outperform FLM. Even when the underlying relation is linear, FNN has similar performance
as FLM and attain better performance than NN.
    In addition to its capacity to capture non-linear effects, FNN inherits advantages from
FLM, which allows FNN to utilize the prior information of effects smoothness to improve
accuracy and robustness. By modeling the effects of SNVs as a function in the form of
a combination of basis functions, FNN is able to take LD information from nearby SNVs
into account and reduces model complexity. Similarly, by modeling complex phenotypes as
functions, FNN reduces the number of parameters, making it applicable to high-dimensional
                                              27


phenotypes (e.g., matrices). Through simulations, the FNN method can have a close perfor-
mance to FLM in the linear setting, where NN was subject to an overfitting issue. Even in
the nonlinear cases, FNN has demonstrated its robustness against overfitting and attained
better or comparable performance than NN.
    FNN can be applied to different types of phenotypes (e.g., vectors and matrices), which
make it useful for emerging genetic research (e.g., imaging genetics research or multi-omics
research). Similar to FLM, FNN has a unique advantage for modeling complex phenotypes
with temporal and spatial correlations (e.g., imaging or longitudinal phenotypes). Such
an advantage has been shown in two real data applications. For instance, in the cigarette
smoking application, I have also tried a traditional approach by treating age as a covariate
in FNN. The result from such an approach is only slightly better than the result from NN.
Therefore, utilizing the functional structure of phenotypes (i.e., considering smoking quantity
as a function of age) can improve the performance of FNN significantly. Nonetheless, FNN is
different from the classic functional model. Unlike FLM, FNN can handle the situation when
the evaluation points are different among phenotypes (e.g., individuals’ phenotype data are
collected at different time points or locations), which is very common in real scenarios.
    Methods have been developed to integrate the spline method into convolutional neural
networks for image classification. In [8], B-splines are used for kernel construction in the
convolutional layers to make the computation time independent from the kernel size, resulting
in great performance in image classification. Different from the previous methods, FNN
model is based on vanilla neural networks, which uses basis functions to model temporal and
spatial correlations of the phenotype data (e.g., brain volumes measured over time) and LD
structure of genetic data. The contribution of this chapter is thus to introduce a new tool,
FNN, that addresses the ongoing challenges from high-dimensional genetic data analysis and
                                               28


facilitates the gene discovery with complex phenotypes. FNN inherits advantages from both
NN and FLM, which makes it capable of considering complex relationships (e.g., non-linear
relationships) between genotypes and phenotypes while not substantially increasing model
complexity. Therefore, compared to NN and FLM, it achieves a better bias-variance balance.
These features make it ideal for analyzing high-dimensional genetic data that has a large
number of SNVs, low signal-to-noise ratio, and complex relationships with phenotypes.
    In this chapter, I mainly focus on introducing the FNN model and illustrating FNN with
simulations and real data applications. Statistical properties of FNN (e.g., consistency and
rate of convergence) are important topics that may be worth further investigation in the
future work.
                                              29


Chapter 3
Consistency of Functional Neural
Networks
In the previous chapter, I introduced the FNN model and conducted simulations and real data
analysis to evaluate its performance. While simulations and real data applications showed
FNN had advantages over existing methods, it is also important to study the statistical
property of FNN. In this chapter, the consistency of FNN will be proved.
    In the process of proof, some lemmas and theorems are applied and make significant
effects. I quote them and provide the citation when I introduce them before the application.
Their proofs can be found from the provided sources. As for my own theorems, the proof
is given below these theorems. The notations of the proposed model are changed compared
with Chapter 2 for the consistency of quoted theorems and ease of proofs.
3.1       Introduction
Consistency is an important property of a statistical model. It determines whether the
estimate can converge to the true function as the sample size n increases. As a new nonlinear
function, it is also necessary to check the consistency of FNN.
    In the proposed FNN, the underlying function can be any continuous function, which
forms a large function space. In section 3, I prove that for any continuous function, there
                                               30


exists one FNN model that is approximated to the continuous function at any given extend.
However, it is uncertain that the approximate FNN model can be learnt from the given data
even the FNN model exists. If the space of the candidate models is too large, it is hard
to find the right one. To measure how large the function space of FNN is, I utilize the
covering number to quantify the extend in section 4. In section 5, I prove that the covering
number of our proposed model is not too large, and our estimated model only deviates from
the best possible model at a controlled rate. Combined with all these results, we prove the
consistency of FNN.
3.2      Model
First, some definitions needed in the proof are introduced.
Definition 3.2.1. A function space B = {Bi (t)|i ∈ N+ } is a basis function system in a
compact space C, if for every  > 0 and every f as a continuous real-valued function defined
                                      PN
in C, there exists a function g(x) =    i=1 ai Bi (t), ai ∈ R satisfying |f (x) − g(x)| <  on C.
    In the following, the concept smoothness is used to restrict the function space of linear
combinations of basis functions. Therefore, the basis functions must be differentiable. In
chapter 2, B-spline basis functions are utilized as a basis function system, which are dif-
ferentiable. Moreover, we can select one B-spline basis function, which is exactly a order
differentiable. By Stone-Weierstrass theorem, monomial functions also compose a basis func-
tion system. Other common basis function systems, such as the monomial basis or Fourier
basis, are also differentiable on a compact space.
    Based on the basis function system, I can define a function-to-function operator.
                                               31


Definition 3.2.2. Let x(t) be a function in L2 ([0, 1]). Define an operator M (·) : L2 ([0, 1]) →
L2 ([0, 1]) as M (x(t)) = x(t)α(s, t)dt + β(s), where α(s, t) and β(s) are linear combinations
                           R
of basis functions.
    To write in a recursive pattern, denote that x(t) = x0 (t0 ) and
                         yi (ti ) = Mi ◦ xi−1 (ti−1 ) = Mi (xi−1 (ti−1 ))
                                     Z
                                   = xi−1 (ti−1 )αi (ti , ti−1 )dti−1 + βi (ti ),            (3.1)
where
                                           Ji Ji−1
                                          X      X
                        αi (ti , ti−1 ) =             Wi,j,k Bi−1,k (ti−1 )Bi,j (ti ),
                                          j=1 k=1
                                                      Ji
                                                      X
                                         βi (ti ) =      bi,j Bi,j (ti ),
                                                     j=1
and {Bi,j (t)|j ∈ N+ } is a predetermined basis function system for each i, and Wi,j,k and
bi,j are parameters of the operator.
Definition 3.2.3. A function Ψ : R → [0, 1] is called a squashing function if it is non-
decreasing, limλ→∞ = 1 and limλ→−∞ = 0.
    If x(t) is a function, Ψ ◦ x(t) = Ψ(x(t)). To express in a recursive way, I have
                                             xi (ti ) = Ψ(yi (ti )).
    The xi (ti ) and the yi (ti ) refers to the same object as in 3.1, therefore I can construct a
                                                       32


model by connecting these operator:
Definition 3.2.4. A model F : L2 ([0, 1]) × [0, 1] → R is called FNN if it satisfies the form,
           F (x0 (t0 ), td ) = yd (td ) = Md ◦ Ψ ◦ Md−1 ◦ Ψ ◦ · · · M2 ◦ Ψ ◦ M1 ◦ x0 (t0 ), (3.2)
where d is the number of hidden layers in FNN.
    The function space of all possible FNN satisfying the above form is denoted by Fd (Ψ).
In the following, I use H to denote the model domain L2 ([0, 1]) × [0, 1].
Definition 3.2.5. For functions defined on L2 ([0, 1]), x(t) and y(t) are considered as in the
same equivalence class if 01 |x(t) − y(t)|dt = 0.
                               R
    The distance of elements in space H is defined by
                                            Z 1
                              d(hi , hj ) =     |xi (t) − xj (t)|dt + |si − sj |,
                                             0
where hi = (xi (t), si ), hj = (xj (t), sj ) ∈ H.
Remark 3.2.1. Normally random variable are defined in a subset of Euclidean space, while
the domain of the variable in the proposed FNN model involves a function space L2 ([0, 1]).
There comes an issue on whether the definition of random variable is valid. From the section
on convergence of sample distribution in [23], random variable is constructed on any separable
metric space. Therefore the domain H is meaningful in terms of the space of random variable
since L2 ([0, 1]) is separable.
                                                       33


3.3      Universal Approximation Theorem
Hornik [12] proved that multi-layer forward neural networks are universal approximators in
1989. In this section, I prove that FNN is also a universal approximator based on approxi-
mations.
    To outline the proof, I first construct another function space Gd (Ψ), and then prove that
Gd (Ψ) is dense in all real continuous function defined on H following the idea in [12]. Finally,
that Fd (Ψ) is dense is proved by showing that any function in Gd (Ψ) can be approximated
by functions in Fd (Ψ).
3.3.1     Definitions for New Construction
Definition 3.3.1. Let x(t) be a function on L2 ([0, 1]). Define an operator L(·) as L(x(t)) =
R
  x(t)α(s, t)dt + β(s), where α(s, t) and β(s) are step functions.
    To express in a recursive pattern, I have
                         yi (ti ) = Li ◦ xi−1 (ti−1 ) = Li (xi−1 (ti−1 ))
                                      Z
                                  = xi−1 (ti−1 )αi (ti , ti−1 )dti−1 + βi (ti ),
where
                                          Ji Ji−1
                                         X   X
                      αi (ti , ti−1 ) =            Wi,j,k χE         (ti−1 )χEi,j (ti ),
                                                               i−1,k
                                         j=1 k=1
                                          Ji
                                         X
                              βi (ti ) =     bi,j χEi,j (ti ).
                                         j=1
    Among them, Wi is a matrix where the j-th row and k-th column element is Wi,j,k . bi is
                                                     34


a vector where the element in the j-th position is bi,j . E(·) are sub intervals of [0, 1], where
Ei,j ∩ Ei,j 0 = ∅ if j 6= j 0 . χE is a characteristic function satisfying:
                                                   
                                                   
                                                          if x ∈ E
                                                   
                                                   1
                                                   
                                          χE (t) =
                                                   
                                                   
                                                   0
                                                         if x ∈
                                                               / E.
    For every set E, denote that aE + b = {ax + b|x ∈ E}
Definition 3.3.2. The function space Gd (Ψ) is a set of all possible functions f of the the
following form:
              f (x0 (t0 ), td ) = yd (td ) = Ld ◦ Ψ ◦ Ld−1 ◦ Ψ ◦ . . . L2 ◦ Ψ ◦ L1 ◦ x0 (t0 ).
    An operator L(·) whose domain is a set of measurable functions is bounded by λ if
                              R
|L(x(t)) − L(y(t))| ≤ λ |x(t) − y(t)|dt.
3.3.2       Step One
Recall that a family A of real functions defined on a set E is an algebra if A is closed under
addition, multiplication, and scalar multiplication.
    A family A separates points on E if, for every x, y in E, x 6= y, there exists a function f
in A such that f (x) 6= f (y).
    The family A vanishes at no point of E if, for each x in E, there exists f in A such that
f (x) 6= 0.
Theorem 3.3.1. (Stone-Weierstrass Theorem.) Let A be an algebra of real continuous
functions on a compact set K. If A separates points on K and A vanishes at no point of K,
                                                     35


then the uniform closure B of A consists of all real continuous functions on K.
     In this chapter, the term C(H) is used to denote all real continuous functions defined on
H.
Definition 3.3.3. A subset S of a metric space (X, ρ) is ρ - dense in a subset T if, for every
 > 0 and every t ∈ T , there is an s ∈ S such that ρ(s, t) < .
Definition 3.3.4. A subset A of C is said to be uniformly dense on compacta in C if, for
every compact subset K which is subset of the domain of C, A is ρK - dense in C, where
ρK (f, g) = supx∈K |f (x) − g(x)|.
     By the definition, A is uniformly dense on compacta in C has the same meaning with
that the uniform closure B of A consists of all real continuous functions on K.
     In order to apply the Stone-Weierstrass theorem on Gd (Ψ), I have to show that Gd (Ψ)
is an algebra.
Lemma 3.3.1. Gd (Ψ) is an algebra on H if Ψ is the cosine function.
Proof. It is necessary to show that Gd (Ψ) is closed under addition, multiplication, and scalar
multiplication on C if Ψ is the cosine function.
   (i) Gd (Ψ) is closed under the scalar multiplication:
       ∀f ∈ Gd (Ψ), λf can be represented by f in which Wd and bd are replaced by λWd and
       λbd without changing any other parameters.
  (ii) Gd (Ψ) is closed under addition:
       It is clear that G1 (Ψ) is closed under addition since it is a simple functional linear
       regression.
                                               36


      ∀f, f˜ ∈ Gd (Ψ), I need to prove that f + f˜ ∈ Gd (Ψ).
      For f˜, the corresponding functions αi (ti , ti−1 ) and βi (ti ) are denoted by α̃i (ti , ti−1 )
      and β̃i (ti ).
      The function which equals to f + f˜ can be constructed with the following coefficient
                           _           _
      functions including α i and β i For i = 1,
                               _
                                α1   (t1 , t0 ) := α1 (2t1 , t0 ) + α̃1 (2t1 − 1, t0 ),
                                      _
                                      β1    (t1 ) := β1 (2t1 ) + β̃1 (2t1 − 1).
      For 1 < i < d,
                       _
                       αi (ti , ti−1 ) := 2αi (2ti , 2ti−1 ) + 2α̃i (2ti − 1, 2ti−1 − 1),
                                       _
                                       βi    (ti ) := βi (2ti ) + β̃i (2ti − 1).
      For i = d,
                          _
                          αd   (td , td−1 ) := αi (td , 2td−1 ) + α̃d (td , 2td−1 − 1),
                                      _
                                      βd    (td ) := βd (2td ) + β̃d (2td − 1).
(iii) Gd (Ψ) is closed under multiplication:
      ∀f, f˜ ∈ Gd (Ψ), yd−1 (td−1 ), ỹd−1 (td−1 ) are the corresponding object in the previous
      layer, i.e.,
                                        Z
                     f (x(t0 ), td ) =       αd (td , td−1 )Ψ(yd−1 (td−1 ))dtd−1 + βd (td )
                                                        37


                                      Z
                  f˜(x̃(t0 ), td ) =      α̃d (td , td−1 )Ψ(ỹd−1 (td−1 ))dtd−1 + β̃d (td ).
For f˜, the corresponding parameters Wi,j,k , bi,j and Ei,j are denoted by W̃i,j,k , b̃i,j and
Ẽi,j . Since Gd (Ψ) is closed in addition, it satisfies to check each component of f · f˜ in
the following:
                                   βd (td ) · β̃d (td )
                                     Jd                         J˜d
                                    X                          X
                                  =       bd,j χE      (td ) ·       b̃d,j χẼ     (td )
                                                   d,j                         d,j
                                    j=1                        j=1
                                     Jd J˜d
                                    X     X
                                  =             bd,j b̃d,j 0 χE               (td ),
                                                                d,j ∩Ẽd,j 0
                                    j=1 j 0 =1
satisfies the form of βd (td ).
                Z
     β̃d (td ) · αd (td , td−1 )Ψ(yd−1 (td−1 ))dtd−1
      Z
  = αd (td td−1 )β̃d (td )Ψ(yd−1 (td−1 ))dtd−1
      Z X  Jd Jd−1                                                  J˜d
                X                                                  X
  =                  Wd,j,k χE           (td−1 )χE         (td ) ·       b̃d,j χẼ     (td )Ψ(yd−1 (td−1 ))dtd−1
                                  d−1,k                d,j                         d,j
          j=1 k=1                                                  j=1
      Z   Jd−1 Jd J˜d
           X XX
  =                        Wd,j,k b̃d,j 0 χE          (td−1 )χE                  (td )Ψ(yd−1 (td−1 ))dtd−1 ,
                                              d−1,j                 d,j ∩Ẽd,j 0
           k=1 j=1 j 0 =1
                           R
satisfies the form of         αd (td , td−1 )Ψ(yd−1 (td−1 ))dtd−1 .
                                                                           R
In the same manner, it is easy to show that βd (td ) · α̃d (td , td−1 )Ψ(ỹd−1 (td−1 ))dtd−1
                           R
satisfies the form of α̃d (td , td−1 )Ψ(ỹd−1 (td−1 ))dtd−1 .
                                                       38


By using the step function decomposition, I have,
                                               Jd−1
                                                X
                              yd−1 (td−1 ) =          ck χE         (td−1 ),
                                                              d−1,k
                                                k=1
                                               J˜d−1
                                                X
                              ỹd−1 (td−1 ) =         c̃k χẼ       (td−1 ),
                                                              d−1,k
                                                k=1
where
  yd−1 (td−1 )
   Z                 Jd−2 Jd−1                                                 Jd−1
                      X X                                                       X
=     xd−2 (td−2 )              χE        (td−1 )χE           (td−2 )dtd−2 +        bd−1,j χE       (td−1 )
                                   d−1,j              d−2,k                                   d−1,j
                      k=1 j=1                                                   j=1
   Jd−1 Jd−2 Z
    X X
=       (          xd−2 (td−2 )χE         (td−2 )dtd−2 + bd−1,j )χE             (td−1 ).
                                   d−2,k                                 d−1,j
    j=1    k=1
                 R
Define |E| =        χE (s)ds. By changing the parameter Ed−1,j to be independent with
Ẽd−1,j without changing its length, the condition |Ed−1,j ||Ed−1,j 0 | = |Ed−1,j ∩Ed−1,j 0 |
is satisfied. The transformed yd−1 (td−1 ) with the new Ed−1,j still satisfies the form of
Gd−1 (Ψ). I used the transformed yd−1 (td−1 ) in the following proofs without altering
its form.
         Z
            αd (td , td−1 )Ψ(x(td−1 ))dtd−1
         Z X Jd Jd−1                                              Jd−1
                 X                                                 X
      =                Wd,j,k χE        (td−1 )χE       (td ) · Ψ(      ck χE       (td−1 ))dtd−1
                                  d−1,k             d,j                       d−1,k
            j=1 k=1                                                k=1
          Jd Jd−1
         X X
      =             Ψ(ck )Wd,j,k |Ed−1,k |χE       (td ),
                                               d,j
         j=1 k=1
                                                39


where Ψ is the cos function that satisfies Ψ(x)Ψ(y) = (Ψ(x + y) + Ψ(x − y))/2.
     Z                                              Z
          αd (td , td−1 )Ψ(x(td−1 ))dtd−1 ] ·           α̃2 (td , td−1 )Ψ(x̃(td−1 ))dtd−1
      Jd Jd−1
     X      X
   =               Ψ(ck )Wd,j,k |Ed−1,k |χE           (td )·
                                                  d,j
     j=1 k=1
       J˜d J˜d
     X      X
                   Ψ(c̃k0 )Wd,j 0 ,k0 |Ed−1,k0 |χE          (td )
                                                      d,j 0
     j 0 =1 k 0 =1
      Jd Jd−1 J˜d J˜d−1
     X      X X X
   =                            Wd,j,k Wd,j 0 ,k0 |Ed−1,k ||Ed−1,k0 |Ψ(ck )Ψ(c̃k0 )χE                  (td )
                                                                                          d,j ∩Ẽd,j 0
     j=1 k=1 j 0 =1 k 0 =1
      Jd Jd−1 J˜d J˜d−1
     X      X X X 1
   =                               Wd,j,k Wd,j 0 ,k0 |Ed−1,k ∩ Ed−1,k0 |
                    0     0
                                 2
     j=1 k=1 j =1 k =1
    (Ψ(ck + c̃k0 ) + Ψ(ck − c̃k0 ))χE                       (td )
                                             d,j ∩Ẽd,j 0
      Jd Jd−1 J˜d J˜d−1
     X      X X X 1
   =                               W      W 0 0 |E                ∩ Ed−1,k0 |Ψ(ck + c̃k0 )χE ∩Ẽ (td )
                    0     0
                                 2 d,j,k d,j ,k d−1,k                                        d,j    d,j 0
     j=1 k=1 j =1 k =1
      Jd Jd−1 J˜d J˜d−1
     X      X X X 1
   +                               Wd,j,k Wd,j 0 ,k0 |Ed−1,k ∩ Ed−1,k0 |Ψ(ck − c̃k0 )χE ∩Ẽ (td ).
                    0     0
                                 2                                                           d,j    d,j 0
     j=1 k=1 j =1 k =1
Since x(td−1 ), x̃(td−1 ) ∈ Gd−1 (Ψ) and Gd−1 (Ψ) is closed in addition and scalar multi-
plication, (x(td−1 ) + x̃(td−1 )), (x(td−1 ) − x̃(td−1 )) ∈ Gd−1 (Ψ).
                   Z
                       _
                       αd   (td , td−1 )Ψ(x(td−1 ) + x̃(td−1 ))dtd−1
                   _           ˜
                    J d Jd−1 Jd−1
                   X     X X                           _
                =                    Ψ(ck + c̃k0 ) W d,j,k |Ed−1,k ∩ Ẽd−1,k0 |χ_            (td ).
                                                                                       E d,j
                   j=1 k=1 k 0 =1
                   _                             _
By choosing E d,j := Ed,j ∩ Ẽd,j 0 , W d,j,K := 21 Wd,j,k Wd,j 0 ,k0 , the first term satisfies
                                                      40


                           R _                         _
        the form of            α d (td , td−1 )Ψ( y d−1 (td−1 ))dtd−1 . The same condition applies to
        x(td−1 ) − x̃(td−1 ). Since Gd (Ψ) is closed under addition, the whole component belongs
        to Gd (Ψ).
     By using the Stone-Weierstrass theorem, if Gd (Ψ) separates points on K for any compact
set K ⊆ H and vanishes at no point, I can have the following theorem.
Theorem 3.3.2. Gd (Ψ) is uniformly dense on compacta in C(H) if G is the cosine function.
Proof. It is easy to show that Gd (Ψ) vanishes at no point by using βd (td ) = 1 while keeping
other functions as 0.
If t̂d 6= t̃d , we can assume that t̂d < t̃d without loss of generality. The points can be separated
by using βd (td ) := χ t̂ +t̃             (td ), while keeping other functions as 0.
                             [ d d    ,1]
                                 2
It is sufficient to show that ∀x(t0 ), x̃(t0 ) ∈ C, ∃f ∈ Gd (Ψ) s.t. f (x(t0 ), td ) 6= f (x̃(t0 ), td ) if
R1
 0 |x(t0 ) − x̃(t0 )| > 0. Choose
                                               1
                            α1 (t1 , t0 ) := (χx(t )>x̃(t ) − χx(t )<x̃(t ) )χ[0,1] (t1 ),
                                                c         0      0         0       0
                                     Z                               Z
                                 1
                  β1 (t1 ) := − (                       x(t )dt −                     x(t0 )dt0 )χ[0,1] (t1 ),
                                  c x(t0 )>x̃(t0 ) 0 0                 x(t0 )<x̃(t0 )
                R                                            R
where c = x(t )>x̃(t ) x(t0 ) − x̃(t0 )dt0 − x(t )<x̃(t ) x(t0 ) − x̃(t0 )dt0 .
                     0       0                                   0      0
Therefore L1 (x(t0 )) = 0 and L1 (x̃(t0 )) = −χ[0,1] (t1 ). For i > 1, choose
                                                           2 + 2e
                                   αi (ti , ti−1 ) :=              χ    (t )χ      (t ),
                                                            e − 1 [0,1] i [0,1] i−1
                                                               e+1
                                                  βi (ti ) :=        χ     (t ).
                                                               e − 1 [0,1] i
                                                                 41


Therefore f (x(t0 ), td ) = 0 and f (x̃(t0 ), td ) = −1, i.e., Gd (Ψ) separates points on compact
function space C.
    The following lemma is quoted from the Lemma A.3 in [12].
Lemma 3.3.2. For every squashing function Ψ,  > 0, and M > 0, there is a function σ(x)
                        Pq
that satisfies σ(x) =     j=1 bj Ψ(aj x + cj ); bj , aj , cj ∈ R, q ∈ N+ such that
                                      sup      |σ(x) − cos(x)| < .
                                  x∈[−M,M ]
    The above lemma shows that the cosine function can be approximated by any squashing
function with some linear transformations. Based on this lemma, the squashing function Ψ
is not limited to the cosine function.
Theorem 3.3.3. Gd (Ψ) is uniformly dense on compacta in C(H) if Ψ is a squashing func-
tion.
Proof. For each given f ∈ Gd (Ψ), a common bound B > 1 can be found for each operator
Li (·).
For any given compact set K ⊆ H, there exists M s.t. the values inside the cosine function
are within [−M, M ]. I will compare the squashing function σ(·) defined in the previous
lemma and the cosine function.
In each step, there exists the difference between the inside functions, which we distinguish
               _
by ỹi (t) and y i (t).
                                                    42


For each δ > 0, choose  = δ(B − 1)/B d ,
                                                  _
                           |σ(ỹi (ti )) − cos( y i (ti ))|
                                                                                    _
                        ≤|σ(ỹi (ti )) − cos(ỹi (ti ))| + | cos(ỹi (ti )) − cos( y i (ti ))|
                                                            _
                        ≤δ(B − 1)/B d + |ỹi (ti )− y i (ti )|
                                                                                 _
                        ≤δ(B − 1)/B d + B              max       |x̃i−1 (ti−1 )− x i−1 (ti−1 )|.
                                                     0≤ti−1 ≤1
                                                                      _
Suppose that f˜ uses the squashing function σ and f uses the cosine function, we have
                              _
         |f˜(x0 (t0 ), td )− f (x0 (t0 ), td )|
                                                    _
       =|Ld (σ(ỹd−1 (td−1 ))) − Ld (cos( y d−1 (td−1 ))|)
                                                        _
       ≤B      max        |σ(ỹd−1 (td−1 )) − cos( y d−1 (td−1 ))|
            0≤td−1 ≤1
                                                                                               _
       ≤δ(B − 1)/B d−1 + B                max        |Ld−1 (σ(ỹd−2 (td−2 ))) − Ld−1 (cos( y d−2 (td−2 ))|
                                       0≤td−2   ≤1
                     d−1
                            1/B i ≤ δ.
                     X
       ≤δ(B − 1)
                      i=1
It satisfies to verify that f˜ ∈ Gd (Ψ).
Since aj L˜1 + cj is still an operator of L(·), it is sufficient to show that
       Pq
L̃i+1 ( j=1 bj Ψ(L̃i,j (xi−1 (ti−1 )))) can be rewritten in the form Li+1 (Ψ(Li (xi−1 (ti−1 )))).
Let
                                                       X q
                                   αi (ti , ti−1 ) :=       α̃i (qti + (1 − j), ti−1 ),
                                                       j=1
                                                       X q
                                           βi (ti ) :=      β̃i (qti + (1 − j)),
                                                       j=1
                                                             43


                                              X q
                        αi+1 (ti+1 , ti ) :=       qbj α̃i+1 (ti+1 , qti + (1 − j)),
                                              j=1
                                         βi (ti+1 ) := β̃i+1 (ti+1 ).
we have f ∈ Gd (Ψ) that satisfies f = f˜, i.e.,
                                                     _
                                |f˜(x0 (t0 ), td )− f (x0 (t0 ), td )| ≤ δ.
Gd (Ψ) is therefore uniformly dense on compacta in C(H).
3.3.3     Step Two
I want to replace the step functions in Gd (Ψ) with the linear combinations of basis functions
in Fd (Ψ). With the knowledge of real analysis, every measurable function is almost a con-
tinuous function. By the definition of a basis function system, every continuous function can
be approximated by linear combinations of basis functions. The conclusions are proceeded
by these approximations.
    I quote the Lusin theorem below. The proof of theorem can be found in book [29].
Theorem 3.3.4. (Lusin Theorem.) Suppose f is measurable and finite valued on E with E
of finite measure, for every  > 0, there exists a closed set F with
                                   F ⊆ E       and m(E − F ) ≤ 
such that f |F is continuous.
    Denote G bd (Ψ) is the same as Gd (Ψ) except that the step functions in Gd (Ψ) are replaced
by continuous functions. If G  bd (Ψ) is ρ - dense in Gd (Ψ), I have the following theorem.
                                                      44


Theorem 3.3.5. G        bd (Ψ) is uniformly dense on compacta in C(H) if Ψ is a Lipschitz con-
tinuous squashing function.
Proof. It is sufficient to prove that for every δ > 0, any compact set K ⊆ H and any
f ∈ Gd (Ψ),
                        ∃fb ∈ Gbd (Ψ),        s.t. |fb(x0 (t0 ), td ) − f (x0 (t0 ), td )| < δ,
for all (x0 (t0 ), td ) ∈ K. Since K is compact, there exists E > 1 and |x0 (t0 )| < E.
A common bound B > 1 can be found for each operator Li (·) in function f .
By the Lusin Theorem, ∀ > 0, there exists Fi, and Fi,                 0 s.t.
                                     α
                                     bi (ti , ti−1 )χFi, = αi (ti , ti−1 )χFi, ,
                                             βbi (ti )χF 0 = βi (ti )χF 0 .
                                                          i,             i,
Find a common bound D for αi (ti , ti−1 ) and βi (ti ).
The squashing function Ψ is Lipschitz continuous, therefore there exists L > 1 s.t.
                                           |Ψ(x) − Ψ(y)| ≤ L|x − y|.
And the difference between each functional linear transformation can also be bounded:
                      |Li+1 (xi (ti )) − L bi+1 (b   xi (ti ))|
                    ≤|Li+1 (xi (ti )) − Li+1 (b      xi (ti ))| + |Li+1 (b
                                                                         xi (ti )) − L
                                                                                     bi+1 (b
                                                                                           xi (ti ))|
                         Z 1
                    ≤B       |xi (ti ) − xbi (ti )|dti + 4D(E + 1)
                          0
                    ≤B max |xi (ti ) − x       bi (ti )| + 4D(E + 1),
                         0≤t ≤1
                            i
                                                              45


Combined the two inequalities and recursive representation of the two functions,
               |f (x0 (t0 ), td ) − fb(x0 (t0 ), td )|
            =|Ld (xd−1 (td−1 )) − L     bd (bxd−1 (td−1 ))|)
            ≤4D(E + 1) + BL              max        |Ld−1 (xd−2 (td−2 )) − L    bd−1 ((bxd−2 (td−2 ))|
                                       0≤td−1 ≤1
                                d−1
                                    (BL)i ≤ 4D(BL)d /(BL − 1).
                                X
            ≤4D(E + 1)
                                i=0
|fb(x(t)) − f (x(t))| < δ is satisfied if  < δ(BL − 1)/4D(E + 1)(BL)d .
    Following the same strategy of the last proof, I can show that Fd (Ψ) is ρ - dense in G             bd (Ψ)
and therefore Fd (Ψ) is a universal approximator.
Theorem 3.3.6. Fd (Ψ) is uniformly dense on compacta in C(H) if Ψ is Lipschitz continuous
squashing function.
Proof. It is sufficient to show that for every δ > 0, any compact set K ⊆ H and any
f ∈ Gd (Ψ), fb ∈ G    bd (Ψ),
                         ∃f ∈ Fd (Ψ),       s.t. |f (x0 (t0 ), td ) − fb(x0 (t0 ), td )| < δ,
for all (x0 (t0 ), td ) ∈ K. Since K is compact, there exists E > 1 and |x0 (t0 )| < E.
Based on the property of basis function, ∀ > 0, we have
                                        αi (ti , ti−1 ) − α bi (ti , ti−1 ) ≤ ,
                                                βi (ti ) − βbi (ti ) ≤ .
                                                           46


A common bound B > 1 can be found for each operator Li (·) in function f .
                    |Mi+1 (xi (ti )) − L bi+1 (b  xi (ti ))|
                  ≤|Mi+1 (xi (ti )) − L  bi+1 (xi (ti ))| + |L    bi+1 (xi (ti )) − L
                                                                                    bi+1 (b
                                                                                          xi (ti ))|
                                       Z
                  ≤(1 + E) + B |xi (ti ) − x           bi (ti )|dti
                  ≤(1 + E) + B max |xi (ti ) − x             bi (ti )|.
                                       0≤t ≤1
                                          i
Suppose that the squashing function Ψ is L-Lipschitz continuous,
                |f (x0 (t0 ), td ) − fb(x0 (t0 ), td )|
              =|Md (xd−1 (td−1 )) − L     bd (bxd−1 (td−1 ))|)
              ≤(1 + E) + LB            max      |Md−1 (xd−2 (td−2 )) − L       bd−1 ((b
                                                                                       xd−2 (td−2 ))|
                                    0≤td−1 ≤1
                            d−1
                                  (LB)i ≤ (1 + E)(LB)d /(LB − 1).
                            X
              ≤(1 + E)
                            i=0
|fe(x(t)) − fb((x(t)))| < δ is satisfied if  < δ(LB − 1)/(1 + E)(LB)d .
                                                            47


3.4       Covering Number
3.4.1      Smooth Functions
I introduce a class of functions on a bounded set χd in Rd quoted from [31]. A differential
operator Dk for any vector k = (k1 , · · · , kd ) is defined as:
                                                     ∂ k.
                                       Dk =                k
                                                              ,
                                                  k
                                              ∂x11 · · · ∂xdd
              Pd
where k. =      i=1 ki . Then, for a function f : χd → R,
                                                             |Dk f (x) − Dk f (y)|
                    kf ka = max sup |Dk f (x)| + max sup                           ,
                            k.≤a x                 k.=a x,y      kx − yka−a
where the suprema taken over all x, y in the interior of χd with x 6= y. a measures how
smooth the function class is. α is the greatest integer smaller than a.
           a,d
    Let CM be the set of all continuous functions f : χd → R with kf ka ≤ M . To obtain
the covering number of the function space, I quote the theorem 2.7.1 from [31]
Theorem 3.4.1. Let χd be a bounded, convex subset of Rd with nonempty interior. There
exists a constant K depending only on a and d such that
                                      a,d                            1
                           log N (, C1 (χd ), k · k∞ ) ≤ Kλ(χ1d )( )d/a ,
                                                                     
for every  > 0, where λ(χ1d ) is the Lebesgue measure of the set {x : kx − χd k < 1}.
    It is easy to show that the value of λ(χ1d ) is 3 and 5 + π when χd is [0, 1] and [0, 1] × [0, 1].
                                                  48


Since the two cases are the only cases that are considered in the proof, I have
                                                                   1
                                 log N (, C1a (χd ), k · k∞ ) ≤ K( )d/a ,
                                                                   
by replacing K with 9K.
    Since CMa (χ ) = {f |f /M ∈ C a (χ )}, we have
                 d                     1 d
                          N (M , CM a (χ ), k · k ) = N (, C a (χ ), k · k ).
                                          d        ∞             1 d         ∞
    Finally, the covering number of function space N (M , CM         a (χ ), k · k ) satisfies
                                                                           d          ∞
                                            a (χ ), k · k ) ≤ K(   M d/a
                               log N (, CM      d         ∞           ) .
                                                                   
    To calculate the covering number of functions space within functional neural networks
space Fd (Ψ), I have to define the distance of the space in the following:
                     d(f, f˜) = max{sup |αi − α̃i |, sup |βi − β̃i |; i = 1, · · · , d},
where f, f˜ ∈ Fd (Ψ) and their corresponding coefficient functions are αi , βi ) and α̃i , β̃i ).
3.4.2     Deductions
Since the domain H of our model is a compact set, the input function x0 (t0 ) should have a
common boundary, which is denoted by E. The term Xi is used to denote the set consisting
of all possible xi (ti ). I use the term Si to denote the set
                                                      49


          Z 1                                                                                          
                                                                                a,2               a/2,1
                  α(ti , ti−1 )xi (ti−1 )dti−1 + βi (ti )|α(ti , ti−1 ) ∈     CM , βi (ti )  ∈  CM        .
              0
Lemma 3.4.1.
                                                                                          M 2/a
                   log N∞ (M  + (Ei + 1)δ, Si ) ≤ log N∞ (, Xi−1 ) + 2K(                    ) ,
                                                                                           δ
where the functions in Xi−1 is bounded by Ei .
                                                                  a,2                  a/2,1
Proof. For each xi−1 (ti ) ∈ Xi−1 , αi (ti , ti−1 ) ∈ CM , βi (ti ) ∈ CM                     , we can find their
representatives x̃i−1 (ti−1 ), α̃i (ti , ti−1 ), β̃i (ti ) from their corresponding -cover, δ-cover and
δ-cover.
        Z 1                                                   Z 1
            αi (ti , ti−1 )xi−1 (ti−1 )dti−1 + βi (ti ) −         α̃i (ti , ti−1 )x̃i−1 (ti−1 )dti−1 − β̃i (ti )
         0                                                     0
        Z 1
      =     (αi (ti , ti−1 ) − α̃i (ti , ti−1 ))xi−1 (ti−1 )dti−1 +
         0
        Z 1
            α̃i (ti , ti−1 )(xi−1 (ti−1 ) − x̃i−1 (ti−1 ))dti−1 + βi (ti ) − β̃i (ti )
         0
      ≤Ei δ + M  + δ.
Therefore, the Cartesian product of the two covering is the (M  + (Ei + 1)δ)-covering of Si ,
i.e.,
                         log N∞ (M  + (Ei + 1)δ, Si )
                                                                a,2                       a/2,1
                       ≤ log N∞ (, Xi−1 ) + log N∞ (δ, CM ) + log N∞ (δ, CM                    )
                                                       M 2/a
                       ≤ log N∞ (, Xi−1 ) + 2K(            ) .
                                                         δ
                                                           50


    E1 = E in the first layer. When the layer i > 1, the value is within the range of a
squashing function, therefore Ei = 1.
    In the first layer, since X0 (t0 ) only consists of identity function without any parameters,
                                              (E+1)M 2/a
log N∞ (, X0 ) = 0, log N∞ (δ, S1 ) ≤ 2K(       δ    ) .
    To illustrate the effect of nonlinear activation, I quote lemma 14.13 in [1].
Lemma 3.4.2. Suppose F is a class of real-valued functions in the vector space from domain
X to Rp and σ : R → R is a element-wise function satisfying Lipschitz condition: |σ(x) −
σ(y)| ≤ L|x − y|. Let σ(F ) denote the class {σ ◦ f : f ∈ F }, then
                               N∞ (, σ ◦ F, m) ≤ N∞ (/L, F, m).
    Suppose the squashing function Ψ is L-Lipschitz, I have
                                log N∞ (, Xi ) ≤ log N∞ (/L, Si ).
    Based on the results of the two lemmas, if i > 1, therefore
                                                                         M 2/a
                     log N∞ (M L + 2δ, Si ) ≤ log N∞ (, Si−1 ) + 2K(       ) .
                                                                          δ
                                                   (E+1)M 2/a
    We have known that log N∞ (δ, S1 ) = 2K(           δ     ) .  Suppose that log N∞ (δ, Si−1 ) =
        M bi−1 2/a
ci−1 K(    δ    ) ,    then
                                                        M bi−1 2/a         M
                   log N∞ (M L + 2δ, Si ) ≤ ci−1 K(           )    + 2K( )2/a .
                                                                           δ
                                                  51


    Take  = δbi−1 ,
                                                                           M 2/a
                         log N∞ ((M Lbi−1 + 2)δ, Si ) ≤ (ci−1 + 2)K(           ) .
                                                                             δ
    Furthermore, by scaling the distance, we have
                                                          M (M Lbi−1 + 2) 2/a
                         log N∞ (δ, Si ) ≤ (ci−1 + 2)K(                        ) .
                                                                   δ
    i.e. ci = ci−1 + 2 and bi = M Lbi−1 + 2. Considering that c1 = 2 and b1 = E + 1, we can
have that cd = 2d and bd = (E + M        M L+1 )(M L)d−1 −       2
                                           L−1                M L−1 . Therefore it can be concluded
that
                                            M (E + M   L+1
                                                     M L−1 )(M L)
                                                                    d−1 −       2
                                                                              M L−1 2/a
                 log N∞ (δ, Sd ) ≤ 2dK(                                            ) .
                                                              δ
3.5       Consistency
When it comes to consistency, the error can be divided into the approximation error and the
estimation error, which is expressed by Lemma 10.1 in the book on non-parametric regression
[11].
Lemma 3.5.1. Let Fn = Fn (Dn ) be a class of functions f : Rd → R depending on the data
Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. If mn satisfies
                                                       n
                                                    1X
                                 mn (·) = arg min         |f (Xj ) − Yj |2 .
                                              f ∈Fn n
                                                      j=1
                                                    52


then
                         Z
                             |mn (x) − m(x)|2 µ(dx)                                        (3.3)
                                     n
                                  1X
                       ≤2 sup            |f (Xj ) − Yj |2 − E{f (X) − Y )2 }               (3.4)
                          f ∈Fn n j=1
                                  Z
                         + inf       |f (x) − m(x)|2 µ(dx).                                (3.5)
                            f ∈Fn
    As is seen in the decomposition, the estimation error (3.4) refers to the the supremum
difference between the error of the trained data and the error expectation over all function
space, while the approximation error (3.5) refers to the infimum difference between the true
function and the model function space. As model the function space becomes larger, the
approximation error is decreasing while the approximation error is increasing. To make the
total error as small as possible, I have to make a trade off in the selection of function space
Fn .
    I quote theorem 10.2 in [11] to clarify what is required in the consistency problem.
Theorem 3.5.1. Let Fn = Fn (Dn ) be a class of functions f : Rd → R and the estimator
mn satisfies
                                   n                                 n
                                1X                                1X
             m̃n ∈ Fn   and            |m̃n (Xj ) − Yj |2 = min         |f (Xj ) − Yj |2 ,
                                n                           f ∈Fn n
                                  j=1                               j=1
                                        mn (x) = Tβn m̃n (x).
(a) If
                                             lim βn = ∞,
                                            n→∞
                                                  53


                                                Z
                      lim           inf             |f (x) − m(x)|2 µ(dx) = 0 a.s.,
                     n→∞ f ∈Fn ,kf k∞ ≤βn
                                        n
                                    1X
                 lim      inf              |f (Xj ) − Yj,L |2 − E{(f (X) − YL )2 } = 0,
                n→∞ f ∈T Fn n
                           βn          j=1
a.s. for all L > 0, then
                                     Z
                               lim      |mn (x) − m(x)|2 µ(dx) = 0        a.s..
                             n→∞
(b) If
                                                 lim βn = ∞,
                                                n→∞
                               (                     Z                         )
                      lim E              inf             |f (x) − m(x)|2 µ(dx)   = 0,
                     n→∞         f ∈Fn ,kf k∞ ≤βn
                                                                                     
                                         n
                                    1                                                
                                             |f (Xj ) − Yj,L |2 − E{(f (X) − YL )2 } = 0,
                                        X
              lim E        inf
             n→∞ f ∈T Fn n
                            βn          j=1
                                                                                      
for all L > 0, then
                                        Z                             
                               lim E          |mn  (x) − m(x)|2 µ(dx)     = 0.
                              n→∞
    I quote theorem 9.1 in [11] to bound the estimation error.
Theorem 3.5.2. Let G be a set of functions g : Rd → [0, B]. For any n and  > 0,
                    n
         (                                          )
                 1X
       P sup |          g(Zi ) − E{g(Z)}| >  ≤ 8E{N1 (/8, G, Z1n )} exp(−n2 /128B 2 ).
           g∈G n   i=1
    The domain of functions in above theorems are defines on Rd for neural networks model.
I checked the proofs of them and make sure that their validity has no issues with the explicit
form of its domain. Therefore I can apply these theorems in our FNN model whose domain
                                                        54


is defined in H.
    The consistency of FNN model Fd (Ψ) is given below.
Theorem 3.5.3. Let Fn be a class of functions satisfying
                     n                                                        o
                                                   a,2              a,1
               Fn = f ∈ Fd (Ψ)|αi (ti , ti−1 ) ∈ CM , βi (ti ) ∈ CM , Ji ≤ kn ,
                                                      n                n
where Ψ is L-Lipschitz continuous squashing function. Let
                                                 n
                                              1X
                             mn = arg min           |f (Xj ) − Yj |2 .
                                       f ∈Fn n
                                                j=1
If Mn and kn satisfy
                               4+ 2d+2
                                    a
                             Mn
                                        → 0, βn → ∞, Mn → ∞,
                                 n
then K (mn (x) − m(x))2 µ(dx) → 0 almost surely with E|Y |2 < ∞ and any compact set
      R
K ⊆ H.
Proof. For the approximation error, universal approximation theorem implies limn→∞ Mn →
∞ and limn→∞ kn → ∞.
For the estimation error, theorem 3.5.1 implies that I can assume |Y | ≤ L almost surely, for
some L, and then it is required to show that
                                               n
                                           1X
                   sup   E|f (X) − Y |2  −        |f (Xj ) − Yj |2 → 0 a.s..
                  f ∈Fn                    n
                                              j=1
Let H = L2 × [0, 1] and f : H → R. Define
                       Z = (X, Y ), Z1 = (X1 , Y1 ), . . . , Zn = (Xn , Yn ),
                                               55


where Xi ∈ H, Yi ∈ R and
           Hn = {h : H × R → R : ∃f ∈ Fn              s.t. h(x, y) = |f (x(t), td ) − y|2 }.
Assume Mn ≥ L so that functions in Hn satisfy
                               0 ≤ h(x, y) ≤ 2Mn2 + 2L2 ≤ 4Mn2 .
Using the boundary of theorem 3.5.2, we have, for arbitrary  > 0,
                                    n
                     (                                                             )
                                 1X
                  P sup |               |f (Xi ) − Yi |2 − E{|f (X) − Y |2 }| > 
                       f ∈Fn    n
                                   i=1
                                    n
                     (                                            )
                                 1X
                =P sup |                h(Zi ) − E{h(Z)}| > 
                       h∈Hn n      i=1
                                           −     n2
                                                    2 2
                ≤8EN1 ( , Hn , Z1n )e 128(4Mn ) .
                           8
Next, I bound the covering number and get
                         n
                     1X
                             |h1 (Zi ) − h2 (Zi )|
                     n
                       i=1
                         n
                     1X
                   =         ||f1 (Xi ) − Yi |2 − |f2 (Xi ) − Yi |2 |
                     n
                       i=1
                         n
                     1 X
                   =         |f1 (Xi ) − f2 (Xi )||f1 (Xi ) − Yi + f2 (X2 ) − Yi |
                     n
                       i=1
                                n
                           1X
                  ≤4Mn             |f1 (Xi ) − f2 (Xi )|.
                           n
                              i=1
                                                   56


Thus,
                                                             
                                 N1 ( , Hn , Z1n ) ≤ N1 (         , Fn , X1n ).
                                      8                     32Mn
                       n
        (                                                             )
                  1X
      P sup |             |f (Xi ) − Yi |2 − E{|f (X) − Y |2 }| > 
          f ∈Fn   n
                      i=1
                                 −     n2
                                          2 2
    ≤8N1 (         , Fn , X1n )e 128(4Mn )
            32Mn
                                  −     n2
                                            2 2
    ≤8N∞ (           , Fn , X1n )e 128(4Mn )
             32Mn
                      32Mn2 (E + M    n L+1             d−1 −    2
                                                                                            !
                                    Mn L−1 )(Mn L)            Mn L−1 2/a           n2
    ≤8 exp 2dK(                                                       )     −
                                                                               128(4Mn2 )2
                                                               Mn L+1
                                                         2                       d−1 − M 2L−1 2/a
                                                                                                  !
                    n1−δ 2          2dKMn4 32Mn (E + Mn L−1 )(Mn L)
    ≤8 exp −nδ             (       −              (                                       n   ) ) ,
                     Mn4 2048             n                              
if δ > 0 exists. To satisfy the condition that
                 ∞                   n
                        (                                                            )
                                  1X
                                         |f (Xi ) − Yi |2 − E{|f (X) − Y |2 }| >  < ∞,
               X
                     P sup |
                          f ∈Fn   n
               n=1                  i=1
The following conditions are required:
                                                      n1−δ
                                                lim         → ∞,
                                              n→∞ Mn4
and
                                              Mn L+1
                                     2
                   2dKMn4 32Mn (E + Mn L−1 )(Mn L)
                                                               d−1  − M 2L−1 2/a
                              (                                           n      )   → 0.
                        n                                
The latter condition can be simplified as
                                                    4+ 2d+2
                                                         a
                                                 Mn
                                                            →0
                                                      n
                                                        57


which implies the former one, therefore the former one is not necessary.
    In terms of consistency, approximation error and estimation error should be bounded
together. The approximation error encourages the functional space to be larger, i.e. kn → ∞
and Mn → ∞. The estimation error controls the increasing rate of Mn . The above theorem
                                       4+ 2d+2
                                            a
requires that the increasing rate of Mn         is slower than n.
3.6      Discussion
In this chapter, the consistency of our proposed FNN model is proved, which implies that
the error between the FNN model and the true model can be as small as possible if the
sample size is large enough. In the future, I will investigate other statistical properties, such
as the convergence rate and asymptotic normality of FNN.
    When it comes to proof of the rate of convergence rate of FNN model, the idea is normally
similar with that of consistency. I have to find the convergence rate of estimation error and
approximation error and find a trade off. The convergence rate of estimation error is not
hard to get given the covering number of the model, while that of approximation error is a
mathematical problem, which is still under investigation.
                                               58


Chapter 4
Transfer Learning for Genetic Studies
In the previous chapter,a new statistical model FNN is proposed and its consistency is proved.
In this chapter, I will introduce transfer learning and study whether transfer learning can be
used to improve the performance of genetic analyses.
4.1      Introduction
With the vast amounts of genetic data collected from biobank projects and from the Cau-
casian population, an interesting scientific question is whether these resources can be used to
enhance the genetic analysis in small-scale studies or in minority populations (e.g., African
American population). With the increased amount of genetic data becoming publicly avail-
able for genetic research, challenges remain in how to efficiently utilize these enriched re-
sources in the studies of small-scale studies and in minority populations. A common as-
sumption made by existing approaches is that two studies should be similar (e.g., the same
population). However, this assumption could fail in reality. The study design and study
population could be different between the two studies (e.g., Caucasian vs. African Ameri-
can). Transfer learning does not require data from two studies drawn from exactly the same
feature space or the same distribution. It learns possibly useful feature from mature studies
and applies learned features based on focused problem. Therefore it holds great promise to
use the enriched resources from large-scale studies for uncovering novel genetic variants in
                                                59


small-scale studies or in minority populations [30].
    Transfer learning has become increasingly popular in AI and is expected to become a
key driver of machine learning success in the future. Various models have been tested and
therefore showed the method and strength of transfer learning ([18]). Transfer learning is
a widely used method in image classification and natural language processing[13], which
transfer the knowledge from a source study/population to a target study/population.
    While transfer learning has achieved success in areas such as natural language processing,
it has rarely been used in genetic data analysis. In this chapter, I integrate the idea of transfer
learning into neural networks and use the knowledge learned from the large-scale UK Biobank
dataset to facilitate genetic analysis in small-scale studies or in minority populations. By
integrating transfer learning into neural networks, I am able to transfer knowledge regarding
complex genotype-phenotype relationships (e.g., gene-gene interactions) between two studies.
    The remaining chapter is organized as follows: The details of our models and two can-
didate regularization methods are introduced in section 2. Real data analyses are given
in section 3, which provides the comparison and selection between the two regularization
methods and the performance of transfer learning. Summary and discussion are provided in
section 4.
4.2      Method
Transfer learning applies knowledge gained from one research problem to a different but
related problem, and has been widely used in text and image recognition. The basic idea
of transfer learning is illustrated in Figure 1. In this chapter, the model I used to illustrate
transfer learning is the vanilla neural networks (NN) model. In this section, I briefly introduce
                                                60


                                 Figure 4.1: Transfer Learning
the NN model, and then integrate the idea transfer learning into NN (TL-NN) to transfer
the knowledge between two datasets. Finally, the technical details of the TL-NN will be
provided.
    To distinguish the different types of data types in Chapter 4, I use lower case letters, bold
lower case letters and upper case letters to denote scalar, vector and matrix, respectively. I
also use lower case letters denote the output function.
    In this section, the vanilla neural networks is introduced at first. The transfer learning
model is then constructed based on neural networks model. The details of technical issues
are discussed in the end.
4.2.1     Neural Networks
Suppose our research interest is to find a predictive function f that models the response
variable y and predictor variable x = (x1 , . . . , xq ), where q is the dimension of input. The
true model is written as:
                                          y = f (xx) + ,                                    (4.1)
where  is the noise in this model.
                                               61


     I use the fully connected NN model to demonstrate transfer learning:
                            h 1 = σ(W1x + b 1 ),                                                (4.2)
                            h d = σ(Wdh d−1 + b d ),    d = 2, . . . , l − 1,                   (4.3)
                             ŷ = fˆ(x
                                     x) = w lh l−1 ,                                            (4.4)
where h d is the feature which is learned from the source study and has a dimension of md .
σ is a nonlinear activation function, such as the sigmoid function σ(x) =                1  used this
                                                                                      1+e−x
chapter. l is the number of layers of the NN model. Wd and bd are weights and biases,
respectively. I denote all the parameters by S = {Wd , b d , w l |d = 1, . . . , l − 1}.
     In NN, I can normalize the response variable y. Suppose that ȳ and v̄ are the estimated
mean value and standard deviation of y, the final model is realized by:
                                    ŷ = fˆ(x
                                            x) = v̄w
                                                   w Dh D−1 + ȳ.                               (4.5)
     NN can be applied to both regression and classification problems. For simplicity, I here
focus on the regression problem, and use L2 loss and mean square error (MSE) to estimate
parameters and evaluate the model’s performance. The model can also be applied to clas-
sification problem by using different loss functions (e.g., cross entropy) and measurements
(e.g., misclassification error)
     Based on the L2 loss, the parameters can be estimated,
                                                    n
                                 b = arg min 1                 xi ))2 ,
                                                  X
                                 S                    (yi − fˆ(x
                                            S n
                                                  i=1
where n is the sample size. S    b is the set of all estimated parameters. yi and x i are the
                                                  62


response and predictor variable of the i-th sample.
    Directly fitting NN to a large number of genetic covariates could bring an overfitting issue.
To avoid the overfitting issue, various regularization methods, such as penalty regularization,
drop out method and early stopping method can be used. In this chapter, I impose parameter
norm penalty p = p(S  S ; λ) over our function space as a regularization method. where λ is a
hyperparameter controling the solution space.
    In a typical NN, a L2 penalty is usually applied on the weight term,
                                               l−1
                                                    kW d k22 + kww l k22 ).
                                               X
                                  S ; λ) = λ(
                                p(S
                                               d=1
.
    In genetic studies, genetic effects are usually smaller than the noise, and therefore heavy
penalties are required to avoid overfitting.
    There are two regularization methods regarding the parameter norm penalty, parameter
regularization and constraint regularization. Their solutions are:
                                              n
                                                                              !
                                           1X
                         Sb = arg min                     xi ))2 + p(S
                                                 (yi − f (x             S ; λ) ,            (4.6)
                                    S      n
                                             i=1
                                         n
                      b = arg min 1                  xi ))2
                                       X
                      S                     (yi − f (x        s.t. p(S   S ; λ) ≤ 1.        (4.7)
                                 S n
                                       i=1
The backpropagation method is typically used to obtain the solution. Since p(S        S ; λ) is a
norm penalty, the larger λ is, the smaller the parameter space is. Parameter regularization
is a widely used regularization method to avoid overfitting. It adds a penalty term to
the objective function, which encourages the weight of NN to be small in each epoch of
                                                   63


backpropagation. Nonetheless, large weights are still possible for parameter regularization.
Constraint regularization can be used to avoid large weights by forcing the weights to be
small, but the penalty term would not affect the backpropagation.
      In the section 4.3.2, I compare the performance between parameter regularization and
constraint regularization. Based on the result, I recommend constraint regularization for
genetic studies. The selection of hyperparameter λ is also discussed in the next section.
4.2.2          Transfer Learning
If we have a source data (e.g., UK biobank), in which we model the response variable y 0
with predictor variable x 0 = (x01 , . . . , x0q ), where x and x 0 shares similar data structure and
effects. Usually, x and x 0 are the same properties of different subjects. We have the similar
model:
                                   h 01 = σ(W10 x0 + b 01 ),                                                   (4.8)
                                   h 0d = σ(Wd0 h 0d−1 + b 0d ),     d = 2, . . . , l − 1,                     (4.9)
                                    ŷ 0 = fˆ(x
                                              x0 ) = v¯0w 0lh 0l−1 + y¯0 ,                                    (4.10)
where y¯0 and v¯0 are the estimated mean value and standard deviation of y 0 . The parameters
space of this model is denoted by S 0 . The set of their solutions are denoted by Sb0 .
      Since the solutions are derived from source studies, it is reasonable to suppose that h 0d is
a good representation of features in x 0 . The relation between x 0 and h 0d are realized through
           0
{Ŵd0 , b̂bd |d = 1, . . . , l − 1}, which is denoted by Sc01 . The other part {ŵ         w 0l } is represented by
Sc02 . Furthermore, we use f1 and f2 to denote the relationship, i.e.,
                                                               64


                                                 h 0d = f10 (xx|Sc01 ),                                    (4.11)
                                                              h0d |Sc02 ).
                                                  ŷ 0 = f20 (h                                            (4.12)
     Similarly, the parameter set S         b from the focused problem can also be divided to Sc1 and
Sc2 . The corresponding relationship is denoted by f1 and f2 . The penalty term p(S                    S ; λ) can
also be written as p(S  S 1 , S 2 ; λ).
     Based on the idea of transfer learning, I use Sc01 as the solution of the focused problem
rather than Sc1 . When it comes to the solution of S 2 , it is derived from the dataset of focused
problem given S 1 = Sc01
     Our final solution in transfer learning is:
                                                n
                                             1X 0
               Sf1 = Sc1 = arg min                 (yi − f 0 (xx0i ))2 ,          S 1 , S 2 ; λ) ≤ 1,
                                                                           s.t. p(S                        (4.13)
                                         S2 n
                                    S 1 ,S
                                               i=1
                                       n
                                 1X
               S 2 = arg min
                f                          (yi − f (x    S 1 = Sf1 ))2 ,
                                                     xi |S                  s.t. p(Sf1 , S 2 ; λ) ≤ 1.     (4.14)
                           S2 n
                                     i=1
     The difference between neural networks and transfer learning methods is illustrated in
Figure 2.
4.2.3      Technical Issue
The optimization algorithm in this chapter is the Adam algorithm [16]. To satisfy the
restriction condition p(S  S 1 , S 2 ; λ) ≤ 1, the back propagation procedure is realized as:
                                                            65


                        Figure 4.2: Transfer Learning in neural networks
                            
                            
                             n1 n                   xi ))2 ,
                                 P
                                    i=1 (yi − f (x                 S 1 , S 2 ; λ) ≤ 1
                            
                                                             if p(S
                    c(SS) =                                                           , (4.15)
                            
                            
                            p(S
                                S 1 , S 2 ; λ1 , λ2 ),            S 1 , S 2 ; λ) > 1
                                                              if p(S
                                     ∂c(S S)
                        s ← s − r1            ,    s ∈ S 1,                             (4.16)
                                        ∂s
                                     ∂c(S S)
                        s ← s − r2            ,    s ∈ S 2,                             (4.17)
                                        ∂s
where r1 , r2 is the learning rate of S 1 and S 2 , respectively.
   The Adam algorithm is an adaptive optimization method where the learning rate is
determined element wisely. While the default setting of the Adam algorithm works well in
most cases, it is not suitable for genetic studies. Due to the heavy penalty in the last layer,
the solution of S can be too small compared with the default learning rate of the Adam
algorithm. Therefore, I set the parameter of the learning rate based on λ while keeping
                                                       66


other parameters as the default value. The selection of the parameter is described in detail
in the next section. For transfer learning, I keep the same settings as the one used in NN,
i.e., r1 = 0 while r2 .
     The iteration process stops when MSE does not decrease for 3000 epochs. The formed
NN model has 2 hidden layers (i.e., l = 3). The numbers of hidden units of each layer is 16
and 4.
     For the comparison purpose, I also added a baseline model, which is essentially the mean
value of predictor variable ȳ. The measurement comparing two models is therefore:
                                                Pn
                                                                xi ))2
                                                   i=1 (yi − f (x
                                 m(f ) = 1 − P       n             2
                                                                       .
                                                     i=1 (yi − ȳ)
     The larger the value is, the better the formed model as compared to the baseline model.
     To illustrate the prediction ability, I split investigated dataset into the training set (80%)
and test set (20%). I tuned the parameters of our models on the training set, and compare
their performance on the test set. In order to remove the random effect of splitting, I repeat
our experiment for 100 times by random splitting the sample. The results are summarized
by the boxplots, where the left part shows the value of m(f ) in the training set and the right
show that of the test set.
4.3       Real Data Analysis
4.3.1       Data Description
Cigarette smoking is one of the leading causes of preventable disease, contributing to 5
million deaths worldwide each year[21]. During the last decade, a great deal of progress has
                                                   67


been made in identifying genetic variants associated with smoking. Among those findings,
the nAChRs subunit genes (e.g., CHRNA5 ) have been identified and confirmed in several
large-scale studies[19]. In this application, I use transfer learning to study the complex
relationships nAChRs subunit genes with nicotine dependence.
    The datasets to be analyzed are the large-scale UK Biobank (UKB) dataset and the rela-
tively small-scale dataset from the Study of Addiction: Genetics and Environment (SAGE).
[2] UKB is a population-based prospective cohort of nearly 500, 000 individuals recruited in
the United Kingdom who were aged 40-69 years. UKB contains a wealth of data on detailed
clinical information, genome-wide genotype data and whole-exome sequencing data. SAGE
is one of the most comprehensive studies conducted to date, aimed at discovering genetic
contributions to substance use disorders. Samples from SAGE were selected from three
studies: COGEND, COGA, and FSCD. Multiple phenotypes were measured in the SAGE
studies. For the focus of our gene-based analysis, I used cigarettes per day (CPD) that is
available in both SAGE and UKB, and only considered genes from two clusters, CHRNA5-
CHRNA3-CHRNB4 and CHRNB3-CHRNA6. SAGE comprises 1445 female and 1272 male
samples, among which 807 samples are African-American and 1910 samples are Caucasian.
    Prior to the analysis, I re-assessed the quality of the data (e.g., check for successful
genotype calls, missing rate, deviation from the Hardy-Weinberg equilibrium, unexpected
relationships67). After a careful quality assessment, I first use the Irish sample to find the
hyperparameters.
    Two studies of transfer learning are investigated in this chapter. One is a cross-project
study and the other is a cross-ethnicity study. In the first study, I am trying to investigate
whether the genetic finding from the British population can be transferred to the Black and
the Irish populations. In the second study, I try to utilize the data from UKBiobank to help
                                              68


the genetic research in SAGE.
    To illustrate the effect of transfer learning strategy, I split investigated dataset into the
training set (80%) and the test set (20%). I model our original NN model and transfer
learning model respectively on the train set, and compare their performance on the test set.
In order to remove the random effect of splitting, I repeat our experiment for 100 times with
random splitting.
    A validation process is used to determine the value of λ. In the validation process, I split
our training set into the subtrain and validation sets with the ratio of 4 : 1. I evaluate a
range of possible values of λ, such as {1, 10, 100, 1000}, in the subtrain set, and then select
the optimal λ based on the validation set.
4.3.2      Regularization
In this section, two regularization methods, parameter regularization and constraint regu-
larization methods are compared under different possible values of hyperparameter λ. The
Constraint regularization method and suitable hyperparameter selection are then used in the
next section.
4.3.2.1    Constraint regularization
Various values of hyperparameters for the constraint regularization is illustrated here. The
candidates of λ are {102 , 101.5 .101 , 100.5 , 100 }.
    As is seen from Figure 3, the optimal parameter for the penalty term λ is between 100.5
and 100 .
                                                    69


                                    dataset = train                                                   dataset = test
            0.2
            0.1
                                                                                                                                         log
            0.0                                                                                                                                2
m(f), (%)                                                               m(f), (%)
                                                                                                                                               1.5
            0.1                                                                                                                                1
                                                                                                                                               0.5
                                                                                                                                               0
            0.2
            0.3
            0.4
                  CHRNA3   CHRNA5      CHRNB4         CHRNA6   CHRNB3               CHRNA3   CHRNA5     CHRNB4         CHRNA6   CHRNB3
                                        name                                                             name
                                               Figure 4.3: Constraint Regularization
   4.3.2.2            Parameter Regularization
   Possible values of λ for the constraint regularization are {100 , 10−0.5 .10−1 , 10−1.5 , 10−2 }.
   The corresponding parameter of learning rate in the Adam method is 10−2.5 · λ−0.5 . The
   optimal parameter for the penalty term λ is between 10−1.5 and 10−2 according to Figure
   4.
   4.3.2.3            Comparison
   Compared with parameter regularization, constraint regularization encourages a higher learn-
   ing rate, which enables NN to jump out of local minimum and attain better performance.
   The result is more stable with various λ values in the penalty constraint. Therefore, in the
   following analysis, I will use the constraint regularization method with the selected parameter
   setting.
                                                                             70


                                    dataset = train                                                   dataset = test
            0.2
            0.0
                                                                                                                                         log
                                                                                                                                            -0.5
m(f), (%)                                                               m(f), (%)
                                                                                                                                            -1
            0.2                                                                                                                             -1.5
                                                                                                                                            -2
                                                                                                                                            -2.5
            0.4
            0.6
                  CHRNA3   CHRNA5      CHRNB4         CHRNA6   CHRNB3               CHRNA3   CHRNA5     CHRNB4         CHRNA6   CHRNB3
                                        name                                                             name
                                               Figure 4.4: Parameter Regularization
   4.3.3             Cross-ethnicity
   The vast amount of genetic data collected from the Caucasian population provides us with
   great resource for genetic research in other populations, especially minority populations. In
   the UK Biobank dataset, there are around 300, 000 British people, 10, 000 Irish people and
   10, 000 Black people. In this project, I transfer the knowledge learned from the British
   population to the Irish and the Black to enhance the genetic study of these populations.
             By applying the NN model and transfer learning model, I obtain the following result in
   Figure 5 for the Black population.
             The performance comparison in the Irish population is given in Figure 6.
             As is seen in the figure, transfer learning helps improve the performance for CHRNA3,
   CHRNA5 and CHRNB4, while has limited gain for CHRNA6 and CHRNB3.
             [38] found genetic heterogeneity of CHRNA6 and CHRNB3 in the studies of nicotine
   dependence, while the heterogeneity is not significant in the other three genes, which is
   consistent with our finding.
                                                                              71


Figure 4.5: Transfer the knowledge from the British population to the Black population
Figure 4.6: Transfer the knowledge from the British population to the Irish population
                                          72


                Figure 4.7: Transfer the knowledge from UK Biobank to SAGE
4.3.4       Cross-Projects
In this scenario, I am transferring the knowledge from the UK Biobank data to the SAGE
data. For this analysis, I focus on the Caucasian population in both datasets.
     By applying the NN model and transfer learning model to the datasets, I have the result
which is illustrated in Figure 7.
     As we can see from the figure, the NN model performs poorly due to the small sample
size (i.e., around 1000) in the SAGE dataset, and the genetic effect can only explain less than
1 percent of the variation. Compared to the NN, transfer learning improves the performance
by using the information from the UK biobank. The improvement is observed for all genes
as we focus on the Caucasian population in this application and genetic heterogeneity has
little influence on the performance.
                                              73


4.4      Discussion
In this chapter, I have shown the advantages of transfer learning for genetic research. Trans-
fer learning can improve the prediction performance across populations, but we need to be
cautious about the heterogeneity of data or genetic mechanisms, such as genetic heterogene-
ity. While I illustrate the method by transferring the knowledge to a small-scale study and
minority population, transfer learning can be potentially used for other purposes, such as
transferring the genetic findings between different traits (e.g., smoking and alcohol drinking)
and different species (e.g., rats and mice).
                                              74


Chapter 5
Epilogue
The main goal of this dissertation is to investigate neural networks based models with their
application to genetics. In Chapter 2, I proposed a new neural networks model, FNN, which
combines functional linear model and nonlinear activation in a recursive way. By treating
genetic and phenotype data as functional data, The underlying smooth property is utilized
to improve the models’ performance in simulations and real data analyses. The consistency
of the proposed FNN model is proved in Chapter 3 by providing the conditions to bound
the approximation error and estimations error. In Chapter 4, I show how transfer learning
can help transfer knowledge between different populations and studies.
    In the FNN model discussed in Chapter 2, I mainly focused on the continuous phenotype.
In the future, I will extend the FNN model to discrete variables and henceforth solve the
classification problem. This can be done by changing the loss function and the corresponding
link function. Another extension of the FNN model is the application to multivariate func-
tions. In the real data analyses, the input function is SNPs, whose indexes are their positions
on the chromosome. The output function is phenotypes over age. These functions have only
one index. In the simulations, I have also shown that FNN outperformed existing methods
when the output function is a bivariate function. Therefore, it will also be interesting to
apply FNN to the brain surface data.
    The consistency of the proposed FNN model is proved in Chapter 3. While consistency
                                               75


has been proved for the vanilla NN model, the result is not directly applicable to the FNN
model. The domain of variables in the NN model is defined on the Euclidean space, whose
dimension is finite. Nevertheless, the domain of the FNN model contains the integrable
function L2 ([0, 1]), whose dimension is infinite. Therefore, I have to prove the universal
approximation theorem and calculate its covering number from scratch.
     Chapter 3 proves that the estimated FNN model can approximate to the underlying
model to any extend if enough samples are provided, but it cannot quantify the speed of the
approximation, i.e., the rate of convergence. The difficulty lies in a mathematical problem,
i.e., the distance of the FNN model function space and continuous function space defined on
the domain of the FNN model. The is an interesting problem for future research.
     Transfer learning is applied to genetic studies under the framework of vanilla NN model
in Chapter 4. An obvious extension is to change the framework to our proposed FNN model.
It can be realized by replacing NN with FNN. The real data analyses in this dissertation
are limited to human genetic data, where I transfer knowledge from the British population
to the Irish and Black populations. The formed model can also be used for cross-species
studies, where the understanding of the genetic effects on human subjects can benefit from
the knowledge learned from animals. Animal studies can be more powerful and designed
in a way that is not allowed in human studies. I am cooperating with other groups on
transfer learning between rats and mice. One of the challenges of cross-species studies is
SNPs from different species are not aligned. Efforts are required to find the positions of the
corresponding SNPs in other spices. Currently, I am trying to overcome the problems and
using FNN in cross-species studies.
                                                76


APPENDICES
    77


APPENDIX A
Solutions of FLM
At first, the solution to equation (2.2) whose phenotype is a scalar value is discussed. In
practice, I do not have the true functions but have the observations at discrete points tj , j =
1, . . . , p. Using the genetic data as an example, we observe genotypes Gij for i-th samples
at locus j. By considering the effects of genotypes as a function, I obtain the following
beta-smooth function([7]):
                                                            Xp
                                      ŷi = θ0 + Zi θ +          Gij β(tj ).
                                                            j=1
     By using pre-specified basis functions βk (t), k = 1, . . . , K (e.g., B-spline basis functions),
                                            PK
I can further expand β(t) as β(t) =            k=1 wk βk (t)     and rewrite FLM as,
                                                       Xp        Xk
                                ŷi = θ0 + Zi θ +          Gij (     wk βk (tj ))
                                                       j=1      k=1
                                                       Xk       Xp
                                    = θ0 + Zi θ +          wk (     Gij βk (tj ))
                                                       k=1     j=1
                                                       Xk
                                    = θ0 + Zi θ +          wk dik
                                                       k=1
                                    = θ0 + Zi θ + Di w,
                 Pp
where dik =        j=1 Gij βk (tj ), w  = (w1 , . . . , wK )T and Di = (di1 , . . . , diK )T . FLM (equation
                                                         78


(2.2)) is then transformed to a linear model:
                                          Ŷ = XΘ + Dw,
where X = (1n , Z), Z = (Z1T , . . . , ZnT )T , Θ = (θ0 , θ T )T . 1n refers to a column vector with
all 1s.
    The penalized loss function to solve the model is
                                                                  Z
                         R(w, Θ) = (Ŷ    − Y )T (Ŷ −Y)+λ          (β 00 (t))2 dt
                                  = (Ŷ − Y )T (Ŷ − Y ) + λwT P w.
By minimizing the loss function, I have the solution:
              ŵ = (DT (I − X(X T X)−1 X T )D + λP )−1 DT (I − X(X T X)−1 )Y
              Θ̂ = (X T X)−1 X T (Y − Dŵ).
    Now I would provide the solution to (2.3) whose phenotype is dispersed in a one dimen-
sional space. To our best knowledge, no existing analytical method is available to solve the
model described in equation (2.3) without imposing restrictions on the model. I adapt the
commonly used restriction ([14]) sij = si0 j , ∀i, i0 . Without loss of generality, I use sj , εij to
denote sij and εi (sij ). The model can then be rewritten as,
                                                        Z
                   Yij = Yi (sj ) = Zi θ + α0 (sj ) +       α(sj , t)Gi (t)dt + εij .
    I further generalize the above model by assuming that the coefficients for covariates θ is
                                                  79


a function,
                                                          Z
                       Yij = Zi θ(sj ) + α0 (sj ) +         α(sj , t)Gi (t)dt + εij ,
where θ(s) = (θ1 (s), . . . , θm (s)).
    By adding another basis functions αl (s), l = 1, . . . , L, I replace α0 (s) and α(s, t) with
PL                   PK PL
   l=1 cl αl (s) and  k=1       l=1 Wkl αl (s)βk (t), respectively.
                                                                                      (d)
    For the bivariate function α(s, t), I use two basis function systems {βk(d) , d ∈ {0, 1}, kd ∈
{1, . . . , ld }},
                                       l (1)
                                       X              (1)
                         α0 (s) =            b (1) β (1) (s),
                                               k      k
                                    k (1) =1
                                       l(1)     l(0)
                                       X       X                   (0)     (1)
                        α(s, t) =                     W (0) (1) β (0) (t)β (1) (s).
                                                         k k       k       k
                                    k (1) =1 k (0) =1
    By using the package provided by [14], which is based on the REML method, I can obtain
the solution of W , b, and θ.
                                                       80


APPENDIX B
Functional Neural Networks
The technical details of the functional neural networks model will be discussed in this ap-
pendix. At first, the explicit form of the proposed model will be given step by step.
   In practice, integration can be approximated by numeric integration in the form of sum-
mation. For simplicity, I assume that all functions is scaled in the interval of [0, 1]. If I
choose evenly distributed points t1 , t2 , . . . , tm , a naive integration for X(t) is
                                   Z                  X m
                                       X(t)dt =            X(tj )/m.
                                                      j=1
   I can rewrite functional notation in a matrix form, i.e.
                                              (d)      (d) m(d) ,l(d)
                                 B (d) = [β (d) (tj )]                  ,
                                              k            j=1,k (d) =1
                                                (d) (d) n,m(d)
                                 X (d) = [Xi (tj )]i=1,j=1 .
                                                     81


    Therefore, the forward propagation algorithm can be written as the following:
                            D(d) = X (d−1) B (d−1) /m(d−1) ,
                             C (1) = D(1) W (1) + 1n,1 b(1) + Zθ,
                                                                                      (B.1)
                             C (d) =  D(d) W (d) + 1n,1 b(d) , d  > 1,
                            X (d) = σ(C (d) B (d)T ).
As is seen from the above equations, FNN reduces to a traditional neural networks when
each matrix of basis B is a diagonal matrix. In other words, NN can be seen as a special
case of FNN.
    When it comes to the solution of the FNN model, the process of back propagation should
be explained. A gradient decent method is used to solve the weight functions. The recursive
process stops when O(r) converges. The weights and biases are updated as follows:
                                                       (d)
                                     O(r) = O(α(d) , α0 ),
                                                        ∂O(r)
                                 W (r+1)  = W (r)  −γ            ,
                                                       ∂W (r)
                                                      ∂O(r)
                                   b(r+1) = b(r) − γ          ,
                                                      ∂b(r)
where γ is determined by using the adadelta method [37].
    The deductions of derivatives involve matrices, functions, and trace. Two lemmas are
given below to simplify the calculation of derivatives.
Lemma B.0.1. For any function σ : R → R applied to matrix elementwisely, F : Rm×n → R
that maps matrix to scalar. X, A, B are matrices with proper dimensions and A = σ(B). If
∂F(X)              ∂F(X)
  ∂A   = D, then     ∂B  = D ◦ σ 0 (B).
                                               82


Proof.
                     F(X)     F(X) Aij
                           =                = Dij σ 0 (Bij ) = (D ◦ σ 0 (B))ij .
                      Bij     σ(Aij ) Bij
Lemma B.0.2. For any function F : Rm×n → R that maps matrix to scalar. A, C are
                                    ∂F(X)                    ∂F(X)
square matrices and Y = ABC. If        ∂Y    = D , then       ∂B   = AT DC T .
Proof.
                            ∂F(X) X ∂F(X)                   ∂(ABC)k,l
                                    =       (          )k,l
                             ∂Bij             ∂ABC            ∂Bij
                                        k,l
                                        X
                                    =       Dkl Aki Cjl
                                        k,l
                                            ATik Dkl CljT
                                        X
                                    =
                                        k,l
                                    = (AT DC T )ij .
                ∂R(W,b)
   Let M (d) =          . By using B.1 and lemma B, I have
                 ∂C (d)
                        ∂R(W, b)
                                  = 11,n M (d) ,
                          ∂b(d)                                                  (B.2)
                        ∂R(W, b)
                                  =   (X (d−1) B (d−1) )T M (d) /m(d−1) .
                          ∂W (d)
                                                83


    For the last layer, σ (D) is a linear function, i.e. Ŷ = C (D) B (D)T . I then have
                                                         − yij )2
                                            P
                          ∂R(W, b)        ∂    i,j (ŷij
                                       =                          == 2(Ŷ − Y ),
                              ∂ Ŷ                  ∂ Ŷ
                               M (D) = 2(Ŷ − Y )B (D) .                                  (B.3)
    For other layers, by using B.1, I have
                   C (d+1) = (X (d) B (d) /m(d) )W (d+1) + 1n,1 b(d+1)
                            = σ (d) (C (d) B (d)T )B (d) W (d+1) /m(d) + 1n,1 c(d+1) .
    Based on lemma B.0.1 and B, I have
                     ∂R(W, b)
                                       = M (d+1) W (d+1)T B (d)T /m(d) ,
                ∂σ (d) (C (d) B (d)T )
                  ∂R(W, b)
                                   = M (d+1) W (d+1)T B (d)T ◦ σ 0 (C (d) B (d)T )/m(d) ,
                ∂(C (d) B (d)T )
                M (d) = [(M (d+1) W (d+1)T B (d)T ) ◦ σ 0 (C (d) B (d)T )]B (d) /m(d) .   (B.4)
    The formula equation (B.2), equation (B.3) and equation (B.4) are then used for calcu-
lating back propagation on cost function R(W, b). As for the penalty term J(W, b), I have
equation (B.5), which is illustrated below,
                                     Z
                             (d)         (d)           (d)          l(d) ,l(d)
                           P0     = [ βi (t(d) )βj (t(d) )dt(d) ]i=1,j=1 ,
                                         (d)00           (d)00
                                     Z
                             (d)                                       l(d) ,l(d)
                           P2 = [ βi (t(d) )βj (t(d) )dt(d) ]i=1,j=1
                                                      84


                           !2                 !2                                  2
                                                                                 (d)
                   ∂ 2 α(d)          ∂ 2 α(d)                              ∂ 2 α0
          Z Z                                                        Z
  J (d) =                     +                   dt(d) dt(d−1) +                    dt(d)
                    ∂t(d)2          ∂t(d−1)2                               ∂t  (d)2
               (d−1) (d) (d) (d)0                    0 (d−1) (d)                           0
                                                                          
                                            (d)                                      (d)
        =tr P0       W P2 W            + P0 W (d) P2          W      + tr b(d) P2 b(d)
Therefore, the final derivatives for back propagation are:
                    ∂J (d)      (d−1)           (d)      (d−1)         (d)
                            =2P0       W (d) P2     + 2P2      W (d) P0 ,
                   ∂W (d)                                                                      (B.5)
                    ∂J (d)           (d)
                            =2b(d) P2 .
                     ∂b(d)
                                              85


BIBLIOGRAPHY
      86


                                  BIBLIOGRAPHY
 [1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.
     cambridge university press, 2009.
 [2] Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T Elliott, Kevin
     Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O’Connell, et al. The
     uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–
     209, 2018.
 [3] 1000 Genomes Project Consortium et al. A map of human genome variation from
     population-scale sequencing. Nature, 467(7319):1061, 2010.
 [4] Heather J Cordell. Detecting gene–gene interactions that underlie human diseases.
     Nature Reviews Genetics, 10(6):392–404, 2009.
 [5] Murat Dikmen and Catherine M Burns. Autonomous driving in the real world: Ex-
     periences with tesla autopilot and summon. In Proceedings of the 8th international
     conference on automotive user interfaces and interactive vehicular applications, pages
     225–228, 2016.
 [6] Gökcen Eraslan, Žiga Avsec, Julien Gagneur, and Fabian J Theis. Deep learning: new
     computational modelling techniques for genomics. Nature Reviews Genetics, 20(7):389–
     403, 2019.
 [7] Ruzong Fan, Yifan Wang, James L Mills, Alexander F Wilson, Joan E Bailey-Wilson,
     and Momiao Xiong. Functional linear models for association analysis of quantitative
     traits. Genetic epidemiology, 37(7):726–742, 2013.
 [8] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast
     geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition, pages 869–877, 2018.
 [9] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning,
     volume 1. MIT press Cambridge, 2016.
[10] Sara Goodwin, John D McPherson, and W Richard McCombie. Coming of age: ten
     years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333,
     2016.
[11] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free
     theory of nonparametric regression. Springer Science & Business Media, 2006.
                                             87


[12] Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward net-
     works are universal approximators. Neural networks, 2(5):359–366, 1989.
[13] Mahbub Hussain, Jordan J Bird, and Diego R Faria. A study on cnn transfer learning
     for image classification. In UK Workshop on Computational Intelligence, pages 191–202.
     Springer, 2018.
[14] Andrada E Ivanescu, Ana-Maria Staicu, Fabian Scheipl, and Sonja Greven. Penalized
     function-on-function regression. Computational Statistics, 30(2):539–568, 2015.
[15] Adam Kiezun, Kiran Garimella, Ron Do, Nathan O Stitziel, Benjamin M Neale, Paul J
     McLaren, Namrata Gupta, Pamela Sklar, Patrick F Sullivan, Jennifer L Moran, et al.
     Exome sequencing and the genetic basis of complex traits. Nature genetics, 44(6):623,
     2012.
[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
     preprint arXiv:1412.6980, 2014.
[17] Tarun Lalwani, Shashank Bhalotia, Ashish Pal, Vasundhara Rathod, and Shreya Bisen.
     Implementation of a chatbot system using ai and nlp. International Journal of In-
     novative Research in Computer Science & Technology (IJIRCST) Volume-6, Issue-3,
     2018.
[18] Torrey Lisa and S Jude. Transfer learning. In Handbook of Research on Machine
     Learning Applications and Trends, pages 242–264. IGI Global, 2010.
[19] Jason Z Liu, Federica Tozzi, Dawn M Waterworth, Sreekumar G Pillai, Pierandrea
     Muglia, Lefkos Middleton, Wade Berrettini, Christopher W Knouff, Xin Yuan, Gérard
     Waeber, et al. Meta-analysis and imputation refines the association of 15q25 with
     smoking quantity. Nature genetics, 42(5):436–440, 2010.
[20] Teri A Manolio, Francis S Collins, Nancy J Cox, David B Goldstein, Lucia A Hin-
     dorff, David J Hunter, Mark I McCarthy, Erin M Ramos, Lon R Cardon, Aravinda
     Chakravarti, et al. Finding the missing heritability of complex diseases. Nature,
     461(7265):747–753, 2009.
[21] Colin D Mathers and Dejan Loncar. Projections of global mortality and burden of
     disease from 2002 to 2030. PLoS medicine, 3(11):e442, 2006.
[22] Melinda C Mills and Charles Rahal. A scientometric review of genome-wide association
     studies. Communications biology, 2(1):1–11, 2019.
[23] Kalyanapuram Rangachari Parthasarathy. Probability measures on metric spaces, vol-
     ume 352. American Mathematical Soc., 2005.
                                             88


[24] James O Ramsay and CJ Dalzell. Some tools for functional data analysis. Journal of
     the Royal Statistical Society: Series B (Methodological), 53(3):539–561, 1991.
[25] JO Ramsay. When the data are functions. Psychometrika, 47(4):379–396, 1982.
[26] JO Ramsay and BW Silverman. Functional data analysis-methods and case studies,
     2002.
[27] Fabrice Rossi, Nicolas Delannay, Brieuc Conan-Guez, and Michel Verleysen. Represen-
     tation of functional data in neural networks. Neurocomputing, 64:183–210, 2005.
[28] Andrew J Saykin, Li Shen, Tatiana M Foroud, Steven G Potkin, Shanker Swaminathan,
     Sungeun Kim, Shannon L Risacher, Kwangsik Nho, Matthew J Huentelman, David W
     Craig, et al. Alzheimer’s disease neuroimaging initiative biomarkers as quantitative
     phenotypes: Genetics core aims, progress, and plans. Alzheimer’s & dementia, 6(3):265–
     273, 2010.
[29] Elias M Stein and Rami Shakarchi. Real analysis: measure theory, integration, and
     Hilbert spaces. Princeton University Press, 2009.
[30] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine
     learning applications and trends: algorithms, methods, and techniques, pages 242–264.
     IGI global, 2010.
[31] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence
     and empirical processes, pages 16–28. Springer, 1996.
[32] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy,
     Matthew A Brown, and Jian Yang. 10 years of gwas discovery: biology, function,
     and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
[33] Olga A Vsevolozhskaya, Dmitri V Zaykin, David A Barondess, Xiaoren Tong, Sneha
     Jadhav, and Qing Lu. Uncovering local trends in genetic effects of multiple phenotypes
     via functional linear models. Genetic epidemiology, 40(3):210–221, 2016.
[34] Olga A Vsevolozhskaya, Dmitri V Zaykin, Mark C Greenwood, Changshuai Wei,
     and Qing Lu. Functional analysis of variance for association studies. PLoS One,
     9(9):e105074, 2014.
[35] Fei-Yue Wang, Jun Jason Zhang, Xinhu Zheng, Xiao Wang, Yong Yuan, Xiaoxiao Dai,
     Jie Zhang, and Liuqing Yang. Where does alphago go: From church-turing thesis to
     alphago thesis and beyond. IEEE/CAA Journal of Automatica Sinica, 3(2):113–120,
     2016.
                                              89


[36] Paul Yushkevich, Stephen M Pizer, Sarang Joshi, and James Stephen Marron. Intu-
     itive, localized analysis of shape variability. In Biennial International Conference on
     Information Processing in Medical Imaging, pages 402–408. Springer, 2001.
[37] Matthew D Zeiler. Adadelta: an adaptive learning rate method.             arXiv preprint
     arXiv:1212.5701, 2012.
[38] Xuefen Zhang, Tongtong Lan, Tong Wang, Wei Xue, Xiaoran Tong, Tengfei Ma, Guifen
     Liu, and Qing Lu. Considering genetic heterogeneity in the association analysis finds
     genes associated with nicotine dependence. Frontiers in genetics, 10:448, 2019.
[39] James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani, and
     Amalio Telenti. A primer on deep learning in genomics. Nature genetics, 51(1):12–18,
     2019.
                                              90