A NEURAL NETWORKS BASED METHOD WITH GENETIC DATA
           ANALYSIS OF COMPLEX DISEASES
                                  By
                             Jinghang Lin
                        A DISSERTATION
                            Submitted to
                    Michigan State University
             in partial fulfillment of the requirements
                          for the degree of
                Statistics — Doctor of Philosophy
                                2021


                                       ABSTRACT
 A NEURAL NETWORKS BASED METHOD WITH GENETIC DATA ANALYSIS OF
                                   COMPLEX DISEASES
                                             By
                                        Jinghang Lin
    The genetic etiologies of common diseases are highly complex and heterogeneous. Classic
statistical methods, such as linear regression, have successfully identified numerous genetic
variants associated with complex diseases. Nonetheless, for most complex diseases, the
identified variants only account for a small proportion of heritability. Challenges remain
to discover additional variants contributing to complex diseases. In this dissertation, we
developed an expectile neural network (ENN) method and applied the method to genetic
data analysis. ENN provides a comprehensive view of relationships between genetic variants
and disease phenotypes and can be used to discover genetic variants predisposing to sub-
populations (e.g., high-risk groups). We integrate the idea of neural networks into ENN,
making it capable of capturing non-linear and non-additive genetic effects (e.g., gene-gene
interactions). Through simulations, we showed that the proposed method outperformed an
existing expectile regression when there exist complex relationships between genetic variants
and disease phenotypes. We also applied the proposed method to the genetic data from
the Study of Addiction: Genetics and Environment(SAGE), investigating the relationships
of candidate genes with smoking quantity. Neural networks have been widely used in ap-
plications. However, few studies have been focused on the statistical properties of neural
networks. We further investigate the Asymptotic properties of ENN (e.g., consistency).
Simulations have been conducted to test the validity of the theory.


    I dedicate this dissertation to my parents, Xianghua Chen and Xilong Lin for their endless
love and support.
                                                iii


                                 ACKNOWLEDGMENTS
There are many people who helped me along the way on this journey. Without their help, I
could not complete this dissertation. I want to take a moment to thank them.
    First of all, I would like to express my sincere gratitude to my advisors Dr. Qing Lu
and Dr. Yuehua Cui for their invaluable advice, continuous support, and patience during
my PhD study. Their immense knowledge and plentiful experience have encouraged me in
all the time of my academic research. They lead me into the area of statistical genetics and
train me to be an independent researcher.
    I would also like to extend my sincere thanks to my dissertation committee members,
Dr. Hyokyoung (Grace) Hong and Dr. Haolei Weng. Their comments and suggestions are
beneficial to my research. My special thanks to Dr. Guowei Wei for his help in my job
search. I am deeply grateful to Dr. Xiaoxi Shen and Dr. Xiaoran Tong for their insight in
theory and computational support.
    During my PhD study, I made a lot of friends. My special thanks to my friends: Steven
Gagnon, Tengfei Ma, Peide Li, Zihuan Liu, Dr. Cheuk (Ken) Lee for their constant help in
my life and study. I would like to express my sincere gratitude to group members in Dr.
Qing Lu’s group: Shan Zhang, Chang Jiang, Yuan Zhou, Tingting Hou, Mingsheng Tang
for creating a positive research atmosphere. My special thanks to my girlfriend Dr. Liping
Sun for her love and accompany.
    Last but not least, I would like to express my sincere thanks to my parents for their
support and endless love for me. I would never make this journey without them.
                                              iv


                            TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1 Introduction . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . .  1
   1.1 Overview . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . .  1
   1.2 A review of basic human genetics . . . . .      . . . . . . . . . . . . . . . . . . .  2
   1.3 Statistical learning . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . .  4
       1.3.1 Neural network . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . .  4
       1.3.2 Artificial intelligence in healthcare .   . . . . . . . . . . . . . . . . . . .  7
   1.4 Organization . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . .  8
Chapter 2    Expectile Neural Networks for Genetic Data Analysis of Com-
             plex Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      9
   2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
   2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10
   2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    12
       2.3.1 Expectile regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .    12
       2.3.2 Expectile neural network . . . . . . . . . . . . . . . . . . . . . . . . .      15
       2.3.3 Theoretical result . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    17
   2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  19
       2.4.1 Simulation I - nonlinear relationship . . . . . . . . . . . . . . . . . .       20
       2.4.2 Simulation II - interactions among SNPs . . . . . . . . . . . . . . . .         23
       2.4.3 Simulation III - interactions between genes . . . . . . . . . . . . . . .       25
   2.5 Real data applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    26
       2.5.1 The relationship between candidate SNPs with smoking quantities . .             28
       2.5.2 Gene-gene interactions between the CHRNA5-CHRNA3-CHRNB4 gene
               cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
   2.6 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      33
Chapter 3 Asymptotic Theory of Expectile Neural Networks                       . . . . . . . 36
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . 36
   3.2 Method of sieves . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 38
   3.3 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 39
   3.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 41
   3.5 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 50
   3.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . 55
       3.6.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 56
               3.6.1.1 Simulation results of consistency with τ = 0.5 .        . . . . . . . 56
               3.6.1.2 Simulation results of consistency with τ = 0.75         . . . . . . . 58
       3.6.2 Normality . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 60
                                               v


             3.6.2.1 Simulation result of normality with τ = 0.5 . . . . . . . . .      61
             3.6.2.2 Simulation result of normality with τ = 0.75 . . . . . . . . .     63
  3.7 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  63
Chapter 4   Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . .        66
APPENDICES . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . 69
  Appendix A Technical Details of Chapter 2   . . . . . . . . . . . . . . . . . . . . . 70
  Appendix B Technical Details of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 76
  Appendix C Supplementary Materials . . . .  . . . . . . . . . . . . . . . . . . . . . 78
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    83
                                         vi


                                 LIST OF TABLES
Table 2.1: The accuracy performance of two models built by ENN and ER based on
           149 candidate SNPs and 3 covariates . . . . . . . . . . . . . . . . . . . . .  28
Table 2.2: Evaluating a pairwise interaction between CHRNA5 and CHRNA3 by using
           ENN and ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 2.3: Evaluating a pairwise interaction between CHRNA5 and CHRNB4 by using
           ENN and ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 2.4: Evaluating a pairwise interaction between CHRNA3 and CHRNB4 by using
           ENN and ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table C.1: Real data application result of CHRNA5     . . . . . . . . . . . . . . . . . . 80
Table C.2: Real data application result of CHRNA3     . . . . . . . . . . . . . . . . . . 80
Table C.3: Real data application result of CHRNB4 . . . . . . . . . . . . . . . . . . .   80
Table C.4: Real data application result of ADNI . . . . . . . . . . . . . . . . . . . . . 82
                                             vii


                                  LIST OF FIGURES
Figure 1.1: A graphical representation of Chromosome, DNA and gene. Credit to
             Genetic Alliance UK . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      3
Figure 1.2: Similarity between biological and artificial neural networks . . . . . . . .        5
Figure 2.1: Quantiles and expectiles. . . . . . . . . . . . . . . . . . . . . . . . . . . .    14
Figure 2.2: A graphical representation of expectile neural network . . . . . . . . . .         15
Figure 2.3: Performance comparison between ENN and ER under various relation-
             ships between genotypes and phenotypes and different expectiles (i.e., 0.1,
             0.25, 0.5, 0.75, and 0.9) . . . . . . . . . . . . . . . . . . . . . . . . . . .   22
Figure 2.4: Performance comparison between ENN and ER for different types of in-
             teractions and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) . . .   24
Figure 2.5: An alternative architecture for gene-gene interaction analyses . . . . . .         25
Figure 2.6: Performance comparison between ENN with a fully connected architecture
             and ENN with a non-fully connected architecture for gene-gene interaction
             analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  26
Figure 2.7: A comprehesive view of the conditional distribution of smoking quantity
             for five expectile levels (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9) . . . . . . . . . 29
Figure 2.8: The conditional distribution of CPD considering the interaction between
             CHRNA5 and CHRNA3 . . . . . . . . . . . . . . . . . . . . . . . . . . .           33
Figure 2.9: The conditional distribution of CPD considering the interaction between
             CHRNA5 and CHRNB4 . . . . . . . . . . . . . . . . . . . . . . . . . . .           34
Figure 2.10: The conditional distribution of CPD considering the interaction between
             CHRNB4 and CHRNA3 . . . . . . . . . . . . . . . . . . . . . . . . . . .           34
Figure 3.1: Comparison between the true function f0 and fitted functions under dif-
             ferent sample sizes, where f0 is a neural network with one single hidden
             layer and two hidden units τ = 0.5. . . . . . . . . . . . . . . . . . . . . .     57
Figure 3.2: Comparison between the true function f0 = x3 + 1 and fitted functions
             under different sample sizes with τ = 0.5. . . . . . . . . . . . . . . . . . .    57
                                                 viii


Figure 3.3: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and
             fitted functions under different sample sizes with τ = 0.5. . . . . . . . . .    58
Figure 3.4: Comparison between the true function f0 and fitted functions under dif-
             ferent sample sizes, where f0 is a neural network with one single hidden
             layer and two hidden units with τ = 0.75. . . . . . . . . . . . . . . . . .      59
Figure 3.5: Comparison between the true function f0 = x3 + 1 and fitted functions
             under different sample sizes with τ = 0.75. . . . . . . . . . . . . . . . . .    59
Figure 3.6: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and
             fitted functions under different sample sizes with τ = 0.75. . . . . . . . .     60
Figure 3.7: Q-Q plot with different sample sizes, where the true function f0 is a neural
             network with one single hidden layer and two hidden units with τ = 0.5 .         61
Figure 3.8: Q-Q plot with different sample sizes, where the true function is f0 = x3 +1
             with τ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  62
Figure 3.9: Q-Q plot with different sample sizes, where the true function is f0 =
             sin(x) + 2exp((−16)x2 ) with τ = 0.5. . . . . . . . . . . . . . . . . . . . .    62
Figure 3.10: Q-Q plot with different sample sizes, where the true function f0 is a neural
             network with one single hidden layer and two hidden units with τ = 0.75.         63
Figure 3.11: Q-Q plot with different sample sizes, where the true function is f0 = x3 +1
             with τ = 0.75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 3.12: Q-Q plot with different sample sizes, where the true function is f0 =
             sin(x) + 2exp((−16)x2 ) with τ = 0.75. . . . . . . . . . . . . . . . . . . .     64
                                               ix


Chapter 1
Introduction
1.1       Overview
With the development of biotechnology, especially next-generation sequencing technologies
(NGS), it is easy to sequence an entire human genome. New technologies arising from the
Human Genome Project and HapMap Project have generated a surge of methodological
development for unsolved problems in human genetics. To find genetic variations associated
with a particular disease, a genome-wide association study (GWAS) that involves rapidly
scanning markers across the complete sets of DNA, or genomes, can be adopted[1]. GWAS
investigates the entire genome and identify SNPs and other variants in DNA associated with
a disease, but they cannot infer which genes are causal. Once new genetic associations are
identified, researchers can use the information to understand, treat and prevent the disease[2].
Successful GWAS has been conducted to identify genetic variations that contribute to the risk
of type 2 diabetes, Parkinson’s disease, heart disorders, obesity, Crohn’s disease and prostate
cancer, as well as genetic variations that influence response to anti-depressant medications[5;
6]. Such research lays the groundwork for personalized medicine.
    Based on prior knowledge of a gene’s biological function on the trait or disease, candidate
genes are most often studied in risk prediction research[3]. In risk prediction research, we
are interested in developing a new genetic risk prediction model to identify the high-risk
                                                1


individuals for certain diseases. If we could predict high-risk individuals at the early stage,
targeted screening and appropriate intervention methods can be used to reduce mortality
and morbidity[4]. However, there are tremendous analytic and computational challenges
when we implement a risk prediction model. Genetic data is high-dimensional. For example,
there are millions of single nucleotide polymorphisms (SNP), and the signal-to-noise ratio
of genetic data is quite low, which makes us hard to capture underlying genetic effects.
Moreover, the study sample is massive (e.g., a million samples in the UK Biobank), which
brings the computational issue.
    In this chapter, we will first review some basic knowledge of human genetics in section 1.2.
In section 1.3, we will briefly introduce the neural network and its application in healthcare.
We give the overall organization of this dissertation in section 1.4.
1.2       A review of basic human genetics
In the human genome, the genetic material is stored on chromosomes in the nucleus of the
cell. There are 23 pairs of chromosomes in the human genome: 22 pairs of them are autosomal
and the 23rd pair is the sex chromosomes. For the sex chromosomes, males have one X and
Y, while females have two non-identical copies of the X chromosome. Each chromosome is
composed of long strands of deoxyribonucleic acid (DNA), which determines how proteins
are manufactured in the human body.
    Genes are segments of DNA that code for specific proteins that function in one or more
types of cells in the body. These proteins control how our body grows and works; they are
also responsible for many of our characteristics, such as our eye color, blood type or height.
Genes are the basic physical units of inheritance, which are passed from parents to offspring
                                                2


and contain the information needed to specify traits. Most parts of DNA are the same in all
people, but a small proportion of DNA (less than 1 percent of the total DNA) are different
between people. These differences contribute to each person’s unique physical features. An
allele is one of two or more versions of a gene. An individual inherits two alleles, one from
each parent.
Figure 1.1: A graphical representation of Chromosome, DNA and gene. Credit to Genetic
Alliance UK
    SNPs are the most common type of genetic variation among people, which are typically
coded as the number of minor frequent alleles (e.g., AA=2, Aa=1, aa=0). A trait is any gene-
determined characteristic and is often determined by more than one gene. The genotypes
for the traits are often not observable and should be inferred from linked markers. In
statistical genetics, we intend to construct a statistical model that connects genotypes and
phenotypes[7].
                                               3


1.3      Statistical learning
We give a brief introduction of the statistical learning framework. Suppose X stands for
the vector space of input and Y for the vector space of output. In statistical learning, we
assume that there is an underlying unknown probability distribution over the product space
Z = X × Y . The training set D = {(x1 , y1 ), ..., (xn , yn )} comprise of n samples from the
probability distribution. The goal of statistical learning is to find the unknown function
f : X → Y from the data D.
    We start with a set of candidate hypotheses H = {h1 , h2 ..., }, which are likely to represent
f . The hypothesis space is the space of functions that the algorithm will search through.
We want to select a hypothesis f from H. The way we do this is called a learning algorithm.
Let L(f (x, y) be the loss function that is a metric of the difference between the predicted
value f (x) and the observed value y. The problem of statistical learning is to minimize the
expected risk:
                                         Z
                                 R(f ) =     L(f (x, y)dF (x, y).
Since the probability distribution F (x, y) is unknown, a proxy measure for the expected risk
must be used. We try to minimize empirical risk:
                                                  n
                                              1X
                                 Remp (f ) =         L(f (xi , yi ).
                                              n
                                                i=1
1.3.1     Neural network
The basis of the biological neural networks is the nerve cells, which is composed of a cell body,
a dendrite and an axon. At the high-level view, incoming stimuli are transmitted to the cell
body via dendrites. Outputs generated after operations in the cell body are transmitted to
                                                4


other nerve cells via axons. In the neural network model, it imitates the functioning of the
human brain. The biological nervous system in the human body consists of a three-layered
structure that includes receiving data, interpreting them, and making decisions. A neuron
model is composed of three layers: input layer, hidden layer and output layer.
    Here we give a graphical representation of similarity between biological and artificial
neural networks with one hidden layer in Figure 1.2[58]. x1 , ..., xm are input units, which
mimic the dendrites of a neuron. Σ is a computation unit, which is involved in the same role
in the cell body. The computation unit is the most important part in neural networks, which
is the linear combination of inputs units and bias and then apply the activation function.
The number of computation units and the type of activation function are crucial in building
models.
            Figure 1.2: Similarity between biological and artificial neural networks
    Common activation functions of neural networks used in perceptrons and neural networks
                                               5


are
    • Rectified Linear Unit (ReLU):
                                     σ(x) = x+ = max{x, 0},
    • Standard Sigmoid:
                                        σ(x) = (1 + e−x )−1 ,
    • Hyperbolic Tangent (Tanh):
                                                       ex − e−x
                                   σ(x) = tanh(x) = x           .
                                                       e + e−x
The output layer consists of a single layer, where the generated data are transmitted to the
outside world. This is analogous to the axon of a neuron.
    Neural networks with multiple hidden layers are called deep neural networks. Deep neural
networks contain multiple non-linear hidden layers and this enables them learn very compli-
cated relationships between inputs and outputs. For most data sets, neural networks with
one hidden layer are enough to build a decent model. While various theoretical perspectives
have been developed to explain why deep learning is successful, the general consensus of the
community is to attribute the success to the joint forces of straightforward neural model-
ing, simple learning techniques, the availability of big data and the hardware revolution in
high-performance computing[40]. Deep neural networks are a powerful tool in analyzing a
large dataset. However, overfitting is a serious issue in deep neural networks. Dropout is
a technique to address the overfitting problem. Dropout randomly drops units (along with
                                               6


their connections) from the neural network during training[39]. By reducing the number of
parameters, the model performance of a deep neural network can be improved.
     Deep learning has been implemented in many software frameworks, such as Tensorflow
and Pytorch[62; 63]. Those frameworks offer building blocks for designing, training and
validating deep neural networks, through a high-level programming language, like Python.
They also provide a clear and concise way to simplify the implementation of complex and
large-scale deep learning models by using a collection of pre-built and optimized components.
     It is worthwhile to mention a well-known result of neural networks: the universal approx-
imation theorem. A neural network with one hidden layer could approximate any continuous
function[61].
Theorem 1.3.1 (Universal Approximation Theorem). For every continuous function f :
[a, b]d → R and for every  > 0, there exists a neural network with one hidden layer ψ(x)
such that
                                    sup |f (x) − ψ(x)| < .
                                   x∈[a,b]d
1.3.2       Artificial intelligence in healthcare
Deep learning or AI has been applied to many applications, such as natural language pro-
cessing and computer vision. AI also holds great promise for healthcare. With the develop-
ment of biotechnology, healthcare data has a large size and complexity that traditional data
management tools cannot store or process it efficiently. Many successful AI applications in
healthcare have been conducted. For example, AI can be used to optimize the care trajectory
of chronic disease patients, suggest precision therapies for complex illnesses, reduce medical
errors, and improve subject enrollment into clinical trials[8]. Fakoor et al. showed that how
                                               7


unsupervised feature learning can be used for cancer detection and cancer type analysis from
gene expression data[9]. Krittanawong et al. gave a glimpse of AI’s application in cardio-
vascular clinical care and discussed its potential role in facilitating precision cardiovascular
medicine[12]. Pham et al used a deep learning approach to read medical records, store previ-
ous illness history, infer current illness states and predict future medical outcomes[59]. Plis
et al. applied deep learning methods to learn physiologically important representations and
detect latent relations in neuroimaging data[60]. There is a great promise that the applica-
tions of AI can provide substantial improvement in all areas of healthcare from diagnostics
to treatments. Although there are many instances in which AI can perform healthcare tasks
better than humans, implementation will prevent large-scale automation of healthcare pro-
fessional jobs for a considerable period[76]. However, AI will not take over the jobs which
require unique human skills such as empathy and persuasion.
1.4      Organization
The dissertation is organized as follows. In chapter 2, we develop a neural-network-based
method called expectile neural networks. In chapter 3, The asymptotic properties of ENN
are discussed. In chapter 4, we summarize this dissertation and discuss some potential future
work.
                                                 8


Chapter 2
Expectile Neural Networks for
Genetic Data Analysis of Complex
Diseases
2.1       Overview
The genetic etiologies of common diseases are highly complex and heterogeneous. Classic
statistical methods, such as linear regression, have successfully identified numerous genetic
variants associated with complex diseases. Nonetheless, for most complex diseases, the
identified variants only account for a small proportion of heritability. Challenges remain
to discover additional variants contributing to complex diseases. Expectile regression is a
generalization of linear regression and provides completed information on the conditional
distribution of a phenotype of interest. While expectile regression has many nice proper-
ties and holds great promise for genetic data analyses (e.g., investigating genetic variants
predisposing to a high-risk population), it has been rarely used in genetic research. In this
chapter, we develop an expectile neural network (ENN) method for genetic data analyses
of complex diseases. Similar to expectile regression, ENN provides a comprehensive view of
relationships between genetic variants and disease phenotypes and can be used to discover
                                               9


genetic variants predisposing to sub-populations (e.g., high-risk groups). We further inte-
grate the idea of neural networks into ENN, making it capable of capturing non-linear and
non-additive genetic effects (e.g., gene-gene interactions). Through simulations, we showed
that the proposed method outperformed an existing expectile regression when there exist
complex relationships between genetic variants and disease phenotypes. We also applied the
proposed method to the genetic data from the Study of Addiction: Genetics and Environ-
ment(SAGE), investigating the relationships of candidate genes with smoking quantity.
2.2      Introduction
Converging evidence suggests that the genetic etiologies of complex diseases are highly het-
erogeneous [13; 14] and various genetic factors and environmental determinants could play
different roles in subgroups of the population. Linear regression has been commonly used
in genetic studies to investigate the effects of genetic variants on the mean of a continuous
phenotype. However, if we are interested in a complete view of genetic effects across the
entire distribution of phenotypes or are interested in investigating genetic contribution to
a sub-population(e.g., a high-risk population), quantile regression and expectile regression
are great alternative choices [15; 16]. Quantile regression generalizes median regression and
has been widely used in fields such as economics [17], medicine [18; 19] and environmental
science [20] to study entire conditional distributions of responses given covariates. While
quantile regression has many good properties (e.g., being robust to distribution assumption
and outlies), as pointed out by Newey and Powell [16], quantile regression has several lim-
itations. First, quantile regression uses the check function with the absolute least error as
loss function, which is not continuously differentiable and is computationally difficult for pa-
                                               10


rameter estimation. Second, quantile regression is relatively inefficient for error distributions
that are close to Gaussian or have low densities at the corresponding percentile. Third, it is
challenging to estimate the density function values of quantile regression.
    To address these issues, Newey and Powell [16] proposes expectile regression, which uses
the sum of asymmetric residual squares as the loss function. Since the loss function is
convex and differentiable, expectile regression has a computational advantage over quantile
regression. Similar to quantile regression, expectile regression makes no assumption on error
distribution (e.g., homoscedasticity) and can be used to study the entire distribution of
the responses. Expectile regression can be viewed as a generalization of linear regression.
A typical expectile regression assumes a linear relationship between the expectile and the
covariates, which may not be suitable for genetic data analysis as genetic variants likely
influence phenotypes in a complicated manner (e.g., through interactions) [21]. Simply
considering linear and additive genetic effects can’t fully take this complexity into account.
    In this chapter, we integrate the idea of neural networks into expectile regression and de-
velop an expectile neural network (ENN) method to model the complex relationship between
genotypes and phenotypes. While several methods have been developed to integrate neural
networks into quantile regression[22; 23; 24], few studies have been focused on investigating
nonlinear expectile regressions, especially using neural networks. Compared to quantile re-
gression neural networks(QRNN), ENN has several advantages. The empirical loss function
in ENN is differentiable everywhere. Moreover, ENN can detect the heteroscedasticity in
the data since ENN is more sensitive to extreme values than QRNN[25; 26; 27; 28; 29].
    The rest of the chapter is organized as follows: in Section 2, we review expectile regres-
sion and propose an ENN method. We then give an inequality that bounds the integrated
squared error of an expectile function estimator in terms of risk functions. The proof of
                                               11


inequality is detailed in the Appendix. Simulations were conducted in Section 3 to evalu-
ate the performance of the new method. In Section 4, we applied ENN to the SAGE data,
studying genetic contribution to smoking quantity. We provide the summary and concluding
remarks in Section 5.
2.3       Method
In this section, we briefly introduce expectile regression and then propose an expectile neural
network. Suppose we have n samples,{(xi , yi ), i = 1, ..., n}, where xi = (1, xi,1 , ..., xi,p )T and
yi denote a p−dimensional covarites and the response for the ith sample, respectively. In
this chapter, the covariates are primarily genetic variants, such as single nucleotide poly-
morphisms (SNPs), which are typically coded as the number of minor frequent allele (e.g.,
AA=2, Aa=1, aa=0). The covariates xi can also include personal characteristics (e.g., gen-
der) and environmental determinants. The response yi is the set of observable characteristics
of an individual in genetics. For example, yi could be the type of diabetes, or the height of
an individual. By building models between xi and yi , we tend to explore the relationship of
candidate genes and certain disease.
2.3.1      Expectile regression
Given the data, linear regression is commonly used to model the relationship between the
covariates and the mean response. However, if we want to explore a complete relationship
between the covarites and the response (e.g., genetic contribution to a high-risk population),
an expectile regression can be used. To simplify the notation, we denote expectile regression
                                                12


as ER. The expectile regression for the τ −expectile can be expressed as,
                                        Expectile(τ ) = xT β̂,                                (2.1)
where β̂ is the estimator of coefficients β = (β0 , β1 , ..., βp )T . The expectile is also closely
related to two commonly used measures in mathematical finance, value at risk and expected
shortfall. The regression parameters, β̂, can be obtained by minimizing an asymmetric L2
loss function,
                                             n
                                          1X
                           RLτ (β; τ ) =        Lτ (yi , xi T β), 0 < τ < 1,                  (2.2)
                                          n
                                            i=1
where Lτ (·) is asymmetric squared loss with convex form
                                      
                                      (1 − τ )(yi − xi T β)2 , if yi < xi T β
                                      
                                      
                              T
                     L(yi , xi β) =                                                           (2.3)
                                      τ (yi − xi T β)2 ,          if yi ≥ xi T β.
                                      
                                      
Minimizing asymmetically weighted sums of squared errors yields the the expectiles. If we
minimize sums of asymmetrically weighted absolute errors, the estimators are quantiles.
In contrast to the quantiles, expectiles have a more global dependence on the form of the
distribution. Shifting mass in the lower tail of a distribution has no impact on the quantiles of
the upper tail, but it will affect all expectiles. We cite the Figure 2.1 to show the relationship
between quantiles and expectiles[54].
    For a model with a large p, a penalty term can be added to the risk function to reduce
                                                  13


the model complexity,
                                         n                        p
                                      1X
                                            Lτ (yi − xi T β) + λ     βi2 .
                                                                 X
                        RLτ (β; τ ) =                                                    (2.4)
                                      n
                                        i=1                      i=1
    τ is a hyperparameter between 0 and 1. By tuning τ , we could get different conditional
distributions of responses which is similiar to quantile regression. However, quantile regres-
sion uses asymmetric absolute value function. When τ = 0.5, the corresponding expectile
regression degenerates to a standard linear regression. Therefore, expectile regression can
also be viewed as a generalization of linear regression. Quantile regression can be seen as a
generalization of median regression, expectiles as alternative are a generalized form of mean
regression.
                            Figure 2.1: Quantiles and expectiles.
                                              14


2.3.2     Expectile neural network
A typical expectile regression model focuses on linear relationships between covariates and
responses. In reality, the underlying relationship could be non-linear and involve complicated
interactions among covariates. In order to model complex relationships between covariates
and responses, we integrate the idea of neural networks into expectile regression and propose
an ENN method. Neural network is a powerful nonlinear approximator. For every continuous
function, neural network with one hidden layer could approximate it well[33]. We don’t
assume a particular functional form of covariates and use neural networks to approximate
the underlying expectile regression function. ENN can be considered as a nonparametric
expectile regression or neural networks with asymmetric L2 loss function, We illustrate ENN
with one hidden layer. The method can be easily extended to an expectile regression deep
neural network with multiple layers.
             Figure 2.2: A graphical representation of expectile neural network
                                               15


    Given the covariates xt , we first build the hidden nodes hq,t ,
                                       P
                                                  (1)      (1)
                              f (1) (
                                      X
                     hq,t =                xp,t wpq + bq ), q = 1, ..., Q, t = 1, ..., n, (2.5)
                                      p=1
where Q is the number of nodes in the first hidden layer, wpq denotes weights and bq denotes
the bias; f (1) is the activation function for the hidden layer that can be a sigmoid function,
a hyperbolic tangent function, or a rectified linear units(ReLU) function. Similar to hidden
nodes in neural networks, the hidden nodes in ENN can learn complex features from covari-
ates x, which makes ENN capable of modelling non-linear and non-additive effects. Based
on these hidden nodes, we can model the conditional τ -expectile, ŷτ (t),
                                                         Q
                                                                   (2)
                                                f (2) (     hq,t wq + b(2) ),
                                                        X
                                     ŷτ (t) =                                            (2.6)
                                                       q=1
                (2)
where f (2) , wq , and b(2) are the activation function, weights, and bias in the output layer,
respectively. f (2) can be an identity function, a sigmoid function, or a rectified linear
units(ReLU) function. To illustrate the structure of ENN, a graphical representation of
ENN is given in Figure 2.1.
    From equations (2.5) and (2.6), we can have the ENN model:
                                           Q           P
                                                                 (1)   (1)  (2)
                                 f (2) (      f (1) (     xp,t wpq + bq )wq + b(2) ).
                                         X            X
                      ŷτ (t) =                                                           (2.7)
                                         q=1          p=1
If we choose τ = 0, f (1) and f (2) as identity function, ENN is reduced to linear regression.
                 (1) (1)     (2)
To estimate wpq , bq , wq , b(2) , we minimize the empirical risk function
                                                          16


                                                       n
                                                   1X
                                       R(τ ) =            Lτ (yi , f (xi )),                  (2.8)
                                                   n
                                                     i=1
where                                     
                                          (1 − τ )(yi − f (xi ))2 ,
                                          
                                                                            if yi < f (xi )
                     Lτ (yi , f (xi )) =                                                      (2.9)
                                          τ (yi − f (xi )))2 ,
                                          
                                                                           if yi ≥ f (xi ).
                                          
The model tends to be overfitted with the increasing number of covariates. To address the
overfitting issue, a L2 penalty is added to the risk function,
                                  n                           P     Q
                               1X                            X X (1)                   (2)
                    R(τ ) =            Lτ (yi , f (xi )) + λ           (wpq )2 + (wq )2 .    (2.10)
                               n
                                 i=1                         p=1 q=1
    The loss function for ENN is differentiable everywhere which gives us computation ad-
vantage. Even though ENN is differentiable, it is not easy to get exact estimator like lin-
ear regression because of the existence of indicator function. We can obtain the estima-
tor of ENN by using gradient-based optimization algorithms (e.g., quasi-Newton Broyden-
Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm). In numerical optimization, the
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving
unconstrained nonlinear optimization problems[34].
2.3.3      Theoretical result
Intuitively, if we fix τ , the upper and lower bound of τ −expectile is related to risk function.
To illustrate well, some notations are changed. We give one theoretical result which shows
that upper bound and lower bound of error of τ −expectile are bounded by risk function
RLτ ,P (f ). In ENN, τ −expectiles fL∗ τ ,P can be estimated by minimizing the asymmetric
                                                        17


least squares (ALS) loss,
                                      Z
       R∗Lτ ,P = inf {RLτ ,P (f ) =          Lτ (y, f (x))dP (x, y)|f : X → R measurable},
                                       X×Y
where P is the distribution on X × Y and f : X → R is some predictor. The following
theorem describe the upper bound and lower bound of error of fL∗ τ ,P .
Theorem 2.3.1. Let Lτ be the ALS loss function and P be the distribution on X × Y . We
further assume that fL∗ τ ,P < ∞ is the τ −expectile for fixed τ ∈ (0, 1). Then, for an arbitrary
neural network function f , we have
     −1/2                                                           −1/2
   Cτ     (RLτ ,P (f ) − R∗Lτ ,P )1/2 ≤ ||f − fL∗ τ ,P ||L (Px ) ≤ cτ    (RLτ ,P (f ) − R∗Lτ ,P )1/2 ,
                                                          2
where cτ = min{τ, 1 − τ }, Cτ = max{τ, 1 − τ }.
    Proof of this theorem can be found in the appendix of the chapter.
                                                  18


2.4       Simulation
Simulation studies were conducted to compare the performance of ENN and ER under dif-
ferent settings. The genetic data used in the simulation is the real sequencing data from
the 1000 Genomes Project, located on Chromosome 17 : 7344328 − 8344327 [30]. Totally
1000 replicates were simulated for each simulation setting. In each replicate, we randomly
selected a number of samples and SNPs from the 1000 Genomes Project based on the simu-
lation settings. Given the genotypes, we further simulated the phenotype by using different
linear/non-linear functions or by assuming different types of interactions among SNPs or
genes.
    We divided the samples into training, validation, and testing sets with the ratio 3: 1: 1.
ENN and ER were applied to the training set to build models. While a variety of activation
functions can be used in ENN, we choose ReLU due to its performance and computational
advantage[10]. Since the loss function of ENN is differentiable, we use the quasi-Newton
BFGS optimization algorithm to estimate the parameters in ENN. We chose the starting
point carefully to avoid the local minimum. To select a proper starting point, we generated
a set of initial values from U [−1, 1], ran the algorithm for a few steps, and chose the initial
values achieving the smallest loss as the initial values. Based on the initial values, the quasi-
Newton BFGS optimization algorithm is implemented to iteratively estimate the parameters
until the convergence criterion is satisfied. The models built on the training set were then
applied to the validation set to choose the most parsimonious model with the optimal tuning
parameter (i.e., λ). To choose the best λ, we use the grid search with different values of
0,0.1,1,10,100. This final model was then evaluated on the testing set by using the mean
squared error (MSE). We chose the number of hidden nodes with smallest MSE value by
                                                19


doing simulation. We simplify those terms: expectile neural network, expectile regression,
training data and testing data as ENN, ER, TR, TS in three simulations.
2.4.1     Simulation I - nonlinear relationship
In simulation I, we varied the relationships between genotypes and phenotypes. Since the
existence of hyperparameter τ , we compared the performances of ENN with ER. If we
wanted to compare with other model, we need to fix τ . The existence of τ gived us a
complete view of genetic effects across the entire distribution of phenotypes, like quantile
regression. If τ is close to 0 or 1, we could investigate genetic contribution to high-risk
individuals. Specially, we considered the following four nonlinear functions as true functions
to simulate the relationship between genotypes and phenotypes. For comparison purpose,
we also include a linear function. We compare ENN with ENN under four different nonlinear
functions: hyperbolic function, mixed function, quadratic function, cubic function.
   1. linear function:
                                        y = α + , α = xT β,
   2. Hyperbolic function:
                                            |α|
                                    y=             + , α = xT β,
                                         (1 + |α|)
   3. Mixed function:
                            y = sin(α) + 2 ∗ exp(−16α2 ) + , α = xT β,
   4. Quadratic function:
                                       y = α2 + , α = xT β,
                                               20


   5. Cubic function:
                                         y = α3 + , α = xT β,
where x is the vector of SNPs (coded as 0, 1 or 2), β represents the genetic effects generated
from the uniform distribution of U (−1, 1), and  ∼ N (0, 1). Totally 1000 replicates were
simulated by setting  with different seed. For each replicate, We randomly choose 500
samples and 50 SNPs from the 1000 Genomes Project. For each nonlinear function, we
choose five different value τ of 0.1, 0.25, 0.5, 0.75, 0.9 in order to get different expectiles. To
have better readability, the columns of validation data are not shown.
                                                 21


Figure 2.3: Performance comparison between ENN and ER under various relationships be-
tween genotypes and phenotypes and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9)
                                          22


    The results from the simulation I are summarized in Figure 2.3. ENN outperforms ER in
terms of MSE under four different nonlinear relationships, and has comparable performance
with ER when the underlying relationship is linear. The pattern is consistent across different
expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9). While ENN outperforms ER for all four
non-linear cases, ENN attains its best performance relative to ER when the underlying
relationship is a high-order polynomial function (i.e., a cubic function). From the simulation
result, ENN has advantages to explore the underlying nonliner relationship between genetic
variants and certain disease. By fixing τ as 0.1 or 0.9, we could apply ENN into real data
to identify high-risk individuals.
2.4.2      Simulation II - interactions among SNPs
Increasing empirical evidence from model organisms and human studies suggests that in-
teractions among loci contribute broadly to complex traits[36; 37; 38]. In simulation II,
we considered three different interactions scenarios that attempt to mimic simple biological
mechanisms. Those three types of interactions included a two-way multiplicative interaction,
a two-way threshold interaction, and a three-way interactions [14]. Similar to simulation I,
we simulated 1000 replicates for each type of interaction. We use the same structure of ENN
like simulatino II. For each replicate, 500 samples and 50 SNPs were chosen from the 1000
Genomes Project. Among the 50 SNPs, we randomly selected 20% of SNPs and simulated
different types of interactions among the selected SNPs. Based on the simulated data, we
compared MSEs of ENN and ER. For the comparison purpose, we also included a baseline
model without any interaction. Only training and testing data are shown.
                                              23


Figure 2.4: Performance comparison between ENN and ER for different types of interactions
and different expectiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9)
   The results of the simulation II are summarized in Figure 2.4. Overall, ENN outperforms
ER under all three interaction scenarios due to its ability of taking interactions into account.
Among all interaction models, ENN attains its best performance relative to ER when there
are three-way interactions. ENN also has more advantage over ER at the upper and lower
expectiles (e.g., 0.1 and 0.9). When there is no interaction, ENN has comparable performance
with ER.
                                                 24


2.4.3     Simulation III - interactions between genes
Following the identification of several disease-associated polymorphisms by whole genome
association analysis, investigating interactions among two or more than two genes is often
interested in genetic studies[35]. Detecting gene-gene interaction will allow us to elucidate
the biological and biochemical pathways underpinning disease.
          Figure 2.5: An alternative architecture for gene-gene interaction analyses
    While a fully connected neural network can be built on all SNPs in the genes of interest,
a neural network with a simpler architecture reflecting the underlying genetic data structure
can be used to reduce the model’s complexity and improve the model’s performance. In this
simulation, We illustrate the idea by modeling interactions between two genes with a non-
fully connected architecture. In the non-fully connected architecture, the hidden units are
only locally connected to SNPs in one gene (Figure 2.5). By using this simple architecture,
we can reduce the number of parameters and build ”gene-specific” hidden units to capture
abstract features of a specific gene. To evaluate the performance of such an architecture, we
                                              25


simply simulated four SNPs for each gene, considered a two-way multiplicative interaction
between two genes, and compared ENN with the non-fully connected architecture to ENN
with a fully connected architecture.
Figure 2.6: Performance comparison between ENN with a fully connected architecture and
ENN with a non-fully connected architecture for gene-gene interaction analyses
   Figure 2.6 summarizes the results from simulation III. The results show that ENN with
the non-fully connected architecture attains lower MSE than ENN with the fully-connected
architecture. As expected, the non-fully connected architecture requires fewer parameters
and more reflects the underlying genetic data structure (i.e., genes are separate functional
units), and therefore attains better performance than the fully-connected architecture. By
reducing the number of parameters, we have more computational advantage.
2.5      Real data applications
Tobacco use is the leading cause of preventable disease and death in the United States. In
2019, nearly 34 million adults currently smoked cigarettes. More than 16 million Americans
                                             26


are related to a disease caused by smoking. More than 300 billion a year are spent in direct
medical care for adults or in lost productivity due to premature death and exposure to
secondhand smoke in United States. More than 7 million deaths per year are caused by
tobacco use in the world(https://www.cdc.gov/tobacco/data_statistics/index.htm).
Predicting high-risk individuals at early stage so that appropriate prevention methods can
be used to reduce mortality and morbidity.
    In this section, we applied ENN into analyzing two real data set. The first one is to explore
genetic effects on nicotine dependence. In the second real data analysis, we take gene-gene
interactions into consideration. Since the existence of hyperparameter τ , we choose ER
as baseline. Five different τ values 0.1, 0.25, 0.5, 0.75, 0.9 are chosen. We use mean square
error(MSE) as metrics to measure the performance of ENN and ER.
                                               27


Table 2.1: The accuracy performance of two models built by ENN and ER based on 149
candidate SNPs and 3 covariates
                                     ENN                    ER
                        τ       Train      Test      Train     Test
                        0.1    409.612 678.331      504.215 694.809
                        0.25   346.118 579.164      394.836 588.759
                        0.5    358.783 502.752      342.144 535.925
                        0.75   344.399 604.969      421.955 613.676
                        0.9    570.994 809.733      699.654 882.781
2.5.1      The relationship between candidate SNPs with smoking quan-
           tities
We applied both ENN and ER to the genetic data from the Study of Addiction: Genet-
ics and Environment(SAGE). The participants of the SAGE are selected from three large
and complementary studies: the Family Study of Cocaine Dependence(FSCD), the Collab-
orative Study on the Genetics of Alcoholism(COGA), and the Collaborative Genetic Study
of Nicotine Dependence(COGEND). In this application, we selected 155 SNPs, which were
previously shown to have a potential role in nicotine dependence. After quality control, 149
SNPs remained for the analysis. There are a total of 3897 samples in the SAGE data from
different ethnic groups. We only included 3888 Caucasian and African American samples
due to the small sample size of other ethnic groups. Our interest is to use ENN and ER to
build models on 149 SNPs, 3 covariates (i.e., sex, age, and race), and smoking quantities,
which is measured by the largest number of cigarettes smoked in 24 hours. We divided the
whole sample into the training, validation and test samples in the ratio of 3:1:1 to build the
models, select the turning parameter, and evaluate the models, respectively.
    Table 2.1 summarizes MSE of the models built by ENN and ER for five expectile levels
(i.e., τ = 0.1, 0.25, 0.5, 0.75, and 0.9). For readability, MSE of validation data is omitted.
                                                28


Table 2.1 shows that ENN outperforms ER, indicating the possibility of non-linear or non-
additive effects among candidate SNPs and covariates.
Figure 2.7: A comprehesive view of the conditional distribution of smoking quantity for five
expectile levels (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9)
    To provide a comprehensive view of the conditional distribution of smoking quantity, we
ordered the expectiles estimated from ENN from lowest to highest and plotted their values
for all five expectile levels. Figure 2.7 shows that the distributions of estimated expectiles are
different across five expectile levels. Under different expectile levels, different expectiles are
predicted. When τ = 0.5, ENN models the mean response, in which the estimated expectiles
are similar for all individuals. Nonetheless, for high expectile levels (e.g., τ = 0.9), the
estimated expectiles vary among individuals and high-ranked individuals have much higher
expectiles than low-ranked individuals. ENN gives us more information compared to linear
regression which only shows predicted value with τ = 0.5.
                                                  29


2.5.2     Gene-gene interactions between the CHRNA5-CHRNA3-
          CHRNB4 gene cluster
Based on previous genome-wide association studies, variants in the CHRNA5-CHRNA3-
CHRNB4 gene cluster on chromosome 15 that encode the α5, α3 and β4 subunits of the
nicotinic acetylcholine receptor (nAChRs) are associated with nicotine dependence (ND) in
European Americans (EAs) or others of European origin[31]. In the second data analysis, we
focused on the CHRNA5-CHRNA3-CHRNB4 gene cluster, and evaluated potential interac-
tions by using ENN and ER. We consider three pairwise interactions between CHRNA5 and
CHRNA3, CHRNA5 and CHRNB4, CHRNA3 and CHRNB4. The phenotype of interest in
this analysis is the number of cigarettes smoked per day (CPD), which has been popularly
used in the genetic study of nicotine dependence.
                                             30


Table 2.2: Evaluating a pairwise interaction between CHRNA5 and CHRNA3 by using ENN
and ER
                                       ENN                 ER
                            τ      Train Test        Train Test
                            0.1 1.106 2.022          1.183 2.036
                            0.25 0.994 1.699         1.027 1.737
                            0.5 0.896 1.266          0.908 1.304
                            0.75 1.148 1.045         1.136 1.066
                            0.9 2.015 1.335          2.069 1.357
Table 2.3: Evaluating a pairwise interaction between CHRNA5 and CHRNB4 by using ENN
and ER
                                       ENN                 ER
                            τ      Train Test        Train Test
                            0.1 1.139 2.020          1.186 2.049
                            0.25 0.980 1.701         1.029 1.735
                            0.5 0.901 1.277          0.908 1.305
                            0.75 1.149 1.047         1.136 1.071
                            0.9 2.054 1.318          2.070 1.351
    Tables 2.2-2.4 summarize MSE of the interaction models built by using ENN and ER
for five expectile levels. For all 3 scenarios, expectile neural network outperforms expectile
regression in terms of MSE slightly because the signal-to-noise ratio of genetic data is low.
    To graphically view the conditional distribution of CPD, we ranked the expectiles esti-
mated from ENN and plotted the values against the estimated expectiles (Figures 2.8-2.10).
Table 2.4: Evaluating a pairwise interaction between CHRNA3 and CHRNB4 by using ENN
and ER
                                       ENN                 ER
                            τ      Train Test        Train Test
                            0.1 1.133 2.019          1.183 2.035
                            0.25 0.979 1.683         1.020 1.696
                            0.5 0.892 1.278          0.896 1.279
                            0.75 1.150 1.048         1.128 1.081
                            0.9 2.020 1.342          2.040 1.386
                                                31


Overall, the estimated expectiles tends to be similar when τ = 0.5 (i.e., mean), while they
are quite different for high expectile levels (e.g., τ = 0.9). This suggest that the gene-gene
interactions may play a more important role in models with high expectiles than the mean
models.
                                               32


Figure 2.8: The conditional distribution of CPD considering the interaction between
CHRNA5 and CHRNA3
2.6      Summary and discussion
In this chapter, we develop an ENN method, which inherits advantages from both neural
networks and expectile regression. Using the hierarchical structure from neural networks,
ENN can learn complex and abstract features from genotypes, making it suitable for modeling
the complex relationship between genotypes and phenotype. Similar to ER, ENN can also
explore the conditional distribution and provide a comprehensive view of the genotype-
phenotype relationship.
   Through simulations and a real data application, we demonstrate that ENN outperforms
ER when there are non-additive and non-linear effects. Evidence also suggests that ENN
has more advantages than ER when the model involves high-order interaction effects or
non-linear effects. This may suggest ENN has improved performance when the underlying
genotype-phenotype relationships become more complicated. The real data analysis shows
                                            33


Figure 2.9: The conditional distribution of CPD considering the interaction between
CHRNA5 and CHRNB4
Figure 2.10: The conditional distribution of CPD considering the interaction between
CHRNB4 and CHRNA3
                                          34


that genetic effects can vary among different expertiles. Compared to the classical linear
regression, ENN provides us more information about the genotype-phenotype relationship
via the conditional distributions for different expectile levels.
     While regularization has been incorporated into ENN to avoid overfitting, ENN can
still be subject to overfitting when the number of SNPs becomes extremely large (e.g., one
million). To deal with such a large number of SNPs, we can model the overall genetic effect
as a random effect and extend ENN, which is an interesting topic for future work.
                                               35


Chapter 3
Asymptotic Theory of Expectile
Neural Networks
In the previous chapter, we focus on introducing the ENN model and providing an inequality
that bounds the integrated squared error of an expectile function estimator. Statistical prop-
erties of ENN (e.g., consistency) are also important topics that worth further investigation.
In this chapter, we study the asymptotic properties of expectile neural networks, including
consistency and normality. In ENN, we use the asymmetric square loss as the loss function.
When the size of parameters is too large, the standard maximum likelihood procedures may
not work. Therefore, we use the sieve method to constrain the parameter space of ENN, and
prove the consistency and normality under the nonparametric regression framework.
3.1       Introduction
Neural networks have been widely used in industry and academy. However, the theoreti-
cal properties of neural networks have not been thoroughly studied. For a typical artificial
neural network, we use the squared loss function to estimate parameters. A general result
for the asymptotic normality of squared loss function could be find[52]. By the universal
approximation theorem, a neural network with one hidden layer can approximate any con-
tinuous functions[43]. In this chapter, we use the asymmetric squared loss function, which
                                             36


gives us a comprehensive view of conditional distribution and computation advantage. In
statistics, fitting a neural network can be considered as a parametric nonlinear regression
problem,
                                                      r
                                                          αj σ(γjT xi + γ0,j ),
                                                     X
                                     yi = α0 +
                                                     j=1
where 1 , . . . , n are i.i.d. random errors with E[] = 0 and E[2 ] = σ 2 < ∞ and σ(z) = 1/(1+
e−z ). However, it is impractical to fix the number of hidden units r,. If we do not fix r, the
parameter in unidentifiable. Fukumizu (1996)[55] and Fukumizu et al. (2003) [56] provided
an example to illustrate the unidentifiable issue. If the true function is f0 (x) = ασ(γx) with
one hidden unit, we fit the model using a neural network with two hidden units. Then, any
parameter Θ = [α0 , α1 , . . . , αr , γ0,1 , . . . , γ0,r , γ1T , . . . , γrT ]T in the following set
                              {Θ : γ1 = γ, α1 = α, γ0,1 = γ0,2 = α2 = α0 = 0}∪
                         {Θ : γ1 = γ2 = γ, γ0,1 = γ0,2 = α0 = 0, α1 + α2 = α}
realizes the true function f0 (x). Therefore, when the number of hidden units is unknown,
the parameters in this parametric nonlinear regression problem are unidentifiable.
    To address this issue, we can consider the neural network in the nonparametric setting.
We assume that the true nonparametric regression model is as follows:
                                                yi = f0 (xi ) + i ,
where 1 , ..., n are i.i.d random variables defined on (Ω, A, P) with E() = 0 and E(2 ) =
σ 2 < ∞. f0 ∈ F is an unknown function, where F is the class of continuous function
with compact support. However, if the complexity of F is large, the estimator may be
                                                           37


inconsistent[48]. The standard and penalized maximum likelihood procedures may be ineffi-
cient, whereas the method of sieves may be able to overcome this difficulty[52]. The method
of sieves provides one way to tackle such difficulties by optimizing an empirical criterion over
a sequence of approximating parameter spaces (i.e., sieves). The sieves are less complex but
are dense in the original space, and the resulting optimization problem becomes well-posed.
To address this issue, we constrain the class of F and use method of sieves to prove the
normality of ENN.
3.2       Method of sieves
Sieve is a sequence of increasing functions that can be used to reduce the number of param-
eters. Sieve plays an important role in infinite-dimensional unknown parameter, such as in
a nonparametric or semiparametric model. When the method of sieves is implemented, a
nonparametric or semiparametric estimation problem is often reduced to a parametric one.
However, to obtain the desired theoretical properties of the estimator, it is necessary that
the number of parameters increases slowly with the sample size[42]. We consider a sequence
of function classes,
                           F1 ⊆ F2 ⊆ · · · ⊆ Fn ⊆ Fn+1 ⊆ · · · ⊆ F,
                                      S∞
approximating F in the sense that       n=1 Fn   is dense in F, that is for each f ∈ F, there
exists πn f ∈ Fn such that d(f, πn f ) → 0 as n → ∞, where d(·, ·) is some pseudo-metric
defined on F.
    The method of sieves consists of two key ingredients: a loss function and sieve parameter
spaces (a sequence of approximating spaces). Both loss function and the sieve parameter
spaces are flexible. Almost all of the classical loss functions, so long as they allow for
                                               38


identification, can be used as loss functions in the method of sieve estimation. Therefore,
the main challenge is the choice of sieve parameter spaces. In this chapter, we focus on the
sieve of neural networks with one hidden layer and the sigmoid activation function.
                         rn                                                   rn
                                   γjT x + γ0,j            Rd , αj , γ0,j
                        X                                                      X
           Frn = {α0 +      αj σ                  : γj ∈                  ∈ R,     |αj | ≤ Vn
                        j=1                                                    j=0
                                                                                              (3.1)
                                                Xd
                for some Vn ≥ 4 and      max        |γi,j | ≤ Mn for some Mn > 0},
                                       1≤j≤rn
                                                i=0
where rn , Vn , Mn → ∞ as n → ∞. Frn has some important properties. For example, Frn
is dense in F and f ∈ Frn has upper bound. When we consider the asymptotic properties
of the sieve estimators, we use the pseudo-norm kf k2n = n−1 n                 2
                                                                       P
                                                                          i=1 f (xi ).
    With some abuse of notation, an approximate sieve estimator fˆn is defined to be
                              Qn (fˆn ) ≤ inf Qn (f ) + Op (ηn ),                             (3.2)
                                           f ∈Fn
where ηn → 0 as n → ∞.
    We refer the reader to Chen for more details in the method of sieves [42]. Since we use
the asymmetric loss function, we establish the upper bounds for the empirical risk and the
sample complexity based on the covering number and the Vapnik-Chervonenkis dimension
[41]. The estimator of expectile neural networks can also be regarded as M-estimator[50].
3.3       Existence
Before we study the consistency and normality of ENN, it is crucial to ask if the sieve esti-
mator based on neural networks exists. In this chapter, we focus on Frn as sieve estimator.
                                                 39


First, we show that any function in Frn has an upper bound.
Lemma 3.3.1. For each fixed n,
                                              sup kf k∞ ≤ Vn .
                                            f ∈Frn
Proof. For any f ∈ Frn with a fixed n and x ∈ X , we have
                                        rn                       
                                                    γ Tj x + γ0,j
                                       X
                      |f (x)| = α0 +        αj σ
                                       j=1
                                         rn                             rn
                                                       γ Tj x + γ0,j
                                        X                                X
                              ≤ |α0 | +      |αj |σ                    ≤     |αj | ≤ Vn .
                                        j=1                              j=0
Since the right-hand side does not depend on x and f , we have
                                 sup kf k∞ = sup sup |f (x)| ≤ Vn .
                               f ∈Frn             f ∈Frn x∈X
Lemma 3.3.2. Let χ be a compact subset of Rd , then for each fixed n, Frn is a compact set.
    The proof of this lemma is in the appendix. This lemma tells us that Frn is compact
in C(X ), which is the set of all continuous functions. We use the theorem 3.3.1 to show the
existence of estimator of ENN[47].
Theorem 3.3.1. Let (Ω, F, P ) be a complete probability space and (Θ, ρ) be a metric space.
Let {Θn } be a sequence of compact subsets of Θ and Qn : Ω × Θn → R be measurable
F × B(Θn )/B. Assuming that for each ω in Ω, Qn (ω, ·) is lower semicontinuous on Θn , n =
1, 2, ...., then for each n = 1, 2, ..., there exists θ̂n : Ω → Θn measurable F/B(Θn ) such that
                                                        40


for each ω in Ω, Qn (ω, θ̂n (ω)) = inf θ∈Θn Qn (ω, Θ).
Proof.
                              n
                          1 X                                       
               Qn (f ) =          |τ − 1{y <f (x )} |(yi − f (xi ))2
                          n                i     i
                             i=1
                              n 
                          1                                                            
                                  |τ − 1{f (x )+ <f (x )} |(f0 (xi ) + i − f (xi ))2
                             X
                       =                                                                 .
                          n                0 i      i     i
                             i=1
Since Qn is measurable and lower semicontinuous and Frn is compact, we could get the
existence of sieve estimator for ENN.
3.4      Consistency
In this section, we are interested in proving the consistency of the neural network sieve
estimator under asymmetric squared loss function. If we choose τ = 0.5, the proof of
consistency of the neural network sieve estimator can be found in [51]. In ENN, we minimize
the following empirical risk,
fˆn = argminf ∈Frn Qn (f )
                        n
                     1X
    = argminf ∈Frn          τ (Yi − f (Xi ))2 1{Y −f (X )≥0} + (1 − τ ) (Yi − f (Xi ))2 1{Y −f (X )<0} .
                     n                           i      i                                  i     i
                       i=1
                                                   41


One important step in proving consistency is to show that the empirical risk is uniformly
over Frn close to the expected risk. More specifically, we need to show that
     n
  1X
         {τ (Yi − f (Xi ))2 1{Y −f (X )≥0} + (1 − τ ) (Yi − f (Xi ))2 1{Y −f (X )<0}
  n                              i      i                                    i     i
    i=1
                                                                                        
   − E τ (Yi − f (Xi ))2 1{Y −f (X )≥0} + (1 − τ ) (Yi − f (Xi ))2 1{Y −f (X )<0} }
                                 i     i                                    i      i
                 n
             1X
   = sup            τ [g1 (Zi ) − E(g1 (Zi ))] + (1 − τ )[g2 (Zi ) − E(g2 (Zi ))] → 0 a.s., as n → ∞,
     f ∈Frn n i=1
    where Zi = (Xi , Yi ), i = 1, ..., n, g1 (x, y) = |y − f (x)|2 1{y−f (x)≥0} , g2 (x, y) = |y −
f (x)|2 1{y−f (x)<0} for f ∈ Fn , Gn,1 = {|y − f (x)|2 1{y−f (x)≥0} : f ∈ Frn } and Gn,2 =
{|y − f (x)|2 1{y−f (x)<0} : f ∈ Frn }.
    In order to bound the distance between an average and its expectation uniformly over
Frn , we introduce the concept of covering number with respect to the supremum norm.
Definition 3.4.1. Let  > 0 and G be a set of functions Rd → R. Each finite collection
of functions g1 , ..., gN : Rd → R has the following property. For every g ∈ G, there is a
j = j(g) ∈ {1, ..., N } such that
                                   ||g − gj || = sup |g(x) − gj (x)| < 
                                                  x
is called an −cover of G with respect to || · ||∞ .
Definition 3.4.2. let  > 0, G be a set of functions Rd → R, and N (, G, || · ||∞ ) be the
size of the smallest − cover of G w.r.t. || · ||∞ . By taking N (, G, || · ||∞ ) = ∞ if no finite
− cover exists, N (, G, || · ||∞ ) is called an −covering number of G.
    To prove the uniform law of large numbers, we need to introduce the lemma 3.4.1 [44].
                                                     42


Lemma 3.4.1. For n ∈ N , Assuming Gn be a set of functions g : Rd → [0, B] and  > 0,
we have
                     (              n
                                                             )                         2
                               1X                                                − 2n2
                  P sup                g(Zi ) − Eg(Z) >  ≤ 2N (/3, Gn )e 9B .
                        g∈Gn n     i=1
Theorem 3.4.1 (The uniform law of large numbers). Assuming Zi = (Xi , Yi ), i = 1, ..., n.
If [rn (d + 2) + 1] log [rn (d + 2) + 1] = o(n). We can get
              n
           1X
    sup          τ [g1 (Zi ) − E(g1 (Zi ))] + (1 − τ )[g2 (Zi ) − E(g2 (Zi ))] → 0 a.s, n → ∞. (3.3)
   f ∈Fn n   i=1
Proof. Since the loss function has two parts, the empirical risk can also be considered as two
parts.
              n
           1X
    sup          τ [g1 (Zi ) − E(g1 (Zi ))] + (1 − τ )[g2 (Zi ) − E(g2 (Zi ))]
   f ∈Fn n   i=1
                     n                                                  n                            (3.4)
                 1X                                                  1X
   ≤ sup τ               g1 (Zi ) − E(g1 (Zi )) + sup (1 − τ )              g2 (Zi ) − E(g2 (Zi )) .
      g1 ∈Gn,1 n i=1                                g2 ∈Gn,2         n
                                                                       i=1
We focus on the first part since the proof of second part can be derived in the same manner.
                                               n
                                          1X
                                  sup τ          g1 (Zi ) − E(g1 (Zi )) → 0.                         (3.5)
                               g1 ∈Gn,1 n i=1
     For B > 0, let G(x) = supg1 ∈Gn,1 |g1 (x)|,GB = {g1 1{G < B} : g1 ∈ Gn,1 }.
                                                     43


    If g1 ∈ Gn,1 ,
      n
   1X
          g1 (Zi ) − E(g1 (Zi ))
  n
     i=1
           n                                             n
      1X                                              1X
  ≤           g1 (Zi ) − g1 (Zi )1{G(Z )≤B} +               g1 (Zi )1{G(Z )≤B} − E(g1 (Zi ))1{G(Z)≤B}
      n                                  i           n                    i
         i=1                                            i=1
          n
      1X
  +           E(g1 (Zi ))1{G(Z)≤B} − E(g1 (Zi ))
      n
         i=1
          n                                 n
      1X                                1X
  ≤          G(Zi )1{G(Z >B)} +                g1 (Zi )1{G(Z )≤B} − E(g1 (Zi ))1{G(Z)≤B}
     n                      i          n                     i
        i=1                                i=1
  + E(G(Z)1{G(Z)>B} ).
                                                                                                   (3.6)
This implies
                                  n
                               1X
                      sup            g1 (Zi ) − E(g1 (Zi ))
                   g1 ∈Gn,1 n i=1
                                    n                                 n
                                1X                                 1X
                   ≤ sup               g1 (Zi ) − E(g1 (Zi )) +          G(Zi )1{G(Z >B)}          (3.7)
                       g1 ∈G    n                                  n                i
                             B     i=1                               i=1
                   + E(G(Z)1{G(Z)>B} ).
Based on E(G(Z)) < ∞ and the strong law of large numbers, we get
                       n
                   1X
                           G(Zi )1{G(Z >B)} → E(G(Z)1{G(Z)>B} ) a.s. if n → ∞
                   n                     i
                      i=1
If B → ∞,
                                           E(G(Z)1{G(Z)>B} ) → 0.
                                                        44


Therefore, we only need to consider,
                                           n
                                       1X
                               sup τ          g1 (Zi ) − E(g1 (Zi )) → 0.
                             g1 ∈GB n i=1
Recall that if g is a function g : R → [0, B], then by Hoeffding’s inequality
                                                               
                                   n                                           2
                                                                         − 2n2
                            1X                                 
                         P            g(Zj ) − E(g(Z)) >  ≤ 2e B .                             (3.8)
                            n                                  
                                  j=1
By lemma 3.4.1, we have
                                                     
                         n                                                                 2
                      1                                                              − 2n2
                       X                             
       P     sup            g(Zj ) − E(g(Z)) >  ≤ 2N (/3, Gn,1 , k · k∞ )e            B    .  (3.9)
         g ∈G
            1 n,1
                      n                               
                        j=1
We use the the upper bound covering number result from the Theorem 14.5 in Anthony and
Gartlett,
                                                                           !(rn (d+2)+1)
                                            12e [rn (d + 2) + 1] ( 41 V )2
                N (/3, Frn , k · k∞ ) ≤                                                 .     (3.10)
                                                     ( 14 V − 1)
Recall the definition of covering number, N (/3, Frn , k · k∞ ) = N , is minimum number such
that there exist functions f1 , ..., fN with the property that for every f ∈ Frn there is a
j = j(f ) ∈ 1, ..., N such that
                                       sup |f (x) − fj (x)| < .
                                         x
Since f (x) andfj (x) is close enough, y − f (x) and y − fj (x) are either negative or positive
                                                   45


in the following situation,
                    sup |(y − f (x))2 1{y−f (x)≥0} − (y − fj (x))2 1{y−f (x)≥0} |
                     x                                                       j
                    ≤ sup |(y − f (x))2 − (y − fj (x))2 |
                         x                                                                  (3.11)
                    = sup |2y(fj − f ) + (f − fj )(f + fj )|
                         x
                    < 2(M1 + M2 ).
Since y ∈ GB and any functions in Frn are bounded, there exist M1 and M2 such that
|y| < M1 and |f | < M2 . So N (/3, Gn,1 , k · k∞ ) ≤ N (/3, Frn , k · k∞ ).
    If [rn (d + 2) + 1] log [rn (d + 2) + 1] = 0(n), then
         ∞
                (                                                        !)          2
        X                                 12e [rn (d + 2) + 1] ( 41 V )2       − 2n2
             exp [rn (d + 2) + 1] log                                       ·e B       < ∞. (3.12)
        n=1
                                                   ( 14 V − 1)
(3.5) follows by using the Borel-Cantelli lemma.
    Since we have proven the uniform laws of large numbers, we use it to show the consistency
of the neural networks. We rewrite the population loss criterion function:
                                      n
                                   1 X h                                        i
                      Qn (f ) =          E |τ − 1{y <f (x )} |(yi − f (xi ))   2            (3.13)
                                   n                      i   i
                                     i=1
In order to prove the consistency of ENN, we use the Theorem 3.4.2 that is the corollary 2.6
in White and Wooldridge[47], and check the condition of the corollary.
Theorem 3.4.2. Let the condition of Theorem 3.3.1 holds. Suppose there exists a function
                                                    46


Q : Θ → R such that Q is continuous at θ0 in Θ, Q(θ0 ) < ∞. For any  > 0,
                          P (ω : sup |Qn (ω, θ) − Q(θ)| > ) → 0 as n → ∞.
                                 θ∈Θn
and
                                           inf        Q(θ) − Q(θ0 ) > 0.
                                        θ∈η c (θ0 ,)
If {Θn } is an increasing sequence and ∪n Θn is dense in Θ, then
                                                                P
                                                  ρ(θ̂n , θ0 ) → 0.
                               1 Pn (f (x )−f (x ))2
Lemma 3.4.2. Suppose n              i=1 0 i            i
                                                               > 2τ   −1    1
                                          σ2                      1−τ for 2 < τ < 1,
then inf f :kf −f kn ≥ Qn (f ) − Qn (f0 ) > 0.
                  0
Proof.
                    n
                1 X h                                                 i
    Qn (f ) =          E |τ − 1{y <f (x )} |(yi − f (xi ))2
                n                     i       i
                  i=1
                    n                        i 1X      n
                1           h                                    h                              i
                       τ E (yi − f (xi ))2 −                 τ E (yi − f (xi ))2 1{y <f (x )}
                  X
            =
                n                                   n                                i      i
                  i=1                                 i=1
                             n                                                                             (3.14)
                         1        h                                    i
                               E (yi − f (xi )) 1{y <f (x )}
                                                    2
                           X
            + (1 − τ )
                        n                                  i       i
                           i=1
                    n                                                n
                1           h                i                  1X h                                   i
                                          2
                                                                        E (yi − f (xi ))2 1{y <f (x )}
                  X
            =          τ E (yi − f (xi )) + (1 − 2τ )
                n                                               n                             i    i
                  i=1                                             i=1
                     n                                               n
                 1X          h                  i               1X h                                     i
 Qn (f0 ) ==            τ E (yi − f0 (xi ))2 +(1−2τ )                   E (yi − f0 (xi ))2 1{y <f (x )} (3.15)
                 n                                              n                               i 0 i
                    i=1                                            i=1
                                                          47


                          n                                              n
                      1X           h                 i              1X h                                    i
Qn (f ) − Qn (f0 ) =                               2
                              τ E (yi − f (xi )) + (1 − 2τ )                 E (yi − f (xi )) 1{y <f (x )}
                                                                                               2
                      n                                             n                               i    i
                         i=1                                           i=1
                          n                                               n
                      1           h                   i              1           h                              i
                                                    2
                                                                              E (yi − f0 (xi )) 1{y <f (x )}
                                                                                                 2
                        X                                               X
                   −         τ E (yi − f0 (xi ))        + (1 − 2τ )
                      n                                              n                                i 0 i
                         i=1                                            i=1
                          n
                      1
                              τ (f0 (xi ) − f (xi ))2 + σ 2
                         X
                   =
                      n
                         i=1
                                       n
                                    1X h                                        i
                   = +(1 − 2τ )            E (yi − f (xi ))2 1{y <f (x )}
                                    n                               i       i
                                      i=1
                          n                        n
                      τ X 2                    1X h2                   i
                   −         σ − (1 − 2τ )              E i 1{ <0}
                      n                        n                 i
                         i=1                      i=1
                          n                                                    n
                      1            h                       i               1X h 2                               i
                              τ E (f0 (xi ) − f (xi ))2 + (1 − 2τ )                E (i ) 1{ <f (x )−f (x )}
                         X
                   =
                      n                                                   n                    i      i  0 i
                         i=1                                                  i=1
                                     n
                                 1                                h                         i
                                        2 (f0 (xi ) − f (xi )) E i 1{ <f (x )−f (x )}
                                    X
                   + (1 − 2τ )
                                 n                                        i       i  0 i
                                    i=1
                                     n
                                 1
                                        (f0 (xi ) − f (xi ))2 P ( < f0 (xi ) − f ((xi )))
                                    X
                   + (1 − 2τ )
                                 n
                                    i=1
                                     n
                                 1X h2                   i
                   − (1 − 2τ )          E i 1{ <0}
                                 n                 i
                                    i=1
                          n                                              n
                      1X                                            1X h 2                                    i
                   =                                 2
                              τ (f0 (xi ) − f (xi )) + (1 − 2τ )             E (i ) 1{0≤ <f (x )−f (x )}
                      n                                             n                        i     i    0 i
                         i=1                                           i=1
                                     n
                                 1                                h                         i
                                        2 (f0 (xi ) − f (xi )) E i 1{ <f (x )−f (x )}
                                    X
                   + (1 − 2τ )
                                 n                                        i       i  0 i
                                    i=1
                                     n
                                 1
                                        (f0 (xi ) − f (xi ))2 P (i < f0 (xi ) − f ((xi )))
                                    X
                   + (1 − 2τ )
                                 n
                                    i=1
                                                                                                         (3.16)
                                          h                         i
   If i < f0 (xi ) − f ((xi ), then E i 1{ <f (x )−f (x )} = E [i ] = 0.
                                                 i       i   0 i
                                          h                         i
   If i ≥ f0 (xi ) − f ((xi ), then E i 1{ <f (x )−f (x )} = 0.
                                                 i       i   0  i
                                                        48


   We could simplify
                         n                                             n
                      1X                                            1X h 2                              i
Qn (f ) − Qn (f0 ) =        τ (f0 (xi ) − f (xi ))2 + (1 − 2τ )           E (i ) 1{0≤ <f (x )−f (x )}
                      n                                             n                     i  i   0 i
                        i=1                                           i=1
                                    n
                               1X
                   + (1 − 2τ )         (f0 (xi ) − f (xi ))2 P (i < f0 (xi ) − f ((xi )))
                               n
                                  i=1
                                                                                                  (3.17)
   If τ ≤ 21 ,
                                                     n
                                                  1X
                        Qn (f ) − Qn (f0 ) ≥             τ (f0 (xi ) − f (xi ))2 > 0              (3.18)
                                                 n
                                                    i=1
   If τ > 21 ,
                                   n                                             n
                               1X                                            1 X h 2i
        Qn (f ) − Qn (f0 ) ≥           τ (f0 (xi ) − f (xi ))2 + (1 − 2τ )          E (i )
                              n                                              n
                                 i=1                                            i=1
                                              n
                                          1X
                           + (1 − 2τ )           (f0 (xi ) − f (xi ))2
                                          n
                                             i=1                                                  (3.19)
                                             n
                                         1
                                                (f0 (xi ) − f (xi ))2 + (1 − 2τ )σ 2
                                           X
                           = (1 − τ )
                                         n
                                            i=1
                                      1 Pn (f (x ) − f (x ))2
                                                   0 i           i        2τ − 1
                           > 0, if n i=1               2
                                                                       >
                                                     σ                     1−τ
   Therefore,
                                        inf       Qn (f ) − Qn (f0 ) > 0.                         (3.20)
                                 f :kf −f0 kn ≥
   Since the conditions of the corollary 2.6 satisfy, we have the consistency of ENN sieve
                                                      49


estimator.
                                                            1 Pn                     2
Theorem 3.4.3. Under the notation given above, if           n i=1 (f0 (xi )−f (xi ))   > 2τ −1
                                                                     σ2                  1−τ for
1 < τ < 1 and [rn (d + 2) + 1] log [rn (d + 2) + 1] = o(n), then
2
                                                      P
                                         kfˆn − f0 kn → 0.
Proof. By using the Theorem 3.4.2, Theorem 3.4.1, lemma 3.4.2 and lemma 3.3.2, we have
                                                      P
                                         kfˆn − f0 kn → 0.
3.5     Normality
We use the following theorem to prove the normality of the ENN sieve estimator[48].
Theorem 3.5.1. Suppose that F is a P −Donsker class of measurable function and fˆn is a
sequence of random functions that take their values in F such that
                              Z                    2
                                                             P
                                    fˆn (x) − f0 (x) dP (x) → 0.
For some f0 ∈ L2 (P ), we have
                             n
                         1 X ˆ                                  
                                                                   P
                        √         (fn − f0 )(Xi ) − P (fˆn − f0 ) → 0,
                          n
                            i=1
                                                 50


and
                                n
                          1 X ˆ
                         √         fn (Xi ) − P fˆn ∼ N (0, P f02 − (P f0 )2 ).
                            n
                               i=1
    From theorem 3.5.1, We need to check two conditions: Frn is P −Donsker class and
R                  2
                              P
    fˆn (x) − f0 (x) dP (x) → 0.
    Next, we give the definition of the Donsker class. In short, if the sequence of processes
√
  n(Pf − P f ) converges in distribution to a tight limit process, then a class F of measurable
functions f is called Donsker. We review the formal definition of the Donsker class.
    Let (X , A, P) be a probability space and Gp be a Gaussian process with zero mean and
covariance E[Gp (f )Gp (g)] = P (f g) − P f · P g. We define a class F ⊂ L2 (X , A, P) as a
Gp BU C class if and only if the process Gp (f, ω) can be chosen so that for all ω, the sample
functions f 7−→ Gp (f, ω), f ∈ F are bounded and continuous for ρp .
Definition 3.5.1 (Donsker Class). A class F ⊂ L2 (X , A, P) is called a Donsker class if
and only if it is a Gp BU C class. There are processes Yj (f, ω), f ∈ F, ω ∈ Ω, where Yj are
independent copies of Gp with f 7−→ Yj (f, ω) bounded and ρP −uniformly continuous on F
for each j, such that for every  > 0,
                                                                       
                                      m
             P∗ n−1/2 max sup k
                                     X
                                         f (Xj ) − P f − Yj (f )k >  → 0 as n → ∞.
                       m≤n f ∈F
                                     j=1
    It is not convenient to check if one class of functions is the Donsker class by definition.
A sufficient condition for a class to be Donsker is that they do not grow too fast. The speed
can be measured by bracketing integral
                                              Z δq
                        J[] (δ, F, L2 (P )) =         logN[] (, F, L2 (P ))dδ,
                                                0
                                                   51


where N[] (, F, L2 (P ) is the bracketing number. If this integral is finite, then the class F is
a Donsker class.
Theorem 3.5.2. Frn is a Donsker class.
Proof. By using the result of the uniform covering number for deep neural neural networks,
we have
                                                                                  !
                                                     4e[rn (d + 2) + 1]( 14 Vn )2
                        N (, Frn , || · ||sup ) ≤                                  .
                                                            ( 14 Vn − 1)
By using the relationship between packing number and covering number, for a small enough
, we have
                                                                           
              logN[ ] (2, Frn , || · ||∞ ) ≤ log 2N ( , Frn , || · ||sup )
                                                         2
                                                                           
                                             ≤ 2log N ( , Frn , || · ||sup )
                                                         2                               
                                                                                        1
                                             ≤ 2[rn (d + 2) + 1] log Ãrn ,Vn ,d + log      ,
                                                                                        
                     2eVn2 [rn (d+2)+1]
where Ãrn ,Vn ,d =         Vn −4        .  By letting
                   Arn ,Vn ,d = [rn (d + 2) + 1]log Ãrn ,Vn ,d − [rn (d + 2) + 1]
                                                          2eVn2 [rn (d + 2) + 1]
                                                                                     
                                = [rn (d + 2) + 1] log                             −1
                                                                   Vn − 4
                                                          2Vn2 [rn (d + 2) + 1]
                                                                                
                                = [rn (d + 2) + 1] log                             ,
                                                                   Vn − 4
and Vn2 − eVn + 4e ≥ 0 for all Vn , we have
                        2Vn2 [rn (d + 2) + 1]            V2            e(Vn − 4)
                    log                          ≥ log n ≥ log                     = 1.
                                 Vn − 4                Vn − 4            Vn − 4
                                                      52


Then
                                                                            
                                                                           1
  logN[ ] (2, Frn , || · ||∞ ) ≤ 2[rn (d + 2) + 1] log Ãrn ,Vn ,d + log
                                                                           
                                                                             
                                                                       1
                                ≤ 2 Ãrn ,Vn ,d + 2[rn (d + 2) + 1]( + 1)
                                                                       
                                                                   1
                                ≤ 2Ãrn ,Vn ,d + [rn (d + 2) + 1] (since logx ≤ x − 1 for all x > 0)
                                                                  
                                                        1
                                ≤ 2Ãrn ,Vn ,d 1 +           .
                                                        
Since Frn is uniformly bounded by Vn , it is clear that N[ ] (2, Frn , || · ||∞ ) = 1 for all  ≥ Vn .
Therefore, for each fixed n, we have
                                                                               2 1/2
                      Z ∞                                  1/2   Z Vn         
                              logN[ ] (2, Frn , || · ||∞ )      .        1+            d
                        0                                            0         
                                                                 < ∞.
Then Frn is a Donsker class.
    More details about the Donsker class can be found in Van der Vaart and A.W., Wellner
[49]. It is easy to check that the sigmoid activation function is squashing function since
σ(x) =      1    is nondecreasing (limx→∞ σ(x) = 1 and limx→−∞ σ(−x) = 0). We use the
         1+e−x
                             R                   2
                                                               P
theorem 3.5.3 to check            fˆn (x) − f0 (x) dP (x) → 0.
Theorem 3.5.3. Let σ be a squashing function. For each probability measure µ on Rd , each
measurable f : Rd → R with |f (x)|2 µ(dx) < ∞, and each  > 0, there exists a neural
                                        R
network h(x) in
                                k
                                    ci σ(aTi x + bi ) + c0 : k ∈ N, ai ∈ Rd , bi , ci ∈ R},
                               X
                  h(x) = {
                               i=1
                                                          53


such that
                                   Z
                                       |f (x) − h(x)|2 µ(dx) < .
    Next, we establish the asymptotic normality of ENN. We assume that f0 ∈ F, where F
is the class of continuous functions with compact supports. f0 is a function needed to be
estimated.
Theorem 3.5.4. Suppose fˆn (x) ∈ F is a sequence of random functions and
  |f0 (x)|2 dP (x) < ∞. If conditions in consistency exist, we can get
R
                                 Z                  2
                                       ˆ                        P
                                      fn (x) − f0 (x) dP (x) → 0.
For some f0 ∈ L2 (P ), we have
                                n
                            1 X ˆ                                  
                                                                      P
                          √          (fn − f0 )(Xi ) − P (fˆn − f0 ) → 0,
                             n
                               i=1
and
                               n
                           1 X ˆ
                         √        fn (Xi ) − P fˆn ∼ N (0, P f02 − (P f0 )2 ).
                            n
                              i=1
Proof. Assuming πrn f0 ∈ Frn , we have
                      kfˆn (x) − f0 (x)k2 ≤ kfˆn − πn f0 k2 + kπn f0 − f0 k2 .     (3.21)
Using the result of consistency of ENN, we have
                                                         p
                                         kfˆn − πn f0 k2 → 0.
                                                  54


From the theorem 3.5.3,
                                              kπn f0 − f0 k2 < .
Therefore, we can get
                                  Z                     2
                                                                      P
                                         fˆn (x) − f0 (x) dP (x) → 0.
Based on the theorem 3.5.1, we can obtain the result for the normality,
                                n
                           1 X ˆ                              ˆ
                                                                         
                                                                           P
                          √             (fn − f0 )(Xi ) − P (fn − f0 ) → 0,
                            n
                               i=1
and
                               n
                          1 X ˆ
                        √           fn (Xi ) − P fˆn ∼ N (0, P f02 − (P f0 )2 ).
                           n
                              i=1
3.6      Simulation
To validate the theoretical properties of ENN, we ran simulations on the consistency and
normality of ENN. We obtained the estimator of ENN by using the gradient-based optimiza-
tion algorithms (e.g., quasi-Newton Broyde-Fletcher-Goldfarb-Shanno (BFGS) optimization
algorithm). The response was simulated through the following equation:
                                      yi = f0 (xi ) + i , i = 1, .., n,                   (3.22)
                                             i.i.d
where x1 , .., xn ∼ N (0, 1), 1 , ..., n ∼ N (0, 0.12 ). For the true function f0 , we consider
three different nonlinear functions:
                                                      55


   1. a neural network with one single hidden layer and two hidden units,
   2. a polynomial function:
                                              f0 = x3 + 1,
   3. a complex nonlinear function:
                                     f0 = sin(x) + 2exp((−16)x2 ).
3.6.1      Consistency
In this section, we used simulations to check the validity of consistency result in Section 4.
Since τ was between 0 and 1. For ENN with 0.5 < τ < 1, ENN had one more condition than
ENN with 0 < τ < 0.5. we mainly considered ENN with τ = 0.5, 0.75. For ENN with τ =
                                                                       1 Pn                    2
0.75, we made    σ2 smaller(e.g., σ2  = 0.01) to satify the condition: n i=1 (f0 (xi )−f (xi ))  >
                                                                                σ2
2τ −1 for 12 < τ < 1.
 1−τ
3.6.1.1     Simulation results of consistency with τ = 0.5
We chose five different sample sizes: 50, 100, 200, 500 and 1000. From Figure 3.1 to Figure
3.3, the fitted curve is closer to the true function as the sample increases.
                                                56


Figure 3.1: Comparison between the true function f0 and fitted functions under different
sample sizes, where f0 is a neural network with one single hidden layer and two hidden units
τ = 0.5.
Figure 3.2: Comparison between the true function f0 = x3 + 1 and fitted functions under
different sample sizes with τ = 0.5.
                                             57


Figure 3.3: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and fitted
functions under different sample sizes with τ = 0.5.
3.6.1.2   Simulation results of consistency with τ = 0.75
We also chose five different sample sizes: 50, 100, 200, 500 and 1000. From Figure 3.4 to
Figure 3.6, the fitted curve is closer to the true function as the sample increases. Overall,
the simulation results are consistent with the theoretical finding.
                                               58


Figure 3.4: Comparison between the true function f0 and fitted functions under different
sample sizes, where f0 is a neural network with one single hidden layer and two hidden units
with τ = 0.75.
Figure 3.5: Comparison between the true function f0 = x3 + 1 and fitted functions under
different sample sizes with τ = 0.75.
                                             59


Figure 3.6: Comparison between the true function f0 = sin(x) + 2exp((−16)x2 ) and fitted
functions under different sample sizes with τ = 0.75.
3.6.2     Normality
In this section, we demonstrated our asymptotic normality derived in theorem 3.5.4. The
same true function were used but the random errors were sampled from standard normal
distribution. We used
                                       n
                                  1 Xˆ                       
                                 √         fn (xi ) − f0 (xi )                          (3.23)
                                   n
                                      i=1
as test statistic to draw the Q-Q plots. We varied sample sizes (i.e., 50, 100, 200, 300, 400,
and 500) when evaluating three nonlinear functions.
                                              60


Figure 3.7: Q-Q plot with different sample sizes, where the true function f0 is a neural
network with one single hidden layer and two hidden units with τ = 0.5
3.6.2.1    Simulation result of normality with τ = 0.5
From Figure 3.7 to Figure 3.9, data points appear as roughly a straight line. The test statistic
                                       n
                                   1 Xˆ                      
                                  √        fn (xi ) − f0 (xi )                           (3.24)
                                    n
                                      i=1
fits the normal distribution pretty well.
                                              61


Figure 3.8: Q-Q plot with different sample sizes, where the true function is f0 = x3 + 1 with
τ = 0.5.
Figure 3.9: Q-Q plot with different sample sizes, where the true function is f0 = sin(x) +
2exp((−16)x2 ) with τ = 0.5.
                                             62


Figure 3.10: Q-Q plot with different sample sizes, where the true function f0 is a neural
network with one single hidden layer and two hidden units with τ = 0.75.
3.6.2.2     Simulation result of normality with τ = 0.75
From Figure 3.10 to Figure 3.12,we used the same test statistic and data points appeared
as roughly a straight line. Based on the simulation results, we demonstrated the validity of
normality of ENN.
3.7       Summary and discussion
In this section, we study the consistency and normality of ENN sieve estimators with one
hidden layer. To overcome the issue of unidentifiability, we use the method of sieve to narrow
down the choice of parametric space. The covering numbers is used to find an approximation
to a rich class Frn . By establishing an upper bound for the covering number of Frn , we prove
the consistency and normality of ENN. To check the validity of theoretical results, we also ran
simulations based on the theorem conditions. If we choose τ as 0.5, then ENN becomes the
traditional neural network. The ENN method inherits advantages from both neural networks
                                               63


Figure 3.11: Q-Q plot with different sample sizes, where the true function is f0 = x3 + 1
with τ = 0.75.
Figure 3.12: Q-Q plot with different sample sizes, where the true function is f0 = sin(x) +
2exp((−16)x2 ) with τ = 0.75.
                                            64


and expectile regression. Using the hierarchical structure from neural networks, ENN can
learn complex and abstract features from covariates, making it suitable for modeling the
complex relationship between covariates and response by tuning hyperparameter τ .
    Although we focus on one hidden layer neural network sieve estimators with sigmoid ac-
tivation function in this chapter, the results of this chapter can be extended to other neural
networks and activation functions. For instance, it can be potentially extended to other
popular activation functions (e.g., the rectified linear unit). Deep neural network struc-
tures are commonly used in convolutional neural networks and recurrent neural networks.
Therefore, it is worthwhile to investigate the asymptotic theory of different neural network
architectures.
    It may be also worthwhile to consider the regularization of neural networks into consid-
eration. Since the number of parameters in deep neural networks is large, the overfitting
issue is common in practice. Dropout is an approach of regularization in neural networks,
which reduces the number of hidden units [53]. In statistics, we also add a penalty term as
a regularization approach. To avoid overfitting, it is common to add a penalty term. Es-
tablishing the asymptotic theory of neural networks with regularization is crucial when we
apply neural networks into real data analysis. We will consider this problem in the future.
                                               65


Chapter 4
Summary and Discussion
This dissertation focuses mainly on developing a neural-network-based method, ENN, with
application in risk prediction of genetic data. We also study the statistical properties of
ENN, including consistency and normality.
    In chapter 2, we develop a neural-network-based method called ENN. To demonstrate
the performance of ENN, we run three different simulation settings: nonlinear, interactions
among SNPs and interactions between genes. If there are nonlinear or high-order interaction
effects in genetic data, ENN outperforms ER. To model the more complex relationship be-
tween genotypes and phenotypes, we change the architecture of ENN to non-fully connected
architecture. By tuning the hyperparameter τ , ENN can provide a comprehensive view of
the genotype-phenotype relationship for different expectile levels. Different expectile levels
could also help us to identify high-risk individuals for certain disease, especially at the low
expectile level(τ = 0.1) and the high expectile level(τ = 0.9).
    Through two real data applications, we also demonstrate that ENN outperforms ER
when the underlying genotype-phenotype relationships become complicated. For different
expectiles, genetic effects vary, which provides us more information about the genotype-
phenotype relationship via the conditional distributions. By studying different expectile
levels, it may help us to predict high-risk individuals since genetic variations can have large
effects on a particular disease.
                                               66


    In chapter 3, we study the consistency and normality of ENN sieve estimators with one
hidden layer. We consider neural networks as a nonparametric regression problem to avoid
the issue of unidentifiability. The method of sieve is used to narrow down the choice of
parametric space. To measure the complexity of neural networks, we use covering number
as measurement. By establishing an upper bound for the covering number of Frn , we first
prove the uniform law of large numbers of ENN. With some regularity conditions, we also
prove the consistency and normality of ENN. Simulations have also been conducted to test
the validity of theoretical results.
    Most complex diseases are not only explained by genetic effects but also can be influ-
enced by environmental determinants, which can be physical, chemical, biological, behavior
patterns or life events. A small difference in one person’s genes can cause them to respond
differently to the same environmental exposure to another person. As a result, some peo-
ple may develop the disease after being exposed to the environment while others may not.
Therefore, it is worthwhile to take environmental determinants into consideration. In the
future, we could apply ENN to study a disease with a potential gene-environment interaction
component. By doing this, we could gain a better understanding of the disease and increase
prediction accuracy.
    Many researchers focus on improving the prediction accuracy of neural network estima-
tors, while the statistical inference based on neural network estimator is not fully studied. By
establishing the asymptotic properties of ENN, it is worthwhile to investigate the statistical
inference of ENN. We could also incorporate other machine learning techniques into ENN.
For many genetic datasets, Caucasian samples are larger than African samples. Due to the
limitation of African samples, we could first train ENN on Caucasian samples and get the
estimator, which can be used to improve prediction accuracy in African samples. We also
                                                67


apply ENN into real data with transfer learning, which is described in the appendix. By
using this technique, we could improve the performance of ENN.
                                            68


APPENDICES
    69


Appendix A
Technical Details of Chapter 2
Proof of theorem 2.3.1
Theorem A.0.1. Let Lτ : Y × R → [0, ∞) be the asymmetric least square loss function and
Q be a distribution on Y = [−M, M ]. Then, the inner Lτ − risks of Q could be defined as
                                     Z
                         Cτ,Q (t) =       Lτ (y, t)dQ(y), t = f (xi ) ∈ R,
                                       Y
and the minimal inner Lτ − risk is
                                     ∗
                                    CL      = inft∈R CLτ ,Q (t).
                                       τ ,Q
Lemma A.0.1. Let Lτ be the asymmetric least square loss function and Q be a distribution
on R with CL∗      < ∞. For a fixed τ ∈ (0, 1) and for all t ∈ R, we have
              τ ,Q
                       cτ (t − t∗ )2 ≤ CLτ ,Q (t) − CL∗
                                                        τ ,Q
                                                             ≤ Cτ (t − t∗ )2 ,
where cτ = min{τ, 1 − τ } and Cτ = max{τ, 1 − τ }, t∗ is τ −expectile .
Proof. Let us fix τ ∈ (0, 1). We use the result obtained in Newey and Powell [16]. For a
                                                  70


distribution Q on R satisfies CL   ∗      < ∞, the τ −expectile t∗ is the only solution of
                                     τ ,Q
                        Z                                    Z
                     τ         (y − t∗ )dQ(y)    = (1 − τ )         (t∗ − y)dQ(y).                    (A.1)
                         y≥t∗                                  y<t∗
First, We consider the lower bound.
To obtain the inner Lτ −risks of Q, we consider two cases: t ≥ t∗ and t < t∗ .
When t ≥ t∗ , we have
 Z                      Z
      (y − t)2 dQ(y) =        (y − t∗ + t∗ − t)2 dQ(y)
  y<t                     y<t
                        Z                                       Z
                     =        (y  − t∗ )2 dQ(y) + 2(t∗     − t)       (y − t∗ )dQ(y)
                          y<t                                     y<t
                     + (t∗ − t)2 Q((−∞, t))
                        Z                            Z
                     =                  ∗ 2
                                (y − t ) dQ(y) +                (y − t∗ )2 dQ(y) + (t∗ − t)2 Q((−∞, t))
                          y<t∗                          t∗ ≤y<t
                                   Z                                       Z
                     + 2(t∗  − t)          (y − t∗ )dQ(y) + 2(t∗     − t)            (y − t∗ )dQ(y),
                                     y<t∗                                   t∗ ≤y<t
and
   Z                       Z                         Z
         (y − t)2 dQ(y) =          (y − t∗ )2 dQ −              (y − t∗ )2 dQ(y) + (t∗ − t)2 Q([t, ∞))
     y≥t                    y≥t∗                       t∗ ≤y<t
                                       Z                                      Z
                        + 2(t∗   − t)         (y − t∗ )dQ(y) − 2(t∗     − t)            (y − t∗ )dQ(y).
                                         y≥t∗                                   t∗ ≤y<t
                                                    71


By definition and (13), we have
                         Z                          Z
   CLτ ,Q (t) = (1 − τ )       (y − t)2 dQ(y) + τ        (y − t)2 dQ(y)
                           y<t                       y≥t
                         Z                            Z
              = (1 − τ )        (y − t∗ )2 dQ(y) + τ        (y − t∗ )dQ(y)
                           y<t∗                        y≥t∗
                                    Z                          Z
              + 2(t∗ − t)((1 − τ )         (y − t∗ )dQ(y) + τ         (y − t∗ )dQ(y))
                                      y<t∗                      y≥t∗
              + (t∗ − t)2 (1 − τ )Q((−∞, t)) + (t∗ − t)2 τ Q([t, ∞))
                          Z                                                Z
              + (1 − 2τ )                  ∗ 2                      ∗
                                    (y − t ) dQ(y) + 2(1 − 2τ )(t − t)              (y − t∗ )dQ(y)
                            t∗ ≤y<t                                         t∗ ≤y<t
              = CLτ ,Q (t∗ ) + (t∗ − t)2 (1 − τ )Q((−∞, t)) + (t∗ − t)2 τ Q([t, ∞))
                          Z
              + (1 − 2τ )           (y − t∗ )2 + 2(t∗ − t)(y − t∗ )dQ(y)
                            t∗ ≤y<t
                                                   72


Therefore,
   CLτ ,Q (t) − CLτ ,Q (t∗ )
    = (t∗ − t)2 (1 − τ )Q((−∞, t∗ )) + (t∗ − t)2 (1 − τ )Q([t∗ , t)) + (t∗ − t)2 τ Q([t, ∞))
                   Z
    + (1 − 2τ )             (y − t∗ )2 + 2(t∗ − t)(y − t∗ )dQ(y)
                    t∗ ≤y<t
    = (t∗ − t)2 ((1 − τ )Q((−∞, t∗ )) + τ Q([t, ∞)))
         Z
    −τ              (y − t∗ )2 + 2(t∗ − t)(y − t∗ )dQ(y)
           t∗ ≤y<t
                                                Z
    + (t∗  − t)2 (1 − τ )Q([t∗ , t)) + (1 − τ )          (y − t∗ )2 + 2(t∗ − t)(y − t∗ )dQ(y)
                                                 t∗ ≤t<t
                                                                Z
    =  (t∗  − t)2 ((1 − τ )Q((−∞, t∗ )) + τ Q([t, ∞))) − τ                (y − t∗ )(y + t∗ − 2t)dQ(y)
                                                                  t∗ ≤y<t
                 Z
    + (1 − τ )             (y − t∗ )2 + 2(t∗ − t)(y − t∗ ) + (t∗ − t)2 dQ(y)
                   t∗ ≤y<t
                                                              Z
    =  (t∗  − t)2 ((1 − τ )Q(−∞, t∗ )) + τ Q([t, ∞))) + τ               (y − t∗ )(2t − t∗ − y)dQ(y)
                                                                t∗ ≤y<t
             Z
   (1 − τ )             (y − t)2 dQ(y).                                                            (A.2)
               t∗ ≤y<t
This leads to the lower bound of inner Lτ −risk when t ≥ t∗ ,
CLτ ,Q (t) − CLτ ,Q (t∗ )
                                                      Z
 ≥ cτ (t∗  − t)2 (Q((−∞, t∗ )) + Q([t, ∞))) + c     τ           (y − t∗ )(2t − t∗ − y) + (y − t)2 dQ(y)
                                                        t∗ ≤y≤t
                                                      Z
 = cτ (t∗  − t)2 (Q((−∞, t∗ )) + Q([t, ∞))) + c     τ           (t∗ )2 − 2tt∗ + t2 dQ(y)
                                                        t∗ ≤y≤t
 = cτ (t∗ − t)2 (Q((−∞, t∗ )) + Q([t, ∞))) + cτ (t∗ − t)2 Q([t∗ , t))
 = cτ (t∗ − t)2 .
                                                     73


When t < t∗ , using similar arguments, we have
                                                                       Z
    CLτ ,Q (t) − CLτ ,Q (t∗ ) =   (t∗ − t)2 ((1 − τ )Q((−∞, t)) + τ               (y − t)2 dQ(y)
                                                                         t≤y<t∗
                                          Z
                              + (1 − τ )           (t∗ − y)(y + t∗ − 2t)dQ(y) + +τ Q([t∗ , ∞)))
                                            t≤y<t∗
                              ≥ cτ (t∗ − t)2 .
Therefore, we summarize them into one inequality
                                 CLτ ,Q (t) − CLτ ,Q (t∗ ) ≥ cτ (t∗ − t)2 .
Next, we consider the upper bound. Similarly, when t ≥ t∗ ,
                     CLτ ,Q (t) − CLτ ,Q (t∗ )
                      ≤ Cτ (t∗ − t)2 (Q((−∞, t∗ )) + Q([t, ∞)))
                              Z
                      + Cτ             ((y − t∗ )(2t − t∗ − y) + (y − t)2 )dQ(y)
                               t∗ ≥y<t
                      = Cτ (t∗ − t)2 .                                                           (A.3)
For the case of t < t∗ , the inequality still holds. Combining these two inequality, we have
                         cτ (t − t∗ )2 ≤ CLτ ,Q (t) − CL  ∗
                                                            τ ,Q
                                                                 ≤ Cτ (t − t∗ )2 .
   Based on the Lemma A.1.1, we can prove Theorem 2.3.1[32].
                                                     74


Proof. If x ∈ X, we define t = f (x) and t∗ = fL∗ τ ,P (x). By Lemma 1, for Q = P (·|x), we
can get the following result
                                                              
                  Cτ−1 CLτ ,P (·|x) (f (x))) − CL ∗              ≤ |f (x) − fL∗ τ ,P (x)|2
                                                    τ ,P (·|x)
and
                                                                                       
                   |f (x) − fL∗ τ ,P (x)|2 ≤ c−1
                                              τ   C Lτ ,P (·|x)  (f (x)) − C ∗
                                                                             Lτ ,P (·|x) .
If we integrate it with respect to PX and take the square root, we can get the final result.
                                                   75


Appendix B
Technical Details of Chapter 3
Proof of lemma 3.3.2
Proof. For each fixed n, let θ n = [α0 , . . . , αrn , γ 0,1 , . . . , γ0,rn , γ T1 , . . . , γ Trn ]T belong to
[−Vn , Vn ]rn +1 ×[−Mn , Mn ]rn (d+1) := Θn . For n fixed, Θn is a bounded closed set and hence
it is a compact set in Rrn (d+2)+1 . Consider a map
                    H : (Θn , k · k2 ) → (Frn , k · kn )
                                                               rn                               
                                                                      αj σ γ Tj x + γ0,j
                                                              X
                                  θ n 7→ H(θ n ) = α0 +
                                                              j=1
                                                   76


Note that Frn = H(Θn ). Therefore, to show that Frn is a compact set, it suffices to show
that H is a continuous map due to the compactness of Θn . Let θ 1,n , θ 2,n ∈ Θn , then
      kH(θ 1,n ) − H(θ 2,n )k2n
                                                                                                             2
                                                                                                            
           n            rn                                             rn           
       1                                   T                                                  T
              α(1) +         (1)       (1)           (1)       (2)            (2)        (2)           (2)
         X             X                                                X
    =            0          αj σ γ j xi + γ0,j − α0 −                        αj σ γ j xi + γ0,j 
       n
         i=1           j=1                                              j=1
                                                                                                          2
                                                                                                            
           n                        rn                                           
       1 X
               α(1) − α(2) +
                                   X      (1)      (1)T          (1)         (2)        (2)T         (2)
    ≤             0       0             αj σ γ j xi + γ0,j − αj σ γ j xi + γ0,j 
       n
         i=1                       j=1
              
           n                        rn                                                             
       1 X  (1)          (2)      X      (1)          (1)T        (1)               (2)T         (2)
    =            α0 − α0 +             |αj | σ γ j xi + γ0,j − σ γ j xi + γ0,j +
       n
         i=1                       j=1
                                                                                                       2
                                                                 (1)       (2)         (2)T         (2)
                                                              |αj − αj |σ γ j xi + γ0,j
                                                                                              2
           n    rn                           rn 
       1                               Vn                    (2) T
                                                                 
         X      X     (1)      (2)          X        (1)                         (1)      (2)
    ≤              |αj − αj | +                  γj − γj            xi + γ0,j − γ0,j 
       n                                4
         i=1 j=0                            j=1
                                                                                               2
          rn                                       rn
         X (1)          (2)     Vn                X (1)             (2)           (1)       (2)
    ≤        |αj − αj | +          (1 ∨ kxk∞ )           γj − γj           + γ0,j − γ0,j 
                                 4                                        1
         j=0                                      j=1
                         2
         Vn
    ≤        (1 ∨ kxk∞ ) [rn (d + 1)]kθ 1,n − θ 2,n k22 .
           4
                                                                                     
                                                   Vn
                                                                     p
Hence, for any  > 0, by choosing δ = /            4 (1 ∨ kxk∞ )        rn (d + 1) , we observe that when
kθ 1,n − θ 2,n k2 < δ, we have
                                       kH(θ 1,n ) − H(θ 2,n )kn < ,
which implies that H is a continuous map and hence Frn is a compact set for each fixed
n.
                                                        77


Appendix C
Supplementary Materials
Expectile neural networks with transfer learning
Normally, machine learning models focus on one single and specific task. If we have two
related tasks, one task could inherit some information from the other task. We call this
technique transfer learning. Transfer learning focuses on storing knowledge gained by solving
one problem and applying the knowledge to a different but related problem. It is easier to
transfer knowledge if tasks are more related. Transfer learning has been implemented in a
wide area, like natural language processing (NLP)[69], medical image[66].
    Transfer learning could be applied in both classification and regression scenarios. For
example, Syed proposes seeded transfer learning in a regression context to improve predic-
tion performance in target domain[71]. Many approaches could be implemented in transfer
learning. Yosinski et al. show how lower layers in neural networks act as conventional
computer-vision feature extractors, such as edge detectors, while the final layer works to-
ward task-specific features[65]. Rosenstein uses naive Bayes classification algorithm to detect,
perhaps implicitly, that the inductive bias learned from the auxiliary tasks will actually hurt
performance on the target task [68]. In this chapter, we focused on applying the transfer
learning technique into expectile neural networks. We focus on parameter transfer or in-
stance reweighting. This approach works on the assumption that the models for related
                                               78


tasks share some parameters. There are some advantages of doing these. First, if the initial
task and target task are relevant, we could improve our result. Second, since we inherit
information from the initial task, the number of parameters in target task is reduced, which
gives us some computational advantages, especially in large datasets.
Real data application
In this section, we integrate expectile regression and transfer learning to improve prediction
performance. To verify if transfer learning works, we run two real data sets to compare the
performance of ENN with transfer learning and ENN without transfer learning.
First real data application
Intuitively, participants in this study tend to be addicted to drinking who have the nicotine
addiction. We applied ENN to the genetic data from the Study of Addiction: Genetics and
Environment(SAGE). The participants of the SAGE are selected from three large, comple-
mentary studies: the Family Study of Cocaine Dependence(FSCD), the Collaborative study
on the Genetics of Alcoholism(COGA), and the Collaborative Genetic Study of Nicotine
Dependence(COGEND).
    We choose max cigs as smoking quantity, which is measured by the largest number of
cigarettes smoked in 24 hours, ranged from 0-240. We choose max drinks as drinking quan-
tity, which is measured by the largest number of alcoholic drinks consumed in 24 hours,
range from 0-258. To have better performance, we transfer smoking-related information to
drinking-related information. We use the following algorithm.
    First, we choose max cigs as phenotype, and get the estimator of the expectile neural
network. Second, we get the estimator obtained from the first step as the initial value(transfer
                                               79


                    Table C.1: Real data application result of CHRNA5
                                 ENN.tsf               ENN
                        τ     Train     Test       Train Test
                        0.1   551.83 605.79        546.90 672.44
                        0.25  325.84 439.18        321.94 473.10
                        0.5   282.57 433.058       275.83 444.16
                        0.75  304.81 484.60        297.81 487.44
                        0.9   347.17 544.24        339.79 549.08
                    Table C.2: Real data application result of CHRNA3
                                 ENN.tsf               ENN
                         τ     Train Test         Train Test
                         0.1   554.11 605.10      533.04 753.96
                         0.25  325.71 441.47      311.85 517.05
                         0.5   281.20 439.40      260.45 491.62
                         0.75  304.60 486.80      292.63 502.86
                         0.9   350.01 558.95      335.89 573.92
learning part). Third, we choose max drinks as a new phenotype and keep the parameter
from the input layer to the hidden layer and then train the expectile neural network again.
Finally, we compare two models: ENN with transfer learning and ENN without transfer
learning.
    We divide the data into three parts: training(60%), validation(20%), testing(20%). We
get the following results.
                    Table C.3: Real data application result of CHRNB4
                                 ENN.tsf               ENN
                         τ     Train Test         Train Test
                         0.1   558.39 622.18      564.57 673.97
                         0.25  327.63 448.63      325.48 473.34
                         0.5   283.28 435.11      270.50 453.76
                         0.75  306.05 488.15      303.02 489.71
                         0.9   349.24 544.85      343.52 553.41
                                             80


    Table C.1-C.3 summarize the MSE of ENN with transfer learning and ENN without
transfer learning for five different expertiles (i.e., 0.1, 0.25, 0.5, 0.75, and 0.9). From those
three tables, we show the expectile neural networks with transfer learning outperform expecilt
neural networks without transfer learning.
Second real data application
In this real data application, we apply our method to the Alzheimer’s Disease Neuroimaging
Initiative(ADNI), which is a multisite study that aims to improve clinical trials for the
prevention and treatment of Alzheimer’s disease. APOE allele is the most important genetic
risk factor for Alzheimer’s disease[67]. We focus our ENN model on APOE gene. After
quality control, 168 SNPs remained for the analysis. We only included 699 Caucasian and
African American individuals due to the small sample size of other ethnic groups. To improve
the performance of ENN, we also included 3 covariates: sex(male=1, female=2), age, and
education in the analysis.
    Hippocampus is the part of the brain area associated with memories. Alzheimer’s disease
usually first damages hippocampus, leading to memory loss and disorientation. Study shows
that hippocampal volume and ratio was reduced by 25% in Alzheimer’s disease[72]. The
Mini-Mental State Examination (MMSE) is a 30-point questionnaire that is used extensively
in clinical and research settings to measure cognitive impairment. For more information, re-
fer to https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?id=phd001525.1. We
transfer Hippocampus bl to MMSE.
    To have stable performance, we randomly split the dataset 50 times and average the
result.
    From table C.4, expectile neural network with transfer learning outperforms expectile
                                                81


                       Table C.4: Real data application result of ADNI
                                    ENN.tsf              ENN
                           τ      Train Test         Train Test
                           0.1 8.21        8.40      8.65    9.50
                           0.25 5.00       5.17      5.30    6.78
                           0.5 4.11        4.31      4.30    4.82
                           0.75 4.67       4.88      4.85    6.86
                           0.9 5.87        6.10      5.99    6.69
regression without transfer learning under different τ .
Summary and discussion
From these two real data application, transfer learning improves performance of expectile
neural networks. However, transfer learning relies on data heavily based on our experience. If
the data does not fit the model, the negative transfer happens where the transfer of knowledge
from the source to the target does not lead to any improvement, but rather causes a drop in
the overall performance of the target task.
                                               82


BIBLIOGRAPHY
      83


                                  BIBLIOGRAPHY
 [1] Genome-Wide Association Studies. National Human Genome Research Institute.
     https://www.genome.gov/about-genomics/fact-sheets/Genome-Wide-Association-
     Studies-Fact-Sheet.
 [2] Manolio TA. Genome wide association studies and assessment of the risk of disease. The
     New England Journal of Medicine, 363 (2): 166–76, 2010.
 [3] Kwon JM, Goate AM. The candidate gene approach. Alcohol Research & Health, 24
     (3): 164–8, 2000.
 [4] Xuexia Wang, Michael J Oldani, Xingwang Zhao, Xiaohui Huang, Dajun Qian. A Re-
     view of Cancer Risk Prediction Models with Genetic Variants. Cancer Inform, 13(2):
     19–28, 2014.
 [5] The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000
     cases of seven common diseases and 3,000 shared controls. Nature, 447, 661–678, 2007.
 [6] Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, String-
     ham HM, Chines PS, Jackson AU, Prokunina-Olsson L, Ding CJ, Swift AJ, Narisu
     N, Hu T, Pruim R, Xiao R, Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M,
     White PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins L, Xiang F,
     Saramies J, Buchanan TA, Watanabe RM, Valle TT, Kinnunen L, Abecasis GR, Pugh
     EW, Doheny KF, Bergman RN, Tuomilehto J, Collins FS, Boehnke M. A genome-wide
     association study of type 2 diabetes in Finns detects multiple susceptibility variants.
     Science, 316(5829):1341-5, 2007.
 [7] Nan M. Laird, Christoph Lange. The Fundamentals of Modern Statistical Genetics.
     Springer-Verlag, 2011.
 [8] Miller DD, Brown EW. Artificial Intelligence in Medical Practice: The Question to the
     Answer? Am J Med, 131(2):129-133, 2018.
 [9] Fakoor R, Ladhak F, Nazi A, Huber M. Using deep learning to enhance cancer diag-
     nosis and classification. Proceedings of the 30th International Conference on Machine
     Learning, 2013.
[10] Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press, 96-161, 2016.
[11] Le Cun Y, Bengio Y, Hinton G. Deep learning. Nature, 521:436- 444, 2015.
                                              84


[12] Krittanawong C, Zhang H, Wang Z, Aydar M, Kitai T. Artificial Intelligence in Precision
     Cardiovascular Medicine. J Am Coll Cardiol, 69(21):2657-2664, 2017.
[13] McClellan J, King MC. Genetic heterogeneity in human disease. Cell, 141(2):210-7,
     2010.
[14] Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci
     that influence complex diseases. Nat Genet, 37(4):413-7, 2005.
[15] R. Koenker, G.W. Bassett Jr. Regression quantiles. Econometrica, 46(1):33-50, 1978.
[16] W. Newey, J. Powell. Asymmetric least squares estimation and testing. Econometrica,
     55(4):819-847, 1987.
[17] Moshe Buchinsky. Quantile regression, Box-Cox transformation model, and the U.S.
     wage structure, 1963–1987. Journal of Econometrics, 65(1):109-154, 1995.
[18] John Crowley, Marie Hu. Covariance Analysis of Heart Transplant Survival Data. Jour-
     nal of the American Statistical Association, 72-357, 1977.
[19] Stuart R. Lipsitz Garrett M. Fitzmaurice Geert Molenberghs Lue Ping Zhao. Quantile
     Regression Methods for Longitudinal Data with Drop-outs: Application to CD4 Cell
     Counts of Patients Infected with the Human Immunodeficiency Virus. Jornal of the
     Royal Statistical Society: Applied Statistics Series C, 46(4):463-476, 1997.
[20] G.R.PandeyaV, T.V.Nguyenb. A comparative study of regression based methods in
     regional flood frequency analysis. Journal of Hydrology, 225:92-101, 1999.
[21] H. J. Cordell. Detecting gene-gene interactions that underlie human diseases, Nat. Rev.
     Genet. 10:392–404, 2009.
[22] A. Cannon. Non-crossing nonlinear regression quantiles by monotone composite quantile
     regression neural network, with application to rainfall extremes. A.J. Stoch Environ Res
     Risk Assess, 32:3207, 2018.
[23] A. Cannon. Quantile regression neural networks: Implementation in R and application
     to precipitation downscaling. Computers & Geosciences, 37:1277-1284, 2011.
[24] J. Taylor. A quantile regression neural network approach to estimating the conditional
     density of multiperiod returns. Journal of Forecasting, 19:299-311, 2000.
[25] C. Jiang, M. Jiang, Q. Xu, X. Huang. Expectile regression neural network model with
     applications. Neurocomputing, 247:73-86, 2017.
[26] L. Liao, C. Park, H. Choi. Penalized expectile regression: an alternative to penalized
     quantile regression. Ann Inst Stat, 71:409–438, 2018.
                                              85


[27] L. Waltrup, F. Sobotka, T. Kneib, G. Kauermann. Expectile and quantile regression-
     David and Goliath? Statistical Modelling, 15(5): 433–456, 2015.
[28] M. Kim, S. Lee. Nonlinear expectile regression with application to Value-at-Risk and
     expected shortfall estimation. Computational Statistics and Data Analysis, 94:1-19,
     2016.
[29] Q. Yao, H. Tong. Asymmetric least squares regression estimation: a nonparametric
     approach. Journal of Nonparametric Statistics, 6:2-3, 1996.
[30] Durbin, R., Altshuler, D., Durbin, R. et al. A map of human genome variation from
     population-scale sequencing. Nature, 467:1061–1073, 2010.
[31] Li MD, Xu Q, Lou XY, Payne TJ, Niu T, Ma JZ. Association and interaction anal-
     ysis of variants in CHRNA5/CHRNA3/CHRNB4 gene cluster with nicotine depen-
     dence in African and European Americans. Am J Med Genet B Neuropsychiatr Genet,
     153B(3):745–756, 2010.
[32] M. Farooq, I. Steinwart. Learning rate for kernel-based expectile regression. Mach Learn-
     ing, 108: 203–227, 2019.
[33] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural
     Networks, 251-257, 1991.
[34] Fletcher, Roger, Practical methods of optimization(2nd ed.), New York: John Wiley &
     Sons, 1987.
[35] Heather J. Cordell, Detecting gene-gene interactions that underlie human diseases. Nat
     Rev Genet, 10(6):392–404, 2009.
[36] Mackay, T.F. Quantitative trait loci in Drosophila. Nat. Rev. Genet, 2:11–20, 2001.
[37] Routman EJ, Cheverud JM. Gene effects on a quantitative trait: Two-locus epistatic
     effects measured at microsatellite markers and at estimated QTL. Evolution, 51:
     1654–1662, 1997.
[38] Zerba, K.E., Ferrell, R.E. & Sing, C.F. Complex adaptive systems and human health:
     the influence of common genotypes of the apolipoprotein E (ApoE) gene polymorphism
     and age on the relational order within a field of lipid metabolism traits. Hum. Genet,
     107: 466–475, 2000.
[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhut-
     dinovDropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of
     Machine Learning Research, 15: 1929-1958, 2014.
                                              86


[40] Chengxi Ye, Yezhou Yang, Cornelia Fermuller, Yiannis Aloimonos. On the Importance
     of Consistency in Training Deep Neural Networks, arXiv:1708.00631, 2017.
[41] Anthony, M. and Bartlett, P.L., Neural network learning: Theoretical foundations,
     Cambridge university press, 2009.
[42] X Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of
     econometrics, 2007.
[43] Kurt Hornik, Maxwell Stinchcombe, Halbert White. Multilayer feedforward networks
     are universal approximators. Neural newtorks, 2(5):359-366, 1989.
[44] László Györfi. A Distribution-Free Theory of Nonparametric Regression. Springer New
     York, 2006.
[45] Jinghang Lin, Xiaoran Tong, Chenxi Li, Qing Lu. Expectile Neural Networks for Genetic
     Data Analysis of Complex Diseases, arXiv:2010.13898, 2020.
[46] Grenander. Abstract Inference. Wily, New York, 1981.
[47] White, H. and Wooldridge, J. Some results on sieve estimation with dependent obser-
     vations. In Nonparametric and Semiparametric Methods in Economics (W. A. Barnett,
     J. Powell and G. Tauchen, eds.) 459-493. Cambridge University Press New York. 1991.
[48] Van der Vaart. Asymptotic Statistics, Cambridge University Press, 1998.
[49] Van der Vaart, Jon A. Wellner. Weak convergence and empirical processes. Springer,
     1996.
[50] Van de Geer. Empirical Processes in M-estimation. Cambridge university press, 2020.
[51] Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, Qing Lu. Asymptotic Properties of
     Neural Network Sieve Estimators, arXiv:1906.00875, 2019.
[52] Xiaotong Shen, On Methods of sieves and penalization. The Annals of Statistics,
     25(6):2555-2591, 1997.
[53] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with
     Deep Convolutional Neural Networks. Advances in neural information processing sys-
     tems, 2012.
[54] Koenker, Roger. Quantile regression. Cambridge University Press, 2005.
[55] Kenji Fukumizu. A regularity condition of the information matrix of a multilayer per-
     ceptron network. Neural networks, 9(5):871–879, 1996.
                                               87


[56] Kenji Fukumizu et al. Likelihood ratio of unidentifiable models and multilayer neural
     networks. The Annals of Statistics, 31(3):833–851, 2003.
[57] Hongtu Zhu and Heping Zhang. Asymptotics for estimation and testing procedures
     under loss of identifiability. Journal of Multivariate Analysis, 97(1):19–45, 2006.
[58] Ergün Akgün, Metin Demir. Modeling Course Achievements of Elementary Education
     Teacher Candidates with Artificial Neural Networks. International Journal of Assess-
     ment Tools in Education, 2018.
[59] T. Pham, T. Tran, D. Phung, S. Venkatesh. Predicting healthcare trajectories from
     medical records: a deep learning approach. J Biomed Inform, 69:218-229, 2017.
[60] Plis, Sergey M. and Hjelm, Devon R. and Salakhutdinov, Ruslan and Allen, Elena A.
     and Bockholt, Henry J. and Long, Jeffrey D. and Johnson, Hans J. and Paulsen, Jane
     S. and Turner, Jessica A. and Calhoun, Vince D. Deep learning for neuroimaging: a
     validation study. Front. Neurosci, 229: 8, 2014.
[61] Devroye, Luc and Györfi, László and Lugosi, Gábor. A probabilistic theory of pattern
     recognition,Springer Science & Business Media, 2013.
[62] Martin Abadi and Paul Barham and Jianmin Chen and Zhifeng Chen and Andy Davis
     and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Geoffrey Irving and
     Michael Isard and Manjunath Kudlur and Josh Levenberg and Rajat Monga and Sherry
     Moore and Derek G. Murray and Benoit Steiner and Paul Tucker and Vijay Vasudevan
     and Pete Warden and Martin Wicke and Yuan Yu and Xiaoqiang Zheng. TensorFlow:
     A system for large-scale machine learning. 12th USENIX Symposium on Operating
     Systems Design and Implementation, 2016.
[63] Adam Paszke and S. Gross and Francisco Massa and A. Lerer and James Bradbury and
     Gregory Chanan and Trevor Killeen and Z. Lin and N. Gimelshein and L. Antiga and
     Alban Desmaison and Andreas Köpf and Edward Yang and Zach DeVito and Martin
     Raison and Alykhan Tejani and Sasank Chilamkurthy and Benoit Steiner and Lu Fang
     and Junjie Bai and Soumith Chintala. PyTorch: An Imperative Style, High-Performance
     Deep Learning Library. NeurIPS, 2019.
[64] Jindong Wang, Yiqiang Chen, Han Yu, Meiyu Huang, and Qiang Yang. Easy Trans-
     fer Learning By Exploiting Intra-domain Structures. arXiv preprint arXiv:1904.01376,
     2019.
[65] J Yosinski, J Clune, Y Bengio, H Lipson. How transferable are features in deep neural
     networks? Advances in neural information processing systems, 3320-3328, 2014.
[66] Amin Khatami, Morteza Babaie, H.R. Tizhoosh, Abbas Khosravi, Thanh Nguyen, Saeid
     Nahavandi. A sequential search-space shrinking using CNN transfer learning and a
                                                88


     radon projection pool for medical image retrieval. Expert Systems with Applications,
     100:224–233, 2018.
[67] Liu Y, Tan L, Wang HF, Liu Y, Hao XK, Tan CC, Jiang T, Liu B, Zhang DQ, Yu
     JT; Alzheimer’s Disease Neuroimaging Initiative. Multiple Effect of APOE Genotype
     on Clinical and Neuroimaging Biomarkers Across Alzheimer’s Disease Spectrum. Mol
     Neurobiol, 53(7):4539-47, 2016.
[68] M.T. Rosenstein, Z. Marx and L.P. Kaelbling, To Transfer or Not to Transfer. Neural
     Information Processing Systems, Workshop Inductive Transfer: 10 Years Later, 2005.
[69] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
     Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Exploring the limits of transfer learning with
     a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1-67,
     2020.
[70] S. J. Pan, Q. Yang. A Survey on Transfer Learning, IEEE Transactions on Knowledge
     and Data Engineering, 22(10): 1345-1359, 2010.
[71] S.M. Salaken, A. Khosravi, T. Nguyen, S. Nahavandi. Seeded transfer learning for re-
     gression problems with deep learning. Expert Syst. Appl, 115:565-577, 2019.
[72] Vijayakumar A . Comparison of hippocampal volume in dementia subtypes. ISRN Ra-
     diol, 2012.
[73] Yoshua Bengio. Deep Learning of Representations for Unsupervised and Transfer Learn-
     ing. Proceedings of ICML Workshop on Unsupervised and Transfer Learning, JMLR
     Workshop and Conference Proceedings, 27:17-36, 2012.
[74] Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen,
     Patrick Nguyen, Ruoming Pang, Igna-cio Lopez Moreno et al. Transfer learning from
     speaker verification to multispeaker text-to-speech synthesis. Conference on Neural In-
     formation Processing Systems, 2018.
[75] Zhilin Yang, Ruslan Salakhutdinov, William W Cohen. Transfer learning for sequence
     tagging with hierarchical recurrent networks.ICLR, 2017.
[76] Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future
     Healthc J. 6(2):94-98, 2019.
[77] Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. The Annals of
     Statistics, 501–513, 1981.
[78] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical
     learning, volume 1. Springer series in statistics New York, 2001.
                                              89