HIGH DIMENSIONAL STATISTICAL METHODS FOR GENE-ENVIRONMENT
INTERACTIONS
By
Cen Wu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Statistics – Doctor of Philosophy
2013

ABSTRACT
HIGH DIMENSIONAL STATISTICAL METHODS FOR
GENE-ENVIRONMENT INTERACTIONS
By
Cen Wu
The genetic inﬂuences on complex disease traits generally depends on the joint eﬀects of
multiple genetic variants, environment factors, as well as their interplays. Gene×environment
(G×E) interactions play vital roles in determining an individual’s disease risk, but the underlying genetic machinery is poorly understood. Traditional analysis assuming linear relationship between genetic and environmental factors, along with their interactions, is commonly
pursued under the regression-based framework to examine G×E interactions. This assumption, however, could be violated due to nonlinear responses of genetic variants to environmental stimuli. As an extension to our previous work on continuous traits, we proposed a
ﬂexible varying-coeﬃcient model for the detection of nonlinear G×E interaction with binary
disease traits. Varying coeﬃcients were approximated by a non-parametric regression function through which one can assess the nonlinear response of genetic factors to environmental
changes. A group of statistical tests were proposed to elucidate various mechanisms of G×E
interaction. The utility of the proposed method was illustrated via simulation and real data
analysis with application to Type 2 Diabetes.
It has been increasingly recognized the power of genetic variant set based association
analysis over the single variant based approach. We develop a variant set based approach to
examine how variants in a genetic system mediated by a common environment factor to aﬀect
the phenotype response. The problem can be approached from a high dimensional variable
selection perspective. In particular, we can select genetic variants with varying, non-zero

constant and zero coeﬃcients, which are corresponding to cases of G×E interactions, no G×E
interactions and no genetic eﬀects, correspondingly. The procedure was implemented in a
two stage iterative framework via Smoothly Clipped Absolute Deviation (SCAD) penalty.
With proper regularity conditions, we can establish the consistency in variable selection and
eﬀect separation of our two stage iterative estimator, as well as the optimal convergence
rates of the estimates for varying eﬀect. In addition, it can be shown that the estimate of
non-zero constant coeﬃcient enjoys the oracle property. The utility of our procedure will be
demonstrated through extensive simulation study and real data analysis.
Due to the drawback of local quadratic approximations in the aforementioned two-stage
framework, the approach is not eﬃcient in handling cases when the dimension p is very
large. A group coordinate descent (GCD) based approach was proposed within the framework, which is computationally eﬃcient particularly for high dimensional problems where
p > n, because the computational complexity increases only linearly with the number of predictor groups after basis expansion. The advantage of our method is demonstrated through
extensive simulation study and real data analysis.

Copyright by
CEN WU
2013

I dedicate this thesis to my parents, Guizhi Ye and Liyang Wu.

v

ACKNOWLEDGMENTS

First and foremost, I am deeply grateful to my advisor, Dr. Yuehua Cui, for his tremendous
assistance, constant support and extreme patience. He is always approachable and ready to
help in every facet of life. Dr. Cui led me into the area of statistical genetics and inspired
me through numerous discussions. His training in the past ﬁve years helped me not only
grasp the skills needed for the interdisciplinary research in statistics and biology, but also
gradually develop the capability of thinking independently. Without his excellent guidance
and advisement, this thesis would not have been possible.
I would like to thank Dr. Robin Buell, my Quantitative Biology Ph.D program coadvisor, for her valuable suggestions and guidance that signiﬁcantly improve my background
in biology and broaden my training in this interdisciplinary area. My sincere thanks also
goes to my thesis committee members: Dr. Lifeng Wang, Dr. Ping-Shou Zhong and Dr.
Qing Lu. My transition to the research area of penalized regressions cannot be so smooth
without frequent discussions with Dr. Wang. I am also indebted to Dr. Zhong’s help in
picking up skills necessary for the technical proof in this thesis. I appreciate all the insightful
comments they made on this work.
I would also like to extend my sincere gratitude to all the faculty and staﬀ in our department. In particular, my thanks goes to Dr. Dennis Gilliland for his constant support and
encouragement, especially for my job hunting. Statistical consulting is a crucial part of my
Ph.D training, and I am grateful to Dr. Robert J. Tempelman and Dr. Sasha Kravchenko for
accepting me as a consultant at CANR Statistical Consulting Center. I beneﬁted enormously
from our weekly group meetings. I also thank their assistance for my job applications.
My special thanks goes to Qi Yan, now at Minnesota, for her tremendous help, especially
vi

during my preparation for the qualify exams. I am also grateful to the assistance and support
from Drs. Gengxin Li, Shaoyu li, Xiaoqin Tang, Wei Wang and Ming Gu. So lucky to have
Gengxin and Shaoyu as my academic sisters. They oﬀered me generous help and precious
suggestions during each stage of my Ph.D life. I appreciate Tao He, Honglang Wang, Bin Gao
and Ran Cao, who are the current members in Dr. Cui’s group, for creating the stimulating
research atmosphere. The memories of heated discussions we had in our weekly journal clubs
will never fade away. I also want to thank Shaoyu and Tao for providing me a cozy oﬃce
environment.
Last, but deﬁnitely not least, I would like to express my deepest gratitude to my parents,
Guizhi Ye and Liyang Wu, for their endless love and undying support. I was not able to
overcome insurmountable obstacles without them, since they never gave me up under any
circumstances. Besides, I thank Xin Tan and Ling Gong who made the winter in Michigan
no longer tough for me. I spent so many weekends and holidays with them in the last ﬁve
years, which turned driving on I-96 one of the most pleasant things for me.

vii

TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1 Introduction . . . . . . . . . . . . . . . .
1.1 A review of basic genetics . . . . . . . . . . . . .
1.2 Genetic mapping of complex traits . . . . . . . .
1.3 Gene-environment (G×E) interactions . . . . . .
1.3.1 Basics of G×E interactions . . . . . . . .
1.3.2 Challenges and issues in G×E interactions
1.4 Objectives and organization of the dissertation . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

1
1
3
4
4
5
7

A Novel Method For Identifying Nonlinear Gene-Environment
Interactions In Case-Control Association Studies . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistical method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimating β(X) function . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Assessing G×E interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 False positive control . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Power evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
9
12
15
16
17
18
19
24
39

Chapter 2
2.1
2.2
2.3
2.4
2.5

2.6
2.7

Chapter 3
3.1
3.2

3.3
3.4
3.5
3.6
3.7

High Dimensional Variable Selection In
teractions . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . .
The proposed variable selection method . . . . . .
3.2.1 The penalized estimation via SCAD . . . .
3.2.2 The computational algorithms . . . . . . .
3.2.3 Selection of tuning parameters . . . . . . .
Asymptotic results . . . . . . . . . . . . . . . . .
Simulation . . . . . . . . . . . . . . . . . . . . . .
Real data analysis . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . .
Technical proofs . . . . . . . . . . . . . . . . . . .
3.7.1 Useful notations and lemmas . . . . . . . .
3.7.2 Proofs of Theorem 1. . . . . . . . . . . . .
3.7.3 Proofs of Theorem 2. . . . . . . . . . . . .

viii

Gene-Environment In. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .

44
44
46
46
49
51
52
54
63
64
66
66
67
73

Chapter 4
4.1
4.2

4.3
4.4
4.5

A Group Coordinate Descent Approach For High Dimensional
Variable Selection In Gene-Environment Interactions . . . . . . 76
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Basis expansion and penalized regression . . . . . . . . . . . . . . . . 79
4.2.2 Computational algorithms . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.2.1 Group LASSO and group adaptive LASSO . . . . . . . . . . 82
4.2.2.2 Group TLP and group SCAD . . . . . . . . . . . . . . . . . 84
4.2.2.3 Convergence of the GCD algorithm . . . . . . . . . . . . . . 86
4.2.3 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . . 87
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Chapter 5

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 105

BIBLIOGRAPHY

. . . . . . . . . . . . .

ix

. . . . . . . . . . . . . . . . . 108

LIST OF TABLES

Table2.1

List of SNPs with p-value < 5E-06 in the HPFS (Male) data set . .

26

Table2.2

List of SNPs with p-value < 5E-06 in the NHS (Female) data set . .

37

Table3.1

Simulation results of Example 3.1 . . . . . . . . . . . . . . . . . . .

57

Table3.2

Simulation results of Example 3.2, d = 10 . . . . . . . . . . . . . . .

61

Table3.3

Simulation results of Example 3.2, d = 50 . . . . . . . . . . . . . . .

62

Table3.4

List of SNPs with p-value < 0.001 from the Jak-STAT signaling pathway 64

Table4.1

Simulation results of Example 4.1, N (0,1) error . . . . . . . . . . . .

91

Table4.2

Simulation results of Example 4.1, t(3) error . . . . . . . . . . . . .

92

Table4.3

Simulation results of Example 4.2, p = 10, n = 200, N (0,1) error . .

94

Table4.4

Simulation results of Example 4.2, p = 10, n = 200, t(3) error . . .

95

Table4.5

Simulation results of Example 4.2, p = 100, n = 200, N (0,1) error .

96

Table4.6

Simulation results of Example 4.2, p = 100, n = 200, t(3) error . . .

97

Table4.7

Simulation results of Example 4.2, p = 200, n = 200, N (0,1) error .

98

Table4.8

Simulation results of Example 4.2, p = 200, n = 200, t(3) error . . .

99

Table4.9

Simulation results of Example 4.2, p = 400, n = 200, N (0,1) error . 100

Table4.10

Simulation results of Example 4.2, p = 400, n = 200, t(3) error . . . 101

Table4.11

List of SNPs with p-value < 5E-06 on Chromosome 9 . . . . . . . . 103

x

LIST OF FIGURES

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4

Figure 2.5

Figure 2.6

Figure 2.7

Figure 2.8

Figure 2.9

Figure 2.10

Figure 2.11

Diﬀerent models of gene-environment interaction: (A) the interaction
of gene and environment in discrete environmental conditions; cases
with (B) no G×E interaction; and (C) linear and (D) non-linear G×E
interactions. AA, Aa and aa represent three diﬀerent genotypes in a
gene, and environment mediator represents a continuous environment
variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

The false positive rate of diﬀerent models at the 0.05 level.(For interpretation of the references to color in this and all other ﬁgures, the
reader is referred to the electronic version of this dissertation.) . . .

19

The power of diﬀerent models under diﬀerent MAFs and sample sizes
when data were generated with the VC model. . . . . . . . . . . . .

21

The power of diﬀerent models under diﬀerent MAFs and sample sizes
when data were generated with the LM model. . . . . . . . . . . . .

22

The power of diﬀerent models under diﬀerent MAFs and sample sizes
when data were generated with the LM-I model. . . . . . . . . . . .

23

The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0
when ﬁtting the VC model to the male data set. . . . . . . . . . . .

27

The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when
ﬁtting the LM model to the male data set. . . . . . . . . . . . . . .

28

The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0
when ﬁtting the LM-I model to the male data set. . . . . . . . . . .

29

The QQ plot of genome-wide p-values for the male data, when data
are ﬁtted with the VC model . . . . . . . . . . . . . . . . . . . . . .

30

The QQ plot of genome-wide p-values for the male data, when data
are ﬁtted with the LM model . . . . . . . . . . . . . . . . . . . . . .

31

The QQ plot of genome-wide p-values for the male data, when data
are ﬁtted with the LM-I model . . . . . . . . . . . . . . . . . . . . .

31

xi

Figure 2.12

The QQ plot of genome-wide p-values for the female data, when data
are ﬁtted with the VC model . . . . . . . . . . . . . . . . . . . . . .

32

The QQ plot of genome-wide p-values for the female data, when data
are ﬁtted with the LM model . . . . . . . . . . . . . . . . . . . . . .

32

The QQ plot of genome-wide p-values for the female data, when data
are ﬁtted with the LM-I model . . . . . . . . . . . . . . . . . . . . .

33

The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0
when ﬁtting the VC model to the female data set. . . . . . . . . . .

34

The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when
ﬁtting the LM model to the female data set. . . . . . . . . . . . . .

35

The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0
when ﬁtting the LM-I model to the female data set. . . . . . . . . .

36

The estimated varying-coeﬃcient function and ﬁtted probability of
SNP rs13050325 (upper panel) on chromosome 21 of female population and SNP rs4635456 (lower panel) on chromosome 19 of male
population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Figure 3.1

The selection ratio of Example 3.1 . . . . . . . . . . . . . . . . . . .

55

Figure 3.2

The selection ratio of Example 3.2, d = 10 . . . . . . . . . . . . . .

59

Figure 3.3

The selection ratio of Example 3.2, d = 50 . . . . . . . . . . . . . .

60

Figure 2.13

Figure 2.14

Figure 2.15

Figure 2.16

Figure 2.17

Figure 2.18

xii

Chapter 1
Introduction

1.1

A review of basic genetics

Genes are the basic functional units where the biological characteristics can be passed from
parents to oﬀspring. The genes of each cell are arranged in linear order on chromosomes.
Majority of the multicellular organisms have duplicate copies of each gene, hence they are
diploid. The number of paired chromosomes varies across diﬀerent species. For instance,
Brassica Oleracea has 9 pairs of chromosomes, Zea Mays has 10 pairs, Mus musculus has 20
pairs and human beings have 23 pairs. The entire set of chromosomes form the genome of a
particular organism and organelles.
Each gene resides on certain site, or locus of the chromosomes. For diploid organisms,
the genes corresponding to the locus take two forms (say A and a), called alleles. The
three genetic compositions of the two alleles, AA, Aa, and aa, are genotypes. The pair of
identical alleles (AA or aa) and diﬀerent alleles (Aa) are called homozygous and heterozygous
respectively. One or several loci may determine the observable characteristics or phenotypic
traits, such as eye color, body weight, blood pressure, and so on.
The traits that can be continuously measured are deﬁned as quantitative traits, like the
aforementioned body weight and blood pressure. The investigation of the genetic basis of a
quantitative trait is the major task of quantitative genetics. In classical quantitative genetics,

1

the phenotypic value (P) is due to genetic factors (G) and environment factors (E):

P =G+E

(1.1)

By assuming the independence of the terms in (1.1), the phenotypic variance of a quantitative trait (VP ) can be decomposed into its genetic (VG ) and environmental (VE ) variance
components:
VP = VG + VE

(1.2)

Based on the three modes of gene actions(additive, dominance and epistasis), we can further
partition VG as
VG = VGa + VGd + VGe

(1.3)

where VGa , VGd , VGe are additive, dominance and epistatic(or interaction) genetic variance
respectively. VGd and VGe are referred to as nonadditive genetic variance. In quantitative
genetics, heritability characterizes the relative importance of the role genetic variance plays in
determining the phenotypic variance. The two types of heritability, broad-sense heritability
(H 2 ) and narrow-sense heritability (h2 ) , are deﬁned separately as:

H2 =

VG
VG + VE

(1.4)

h2 =

VGa
VG + VE

(1.5)

and

The degree of overall genetic control over the quantitative trait can be measured by the two
heritability parameters H 2 and h2 [1, 2].

2

1.2

Genetic mapping of complex traits

The past decades witnessed waves of breakthroughs in genetic mapping of complex diseases
(traits). The family-based linkage study prevails as conventional means of locating disease
genes. Transmission disequilibrium test (TDT)[3], the most popular association test in
family based study, was proposed to test the linkage between a genetic marker and disease
susceptibility. However, large pedigrees are needed for ﬁne-mapping in linkage study, so the
utility of the study is conﬁned when such information is not available.
While the linkage study predominated in discovering disease variants with major eﬀects,
the population based association study gained advantage in detecting genes with modest
eﬀects [4], which is of particular signiﬁcance for complex human diseases[5]. The rapid
progress in the high-throughput genotyping technologies made possible the large scale, highly
dense genome-wide association studies (GWAS) for millions of genetic variants, such as
single-nucleotide polymorphism (SNP) across the entire human genome[6].
The GWAS has thus identiﬁed a large number of susceptibility variants associated with
the complex human diseases, including asthma[7], breast cancer [8, 9], coronary heart disease
[10, 11] and Type 2 Diabetes[12, 13]. A detailed list is given in [14]. Despite the huge success
achieved in GWAS, the disease etiology is still not clearly elucidated since a substantial proportion of the heritability remains obscure. Therefore, there are pressing needs to investigate
the part of unexplained heritability.
There are a number of potential sources of missing heritability, such as rare variants,
structural variation, gene-gene (G×G) interactions and gene-environment (G×E) interactions [15]. For example, the ‘Common Disease, Common Variants’ hypothesis is commonly
adopted in GWAS, assuming that majority of the genetic risk of common complex dis-

3

eases can be attributed, at least partially, to common disease variants which is of more
than 5% minor allele frequency (MAF) [16, 17]. The rare (MAF<0.5%) and low frequency
(0.5%<MAF<5%) variants cannot be well captured by GWAS commercial genotyping arrays which are aimed at covering common genetic variants [18]. Detection of such rare and
low frequency variants using next generation sequencing technologies [19, 20, 21, 22]on either
the genome regions of interest or the whole genome will lead to a better interpretation of
heritability.
Compared to the detection of main eﬀects of genetic variants in GWAS, the discovery of
high order eﬀects, like G×G and G×E interactions, requires much higher statistical power
and thus impinges on the eﬀort to better understand the missing heritability not attributable
to the identiﬁed disease variants [15, 23]. Therefore investigations on G×G, G×E interactions will help make the best use of GWAS, improving disease prediction, prevention and
treatment.

1.3
1.3.1

Gene-environment (G×E) interactions
Basics of G×E interactions

G×E interactions can be traced back to [24, 25] and were formulated by Falconer in [26],
which refers to how phenotypes are reactive to diﬀerent genotypes under various environmental conditions. It has been widely acknowledged that not only the genetic and environmental
factors themselves, but also the interactions between them, are involved in the genetic basis
of complex diseases [27]. A growing number of evidence of G×E interactions have been
found in a wide range of complex diseases, such as colorectal cancer[28], chronic beryllium
lung disease[29], skin cancer[30], cardiovascular diseases [31]and psychiatry diseases [32].
4

The environment factor in a G×E interaction study design can be deﬁned either continuously or discretely. For instance, in a study of G×E interaction on skin cancer, the intensity
of sunlight to which the skins are exposed could be deﬁned as an environment condition
with a continuous measure, and the risk of developing skin cancer triggered by a speciﬁc
gene might be quite diﬀerent given various amount of sunlight. For a G×E interaction study
design related to myocardial infarction, drinking status can be deﬁned as an environment
factor with 2 categories, coded as 1 (drinking alcohol) and 0 (not drinking alcohol).
Study designs for G×E interaction mainly fall into two categories: the family-based
study and the population-based association study[27]. In family-based studies, direct estimations on particular G×E interaction are possible if the environmental information is
integrated into designs, such as the pedigree or sib-pair designs. Compared to the association studies in unrelated subjects, the family-based study may demand more eﬀort to
gather information needed in the design. Depending on the timing of information collection
on environmental,dietary and lifestyle variables, the typical case-control association studies
could be further categorized into retrospective (data collection after disease diagnosis) and
prospective studies (data collection at the beginning of study). Refer to [27] for a detailed
discussion on pros and cons for all these designs.

1.3.2

Challenges and issues in G×E interactions

The major challenges in the study of G×E interactions are how to appropriately model and
test G×E eﬀects. The multifactor dimensionality reduction (MDR)-based approach, a data
mining method for identifying interactions among independent variables, was proposed in
[33] to examine gene-environment interactions. While MDR can be considered as a modelfree approach, other statistical methods were developed within the traditional regression
5

framework, such as the parametric [34] and semi-parametric methods [35, 36].
A common issue in current parametric models for G×E interactions, as pointed out in
[37], is that a strong linear assumption on G×E interactions is demanded. Recall that the
linear regression model below is the starting point to examine G×E interactions:

Y = α0 + α1 X + α2 G + α3 GX + ε

(1.6)

Here Y is the continuous response, α0 is the overall mean, α2 , α3 , α4 are the eﬀects of
environmental factor(X), genetic factor(G), and their interactions (G × X) respectively.
The error term ε is distributed with mean 0 and ﬁnite variance σ 2 .
The G×E interaction is modeled as a product term between G and X in (1.6). A
rearrangement of (1.6) results in

Y = α0 + α1 X + (α2 + α3 X)G + ε

(1.7)

(1.7) explicitly conveys the message that the variation in response Y caused by the genetic
eﬀect G is a linear function of environment X. However, the genetic eﬀect may not necessarily take a linear format in practice. Positing such linear form might lead to model
mis-speciﬁcation and inﬂated bias. A varying coeﬃcient model approach, together with a
group of goodness-of-ﬁt tests, were thus proposed in [37] to investigate the non-linear machinery of G×E interactions for continuous phenotypic response. Extensions of [37] to binary
response is urgently needed as majority of complex human diseases are casted in the casecontrol association study framework where the response are of two categories, disease(case)
or no disease(control).

6

The recent success of set-based GWAS, such as in the pathway-based [38] and gene-centric
study [39, 40, 41], in bringing novel interpretations to the disease signals, motivates us to
coin G×E interaction within a set-based association framework. When multiple variants
within a genetic system, such as the pathway or gene set, are involved, we can jointly model
their eﬀect with an environment factor, especially if any non-linear eﬀects are present, by
proposing an additive varying-coeﬃcient model:

Y = α0 (X) + α1 (X)G1 + . . . + αd (X)Gd + ε

(1.8)

where Y is the phenotypic response, d is the total number of SNP variants in a genetic
feature and Gj refers to the jth SNP, and ε is the random error. The model has particular
power to help us understand how multiple genetic variants in a system are mediated by a
common mediator X to aﬀect disease risk. When d is relatively large, the problem can be
approached from a high dimensional variable selection perspective.

1.4

Objectives and organization of the dissertation

Due to the crucial roles G×E interactions played in elucidating the genetic basis of complex
diseases, the objective of this dissertation will be on developing novel statistical methodology
and powerful computational tools to tackle the challenges originated from high dimensional
G×E interaction analysis.
The dissertation is organized as follows. In chapter 2, the dissection of the nonlinear
penetrance eﬀect of the genetic variants with regard to the environmental factor will be
extended from continuous phenotypic response in [37] to case-control association study within

7

the varying coeﬃcient model framework with an application to Type 2 Diabetes. The varying
coeﬃcients are estimated through nonparametric regression spline techniques. A group of
statistical tests was proposed to elucidate the machinery of G×E interactions. In chapter 3,
a set-based method examining G×E interactions will be developed. We propose a penalized
additive varying coeﬃcient model to select genetic variants with varying eﬀects (the presence
of G×E interactions), constant eﬀects (no G×E interactions), and zero eﬀects (no association
with phenotype). The selection consistency of this approach and the oracle property of the
corresponding penalized estimator will be rigorously established. In chapter 4, we further
extend the framework in chapter 3 to the scenario where the number of genetic variants
exceeds the sample size, the so called “large p,small n”, via group coordinate descend (GCD)
algorithms. A thorough investigation on the performance of diﬀerent penalty functions
within this framework will be conducted. Concluding remarks and outline of the future
work will be given in chapter 5.

8

Chapter 2
A Novel Method For Identifying
Nonlinear Gene-Environment
Interactions In Case-Control
Association Studies

2.1

Introduction

It has been increasingly recognized that the predisposition of many complex diseases is not
purely triggered by genetic factors. They are also inﬂuenced by environmental exposures,
due to potential gene-environment interactions. For example, Type 2 Diabetes mellitus is a
typical complex human disease whose incidence is heavily contingent on the environmental
exposures such as behavioral and dietary factors, in addition to genetic susceptibility [42,
43]. Studies on gene×environment (G×E) interactions will shed novel light on the genetic
responses to environment dynamics and how environment changes mediate gene expression
to increase disease risks. Such phenomenon that disease risk or genetic expression varies
under diﬀerent environment conditions is also termed phenotypic plasticity [44].
G×E interactions were historically pursued by evaluating the gene eﬀect under diﬀerent

9

environment conditions. Figure 2.1A shows the case of G×E interaction under two discrete environment conditions, protective and predisposing such as non-smoking and smoking. When environment conditions are measured in a continuous scale, more information is
available to assess the gradient/dynamic change of genetic eﬀect under subtle environment
changes. For example, adult bone mineral density changes with age and vitamin D intake
[45]. Figure 2.1B-D display several scenarios where the environment mediator is measured in
a continuous scale. Example of continuous environment could be age for age related diseases
such as Alzheimer, or body mass index for Type 2 Diabetes or hypertension. In Figure
2.1B, no G×E interaction is observed since the genetic eﬀects of the three genotypes are
parallel to each other. Figure 2.1C shows a typical example of linear G×E interaction, while
Figure 2.1D displays a non-linear G×E interaction pattern assuming the Aa genotype is the
baseline. As will seen in the following section, most current G×E interaction model assumes
the case displayed in Figure 2.1C. Few statistical analysis has considered the case shown in
Figure 2.1D.
In fact, many literature work supports the view of nonlinear G×E interaction. Sparrow et
al. [46] found that mutations in gene HES7 and MESP2 caused congenital scoliosis, and the
risk was highly related to transient hypoxia during mice pregnancy. The rate of risk increase
was non-linearly correlated with increasing hypoxic levels. Laitala et al. [47] reported that
the reaction of personal genetic eﬀects on coﬀee consumption showed a non-linear relationship
with age. Martinez et al. [48] found that women carrying Gln27Glu genotype in ADRB2
gene had higher probability for obese and the obesity rate was nonlinearly correlated with
the amount of carbohydrate intake. Even though these empirical evidences are limited to
small-scale observational studies, they underscore the importance of further exploration on
non-linear G×E interaction when searching for genetic roots of complex diseases.
10

A
Trait/risk measure

Trait/risk measure

B
AA

Aa

aa
Protective

Predisposing

Aa

aa
Environment mediator

Environment condition

D

C
AA

Trait/risk measure

Trait/risk measure

AA

Aa

aa
Environment mediator

AA

Aa
aa

Environment mediator

Figure 2.1: Diﬀerent models of gene-environment interaction: (A) the interaction of gene
and environment in discrete environmental conditions; cases with (B) no G×E interaction;
and (C) linear and (D) non-linear G×E interactions. AA, Aa and aa represent three diﬀerent
genotypes in a gene, and environment mediator represents a continuous environment variable.
Within the statistical framework, G×E interactions in human diseases have been investigated mainly through model-based approaches, ranging from the standard linear model
with interaction in diverse design settings, such as the case-control design, the case only
design and the two-stage screening design, to more sophisticated models, such as proﬁle
likelihood-based semi-parametric models, empirical Bayesian models and Bayesian model
average (reviewed in Mukherjee et al. [48]). However, as pointed out in Ma et al. [37],
the model-based regression framework generally needs strong model assumptions between
genetic eﬀects and environmental inﬂuences, which cannot be directly applied to the above
11

mentioned empirical studies in which non-linear interaction exists.
In this chapter, we extended the varying-coeﬃcient (VC) model proposed in Ma et al.
[37] for continuous quantitative responses to binary disease responses. We ﬁrst laid out
the VC modeling framework for binary responses, with details on parameter estimation
and hypothesis testing. The utility of our approach was demonstrated through extensive
simulations. Finally, we applied our method to two case-control Type 2 Diabetes cohorts
data sets, followed by discussions.

2.2

Statistical method

For a sample of n unrelated individuals collected from a population, let n1 and n2 be the
number of aﬀected (cases) and unaﬀected (controls) individuals, respectively with n = n1 +
n2 . All individuals in the sample could be genotyped either based on candidate genes or on
a whole genome-wide scale. Let Yi = 1 if the ith individual is aﬀected and 0 otherwise. Let
G be the genetic variable which is coded as 0, 1, 2 corresponding to genotype aa, Aa and AA
where allele A is the minor allele. This coding scheme assumes an additive disease model,
although a genetic variant may show dominant or recessive action mode. In reality we can
do a model selection to choose which one is the optimal one using AIC or BIC criterion.
Suppose in addition to the genetic variables, the disease risk is also aﬀected by environmental factors as well as the interaction between them. Let X be the environmental variable
which is measured in a continuous scale. Throughout this chapter, we are only interested
in environment changes that display in a continuous scale (e.g., geographical location or
temporal changes). Traditional analysis for G×E was commonly pursued by discretizing an
environmental variable into diﬀerent groups (e.g., old vs young), as shown in Figure 2.1.

12

However, we can have more information to assess the G×E relationship when a continuously
measured environment factor is treated in a continuous scale. Thus, the purpose of the work
is to model the genetic responses under diﬀerent environmental stimuli, and further assess
in what form genes respond to these changes.
For a continuous phenotype Y , the general form of an additive VC model to investigate
the non-linear G×E interaction between X and G can be expressed as,

Y = α(X) + β(X)G + σ(X)ε

(2.1)

where the error term ε satisﬁes E(ε|X, G) = 0 and V ar(ε|X, G) = 1. Ma et al. [37] evaluated
the performance of the model by assuming α(X) = α0 +α1 X. The key components of the VC
model lie in proper estimation of the smoothing function β(X) and the variance function
σ 2 (X), through which the eﬀect of the genetic variant can be evaluated as a function of
environment exposures. Various tests have been proposed to assess the linear or non-linear
mechanisms via likelihood ratio test. When inhomogeneous variance (i.e., σ(X) varies with
X) and no parametric distribution are assumed for the error term, wild bootstrap is a
common choice to assess the signiﬁcance of the likelihood ratio statistic.
In human genetics, many diseases are displayed as discrete qualitative traits. The focus
of this work is to extend the above model to responses that do not follow continuous distribution. In a generalized linear model set up, the relationship between the mean of a response
variable Y and the independent variables (X, G) under the varying-coeﬃcient model can be
expressed as
E(Y |X, G) = µ = g −1 {α(X) + β(X)G}
where g is a link function. When Y is measured as counts (e.g., tumor numbers), a log link
13

function can be assumed. When Y is a binary variable (i.e., aﬀected vs unaﬀected), then a
logit link function is commonly applied. In the later case, the logit varying-coeﬃcient model
is given by,
logit(p) = α(X) + β(X)G

(2.2)

where p=Pr(Y = 1|X, G). In this work, we allow the intercept function α(X) varies with
X instead of assuming a linear structure, to make it more ﬂexible to capture the underlying
mean function when there is no genetic contribution (i.e., β(X) = 0).
If we allow β(X) = β1 , the logistic VC model is reduced to a logistic linear predictor
model without G×E interaction (denoted as LM). If we allow β(X) = β1 + β2 X, the logistic
VC model is reduced to a logistic linear predictor model with linear G×E interaction (denoted
as LM-I), i.e.,

logit(p) =α(X) + (β1 + β2 X)G

(2.3)

=α(X) + β1 G + β2 XG
One can also put structures on the function of α(X). For example, we can let α(X) =
α0 + α1 X. Such a model like logit(p) = α0 + α1 X + β1 G + β2 XG is often applied in assessing
G×E interactions in a typical logistic regression analysis by testing H0 : β2 = 0. It can also
be seen that this model assumes a linear G×E interaction structure, that is, the function
β(X) is linear in X. Thus, without assuming speciﬁc structure on the linear predictors, the
VC model has much ﬂexibility to capture the underlying interaction mechanism via ﬁtting
β(X) using smoothed nonparametric functions. The VC interaction model can be considered
as a generalization to the linear interaction model.

14

2.3

Estimating β(X) function

The nonparametric estimation of varying coeﬃcients has undergone intensive investigations
in the last two decades and falls generally into three categories: the local kernel polynomial
smoothing, polynomial spline, and smoothing spline (Fan and Zhang [49]; Huang et al. [50]).
Huang et al. [50] approximated the varying-coeﬃcient functions via B-spline basis expansion.
Using the B-spline technique, the authors further established the relevant asymptotic properties of the estimators, such as consistency, convergence rates and asymptotic normality.
In addition, the estimation of B-spline estimators is computationally fast and numerically
stable. These merits are especially important in the context of high-dimensional genetic
data analysis, which make it a natural choice for us to choose when estimating the varyingcoeﬃcient functions α(X) and β(X).
Let h be the degree of B-splines and N be the corresponding interior knots. Further
assume that the knots are equally distributed for the B-spline basis matrix {Bs : 1 ≤ s ≤
(N + h + 1)}. Ideally we can select h and N for α(X) and β(X) separately using the B-spline
technique when ﬁtting each SNP variant. This process involves a search of optimal degree
and knots through a list of possible combinations for both functions. This, however, could
incur heavy computation burden when the estimation is done for each SNP given that the
number of SNP variants to be tested could be huge. Thus, the degree h0 and knots N0
for α(X) are selected ﬁrst by ﬁtting a logistic VC model without the genetic components.
Once the degree and knots for function α(X) are selected, they will be ﬁxed when estimating
degree and knots for function β(X) for each SNP. The selection is done by using the Bayesian

15

Information Criterion (BIC) criteria deﬁned as,

arg min BIC(N, h) = arg min (γ1 ) + (N + h) log(n)/n,
N,h

N,h

where (γ1 ) refers to the log-likelihood function. A grid search for possible combinations of
N and h can be done and the values corresponding to the minimum BIC are the “optimal”
ones.
Once the degree and knots for α(X) are determined, the function α(X) can be estimated
ˆ
by α(X) = γ1 T B1 (X) =
ˆ

N0 +h0 +1
γ1k B1k (x).
ˆ
k=1

The degree h1 and the number of knots

N1 for β(X) are also selected using the same BIC criterion deﬁned above. The estimator for
T
ˆ
β(X) is given by β(X) = γ2 B2 (X) =

N1 +h1 +1
γ2k B2k (x).
k=1

Regular Newton-Raphson or

Fisher scoring algorithm can be applied to estimate the parameters.

2.4

Assessing G×E interaction

Our goal is to assess if a genetic variant is sensitive to environment changes. If it does, then
in what form, linear or nonlinear. For this purpose, we ﬁrst propose to assess if the genetic
eﬀect is a constant by testing

 C
 H : β(·) = β
0

(2.4)

 C
 H : β(·) = β
a
where β is an unknown constant and logit(p) = α(X) + βG is the corresponding reduced
model under the null hypothesis. Under the H0 , the genetic eﬀect is a constant and its
contribution to disease risk has nothing to do with environmental changes. If we fail to
reject the null, then association can be assessed via testing H0 : β = 0 by ﬁtting the reduced
model. Rejecting the null hypothesis leads to the conclusion that the G×E interaction exists.
16

We next test the linear eﬀect of G×E interaction by formulating,

 L
 H : β(·) = β1 + β2 X
0

(2.5)

 L
 H : β(·) = β + β X
1
2
a
where β1 and β2 are unknown constants. Under the H0 , the reduced model is given by
logit(p) = α(X) + β1 G + β2 XG. If we fail to reject the null, then association can be assessed
via testing H0 : β1 = β2 = 0 by ﬁtting the reduced model. If the null is rejected, it indicates
nonlinear G×E interaction eﬀect and next we ﬁt model 2.2 to assess genetic association.
The above tests are sequential. At each step if we fail to reject the null, we stop and ﬁt
L
the null model and assess the genetic eﬀect by a likelihood ratio test. When H0 is rejected, a

nonlinear G×E interaction eﬀect is implied and we allow the data tell the shape of the eﬀect
by ﬁtting the above described nonparametric B-spline functions. The nonlinear eﬀect is then
assessed by testing H0 : β(X) = 0 using a likelihood ratio test which asymptotically follows
a chi-square distribution with the degrees of freedom equal the number of ﬁtted B-spline
coeﬃcients for function β(X).

2.5

Simulation

The statistical behavior of the proposed approach was evaluated through extensive Monte
Carlo simulations. When using B-spline functions to estimate the varying-coeﬃcients, a uniform distribution on X is generally assumed. In real application, the environment measure
(X) may not be uniformly distributed as in the Type 2 Diabetes data analyzed later in the
chapter. Instead, it is often normally distributed. To mimic real situations, we generated
a continuous environment measure X ∗ from a normal distribution, and subsequently trans-

17

∗

¯∗

formed it by X = Φ( XS −X ), to make X ∗ evenly distributed on the B-spline subintervals,
X∗
¯
where Φ(·) is the standard normal cumulative distribution function and X ∗ and SX ∗ are the
sample mean and sample standard deviation of X ∗ , respectively. The B-spline basis matrix
was constructed on the transformed values. For a given minor allele frequency (MAF) pA
and assuming Hardy-Weinberg equilibrium, SNP genotypes AA, Aa and aa were simulated
from a multinomial distribution with frequencies p2 , 2pA (1 − pA ) and (1 − pA )2 for the
A
three genotypes, respectively. We coded the genetic variable Gi as (2, 1, 0) corresponding
to genotypes (AA, Aa, aa).

2.5.1

False positive control

We ﬁst evaluated the false positive control for the VC model at the nominal 0.05 level.
For comparison purpose, we also reported the error rate for the linear predictor model
with and without interaction. Under the null of no genetic eﬀects, the disease phenotypes
were simulated with logit(p) = α0 + α(x) where α(x) was generated via the B-spline basis
function, i.e., α(x) =

4
k=1 γk B(x)

for given spline coeﬃcients γ1 = 6.162, γ2 = 5.948,

γ3 = 3.858, γ4 = 3.640. The spline coeﬃcients were obtained by ﬁtting the real data
(described later) without ﬁtting the genetic eﬀect. We added a constant α0 in order to
control the simulated proportion of case:control ratio to approximately 1:1 (by varying the
size of α0 ). A total of 10,000 simulation replicates were taken under all the combinations of
sample size (n = 500, 1000, 2000) and MAF (pA = 0.1, 0.3, 0.5).
The results were summarized in Figure 2.2. As we can observe, the false positive rates
were estimated sensibly from the simulated data. The VC model slightly overestimated the
false positive rate under low allele frequency (pA =0.1). But the performance improved as
MAF increases for a ﬁxed sample size. In addition, the performance improved as sample
18

size increased under a ﬁxed MAF. In general, there were no signiﬁcant deviations from the
nominal 0.05 level for all the 3 models, except in some cases under low MAF and small
sample size.

0.08

n=500

n=1000

n=2000

0.07
0.06
0.05
0.04
0.03
0.02
0.01
0

VC

LM

LMI

pA=0.1

VC

LM

LMI

pA=0.3

VC

LM

LMI

pA=0.5

Figure 2.2: The false positive rate of diﬀerent models at the 0.05 level.(For interpretation
of the references to color in this and all other ﬁgures, the reader is referred to the electronic
version of this dissertation.)

2.5.2

Power evaluation

For given genetic eﬀects, the disease status was simulated from a Bernoulli trial. The varyingˆ
coeﬃcient function β(·) was estimated through β(x) ≡

N1 +h1 +1
γs B(x).
ˆ
s=1

In a typical

simulation study with VC models, people generally simulate data assuming a nonlinear
function such as a sin or exponential function. As SNPs do not function in such form, we
simulated data according to the ﬁt calculated from the real data to make it more realistic.
Three scenarios were considered. Scenario 1 assumed that the true G×E interaction was
19

nonlinear and the data were generated with the VC model. In scenario 2, we assumed
there was no G×E interaction, while in scenario 3 we assumed linear G×E interaction. The
simulated data were then analyzed using the VC, LM-I and LM models, to compare the
performance under model mis-speciﬁcation.
For a given MAF, the data assuming nonlinear G×E interaction were generated with the
following VC model,
logit(pi ) = α0 + α(Xi ) + β(Xi )Gi
where pi = p(Y = 1|X, G), and α0 was a constant used to control the case:control ratio to
make it close to 1. The varying coeﬃcient functions α(X) and β(X) were computed based
upon the quadratic B-spline basis matrix with α(X) = γ1 B1 (X) and β(X) = γ2 B2 (X),
where γ1 = (7.287, 7.146, 3.917, 3.413)T and γ2 = (0.080, −0.460, −0.201, 0.465)T were obtained from real data ﬁt, namely SNP rs4506565 on chromosome 10 of the Nurses’ Health
Study (NHS) data in GENEVA consortium (described later). The binary responses were
then generated from a Bernoulli trial with case probability pi .
The likelihood ratio test was applied to assess the signiﬁcance of each test illustrated
in previous section. The comparison results were shown in Figure 2.3. As we expected, a
common trend for the three models is that the power increases as MAF and sample size
increase. Under the same sample size or MAF, the VC model always has the best power
among the three, which is not surprising since the phenotypes were generated from a VC
model. In addition, the LM-I model performs better than the LM model since structurally
it is more close to the VC model.

20

1
n=500

n=1000

n=2000

0.8

0.6

0.4

0.2

0
VC

LM
pA=0.1

LMI

VC

LM

LMI

pA=0.3

VC

LM

LMI

pA=0.5

Figure 2.3: The power of diﬀerent models under diﬀerent MAFs and sample sizes when data
were generated with the VC model.
We also simulated data assuming no G×E interaction using the following model,

logit(pi ) = α0 + α(Xi ) + β1 Gi

where α(x) was generated from the B-spline basis function with α(X) = γ0 B0 (X). The
ˆ
spline coeﬃcient vector was given by γ0 = (5.977, 6.011, 3.843, 3.668)T , and the genetic
coeﬃcient was set as β1 = 0.271 (corresponding to an odds ratio of 1.3). These coeﬃcients
were obtained by ﬁtting the real data with a linear predictor without interaction for SNP
rs12255372 on chromosome 10. α0 was used to adjust the case:control ratio as described
before under diﬀerent sample sizes and MAFs. The results shown in Figure 2.4 demonstrate
that the LM model outperforms the other two models in all the scenarios, since data were
analyzed with the true data generating model. As MAF increases, the power diﬀerences
21

among the three models diminishes for larger sample sizes. For example, the power diﬀerence
among the three models is very small under pA = 0.5 and n = 2000.
n=500

1

n=1000

n=2000

0.8
0.6
0.4
0.2
0
VC

LM

LMI

pA=0.1

VC

LM

LMI

VC

pA=0.3

LM

LMI

pA=0.5

Figure 2.4: The power of diﬀerent models under diﬀerent MAFs and sample sizes when data
were generated with the LM model.

The following linear interaction model (LM-I) was assumed to generate the linear G×E
interaction data,
logit(pi ) = α0 + α(Xi ) + β1 Gi + β2 Xi Gi
ˆ
where α(X) = γ0 B0 (X) with spline coeﬃcients γ0 = (6.358, 6.481, 4.232, 4.113)T . The
genetic coeﬃcient β1 =0.226 and interaction coeﬃcient β2 =-0.787. All the coeﬃcients were
obtained by ﬁtting model 2.3 to SNP rs17537178 on chromosome 10. Figure 2.5 shows
that the linear interaction model has the best performance among the three. In addition,
the power of the VC model is more close to the linear interaction model since it is more
structurally close to the linear interaction model. As sample size and MAF increase, the
22

power diﬀerence between the VC and LM-I model vanishes quickly.
n=500

1

n=1000

n=2000

0.8

0.6

0.4

0.2

0
VC

LM
pA=0.1

LMI

VC

LM
pA=0.3

LMI

VC

LM

LMI

pA=0.5

Figure 2.5: The power of diﬀerent models under diﬀerent MAFs and sample sizes when data
were generated with the LM-I model.

In summary, when the true G×E interaction is linear or when there is no interaction at
all, the model assuming linear or constant coeﬃcient outperforms the VC model. However,
the VC model outperforms the other two when the true interaction in nonlinear. In addition,
the LM or LM-I models suﬀer more from power loss when the underly true interaction is
nonlinear in comparison to the case when the underlying truth is linear or no interaction.
This is not surprising since the B-spline estimator is consistent for large samples. Under large
sample sizes, the VC model should perform similar to the LM and LM-I model. However,
one has to be careful in ﬁnite samples. The simulation results suggest that one should assess
the function β(X) ﬁrst before testing β(X) = 0. In practice, one can test if β(X) = β or
β(X) = β1 + β2 X, then ﬁt the appropriate model depending on the test result.

23

2.6

Real data analysis

The fast increase in global prevalence of Type 2 Diabetes draws worldwide attentions for the
disease. About 50 novel loci have been reported in association with Type 2 Diabetes so far
(Perry et al. [52]. However, only a small proportion of disease heritability has been explained
by these loci, leaving the question of how to eﬀectively accounting for gene-environment interaction in the search of T2D susceptibility variants with the hope to capture the missing
heritability. We applied our model to two nested case-control cohort studies of Type 2
Diabetes, the Nurses’ Health Study (NHS) and the Health Professionals Follow-up Study
(HPFS), from the Gene, Environment Association Studies Consortium (GENVEA) (Cornelis
et al. [53]). The two data sets are well-characterized cohorts of genome-wide association studies investigating a set of hypotheses about the dietary and lifestyle factors to the triggering
of a series of diseases, including Type 2 Diabetes, for women and men. Details of the two
cohorts can be found from Colditz et al. [54] and Rimm et al. [55]. The data sets from
the two cohort studies originally contain 3,391 females (NHS) and 2,599 males (HPFS) with
European ancestry. After data cleaning by removing subjects with unmatched phenotypes
and genotypes, excluding SNPs with MAF< 0.05 and deviation from Hardy-Weinberg equilibrium, the ﬁnal data contain 3,391 females (1,646 cases and 1,745 controls) with 635,748
SNPs in the NHS set and 2,570 males (1,300 cases and 1,270 controls) with 636,764 SNPs
in the HPFS set.
Body mass index (BMI), calculated as the quotient between an individual’s mass (kg)
and the square of height (m2 ), is an indicator of human obesity. It is widely recognized that
the risk of Type 2 Diabetes could be potentially inﬂuenced by obesity condition evidenced
by strong association between them for both women and men (Holbrook et al. [56]; Carey

24

et al. [57]; Chan et al. [58]). Therefore, individual’s BMI can be regarded as a type of
environmental condition pivotal in evaluating the incidence of Type 2 Diabetes. Individuals
carrying the same gene may have diﬀerent risks of Type 2 Diabetes under diﬀerent obese
conditions. The phenomenon could be elucidated, at least partially, by the complicated
interaction mechanism between the carrier’s gene and the environment (measured by BMI).
Thus, we can treat the genetic sensitivity to obese as a dynamic process which can be
captured by the proposed VC model, if any.
We analyzed the male and female data sets separately in order to ﬁnd sex-speciﬁc genes
responsible for T2D risk. Figures 2.6–2.8 showed the Manhattan plot of the -log10(p-values)
for the male data. To compare the performance of the three models (VC, LM, LM-I), we
plotted all the signals at each SNP locus. It can be seen that the overall signals for the three
models are quite consistent. The dashed red line corresponds to the genome-wide Bonferroni
threshold (7.9E-8) and the dotted blue line corresponds to the suggestive threshold (5E-6).
Table 2.1 tabulated SNPs that passed both threshold. Seven SNPs passed the Bonferroni
threshold are marked by ∗. Testing constant coeﬃcient showed that the majority of SNPs
has constant coeﬃcients, which indicated they are not sensitive to obese condition. This also
explained why the LM model gives relatively stronger signals than the other two models.
Columns with P CON and P LIN showed the p-values for testing H0 : β(X) = β and
H0 : β(X) = β1 + β2 X. The smaller p-values for testing constant and linear coeﬃcients in
the top panel showed that the eﬀects of those SNPs were neither constant nor linear, thus
the VC model gave the strongest signals evidenced by smaller P VC than P LM and P LMI.
For example, SNP rs4635456 had P CON=9.5E-07 and P LIN=0.0117 which indicated the
coeﬃcient of this SNP is varying over BMI. Thus, ﬁtting a VC model gave the strongest
signal (P VC=3.05E-06 vs P LM=0.6299 and P LMI=1.58E-05).
25

Table 2.1: List of SNPs with p-value < 5E-06 in the HPFS (Male) data set
SNP ID

GeneName

Chr

rs4635456
rs4972250
rs4842244

SEMA6B
Unknown
RXRA

19
2
9

rs2371765
rs7901695
rs7991210
rs12243326
rs4132670
rs12255372
rs4506565
rs11013381
rs6893115

ADAMTS9-AS2
TCF7L2
PCCA
TCF7L2
TCF7L2
TCF7L2
TCF7L2
C10orf67
Unknown

3
10
13
10
10
10
10
10
5

rs699253
SNP ID

PDE4B
GeneName

1
Chr

rs4635456
rs4972250
rs4842244

SEMA6B
Unknown
RXRA

19
2
9

rs2371765
rs7901695
rs7991210
rs12243326
rs4132670
rs12255372
rs4506565
rs11013381
rs6893115

ADAMTS9-AS2 3
TCF7L2
10
PCCA
13
TCF7L2
10
TCF7L2
10
TCF7L2
10
TCF7L2
10
C10orf67
10
Unknown
5

rs699253

PDE4B

1

26

P VC
P CON P LIN
ﬁtted with VC model
3.05E-06 9.49E-07 0.0117
3.99E-06 2.21E-06 1.65E-06
4.18E-06 1.25E-06 2.91E-06
ﬁtted with LM model
6.82E-09
0.2909
1.49E-06
0.8638
2.80E-07
0.2234
1.14E-06
0.6896
1.64E-06
0.8560
1.89E-06
0.7372
2.66E-06
0.8546
7.83E-05
0.8865
9.19E-05
0.8287
ﬁtted with LMI model
3.93E-05
0.0108
0.6792
P LM
P LMI
PI
ﬁtted with VC model
0.6299
1.58E-05
2.91E-06
0.2772
0.1982
0.1516
0.7146
0.0886
0.0299
ﬁtted with LM model
2.38E-10∗ 1.88E-09
0.8140
1.72E-08∗ 1.06E-07
0.5633
1.81E-08∗ 7.07E-08
0.2655
1.87E-08∗ 8.88E-08
0.3570
1.94E-08∗ 1.07E-07
0.4632
2.93E-08∗ 1.49E-07
0.4076
3.31E-08∗ 1.87E-07
0.4967
1.32E-06
7.74E-06
0.7106
1.79E-06
1.11E-05
0.9266
ﬁtted with LMI model
1.5E-04
4.21E-06 0.00125

Figure 2.6: The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0 when ﬁtting
the VC model to the male data set.
The mid-panel in the table listed SNPs with the strongest signals ﬁtted with the LM
model. The Pconst values for the SNPs were all large (>0.05), which suggests that β(X) was
a constant and there was no G×E interaction for these SNPs. Hence the LM model assuming
no interaction gave the strongest signals. The bottom SNP in the table had the strongest
signal when data were ﬁtted with the LM-I model since we rejected constant coeﬃcient
27

Figure 2.7: The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when ﬁtting the
LM model to the male data set.
(P CON=0.0108) but failed to reject linear coeﬃcient (P LIN=0.6792).
Among the SNPs listed in the table, some have been reported in other studies. For
example, transcription factor 7-like 2 (TCF7L2) is an intensively examined gene associated
with a broad categories of diseases, including Type 2 Diabetes. The causal genetic association
between SNPs of the gene and the Type 2 Diabetes was ﬁrst reported in Grant et al. [59] and
28

Figure 2.8: The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0 when ﬁtting
the LM-I model to the male data set.
was subsequently replicated in many ethnic groups (Jin and Liu [60]). As the SNPs in this
gene are not sensitive to obesity, it is not surprise that they can be identiﬁed in other studies
by using methods assuming a linear relationship. But our method identiﬁed three more that
show nonlinear G×E relationship, even though they did not pass the genome-wide Bonferroni
threshold. We also did QQ plot of the p-values for data ﬁtted with the three models. The
29

p-values are quite uniformly distributed and only a few showing departure from the expected
values (see the QQ plot). This indicates that the models have no serious inﬂation of false
positives and the strong signals are likely to be true.

Figure 2.9: The QQ plot of genome-wide p-values for the male data, when data are ﬁtted
with the VC model

30

Figure 2.10: The QQ plot of genome-wide p-values for the male data, when data are ﬁtted
with the LM model

Figure 2.11: The QQ plot of genome-wide p-values for the male data, when data are ﬁtted
with the LM-I model
31

Figure 2.12: The QQ plot of genome-wide p-values for the female data, when data are ﬁtted
with the VC model

Figure 2.13: The QQ plot of genome-wide p-values for the female data, when data are ﬁtted
with the LM model
32

Figure 2.14: The QQ plot of genome-wide p-values for the female data, when data are ﬁtted
with the LM-I model

33

Figure 2.15: The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0 when ﬁtting
the VC model to the female data set.

Figures 2.15–2.17 showed the Manhattan plot of the -log10(p-values) for the female data.
Even though no SNPs passed the genome-wide Bonferroni threshold, we did see stronger
signals ﬁtted by the VC model. Those SNPs that passed the suggestive threshold are listed
in Table 2.2. Again, gene TCF7L2 does not show sign of sensitivity to obesity to aﬀect

34

Figure 2.16: The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when ﬁtting the
LM model to the female data set.
T2D risk. Gene GLI2 shows sign of interaction with obese to aﬀect T2D risk. Two SNPs
in gene NRIP1 located on chromosome 21 show sign of nonlinear interaction with obese to
aﬀect T2D risk. In comparison to the male data, it is clear that SNP eﬀects are stronger in
the male population than in the female population. Moreover, the genetic eﬀects in females
are relatively more sensitive to obesity to aﬀect T2D risk. In summary, strong sex-speciﬁc
35

Figure 2.17: The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0 when
ﬁtting the LM-I model to the female data set.
genetic eﬀects were observed, for example, those SNPs on chromosome 2, 3, 4, and 21.
To further demonstrate the utility of the method, we plotted the dynamic eﬀect of SNP
rs13050325 on chromosome 21 from the female data (upper panel) and SNP rs4635456 on
chromosome 19 from the male data (lower panel). The two curves on the left side of Figure
2.18 showed the estimated dynamic genetic eﬀect as a function of BMI ﬁtted with the B36

Table 2.2: List of SNPs with p-value < 5E-06 in the NHS (Female) data set
SNP ID

GeneName

Chr

P VC

P CON P LIN
ﬁtted with VC model
rs13050325 NRIP1
21
3.79E-07 3.77E-06 0.0016
rs2331061 LANCL2
7
8.60E-07 1.30E-06 5.26E-07
rs1466042 GLI2
2
1.10E-06 3.48E-06 0.0389
rs11145373 VPS13A
9
2.63E-06 8.41E-04 2.56E-04
rs3775043 UNC5C
4
2.95E-06 0.0018
6.96E-04
rs12627409 NRIP1
21
4.00E-06 2.38E-05 0.0441
ﬁtted with LM model
rs10519107 RORA
15
4.84E-05
0.8381
rs809736
RORA
15
4.96E-05
0.8145
rs4506565 TCF7L2
10
4.35E-05
0.4953
rs7901695 TCF7L2
10
4.42E-05
0.4895
rs12255372 TCF7L2
10
1.2E-04
0.5576
rs4368343 Unknown
2
1.88E-04
0.7537
ﬁtted with LMI model
rs2677528 GLI2
2
2.63E-06
2.62E-05 0.0732
rs7978946 Unknown
12
3.09E-05
0.0117
0.6078
rs887370
TSHZ2
20
3.63E-05
1.45E-05 0.5868
SNP ID
GeneName Chr P LM
P LMI
PI
ﬁtted with VC model
rs13050325 NRIP1
21
0.0062
1.23E-05
1.0E-04
rs2331061 LANCL2
7
0.0703
0.1456
0.4471
rs1466042 GLI2
2
0.0241
1.61E-06
3.37E-06
rs11145373 VPS13A
9
1.27E-04
6.2E-04
0.7679
rs3775043 UNC5C
4
6.11E-05
2.56E-04
0.4929
rs12627409 NRIP1
21
0.0119
5.60E-06
2.38E-05
ﬁtted with LM model
rs10519107 RORA
15
8.52E-07 3.72E-06
0.3802
rs809736
RORA
15
9.22E-07 5.84E-06
0.8961
rs4506565 TCF7L2
10
1.69E-06 4.66E-06
0.2018
rs7901695 TCF7L2
10
1.75E-06 4.30E-06
0.1729
rs12255372 TCF7L2
10
4.47E-06 9.73E-06
0.1543
rs4368343 Unknown
2
4.75E-06 2.53E-05
0.6320
ﬁtted with LMI model
rs2677528 GLI2
2
0.0064
2.16E-06 1.55E-05
rs7978946 Unknown
12
1.04E-04
3.62E-06 0.0016
rs887370
TSHZ2
20
0.4492
4.46E-06 9.30E-07

37

spline function. We can see clear nonlinear genetic eﬀects over BMI, which indicates nonlinear
interaction between BMI and the variants. The ﬁgures in the right panel show the plot of
ﬁtted probabilities against individual BMI values corresponding to diﬀerent genotypes. We
coded the heterozygote as 0 in our model. This implies that the green curves in the two plots
correspond to the mean ﬁtted probability when G = 0. In general, the risk of T2D increases
as BMI increases. This is consistent with our prior knowledge that the disease prevalence is
strongly associated with body weight (McCarthy[61]).
For SNP rs13050325 on chromosome 21, the allele frequency for the minor allele G is
0.2587. For SNP rs4635456 on chromosome 19, the allele frequency for the minor allele G
is 0.3771. In both cases, the overall trend for T2D risk for the baseline (corresponding to
genotype AG) increased as BMI level increases (green curve). However, individuals carrying
AA genotype had much higher chance to develop T2D than those carrying AG or GG
genotype. Man with genotype AA had the lowest risk of conferring T2D susceptibility when
BMI level was below 28 in male and below 33 in female. After the transition points, the
AA genotype triggers larger eﬀect, resulting in higher risk of T2D. The association signals
for both LM and LM-I model are weaker than the one ﬁtted with the VC model, leading
to potential mis-identiﬁcation of these variants. The results oﬀered personalized preventive
suggestions based upon our ﬁndings ﬁtted with the VC model. For example, man carrying
genotype AA at this SNP locus should pay more attention to control their body weight if
their BMI level is above 28 to avoid the risk of T2D.

38

1
Fitted Probability

1

ˆ
β(X)

0.5
0
−0.5
−1

20

30

40

0.2
0

Fitted Probability

−1

30

20

40

0.8

30

40

50

BMI

1

−0.5

20

AA
AG
GG

0.4

50

0

ˆ
β(X)

0.6

X(BMI)

0.5

−1.5

0.8

AA
AG
GG

0.6
0.4
0.2

X(BMI)

20

30

40

BMI

Figure 2.18: The estimated varying-coeﬃcient function and ﬁtted probability of SNP
rs13050325 (upper panel) on chromosome 21 of female population and SNP rs4635456 (lower
panel) on chromosome 19 of male population

2.7

Discussion

It is broadly recognized that naturally occurring variations in most complex disease traits
have a genetic basis. However, the degree of variability is believed to have a strong environmental component in addition to genetic causes for many disease traits such as obesity
and Type 2 Diabetes (Qi and Cho [62]). Recent eﬀorts on epigenetics study reveals the
importance of epigenetic modiﬁcation on complex diseases (Liu et al. [63]). These epige-

39

netic changes involve major chromatin remodeling processes such as DNA methylation and
histone modiﬁcation. Being the environmentally driven plasticity at the DNA level, these
structural changes at the DNA level reveal the interplay of gene-environment interaction
in the regulation of phenotype, which is increasingly recognized as the epigenetic basis of
many complex diseases (Liu et al. [63]). Large eﬀorts have been devoted to the exploration
of epigenetic mechanisms for a better understanding of the molecular machinery underlying
complex diseases (Feinberg and Irizarry [64]). However, how environment mediates epigenetic changes to aﬀect phenotypic plasticity is still poorly understood, largely due to the
lack of powerful statistical methods to dissect this complicated process.
In this chapter, we proposed a novel statistical method by modeling the genotypic eﬀect
on disease risk as a dynamic function of environment mediators. Our model is built upon
well-studied statistical varying-coeﬃcient model implemented with the nonparametric spline
technique to estimate the varying coeﬃcients. The model extends out previously developed
method on continuous traits to a case-control population-based design. Simulation studies
show dramatically improved power when the underlying genetic penetrance behaves nonlinearly under certain environmental stimulus. Our model can capture the dynamic changes of
the gene functions over environmental changes, hence has particular power to tackle longstanding genetic questions regarding gene action and phenotypic plasticity (Feinberg [44]).
Our simulation studies indicate that model mis-speciﬁcation is an issue in G×E study.
The power to detect genetic signals is heavily dependent upon the models to ﬁt the data.
Simple models are always the ﬁrst choice due to their simplicity to interpret. However, if
they cannot capture the underlying functional mechanism, they suﬀer tremendously from
power loss. For example, if the true genetic eﬀect does vary nonlinear across environmental
changes, ﬁtting a simple linear model would loss power (Figure 2.3). On the other hand,
40

complex models always suﬀer from large degrees of freedom for testing. We proposed a
sequential testing procedure to assess if a simpler model ﬁts the data better. The real data
analysis conﬁrms that this strategy works. For example, when testing constant shows that
there is no G×E interaction, the model with linear predictor and without interaction term
gives the smallest testing p-values (see Table 2.1 and Table 2.2). In real data analysis, one
should always start by assessing constant coeﬃcient ﬁrst, then move to test linear or varying
coeﬃcients.
We applied our model to two Type 2 Diabetes data sets. Cornelis et al. [65] evaluated
seven statistical models to dissect G×E interactions using the same data sets. Both Cornelis et al. [65] and our work treated BMI as the environmental factor. Cornelis et al. [65]
claimed that specifying BMI as a continuous covariate will lead to inﬂated type 1 error,
which has consequence in detecting increased number of false positives as the true signal.
They converted the continuous environment factor BMI into a binary variable prior to further comparisons of all the 7 models. However, this conversion will result in information
loss, which might be the reason that no G×E interaction signals passed the genome-wide
signiﬁcance levels for all the seven models in both data sets in their analysis (Cornelis et al.
[65]). In our approach, we allowed the nonlinear eﬀect of BMI on Type 2 Diabetes (modeled
by function α(X)) rather than treating it as a linear function (i.e., α0 + α1 X). This greatly
alleviated the type 1 error inﬂation compared to a model ﬁtted with a linear function in
BMI (data not shown). In our analysis, several signals reached the genome-wide signiﬁcance
level, which is a piece of convincing evidence for keeping the continuous BMI measure as an
environmental variable.
In the real data analysis, we observed strong sex-speciﬁc variants associated with T2D.
There were not much overlap between genes identiﬁed in both data sets except for SNPs
41

in gene TCF7L2. Identiﬁcation of SNPs in gene TCF7L2 on the pathogenesis of Type 2
Diabetes has been successfully replicated from diﬀerent populations (Grant et al. [59]).
This information indicates the robustness of our model. In addition, we observed stronger
signals in the male data evidenced by seven SNPs from three genes reaching the genomewide signiﬁcance threshold (cutoﬀ=7.9E-8), as shown in Table (2.1). However, we observed
stronger BMI×G interaction to aﬀect T2D in females than in males evidenced by more
nonlinear G×E interaction in the female data set (Table 2.2). We could miss these signals if
we only focused on linear predictor models. In a recent investigation of a Italian population,
Vaccaro et al. [66] found a signiﬁcantly higher average BMI levels in diabetes women. So
possibly certain genes may be sensitive to high BMI level to increase T2D risk. Our model
provides a testable framework to identify the underlying genetic blueprint sensitive to obese
changes to aﬀect T2D risk. The results obtained by our model can be applied to pathway
or gene-set enrichment analysis to identify potential sex-speciﬁc pathways for T2D.
In this chapter, we generalized the VC model for continuous quantitative response to the
case-control binary response. There is several ongoing work worthy of further investigation.
First, the model can be easily extended to other types of phenotype data, such as count data
or survival data by applying diﬀerent link functions. Second, more replication studies are
needed by applying our approach to Type 2 Diabetes of diﬀerent ethnic groups to further
conﬁrm the robustness of the method. Third, it is worth noting that the interesting result
reported by Perry et al. [52] that stratiﬁcation on the Type 2 Diabetes patients based on
BMI might help enrich the signiﬁcance of potential susceptibility loci. We could also try
to carry out analysis to test if this hypothesis leads to any new discoveries based on the
VC model. Finally, our model can easily incorporate population stratiﬁcation (PS) eﬀect by
ﬁrst doing a principle component analysis using software such as EIGENSTRAT (Price et
42

al. [67]), then incorporate those PCs as covariates into the model to account for the eﬀect
of PS.

43

Chapter 3
High Dimensional Variable Selection
In Gene-Environment Interactions

3.1

Introduction

Gene-environment (G×E) interaction has been traditionally examined by assessing genetic
responses corresponding to various environmental stimuli, which provides novel insight in
elucidating the genetic basis of complex diseases, because the disease risk is not only contingent on genetic risk factors, but also on the environmental pressures, as well as the interplay
between them. The environmental pressure could be either discrete or continuous. When
it comes to a G×E interaction study related to asthma, the environmental factor could be
discrete, such as smoking status (smoking v.s. non-smoking). A much more clear picture
on the interaction will be tangible if the environmental factor is evaluated on a continuous scale, since we can trace the varying patterns of genetic eﬀect responsive to changes in
environment.
Conventional statistical modelling of G×E interaction often requires a linear relationship
assumption between genetic and environmental factors, which could be violated in practice,
as pointed out in Ma et al [37] and Wu and Cui [68]. Consequently, a varying coeﬃcient
(VC) model framework, together with a sequence of goodness-of-ﬁt tests, were proposed in

44

[37] and [68] for continuous and binary responses respectively, to track down the dynamic
features of genetic responses to environmental pressures. Because of the particular power and
ﬂexibility of VC models to capture the variations in regression coeﬃcients, the framework
demonstrated signiﬁcant advantage over the conventional methods especially in the presence
of non-linear G×E interaction
Unlike the predominant single genetic variant based approaches dissecting G×E interactions, such as the parametric methods in Guo [34], non-parametric methods in Ma et al
[37] and Wu and Cui [68], and semi-parametric methods in Chatterjee et al [69] and Maity
et al [70], we propose a variant set based framework to investigate how variants in a set
are mediated by a common environment factor to aﬀect the phenotypic response, since it
has been increasing acknowledged the merit of set based association analysis, such as in the
gene-centric analysis in Cui et al [39] and Wu and Cui [40], gene-set analysis in Schaid et
al [71] and Efron and Tibshirani [72], as well as the pathway based analysis in Wang et al
[38]. When the number of variants within the genetic system is large, the problem can be
approached from the entry point of high dimensional variable selection. In particular, we
can select genetic variants with varying, non-zero constant and zero coeﬃcients, which are
corresponding to scenarios of G×E interactions, no G×E interactions and no genetic eﬀects,
respectively. To the best of our knowledge, this is the ﬁrst time that the problem is tackled
from the angle of high dimension variable, on the contrary to the popular single genetic
variant based approaches coined in a hypothesis testing framework.
Through B spline basis expansion, the varying coeﬃcient function can be separated into
constant and varying portions, respectively. Then the distinction of the 3 eﬀects could
be achieved by penalizing the 2 portions in a two-stage iterative framework, as shown in
Tang et al.[73]. Though the asymptotic properties of the two stage estimator with adaptive
45

LASSO penalty were established in [73], the ﬁnite sample performance still has a large
margin to improve. Therefore we proposed a Smoothly Clipped Absolute Deviation (SCAD)
based approach to examine the separation of varying, non-zero constant and zero coeﬃcient
functions. Our approach has signiﬁcantly improved percentages of choosing the exact true
model and reduced error in parameter estimation. Assuming suitable regularity conditions,
we can establish the consistency in variable selection and eﬀect separation of our estimator,
as well as the optimal convergence rates of the estimates for varying eﬀect. Furthermore, it
can be shown that the estimate of non-zero constant coeﬃcient enjoys the oracle property,
that is, the asymptotic distribution of the non-zero constant coeﬃcient function is the same
as that when the true model is known in priori.
In this chapter, we describe the penalized least square estimation procedure via basis
expansion and SCAD penalty, as well as the computational algorithms. Next we present the
theoretical results including consistency in variable selection and oracle property. The merit
of the proposed approaches were demonstrated through extensive simulation study and real
data analysis. Discussions will be given at the end of the chapter. We relegate technical
proofs to the Appendix.

3.2
3.2.1

The proposed variable selection method
The penalized estimation via SCAD

Let (Xi , Yi , Zi ), i = 1, . . . , n be independent and identically distributed (i.i.d.) random
vectors, then the varying coeﬃcient (VC) model, proposed by Hastie and Tibshirani [74],

46

has the form
d

Yi =

βj (Zi )Xij + εi

(3.1)

j=0

where Xij is the jth component of (d+1)-dimensional vector Xi with the ﬁrst component Xi0
being 1, βj (·)’s are unknown varying-coeﬃcient functions, Zi ’s are the scalar index variable,
and εi is the random error such that E(ε|X, Z) = 0 and V ar(ε|X, Z) = σ 2 < ∞ .
The smooth functions {βj (·)}d in (3.1) can be approximated by polynomial splines.
j=0
Without loss of generality, suppose that Z ∈ [0, 1]. Let wk be a partition of the interval
[0,1], with kn uniform interior knots

wk = {0 = wk,0 < wk,1 < . . . < wk,kn < wk,kn +1 = 1}

Let Fn be the collection of functions on [0,1] satisfying (3.1) the function is a polynomial of degree p or less on subintervals Is = [wk,s , wk,s+1 ), s = 0, . . . , Nn − 1 and
INn = [wj,Nn , wj,Nn +1 ). (2) the functions are p − 1 times continuous diﬀerentiable on [0,1].
L

j
¯
¯
Let B(·) = {Bjl (·)}l=1 be a set of normalized B spline basis of Fn . Then for j = 0, . . . , d,

the VC functions can be approximated by basis functions βj (Z) ≈

Lj
¯ ¯
l=1 γjl Bjl (Z),

where

Lj is the number of basis functions in approximating the functions βj (Z). By changing of
equivalent basis, the basis expansion can be reexpressed as
Lj

βj (·) ≈

.
˜T
γjl Bjl (·) = γj1 + Bj (·)γj,∗

l=1

.
T
˜
the spline coeﬃcient vector γj = (γj,1 , γj∗ )T , and Bj (·) = (Bj2 (·), . . . , BjLj (·))T . γj,1 and
γj∗ correspond to the constant and varying part of the coeﬃcient function respectively. We
treat γj∗ as a group. If γj∗ 2 =0, then the jth predictor only has a non-zero constant eﬀect
47

and moreover, if γj,1 =0, then the predictor is redundant.
To carry out variable selection separating the varying, non-zero constant, and zero eﬀects,
we minimize the penalized least square function

Q(γ) =

n

1
n



d

Yi −
i=1

2

L

d

γjl Xij Bjl (Zi ) +
j=0 l=1

j=1

pλ1 ( γj∗ 2 )

d

+
j=1

(3.2)

pλ2 (|γj1 |)I( γj∗ 2 = 0)

where λ1 and λ2 are the penalization parameters, pλ (·) is the SCAD penalty function, deﬁned
as



 λx
if 0 ≤ x ≤ λ




(x2 −2aλx+λ2 )
pλ (x) =
if λ < x ≤ aλ
−
2(a−1)



 (a+1)λ2


if x > aλ
2

(3.3)

To express (3.2) by vectors and matrices, we redeﬁne
d

Q(γ) =(Y − U γ)T (Y − U γ) + n

pλ1 ( γj∗ 2 )
j=1

d

(3.4)

pλ2 (|γj1 |)I( γj∗ 2 = 0)

+n
j=1

T
T
where Y = (Y1 , . . . , Yn )T , γ = (γ0 , . . . , γd )T , Ui = (Xi0 B(Zi )T , . . . , Xid B(Zi )T )T , and
T
T
U = (U1 , . . . , Un )T . The function of the 2nd term of Q(γ) is to separate the varying and

constant eﬀects by penalizing the L2 norm of the varying part of the coeﬃcient functional.
The indicator functions in the 3rd term helps to penalize the variables of the constant eﬀects.
Both γj,1 and γj,∗ will be shrunk to zero if the predictor Xj has no eﬀect.

48

3.2.2

The computational algorithms

The SCAD penalty function is singular at the origin, and do not have continuous 2nd order
derivatives, therefore the regular gradient-based optimization cannot be applied. In this
section, we develop an iterative two-stage algorithm to minimize the penalized loss function
using local quadratic approximation to the SCAD penalty. Following Fan and Li [75], in a
neighbourhood of a given positive x0 ∈ R+ ,
p (x0 ) 2
pλ (x) ≈ pλ (x0 ) + λ
(x − x2 )
0
2x0
where pλ (x) = λ{I(x

(aλ−x)

+
λ) + (a−1)λ I(x > λ)} for a=3.7 and x >0. Here we use a similar

quadratic approximation by substituting x with γj∗ 2 and |γk1 | in LQA, for k = 0, ..., d.
Therefore we have

pλ ( γj∗ 2 ) ≈ pλ (

pλ (
0
γj∗ 2 ) +

0
γj∗ 2 )

0
2 γj∗ 2

0
( γj∗ 2 − γj∗ 2 )
2
2

(3.5)

and
0
pλ (|γj,1 |) ≈ pλ (|γj,1 |) +

0
pλ (|γj,1 |)
0
2|γj,1 |

0
(|γj,1 |2 − |γj,1 |2 )

(3.6)

The sets of predictors with varying, non-zero constant, and zero eﬀects are termed as
V, C and Z respectively. We implement the iterative algorithm in the following two-stage
procedure. At stage 1, using the LQA (3.5) and dropping the irrelevant constant terms, we
minimize
Q1 (γ) = (Y − U γ)T (Y − U γ) +

n T
γ Ωλ1 (γ0 )γ
2

(3.7)

where the initial spline vector γ0 is the unpenalized estimator, Ωλ1 (γ0 )=diag{Ω0 , Ω1 , . . . , Ωd },

49

where Ω0 = 0L , Ωj =

0,

0
0
pT ( γj∗ 2 )
pT ( γj∗ 2 )
λ1
λ1
,...,
0
0
γj∗ 2
γj∗ 2

for j = 1, . . . , d. Hence the estiL

mator can be iteratively obtained as

(m)

ˆ
γVC = U T U +

n
(m−1) −1 T
Ωλ1 (ˆVC
γ
)
U Y
2

(3.8)

Suppose that all the predictors are in V at the beginning. The jth predictor will be moved
to C if γj∗ 2 =0, otherwise it will stay in V.
ˆ VC
At stage 2, using the LQA (3.6) and dropping the irrelevant constant terms, we minimize
the penalized loss only for the predictors in C:

Q2 (γ) = (Y − U γ)T (Y − U γ) +

n T
γ Ωλ2 (ˆVC )γ
γ
2

(3.9)

where Ωλ2 (ˆVC )=diag{Ω0 , Ω1 , . . . , Ωd } with Ω0 = 0L ,
γ
Ωj =

VC
pT (|ˆj,1 |)
λ2 γ
I(
|ˆj,1 |
γ VC

ˆ VC
γj∗ L2 = 0), 0, . . . , 0

. The estimator can be iteratively obtained
L

as
(m)

ˆ
γCZ = U T U +

n
(m−1) −1 T
Ωλ2 (ˆCZ )
γ
U Y
2

(3.10)

If the jth predictor is in C, then it will be moved to Z if |ˆk,1 |=0, otherwise it stays in C.
γ CZ
ˆ
We can obtain the estimator γ at convergence from the iterative procedure between
ˆ
the above two stages, and the estimated coeﬃcient function in (3.1) as βj (z) = B T (z)ˆj .
γ
ˆ
ˆ
βj (z) will be a varying function in z, non-zero constant and zero if γj is in V, C and Z
correspondingly.

50

3.2.3

Selection of tuning parameters

In this section, we choose the tuning parameters N ,p, λ1 and λ2 from a data driven procedure.
N is the number of interior knots uniformly spaced on [0,1], p is the degree of the spline
basis. here p and N control the smoothness of the coeﬃcient functions, while λ1 and λ2
determine the threshold for variable selection.
At the beginning, we use BIC in Schwarz [76] to choose N and p. The range for N is
[max(

1
(2p+3)
0.5n

, 1),

1
(2p+3)
1.5n

], where x denotes the integer part of x. The optimal

pair of N and p can be achieved via a two-dimensional grid search, according to the following
criterion:
BICN,p = log(RSSN,p ) +

(N + p + 1)
log(n)
n

ˆ
ˆ
ˆ
where RSSN,p = (Y − U γ )T (Y − U γ )/n, γ = (ˆ0 , 0T , . . . , 0T )T . Conditional on the
γT
selected N and p, λ1 is the minimizer of

BICλ1 = log(RSSλ ) +
1

dfλ1
n

log(n)

ˆ
ˆ
ˆ
where RSSλ1 = (Y − U γλ1 )T (Y − U γλ1 )/n, γλ1 is the minimizer of (3.7), and dfλ1 is the
eﬀective degree of freedom, deﬁned as the total number of predictors in V and C.
ˆ
Conditional on γλ1 , λ2 is the minimizer of

BICλ2 = log(RSSλ ) +
2

dfλ2
n

log(n)

ˆ
ˆ
ˆ
where RSSλ2 = (Y − U γλ2 )T (Y − U γλ2 )/n, , γλ2 is the minimizer of (3.8), and dfλ2 is
the eﬀective degree of freedom, deﬁned similarly as dfλ1 .

51

3.3

Asymptotic results

Here we establish the asymptotic properties of the penalized least square estimators. Without
loss of generality, we assume there are v varying coeﬃcients as βj (·) ≡ βj (z),j = 1, . . . , v,
(c − v) non-zero constant coeﬃcients as βj (·) ≡ βj > 0, j = v + 1, . . . , c, and (d − c) zero
coeﬃcients as βj (·) ≡ 0, j = (c+1), . . . , d. Our asymptotic results are based on the following
assumptions.
(A1) Let Hr be the collection of all functions on the compact support [0,1] such that
the r1 th order derivatives of the functions are H¨lder of order b with r = r1 + r2 , i.e.,
o
|hr1 (z1 ) − hr1 (z2 )| ≤ C0 |z1 − z2 |r2 where 0 ≤ z1 , z2 ≤ 1 and C0 is a ﬁnite positive constant.
3
Then βj (z) ∈ Hr , j = 0, 1, . . . , v, for some r ≥ 2 .

(A2) The density function of the index variable Z, f (z), is continuous and bounded away
from 0 and inﬁnity on [0, 1], i.e., there exist ﬁnite positive constants C1 and C2 such that
C1 ≤ f (z) ≤ C2 for all z ∈ [0, 1].
(A3) Let λ0 ≤ . . . ≤ λd be the eigenvalues of E[XXT |Z = z]. Then λj (k = 0, . . . , d) are
uniformly bounded away from 0 and inﬁnity in probability. In addition, the random design
vector are bounded in probability.
(A4) For wj , the partition of the compact interval [0,1] deﬁned as {0 = wj,0 < wj,1 <
. . . < wj,kn < wj,kn +1 = 1}, j = 0, . . . , d, there exists ﬁnite positive constant C3 such that
max(wj,k+1 − wj,k , k = 0, . . . , kn )
≤ C3
min(wj,k+1 − wj,k , k = 0, . . . , kn )

(A5) The tuning parameters satisfy

1
2 max{λ , λ }
kn
1 2

1

−1
→ 0 and n 2 kn min{λ1 , λ2 } → ∞.

(A6) maxj {|pλ (|γj∗ |)| : γj∗ = 0} → 0 as n → ∞ and maxj {|pλ (|γj1 |)| : γj1 = 0} → 0
1
2

52

as n → ∞
(A7) lim infn→∞ lim infθ→0+ λ−1 pλ (θ) > 0 and lim infn→∞ lim infθ→0+ λ−1 pλ (θ) > 0
1
2
1
2
The above assumptions are commonly used in literature of polynomial splines and variable
selections. The assumption similar to (A1) could be found in Kim [77] and Tang et al [73].
(A1) guarantees certain degrees of smoothness of the true coeﬃcient function in order to
improve goodness of approximation. (A2) and (A3) are similar to those in Huang et al
[50, 51] and Wang et al [78]. (A4) suggests that the knot sequence is quasi-uniform on [0,1],
by Schumaker [79]. (A5-A7) are conditions on tuning parameters, of which (A5) could be
found in Tang et al [73]; (A6) and (A7) are similar to those in Fan and Li [75] and Wang et
al [78].
1

Theorem 1. Under the assumptions (A1-A7) and suppose kn = Op n 2r+1 , then we
have
ˆ
ˆ
(1) βj (z) are nonzero constant, j = v + 1, . . . , c and βj (z) = 0, j = c + 1, . . . , d, with
probability approaching 1;
−r

ˆ
(2) βj − βj 2 = Op (n 2r+1 ), j = 0, . . . , v.
Denote β ∗ = (βv+1 , . . . , βc )T as the vector of true nonzero constant coeﬃcients.
1

Theorem 2. Under the assumptions (A1-A7) and suppose kn = Op (n 2r+1 ), then with
n → ∞,
√
d
ˆ
n(β ∗ − β ∗ ) − − → N (0, σ 2 Σ−1 )
−−
where Σ is deﬁned in the Appendix.

53

3.4

Simulation

The performance of our proposed approach is demonstrated through extensive simulation
study in this section. We use the percentage of choosing the true model out of total R
replicates, or oracle percentage, to evaluate the accuracy of variable selection by identifying varying, non-zero constant and zero eﬀects. The precision of estimation is assessed by
integrated mean squared error (IMSE).
ˆ(r)
Let βj be the estimator of a nonparametric function βj in the rth (1

r

R) replica-

n

grid
ˆ(r)
tion, and {zm }m=1 be the grid points where βj is evaluated. We use the integrated mean
R
1
r=1 ngrid

1
ˆ
ˆ
squared error (IMSE) of βk (x), deﬁned as IMSE(βj (z))= R

ngrid (r)
ˆ
m=1 {βk (zm )

−

βj (zm )}2 , to evaluate the estimation accuracy of coeﬃcient βj , and the total integrated mean
squared error (IMSE) of all the d coeﬃcients (TIMSE), deﬁned as TIMSE=

d
ˆ
j=1 βj (z),

is

ˆ
used to evaluate the overall estimation accuracy. Note that IMSE(βj ) will be reduced to
ˆ
ˆ
MSE(βj ) when βj is a constant.
Example 3.1. We simulate data from the following VC model
d

Yi = β0 (Zi ) +

βj (Zi )Xij + εi
j=1

where the index variable Zi ∼Uniform(0,1), and the predictors Xi are generated from a
multivariate normal distribution with mean 0 and Cov(Zij , Z

ij

) = 0.5|j−j | for 0 ≤ j, j ≤ d.
j

The performance is evaluated under both d=10 and 50. Xi , j = 0, 1, 2 are of varying
j

eﬀects, Xi , j = 3, 4 are of non-zero constant eﬀects, and the rest variables are redundant.
The random error εi were generated from standard normal distributions and t distribution
with 3 degrees of freedom respectively. The coeﬃcients were set as: β0 (z) = sin(2πz),
β1 (z) = 2 − 3 cos{(6z − 5)π/3}, β2 (z) = 3(2z − 1)3 , β3 (z) = 2, β4 (z) = 2.5, and βj (z) = 0
54

for j = 5, . . . , 10. The results are listed in Figure 3.1 and Table 3.1.

N(0,1) error

t(3) error

1
0.8

0.6

0.6

0.4

0.4

0.2

Selection Ratio

0.8

0.2

0
1

1 2 3 4 5 6 7 8 9 10 11

1

0.8
Selection Ratio

ALASSO
SCAD

0.6

0.4

0.4

0.2

ALASSO
SCAD

0.8

0.6

1 2 3 4 5 6 7 8 9 10 11

0.2

0

10

20

30

0

40

Predictors

10

20

30

Predictors

Figure 3.1: The selection ratio of Example 3.1

55

40

Figure 3.1 shows the selection ratio for predictors under diﬀerent groups of d and error
distributions. The height of bars on the top panel of Figure 3.1 denote the selection ratio for
true positives for the ﬁrst 5 predictors, and false positives for the rest predictors. Under both
standard normal and and t(3) error, compared with the method based on adaptive LASSO,
our method is capable of correctly identifying signiﬁcant eﬀect with high percentages, and
has kept a very small percentages of choosing false positive predictors. In addition, our
method has relative stable performance in terms of correct selection ratio when dimension
grows.
The oracle percentage and parameter estimation results are summarized in Table 3.1. In
Tang et al [73], MSE was computed for constant coeﬃcients only when the corresponding
predictor is chosen with non-zero constant eﬀect. To reﬂect the overall estimation precision,
we compute IMSEs for all predictors, including β4 and β5 . When βj (j = 4, 5) is selected
as non-zero constant, IMSE reduces to MSE. The IMSEs will be calculated if βj (j = 4, 5)
incorrectly identiﬁed as varying. In all the pairs of d and error distribution type, the SCAD
approach demonstrates superior performance over the adaptive LASSO approach.
Example 3.2. Now we consider the simulation in genetics settings from the following VC
model
d

Yi = β0 (Zi ) +

βj (Zi )Xij + εi
j=1

where the SNP Xi was coded with 3 categories (1,0,-1) for genotypes (AA,Aa,aa) respectively. We simulate the SNP genotype data based on the pairwise linkage disequilibrium(LD)
structure. Suppose the two risk alleles A and B of two adjacent SNPs have the minor allele
frequencies (MAFs) qA and qB , respectively, with LD denoted as δ. Then the frequencies
of four haplotypes can be expressed as qab = (1 − qA )(1 − qB ) + δ, qAb = qA (1 − qB ) − δ,

56

Table 3.1: Simulation results of Example 3.1

d=10

N (0,1) error
t(3) error
SCAD ALASSO Oracle SCAD ALASSO
Oracle Perc. 0.972
0.82
1
0.92
0.315
IMSE
β0 (u)
0.0214
0.0243
0.0216 0.0398
0.0448
β1 (u)
0.0902
0.0930
0.0951 0.1166
0.1254
β2 (u)
0.0365
0.1018
0.0431 0.0764
0.2211
0.6248
β3 (u)
0.0122
0.2405
0.0032 0.0753
β4 (u)
0.0045
0.0405
0.0031 0.0183
0.1713
TIMSE
0.1648
0.5075
0.1661 0.3282
1.3000

d=50 Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

Oracle
1
0.1929
0.3392
0.5859
0.1775
0.1100
0.4017

0.945

0.635

1

0.8

0.012

1

0.0221
0.0878
0.0404
0.0478
0.0101
0.2086

0.0230
0.0896
0.0551
0.0776
0.0165
0.2966

0.0219
0.0927
0.0428
0.0027
0.0029
0.1631

0.0431
0.1230
0.1042
0.1727
0.0239
0.5146

0.0612
0.1477
0.0969
0.0771
0.0608
2.4926

0.0426
0.1253
0.0751
0.0105
0.0083
0.2619

qaB = (1 − qA )qB − δ, and qAB = pA pB + δ. With the Hardy-Weinberg equilibrium assumption, the SNP genotype at locus A can be simulated assuming a multi-nomial distribution
with frequencies p2 ,2pA (1 − pA ) and (1 − pA )2 for genotypes (AA,Aa,aa) correspondingly.
A
We can subsequently generate the SNP2 genotypes conditional on SNP1 can be simulated
based on the conditional probability matrix in Cui et al. [39]. The non-zero coeﬃcients of
the model are the same as those in Example 1. The simulation with sample size 500 were
performed 500 replicates.
Figure 3.2 show the selection ratio when d=10, under diﬀerent combinations of MAF and
error distributions. The height of bars is deﬁned similarly as that in Example 1. When the
random error is standard normal, our approach has higher proportions to choose true positive
SNPs, especially those with no eﬀect on G×E interactions, and lower proportions to choose
57

false positive SNPs. When MAF increases, both approaches lead to higher selection ratios
for true positive SNPs and lower selection ratios for false positive SNPs. A similar pattern
can be observed when the error distribution follows a t(3) distribution. The performance of
both approaches is better when the error is normal. An analogous conclusion can be reached
in Figure 3.3 when d=50.
Table 3.2 presents the oracle proportions and estimation results for d=10. Under the
standard normal error, we observe the superior performance of our approach over the adaptive LASSO based approach in terms of both oracle percentage and estimation precision.
The estimation accuracy of our approach is pretty close to that of the true model. The
accuracy of all the 3 methods improves as MAF increases from 0.1 to 0.5. The performance
of our method under t(3) error are still comparable to that under the standard normal error,
while the ALASSO method did much worse for the t(3) error. A similar pattern can be
observed for the high dimensional case (d=50) in Table 3.3. Our approach is still powerful
in high dimensional scenario, especially under the N (0,1) random error, while the ALASSO
approach barely select the correct model.

58

p=0.1

p=0.3

p=0.5

Selection Ratio,t(3) error

Selection Ratio,N(0,1) error

1
0.8

0.8

ALASSO
SCAD 0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0
1

2

4

6

8 10

2

4

6

8 10

2

0.8

0.8
0.6
0.4
0.2

2

4

6

8 10

0.4

0.2

8 10

0.6

0.4

6

0.8

0.6

4

0.2

0

2

4

6

8 10

2

4

6

8 10

SNP

Figure 3.2: The selection ratio of Example 3.2, d = 10

59

p=0.1
Selection Ratio,N(0,1) error

1

p=0.5
1

ALASSO
SCAD

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0
1
Selection Ratio,t(3) error

p=0.3
1

10 20 30 40

0
1

10 20 30 40

0.8

0
1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

10 20 30 40

0.2

0

10 20 30 40

0

10 20 30 40

0

10 20 30 40

SNP

Figure 3.3: The selection ratio of Example 3.2, d = 50

60

Table 3.2: Simulation results of Example 3.2, d = 10

pA=0.1

pA=0.3

pA=0.5

N (0,1) error
t(3) error
SCAD ALASSO Oracle SCAD ALASSO
Oracle Perc. 0.976
0.784
1
0.72
0.268
IMSE
β0 (u)
0.0863
0.1250
0.0891 0.3078
1.6608
β1 (u)
0.1611
0.1601
0.1667 0.3285
0.3947
β2 (u)
0.1264
0.1358
0.1238 0.4890
1.2776
2.8155
β3 (u)
0.0270
0.1183
0.0192 1.3307
β4 (u)
0.0191
0.0433
0.0174 0.2943
2.1633
TIMSE
0.4205
0.6106
0.4162 2.9342
9.2044
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

Oracle
1
0.2247
0.3557
0.2932
0.0643
0.0475
0.9855

0.992

0.84

1

0.91

0.33

1

0.0268
0.1071
0.0561
0.0086
0.0066
0.2007

0.0297
0.1074
0.0551
0.0271
0.0118
0.2404

0.0273
0.1174
0.0637
0.0084
0.0065
0.2233

0.0607
0.1600
0.1360
0.1111
0.0443
0.5311

0.0975
0.2065
0.1373
0.1216
0.1125
1.3069

0.0601
0.1746
0.1320
0.0237
0.0222
0.4126

0.98

0.846

1

0.894

0.34

1

0.0213
0.1044
0.0497
0.0077
0.0063
0.1895

0.0214
0.1043
0.0507
0.0210
0.0103
0.2177

0.0214
0.1106
0.0604
0.0077
0.0063
0.2065

0.0431
0.1581
0.1101
0.0439
0.0240
0.4072

0.0485
0.1721
0.3270
0.7984
0.3082
1.8768

0.0451
0.1725
0.1170
0.0192
0.0135
0.3673

61

Table 3.3: Simulation results of Example 3.2, d = 50

pA=0.1

pA=0.3

pA=0.5

N (0,1) error
t(3) error
SCAD ALASSO Oracle SCAD ALASSO
Oracle Perc. 0.908
0.542
1
0.435
0.025
IMSE
β0 (u)
0.1929
0.9911
0.0884 0.5687
1.2335
β1 (u)
0.2064
0.1988
0.1684 0.3851
0.3484
β2 (u)
0.5235
0.8382
0.1218 0.6934
0.4432
0.7892
β3 (u)
2.0918
2.0345
0.0196 2.4522
β4 (u)
0.3475
0.4798
0.0158 0.5996
0.4671
TIMSE
3.3644
4.7239
0.4140 5.7021
8.9145
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

Oracle
1
0.2209
0.3340
0.2614
0.0484
0.0445
0.9092

0.986

0.642

1

0.745

0.06

1

0.0289
0.1107
0.0817
0.1083
0.0229
0.3526

0.0732
0.1124
0.1834
0.4072
0.0748
0.9334

0.0278
0.1137
0.0646
0.0075
0.0068
0.2204

0.0860
0.1858
0.2205
0.3865
0.0840
1.2288

0.1970
0.1974
0.1768
0.2018
0.1099
3.3013

0.0599
0.1742
0.1301
0.0254
0.0220
0.4117

0.988

0.706

1

0.8

0.07

1

0.0215
0.1048
0.0608
0.0470
0.0120
0.2461

0.0232
0.1073
0.1269
0.2846
0.0444
0.6426

0.0216
0.1123
0.0579
0.0078
0.0053
0.2050

0.0450
0.1551
0.1754
0.1681
0.0480
0.6492

0.0560
0.1716
0.1525
0.1501
0.0889
2.8755

0.0434
0.1608
0.1085
0.0167
0.0190
0.3484

62

3.5

Real data analysis

We applied the method to a real dataset from a study conducted at Department of Obstetrics
and Gynecology at Sotero del Rio Hospital in Puente Alto, Chile. The initial objective of
the study was to pinpoint genetic variants associated with a binary response indicating large
for gestational age (LGA) or small for gestational age (SGA) depending on new born babies’
weight and mother’s gestational age. After data cleaning by removing SNPs with MAF less
than 0.05 or deviation from Hardy-Weinberg equilibrium, the dataset contains 1536 new
born babies with 189 genes covered by 660 single nucleotide polymorphisms (SNPs).
Mother’s body mass index (MBMI), deﬁned as mother’s body mass (kg) divided by the
square of their height (m2 ), is a measure for mothers’ body shape and obesity condition.
The environment factor for a baby inside mother’s body is deﬁned through the mother, such
as mother’s obesity condition (MBMI) or age. Due to the complicated interaction between
fetus’s genes and mother’s obesity level, the birth weight might be diﬀerent for a fetus with
the same gene but under diﬀerent environment conditions. The phenomenon of regular
variation in birth weight could be explained by corresponding genetic variants and how they
respond to diﬀerent MBMI.
We applied both methods to the Janus kinase/signal transducers and activators of transcription (JAK/STAT) signaling pathway, which has 68 SNPs covering 24 genes in our real
data. JAK/STAT signaling pathway is the main signaling mechanism for a broad range of
cytokines and growth factors in mammals [80]. Our method select the model

Y = β0 (z) + βk Xk + ε

where Xk corresponds to SNP 2069762, and βk is a constant, while the ALASOO method
63

only identiﬁes the varying intercept. SNP 2069762 is of non-zero constant eﬀect, therefore
this one is a genetic risk factor associated with birthright but not sensitive to MBMI to
inﬂuence birth weight.
To further validate our result, we conducted the single SNP based analysis in Ma et al
[37] and tabulate SNPs with p-value less than 0.001 when ﬁtting the candidate models (LM,
LMI and VC). The p-values for the overall genetic association test with the LM, LMI and
VC model are denoted as P CON, P LIN and P VC. It follows from the test on constant
coeﬃcient that SNP 2069885 does not vary across MBMI (PP CON¡ 0.05). Consequently,
the p-value obtained from test with LM model is less than those obtained from VC and LM
model.
Table 3.4: List of SNPs with p-value < 0.001 from the Jak-STAT signaling pathway
SNP ID

GeneName

Location

P VC

P CON

P LIN

P LM

P LMI

PI

LM model
2069885

3.6

IL9

exon

0.0014

0.0913

-

7.32E-05 8.93E-05

0.0875

Discussion

The signiﬁcance of G×E interactions in complex traits has stimulated waves of discussion.
A diversity of statistical models have been proposed to assess the gene eﬀect under diﬀerent
environmental exposures, as reviewed in Cornelis et al [65]. The success of genetic variant
set based association analysis, as shown in Wang et al [38], Cui et al [39], Wu and Cui
[40] and Schaid et al [71], motivates us to propose a high dimensional variable selection
approach to understand the mechanism of G×E interactions associated with complex traits.
We adopt a penalized regression method within the VC model framework to investigate how
64

multiple variants within a genetic system, like the pathway, were mediated by a common
environmental factor to inﬂuence the phenotypic response.
Variable selection and parameter estimation can be achieved simultaneously within the
framework. The varying coeﬃcient function are divided into 2 parts after B spline basis
expansion, for the non-zero constant and varying eﬀect. We can determine if a particular
genetic variant is sensitive to environmental stimuli by examining the status of the coeﬃcient function. Speciﬁcally, the presence of G×E interactions, no G×E interactions and no
association with the phenotype are corresponding to varying, non zero constant and zero
eﬀects of the coeﬃcients. A two-stage iterative procedure was developed in Tang et al [73]
to distinguish diﬀerent eﬀects. Here, we adopted the framework and carried out the separation with SCAD penalty. Asymptotic properties of the two-stage estimator were established
under suitable regularity conditions.
A comprehensive comparison between our method and that in Tang et al [73] was conducted in the simulation session, in terms of two criteria, the percentage of choosing the
exact true model and the precision in parameter estimation. The estimation accuracy was
calculated as IMSE for varying coeﬃcients and MSE for constant coeﬃcients. In Tang et
al [73], for the predictors with non-zero constant coeﬃcients, MSE was calculated when the
predictor is corrected identiﬁed. However, this won’t reveal the error caused by failure to
classify the coeﬃcient as non-zero constant. Instead, we suggest calculating IMSE for all
the predictors, since IMSE reduces to MSE when the coeﬃcient is a constant. A much more
accurate assessment on assessing the performance of the model can thus be achieved. The
advantage of our approach has been endorsed by the extensive simulation study and real
data analysis.
Both algorithms are based on local quadratic approximations (LQA) to the penalty func65

tions, which suﬀers from the eﬃciency loss caused by repeated factorizations of large matrices. LQA limits the power of the framework to dissect G×E interactions when the dimension
is large, especially in cases where p > n. We will integrate group coordinate descent (GCD)
approach into the current framework and demonstrate the merit of the new scheme in next
chapter.

3.7
3.7.1

Technical proofs
Useful notations and lemmas

For convenience, the following notations are adopted :
¯
¯ ¯
¯
¯
Y = E(Y |X, T ), γ = (U T U )−1 U T Y , β = B γ
T
T
T
T
T
T
γ(v) = (γ0 , . . . , γv )T , γ(c) = (γv+1 , . . . , γc )T , γ(d) = (γv+1,1 , . . . , γd,1 )T ,

γ(v) = (˜0 , . . . , γv )T , γ(c) = (˜v+1 , . . . , γc )T , γ(d) = (γv+1,1 , . . . , γd,1 )T ,
˜
γT
˜T
˜
γT
˜T
˜
Gn = (B(z1 ), . . . , B(zn ))(B(z1 ), . . . , B(zn )T , ε = (ε1 , . . . , εn )T
Φn = n−1

n
T
i=1 U(v)i U(v)i ,

Ψn = n−1

n
T
i=1 U(v)i U(c)i ,

Λi = U(c)i − ΨT Φ−1 U(c)i
n n

First we provide several lemmas to facilitate the proofs of Theorems 1 and 2.
Lemma A.1. Under assumptions (A1-A3), there exists ﬁnite positive constants C1 and
C2 such that all the eigenvalues of (kn /n)Gn fall between C1 and C2 , and therefore, Gn is
invertible.
Lemma A.2. Under assumptions (A1-A3), for some ﬁnite constant C0 , there exists
γ = (˜0 , . . . , γd )T satisfying
˜
γT
˜T
(1) γj∗ L2 > C0 , j = 0, . . . , v; γj1 = βj , γj∗ L2 = 0, j = v + 1, . . . , c; γj = 0,
˜
˜
˜
˜
j = c + 1, . . . , d
−r
(2) supt∈[0,1] |βj (z) − B(z)T γj | = Op (kn ), j = 0, . . . , d, where γj = (˜j,1 , γj∗ )T
˜
˜
γ
˜T

66

−r
(3) sup(t,x)∈[0,1]×Rd+1 |X T β(z) − U (X) γ | = Op (kn )
˜

3.7.2

Proofs of Theorem 1.

(A) Proof of Theorem 1(1) (Part 1)
ˆ
Here we ﬁrst show βj (z) is constant for j = v + 1, . . . , d in probability, which amounts to
demonstrating γj∗ j = 0, j = v + 1, . . . , d with probability tending to 1, as n → ∞. For
ˆ vc
n

Yi − UiT γ

Q1 (γ) =

2

d

+n

i=1

j=1

pλ1 ( γj∗ )

(3.11)

1
ˆ
˜
Let αn = n− 2 kn + an and γ vc = γ + αn δ. We want to show that for any given ε > 0, there

exists a large constant C such that

P inf δ =C Q1 (ˆ vc ) ≥ Q1 (˜ ) ≥ 1 − ε
γ
γ

(3.12)

This suggests that with probability at least 1 − ε there exists a local minimum in the ball
ˆ
˜
{˜ + αn δ : δ ≤ C}. Hence, there exists a local minimizer such that γ vc − γ = Op (αn ).
γ
A direct computation yields

Dn (δ) = Q1 (ˆ vc ) − Q1 (˜ )
γ
γ
n

n
T
εi + X1 r(zi )

= −2αn δ

UiT

2
+ αn δ 2

i=1

i=1

d

+n
j=1

UiT Ui

pλ1 ( γj∗ ) − pλ1 ( γj∗ )
ˆ vc
˜

≡ ∆1 + ∆2 + ∆3

67

(3.13)

where rj (z) = B(z)T γj − βj (z), j = 1, . . . , d and r(z) = (r1 (z), . . . , rd (z))T . By the fact
˜
E(εi |U , zi ) = 0, we obtain that
1
√
n

n

εi UiT δ = Op ( δ )

(3.14)

i=1

Recall Lemma A.1, then
1
n

n
T
−r
Xi r(zi )U δ = Op (kn δ )

(3.15)

i=1

Therefore
√
−r
−r
∆1 = Op ( nαn δ ) + Op (nkn αn δ ) = Op (nkn αn ) δ
2
We can also show that ∆2 = Op (nαn ) δ 2 . Then, by choosing a suﬃciently large C, ∆1 is

dominated by ∆2 uniformly in δ = C. It follows from Taylor expansion that
d

∆3 ≤ n

αn pλ1 ( γj∗ )
˜
j=1

γj∗
˜
γj∗
˜

2
δj∗ + αn pλ2 ( γj∗ ) δj∗ 2 (1 + op (1))
˜

√
2
≤ n dαn fn δ + bn αn δ 2
where fn = maxj {|˜j∗ | : γj∗ = 0}. With assumption (A6), we can prove that ∆2 dominates
γ
˜
∆3 uniformly in δ

= C. Therefore, (3.12) holds for suﬃciently large C, and we have

ˆ
˜
γ vc − γ = Op (αn ).
ˆ
In order to prove βj (z) = 0 for j = v+1, . . . , d in probability, it is suﬃcient to demonstrate
ˆ vc
that γj∗ = 0,j = v + 1, . . . , d. It follows from the deﬁnition that when max(λ1 , λ2 ) → 0,
an = 0 for large n. Then we need to show that with probability approaching 1 as n → ∞,

68

1
1
ˆ
ˆ
˜
for any γ vc satisfying γ vc − γ = Op (n− 2 kn ) and some small εn = Cn− 2 kn , we have

∂Q1 (γ)
< 0,
∂γj,∗

for

− εn < γj,∗ < 0,

j = v + 1, . . . , d
(3.16)

> 0,

for 0 < γj,∗ < εn ,

j = v + 1, . . . , d

where γj,∗ denotes the individual component of γj∗ . It’s not hard to show that
∂Q1 (ˆ vc )
γ
= −2
vc
∂ˆj,∗
γ

n

ˆ
Uij Yi − UiT γ vc + npλ (|ˆj,∗ |)sgn(ˆj,∗ )
γ
γ
1

i=1
n

n

Uij UiT [˜ − γ vc ]
γ ˆ

T
Uij [εi + Xi r(zi )] − 2

= −2

i=1

i=1

(3.17)

+ npλ (|ˆj,∗ |)sgn(ˆj,∗ )
γ
γ vc
1

−r+1/2

= nλ1 Op (λ−1 n 2r+1 ) + λ−1 pλ (|ˆj,∗ |)sgn(ˆj,∗ )
γ
γ vc
1
1

−r+1/2

−1
By assumption (A5), λ1 n 2r+1 → 0. Then it follows from assumption (A7) that the sign

ˆ
of the derivative is completely determined by that of γj,∗ . Therefore, γ vc , the minimizer of
ˆ vc
ˆ vc
Q1 , is achieved at γj∗ = 0, j = v + 1, . . . , d. This completes the proof of Theorem 1(1), part
1.

(B) Proof of Theorem 1 (2)
1

Next we establish the consistency of the varying coeﬃcient estimator. Let αn = n− 2 kn +an ,
T T
ˆ
˜
ˆ
˜
γ(v) = γ(v) + αn δv , γ(d) = γ(d) + αn δd , and δ = (δv , δd )T
n
T
T
Yi − U(v)i γ(v) − U(d)i γ(d)

Q2 (γ(v) , γ(d) ) =
i=1

2

d

+n
j=v+1

69

pλ2 (|γj,1 |)

(3.18)

We need to show that for any given ε > 0, there exists a large constant Cε such that

ˆ
˜
P inf δ =C Q2 (ˆ(v) , γ(d) ) ≥ Q2 (˜(v) , γ(d) ) ≥ 1 − ε
γ
γ

(3.19)

which implies that with probability at least 1 − ε there exists a local minimum in the ball
{˜(v) + αn δv : δv ≤ C} and {˜(d) + αn δd : δd ≤ C} respectively. Therefore, there exists
γ
γ
ˆ
˜
ˆ
˜
local minimizers such that γ(v) − γ(v) = Op (αn ) and γ(d) − γ(d) = Op (αn ). We have

ˆ
˜
Dn (δv , δd ) = Q2 (ˆ(v) , γ(d) ) − Q2 (˜(v) , γ(d) )
γ
γ
n
T
εi + X1 R(Zi )

= −2αn
i=1
n
2
+ αn

T
U(v)i δ(v)

T
T
U(v)i δ(v) + U(d)i δ(d)

2
T
+ U(d)i δ(d)

d

+n
j=v+1

i=1

γ
γ
pλ2 (|ˆj,1 |) − pλ2 (|˜j,1 )|

≡ ∆1 + ∆2 + ∆3
(3.20)
where r(z) = (r1 (z), . . . , rd (z))T and rj (z) = B(z)T γj − βj (z), j = 1, . . . , d.
˜
Since E(εi |U(v) , U(d) , zi ) = 0, we have
1
√
n

n
T
T
εi [U(v)i δ(v) + U(d)i δ(d) ] = Op ( δ )

(3.21)

i=1

With Lemma A.1 we can show

1
n

n
T
T
T
−r
Xi r(zi ) U(v)i δ(v) + U(d)i δ(d) = Op kn δ
i=1

70

(3.22)

Combine the above two equations, we can obtain that

1

−r
−r
∆1 = Op (n 2 αn δ ) + Op (nkn αn δ ) = Op (nkn αn ) δ

2
Since ∆2 = Op (nαn ) δ 2 , it’s easy to show that by choosing a suﬃciently large C, ∆1 is

dominated by ∆2 uniformly in δ = C. By Taylor expansion,
d
2
2
αn pλ2 (|˜j,1 |)sgn(˜j,1 )|δdj | + αn pλ2 (|˜j,1 |)δdj (1 + o(1))
γ
γ
γ

∆3 ≤ n
j=v+1

1
2
≤ (p − v) 2 nαn fn δ + bn αn δ 2

where fn = maxj {|˜j,1 | : γj,1 = 0}. Recall (A6), then it follows that, by choosing an enough
γ
˜
large C, ∆2 dominates ∆1 uniformly in δ = C. Consequently (3.19) holds for suﬃciently
ˆ
˜
ˆ
˜
large C, and we have γv − γv = Op (αn ) and γd − γd = Op (αn ). By the deﬁnition of
ˆ cz
˜
γ cz , we have γ(d) − γ(d) = Op (αn ). Then for j = 0, . . . , d

ˆ
βj (zi ) − βj (z) 2 =

1
0
1

≤
0

=

ˆ
βj (z) − βj (z)

2

dt

ˆ cz
B(z)T γj (z) − B(z)T γj + rj (z)
˜

2

dt

1
2 cz
˜
rj (z)2 dt
(ˆj − γj )T Gn (ˆj − γj ) + 2
γ
γ cz ˜
n
0

= ∆1 + ∆2
1

−1 2
Recall Lemma A.1, A.2 and kn = Op n 2r+1 , we can demonstrate that ∆1 = Op kn αn ,
−2r
∆2 = Op kn . ∆1 is dominated by ∆2 , thus we ﬁnish the proof of Theorem 1(b).

(C) Proof of Theorem 1(1) (Part 2)
71

ˆ
ˆ cz
To show βj (z) = 0 for j = c + 1, . . . , d, it is suﬃcient to demonstrate that γj,1 = 0,
since the constancy of βj (z), j = v + 1, . . . , d was already established in (A). It follows
from the deﬁnition that when max(λ1 , λ2 ) → 0, an = 0 for large n. Then we need to
ˆ
ˆ
prove that with probability approaching 1 as n → ∞, for any γ(v) and γ(d) satisfying
1

1

ˆ
˜
ˆ
˜
γ(v) − γ(v) = Op (n− 2 kn ), γ(d) − γ(d) = Op (n− 2 kn ) respectively, as well as some small
1

εn = Cn− 2 kn , we have
∂Q2 (γ(v) , γ(d) )
∂γj,1

< 0,

for

− εn < γj,1 < 0,

j = c + 1, . . . , d
(3.23)

> 0,

for 0 < γj,1 < εn ,

j = c + 1, . . . , d

We can prove that
ˆ
∂Q2 (ˆ(v) , γ(d) )
γ
∂ˆj,1
γ

n
T ˆ
T ˆ
U(d)ij Yi − U(v)i γ(v) − U(d)i γ(d) + npλ (|ˆj,1 |)sgn(ˆj,1 )
γ
γ

= −2
i=1
n

n
T
U(d)ij εi + Xi r(zi )

= −2
i=1
n

T γ
ˆ
U(d)ij U(v)i [˜v − γv ]

−2
i=1

(3.24)

T γ
ˆ
U(d)ij U(d)i [˜d − γd ] + npλ (|ˆj,1 |)sgn(ˆj,1 )
γ
γ

−2
i=1

−r+1/2

= nλ2 Op

λ−1 n 2r+1
2

+ λ−1 pλ (|ˆj,1 |)sgn(ˆj,1 )
γ
γ
2

−r+1/2

By assumption (A5), λ−1 n 2r+1 → 0. Then it follows from assumption (A7) that the sign
2
ˆ
of the derivative is completely determined by that of γj,1 . Therefore, γ cz , the minimizer of
ˆ
ˆ cz
Q2 , is achieved at γj,1 = 0, j = c + 1, . . . , d. This completes the proof of Theorem 1(1).

72

3.7.3

Proofs of Theorem 2.

ˆ
In Theorem 1, we showed that both γj∗ = 0, j = v + 1, . . . , c and γj = 0, j = c + 1, . . . , d,
ˆ
hold in probability. Then Q2 reduces to
n
T
Yi − U(v)i γ(v)

Q2 (γ(v) , γ(d) ) =

2
T
− U(c)i γ(c)

i=1

c

+n
j=v+1

pλ2 (|γj,1 |)
(3.25)

≡ Q2 (γ(v) , γ(c) )
ˆ
Since (ˆ(v) , γ(c) ) is the minimal value of Q2 (γ(v) , γ(c) ), we obtain
γ

ˆ
∂Q2 (ˆ(v) , γ(c) )
γ
ˆ
∂ γ(v)
ˆ
∂Q2 (ˆ(v) , γ(c) )
γ
ˆ
∂ γ(c)

n
T ˆ
T ˆ
U(v)i Yi − U(v)i γ(v) − U(d)i γ(d) = 0

= −2

(3.26)

i=1
n

= −2

T ˆ
T ˆ
U(c)i Yi − U(v)i γ(v) − U(c)i γ(c)
i=1
c

(3.27)
pλ2 (|ˆj,1 |)sgn(ˆj,1 ) = 0
γ
γ

+n
j=v+1

By applying Taylor expansion on pλ2 (|ˆj,1 |) in (3.27), we have
γ

pλ2 (|ˆj,1 |) = pλ2 (|γj,1 |) + pλ2 (|γj,1 |)(ˆj,1 − γj,1 )[1 + op (1)]
γ
γ

By the fact that pλ2 (|ˆj,1 |) = 0 as λ2 → 0, and pλ2 (|γj,1 |) = op (1) from the assumption, it
γ
follows that

c
γ
γ
j=v+1 pλ2 (|ˆj,1 |)sgn(ˆj,1 )

= op (ˆj,1 − γj,1 ) = op (ˆ(c) − γ(c) ). Consequently,
γ
γ

we have
1
n

n
T ˆ
T ˆ
U(c)i Yi − U(v)i γ(v) − U(c)i γ(c) + op (ˆ(c) − γ(c) ) = 0
γ
i=1

73

Following similar lines of arguments in Theorem 1, we can show

1
n

n
T
T
T
ˆ
ˆ
γ
U(c)i εi + Xi r(zi ) + U(v)i (γ(v) − γ(v) ) + U(c)i (γ(c) − γ(c) ) + op (ˆ(c) − γ(c) ) = 0
i=1

(3.28)
Meanwhile, a straightforward calculation yields

1
n

n
T
T
T
ˆ
ˆ
U(v)i εi + Xi r(ui ) + U(v)i (γ(v) − γ(v) ) + U(c)i (γ(c) − γ(c) ) = 0

(3.29)

i=1

Recall the deﬁnition of Φn and Ψn , (3.29) is equivalent to

ˆ
γ(v) − γ(v) = Φ−1
n

1
n

n
T
ˆ
U(v)i εi + Xi r(zi ) + Ψn [γ(c) − γ(c) ]

(3.30)

i=1

Plugging (3.30) into (3.28) results in
1
n

n

U(c)i
i=1

1
=
n

1
T
T
εi + Xi r(zi ) − U(v)i Φ−1
n
n

n

U(c)i U(c)i − ΨT Φ−1 U(v)i
n n

T

n
T
U(v)i εi + Xi r(zi )
i=1

(3.31)

(ˆ(c) − γ(c) ) + op (ˆ(c) − γ(c) )
γ
γ

i=1

Together with the facts that

1
n



n

T
T
ΨT Φ−1 U(v)i εi + Xi r(zi ) − U(v)i Φ−1
n n
n
i=1

1
n



n

T
U(v)k [εk + Xk r(tk )] = 0
j=1

and
1
n

n
T
ΨT Φ−1 U(v)i U(c)i − ΨT Φ−1 U(v)i
n n
n n
i=1

74

T

=0

and recall the deﬁnition of Λi , a direct computation from (3.31) leads to
1
n

n

Λi ΛT + op (1)
i
i=1

√

1
ˆ
n(γ(c) − γ(c) ) = √
n

n

1
Λ i εi + √
n

i=1
n

1
+√
n

i=1

n
T
Λi Xi r(zi )
i=1

1
T
Λi U(v)i Φ−1
n
n

n
T
U(v)k εk + Xk r(tk )
j=1

= ∆1 + ∆2 + ∆3
It follows from law of large numbers that

1
n

n

p

Λi ΛT − − → N (0, Σ)
i −−
i=1

T
where Σ = E U(c) U(c) − E E(ΨT |T )E(Φn |T )−1 E(Ψn |T ) . Consequently,
n

d

∆2 − − → N (0, σ 2 Σ)
−−

follows from central limit theorem. Because Xi is bounded and r(z) = op (1), we have
∆2 = op (1).Besides,

n
T
i=1 Λi U(v)i =0

implies that ∆3 = 0. Therefore, by Slutsky theorem,

we complete the proof of Theorem 2.

75

Chapter 4
A Group Coordinate Descent
Approach For High Dimensional
Variable Selection In
Gene-Environment Interactions

4.1

Introduction

High dimensional data arises in a diversity of scientiﬁc areas, especially in the study of human
genetics as tons of data covering the entire human genome are brought by the advancement
of high-throughput genotyping technologies. Gene-environment (G×E) interaction draws
our interest due to the crucial roles it plays in elucidating the etiology of complex disease
and tracking down the disease variants. The risk of complex diseases is triggered not merely
by genetic factors, but also by the environmental exposures, as well as their interactions.
The varying coeﬃcient (VC) model framework, initially proposed in Hastie and Tibshirani [74], lends us signiﬁcant ﬂexibility in investigating genetic responses to environmental
stimuli and how gene expressions are mediated by environmental inﬂuences to increase disease predispositions, for both continuous phenotype response in Ma et al [37] and binary

76

disease response in Wu and Cui [68]. The merit of the framework is especially prominent
when the penetrance eﬀect of genetic variants are non-linear, as the VC model is powerful
to capture the dynamic ﬂuctuations of regression coeﬃcients.
Current methodologies on G×E interactions are mainly developed within single variantbased framework, using either parametric methods as reviewed in Mukherjee et al [81],
semi-parametric methods as in N. Chatterjee et al [35] and Maity et al [36], non-parametric
methods as in Ma et al [37] and Wu and Cui [68], or data mining approaches such as multifactor dimension reduction (MDR) in Hahn et al [33]. Accumulation of evidence showing
the advantage of set based association analysis, such as in Neal and Sham [82], Cui et al
[39], Wu and Cui [40] and Wang et al [38], motivated us to consider the joint modeling of
a number of variants (p) within the genetic system given a common environment mediator.
When the dimension p is large, the problem can be approached from a high dimensional
variable selection perspective, within the VC model framework.
In recent years, much progress has been made on penalized regression methods for VC
models. Wang et al [78] developed SCAD penalty based method for longitudinal response.
Wang and Xia [83] considered variable selection for VC model via local constant kernel
estimation with LASSO penalty, while the penalization method in Leng [84] was proposed in
the framework of smoothing spline ANOVA models with component selection and smoothing
operator (COSSO). The number of candidate predictors for selection in those models is ﬁnite
and less than the sample size. A diversity of penalized group coordinate approaches have been
developed for high dimensional case where the dimension of model p signiﬁcantly exceeds
the sample size n. See Yuan and Lin [85] for group LASSO approach, Wei et al [86] for group
adaptive LASSO approach and Breheny and Huang [87] for group SCAD approach.
In this chapter, we develop a general framework, based on the group coordinate descent
77

(GCD) algorithm, to carry out variable selection for set-based G×E interactions in VC
models. GCD was generalized to group case from coordinate-wise descent algorithms which
are demonstrated to be eﬀective in ﬁtting penalized models such as in Friedman et al [88],
Wu and Lange [89] and Friedman et al [90]. Our framework is implemented through a
two-stage iterative procedure to separate the varying, non-zero constant and zero eﬀect
of the predictors. It is computationally eﬃcient especially for high dimensional problems
where p > n, since the computational complexity increases only linearly with the number of
predictor groups.
The rest of the chapter is organized as follows. First, we describe the B spline basis
expansion to the VC model. The computation algorithm via GCD with both convex and
non-convex penalties will be proposed and the convergence of the algorithm will be discussed.
Selection of the tuning parameters are examined subsequently. We demonstrate the utility
of our approach through extensive simulation studies and real data analysis. Discussions
and concluding remarks are given at the end of this chapter.

4.2

Statistical methods

Let (Xi , Yi , Zi ), i = 1, . . . , n be random vectors which are independent and identically distributed (i.i.d.). Consider a varying coeﬃcient (VC) model with p predictors
p

Yi =

βj (Zi )Xij + εi

(4.1)

j=0

where βj (·) is the smooth varying-coeﬃcient function, Xij is the jth component of (p+1)dimensional vector Xi with the ﬁrst component Xi0 being 1, Zi ’s are the scalar index variable,

78

and εi is the model error.
The varying coeﬃcient model helps us gain particular power to evaluate the nonlinear
responses of genetic variants to environmental incentives in gene-environment (G×E) interactions [37, 68].We are interested in dissecting the penetrance of multiple genetic variants
within a system, such as gene set or pathway, under various environmental stimuli, especially when those variants are mediated by a common mediator Z to aﬀect phenotype. The
eﬀect of the variants can be determined by investigating the status of the coeﬃcient function
βj (·) in (4.1). If βj (·) is a varying function of the environmental factor Z, then the G×E
interaction exists. The nonzero constancy of βj (·) indicates that the G×E interaction is not
present. The genetic variant is not associated with the response (phenotype) if βj (·) = 0. In
such a set-based G×E interaction study design, the total number of variants in the system
(p) can far exceed the sample size (n).

4.2.1

Basis expansion and penalized regression

Supposed that the coeﬃcient function βj (z) (j = 0, . . . , p) in (4.1) can be approximated by
basis expansion such that
L

βj (Z) ≈

γjl Bjl (Z)
l=1

where L is the number of basis functions to approximate the coeﬃcient function, B(·) =
{Bjl (·)}L is a set of normalized B spline basis, and γj is the corresponding spline coeﬃcient
l=1
vector. It follows from change of basis [79] that the above basis expansion is equivalent to
L

βj (·) ≈

.
˜T
γjl Bjl (·) = γj1 + Bj (·)γj∗

l=1

79

.
T
˜
where the spline coeﬃcient vector γj = (γj,1 , γj∗ )T , and Bj (·) = (Bk2 (·), . . . , BjL (·))T . γj,1
and γj∗ correspond to the constant and varying part of the coeﬃcient functional respectively.
1
T
Denote γj 2 = γj Rj γj , where Rj = n
j

n
i=1

T T
Bj (Zi )Xij Xij Bj (Zi ) . Note that Rj is a

L × L positive deﬁnite matrix.If γj∗ j =0, then the jth predictor only has a constant eﬀect.
Furthermore, if γj,1 =0, then the predictor is not associated with the response. Therefore
γj∗ can be treated as a group.
To separate the varying, constant and zero eﬀect in the procedure of simultaneous variable
selection and parameter estimation, we minimized the penalized loss function

1
Q(γ) =
n

n



p

Yi −
i=1
p

+
j=1

2

L

p

γjl Xij Bjl (Zi ) +
j=1

j=0 l=1

pλ1 ( γj∗ j )
(4.2)

pλ2 (|γj1 |)I( γj∗ j = 0)

where λ1 and λ2 are the penalization parameters. The penalty function pλ (k=1,2) in
k
(4.2) can be concave, such as in LASSO [91] or adaptive LASSO [92], or non-cave, such as
the smoothly clipped absolute deviation (SCAD) penalty function [75]. Tang et al [73] ﬁrst
proposed the framework and established the asymptotic results for adaptive LASSO penalty
function, while Wu et al [93] demonstrated improved ﬁnite sample performance of SCAD
penalty over adaptive LASSO as well as the corresponding oracle property. Both approaches
are based on local quadratic approximations (LQA) to the penalty function pλ (k=1,2).
k

However, LQA leads to a ridge-type solution dependent on repeated large matrix inversions,
which renders the algorithm not eﬃcient for large scale regression problems. Furthermore,
as pointed out in Breheny and Huang [87], the quadratic approximation cannot beneﬁt from
the sparsity since this approach will not yield naturally sparse solutions.

80

Zou and Li [94] developed local linear approximation (LLA) to the non-convex penalty
function and demonstrated its advantage over LQA. Breheny and Huang [95] further enhanced the LLA by showing penalized models with non-convex penalty can be ﬁtted eﬀectively by coordinate descent method. The main idea of GCD for (4.2) is to minimize the
penalized loss function Q with respect to an individual predictor group after basis expansion
at each step, and then cycle through all the predictor groups till convergence.

4.2.2

Computational algorithms

In this section, we extend the two-step iterative framework in [73, 93] with GCD approach.
(4.2) can be rewritten using matrix notations as
p

Q(γ) =(Y

− W γ)T (Y

− W γ) + n

pλ1 ( γj∗ j )
j=1

(4.3)

p

pλ2 (|γj1 |)I( γj∗ j = 0)

+n
j=1

T
T
where Y = (Y1 , . . . , Yn )T , Wi = (Xi0 B(Zi )T , . . . , Xip B(Zi )T )T , W = (W1 , . . . , Wn )T and
T
T
γ = (γ0 , . . . , γp )T .

We denote the subsets of predictors of varying, non-zero constant and zero eﬀects by V,
C, and Z respectively. The iterative algorithm is carried out in the following two steps.
At step 1, to separate the varying and nonzero constant eﬀects, we minimize the following
penalized loss:
p
T

Q1 (γ) = (Y − W γ) (Y − W γ) + n
j=1

pλ1 ( γj∗ j )

(4.4)

to obtain γ VC . All the predictors are assumed to be in the subset of varying eﬀects V initially.
81

The penalization is conducted in a group manner. The jth predictor will be moved from
subset V to C if γj∗ j =0, or it will stay in V.
ˆ VC
At step 2, the penalized criterion
p

Q2 (γ) = (Y

− W γ)T (Y

− W γ) + n
j=1

pλ2 (|γj1 |)I( γj∗ j = 0)

(4.5)

is minimized only with regard to the predictors in set C. If Xj is in C, then it will be moved
to Z if |ˆj,1 |=0, otherwise it will stay in C.
γ CZ
The above two steps will be iterated till convergence and we can obtain the estimator
ˆ
γ at convergence. The coeﬃcient function βj (z) (j = 0, . . . , p) in (4.1) can be estimated as
ˆ
ˆ
βj (z) = B T (z)ˆj . βj (z) will be a varying function in z, non-zero constant and zero if γj is
γ ˆ
in V, C and Z respectively.

4.2.2.1

Group LASSO and group adaptive LASSO

In this section, we investigate the two-step iterative algorithm based on GCD approach
using the convex penalty functions, such as LASSO and adaptive LASSO. Yuan and Lin [85]
extended LASSO to the selection of variables group-wisely. By setting pλi (x) = λi

Dij x

for i = 1, 2 in (4.3), where Dij is the corresponding dimension of predictor group j. Let
1

1

−
2
γj = Rj γj and Wij = Rj 2 Wij . Then (4.4) and (4.5) can be reexpressed as

Q1 (γ) = Y − W γ

T

p

Y − Wγ + n

λ1

D1j γj∗ 2

j=1

and
Q2 (γ) = Y − W γ

T

p

Y − Wγ + n

λ1
j=1

82

D2j |γj1 |I( γj∗ 2 = 0)

respectively. From now on, we drop

from the formula for simpliﬁcation of notation. The

dimension of predictor group, Dij , was included in the optimization, as in [85], to guarantee
that the amount of penalization is consistent with the group size, so the small groups won’t
be dominated by large groups. However, note that the number of basis functions is the same
for all the predictors in approximating the coeﬃcient function, and the size of all the groups
in Q2 is 1, Dij can be dropped from both Q1 and Q2 .
By setting the partial derivative of Q1 with respective to γj∗ to zero, we have

ˆ VC
γj∗ =

1−

λ1
S
S1j 2 + 1j

(4.6)

T
T
T
T
T
where S1j = Wj∗ Y − W γ−j∗ with γ−j∗ = γ0 , . . . , γj−1 , (γj1 , 0L−1 )T , γj+1 , . . . , γp

T

,

Wj∗ is the part of Wj corresponding to γj∗ and (x)+ = xI{x > 0}. Similarly, setting the
partial derivative of Q2 with respective to γj1 for those j such that γj∗ 2 = 0 will lead to

ˆ CZ
γj1 =

1−

λ2
S
S2j 2 + 2j

T
where S2j = Wj1 Y − W γ−j1 with γ−j1 =

(4.7)

T
T
T
T
γ0 , . . . , γj−1 , (0, γj∗ )T , γj+1 , . . . , γp

T

,

and Wj1 is the part of Wj corresponding to γj1 .
The group adaptive LASSO estimator could be obtained by simply replacing λ1 and λ2
with λ1 S1j −1 and λ2 S2j −1 , in (4.6) and (4.7) respectively. (4.6) and (4.7) are essentially
2
2
the multivariate version of the soft-thresholding operator in [96]. They will be updated solely
ˆ VC
for predictors with varying and constant eﬀect, respectively. All the rest components of γj
ˆ CZ
and γj will remain as the updates in previous iteration. Both solutions have closed forms
and are exempt from any sort of approximations. Furthermore, the GCD algorithm is

83

piecewise linear and therefore exceptionally well suited for high dimensional scenarios.

4.2.2.2

Group TLP and group SCAD

In group settings, the closed form solutions (4.6) and (4.7) exist not only for convex penalties
such as LASSO and adaptive LASSO, but also for non-convex penalties like truncated L1 function penalty, TLP, in Shen et al [97], and smoothly clipped absolute deviation penalty,
SCAD, as in Fan and Li [75].
The TLP penalty function is given as

pτ,λ (x) = min(

|x|
, 1)λ
τ

(4.8)

where positive tuning parameters λ and τ control adaptive model selection and the degree
of sparsity, correspondingly. As pointed out in Shen et al [97], with properly tuned τ , the
approximation error of TLP to L0 function reduces to 0, and low resolution coeﬃcients
can be taken good care of. By resorting to diﬀerence convex (DC) approach, Shen et al
[97] turned non-convex optimization problems into its convex counterpart, thus signiﬁcantly
improved the computational eﬃciency and stability.
Following the similar lines of derivations in Shen et al [97] and Xue and Qu [98], the closed
form solutions for (4.4) and (4.5) with group TLP penalty can be obtained by replacing λ1
λ
and λ2 in (4.6) and (4.7) with τ 1 I
1

λ

ˆ
γj∗ 2 ≤ τ1 and τ 2 I
2

ˆ
γj1 2 ≤ τ2 , respectively, where

ˆ
ˆ
γj∗ and γj1 are taken as their most recent updates.
Before ﬁnding the solutions to (4.4) and (4.5) under group SCAD penalty, we brieﬂy

84

revisit the SCAD penalty function, deﬁned in Fan and Li [75] as


 λx
if 0 ≤ x ≤ λ




(x2 −2aλx+λ2 )
pλ,a (x) =
if λ < x ≤ aλ
−
2(a−1)



 (a+1)λ2


if x > aλ
2

(4.9)

where λ > 0 and a > 2, and its derivative


 λ
if 0 ≤ x ≤ λ




(x−aλ)
pλ,a (x) =
if λ < x ≤ aλ
 − a−1




 0
if x > aλ

(4.10)

The spirit of SCAD penalty could be explicitly conveyed by (4.10). SCAD starts its penalization with the same rate as LASSO till the point x = λ. Then the rate continuously reduces
to 0 and will stay as 0 from the point x = aλ. Therefore SCAD is capable of correcting
the bias introduced by LASSO while still performing variable selection. The group SCAD
estimator for λ > 0 and ai > 2 (i=1,2) can be derived as

ˆ VC
γj∗ =

and

ˆ CZ
γj1 =











λ

1− S 1
S1j
1j 2 +
a1 −1

 a1 −2





 S
 1j











if S1j 2 ≤ 2λ

a1 λ1
1 − (a −1) S
S1j if 2λ < S1j 2 ≤ a1 λ
1
1j 2 +

if S1j 2 > a1 λ

λ

1− S 1
S2j
2j 2 +
a2 −1

 a2 −2





 S
 2j

(4.11)

if S2j 2 ≤ 2λ

a2 λ1
1 − (a −1) S
S2j if 2λ < S2j 2 ≤ a2 λ
2
2j 2 +

if S2j 2 > a2 λ
85

(4.12)

with notations deﬁned the same as in (4.6) and (4.7), correspondingly. (4.11) and (4.12) are
of the soft-thresholding property as a1 −→ ∞ and a2 −→ ∞ respectively.

4.2.2.3

Convergence of the GCD algorithm

In the previous two sections, we obtained the closed form updates for all the 4 penalty
functions (LASSO, adaptive LASSO, TLP and SCAD) within the group settings. At each
cycle of the GCD algorithm, the minimization of Q with respect to γj (j = 1, . . . , p) is
achieved by iterating the minimization of Q1 with respect to γj∗ , and Q2 with respect to
γj1 correspondingly, where Q, Q1 and Q2 are deﬁned in (4.3), (4.4) and (4.5) respectively.
Hence, the two-stage iterative algorithm is of the descent property. Meanwhile, the value of
Q at each cycle is nonnegative. So we have the following proposition:
ˆ
Proposition 1. Let γ (k) be the estimated spline coeﬃcient vector at the convergence of
kth iteration. Then for the group penalties of LASSO, Adaptive LASSO, TLP and SCAD,
GCD has the property such that

Q(ˆ (k) ) ≤ Q(ˆ (k−1) )
γ
γ

ˆ
γ
Moreover, given an initial value γ (0) , the sequence {Q(ˆ (k) ), k ≥ 1} converges to a local
minimal of Q.
It follows from the above proposition that the two-stage iterative algorithm always converges. The penalized loss function Q is not convex in general, otherwise the convergence
to the global optimum will be guaranteed by Proposition 1. Given that dimension p is ﬁxed
and p < n, the asymptotic properties of the two-stage estimator were established in Tang et
al [73] with adaptive LASSO penalty, and Wu and Cui [93] with SCAD penalty. The global

86

optimality of the two stage estimator for large p small n is beyond the scope of this chapter
and will not be discussed here.

4.2.3

Selection of tuning parameters

Here we propose a data-driven procedure to choose the proper tuning parameters. For group
LASSO and group adaptive LASSO, there are 4 tuning parameters, O, L, λ1 and λ2 , where
the degree of the B spline basis O and the number of interior knots L uniformly spaced on
[0,1] govern the smoothness of the varying coeﬃcient functions, while λ1 and λ2 control the
sparsity of the estimator.
At the beginning, we apply BIC in Schwarz [79] to choose O and L. The range for L can
1

1

be determined by [max( 0.5n (2O+3) , 1), 1.5n (2O+3) ], with x denoting the integer part
of x. The optimal combination of O and L can be reached by searching the corresponding
two-dimensional grid, according to the following criterion

ˆ
ˆ
BICO,L = log (Y − W γ )T (Y − W γ ) +

(O + L + 1)
log(n)
n

ˆ
for γ = (ˆ0 , 0T , . . . , 0T )T . Conditional on the chosen N and p, we can determine optimal
γT
λ1 by minimizing

ˆ
BICλ1 = log Y − W γλ1

T

ˆ
Y − W γλ 1 +

dfλ1
n

log(n)

(4.13)

ˆ
where γλ1 is the minimizer of (4.4) given λ1 , and dfλ1 is the eﬀective degree of freedom,
deﬁned as the total number of varying and non-zero constant predictors. Once we selected

87

N , p and λ1 , the optimal λ2 can be determined by minimizing

ˆ
BICλ2 = log Y − W γλ2

T

ˆ
Y − W γλ 2 +

dfλ2
n

log(n)

(4.14)

ˆ
where γλ2 is the minimizer of (4.5) given λ2 , and dfλ2 is deﬁned similarly as dfλ2 .
We need to choose additional tuning parameters τ1 and τ2 for group TLP algorithm.
Slight modiﬁcations can be made to (4.13) so we can carry out a two-dimensional search for
the best pair of λ1 and τ1 . Optimal λ2 and τ2 can be taken similarly.
Note that in the group SCAD algorithm, we also have two more tuning parameters a1 and
a2 than those in group LASSO and group adaptive LASSO. A search for the best combination
of λi and ai (i=1,2) over a two dimensional grid will be computationally intensive. It was
pointed out in Fan and Li [75] that as a function of a in (4.9), the Bayesian risks are not
sensitive to the choice of a. Here a is ﬁxed as 2.2 in the subsequent analysis. We can tune
λ1 and λ2 according to (4.13) and (4.14) correspondingly.

4.3

Simulation study

In this section, we carry out Monte Carlo simulations to assess the ﬁnite sample performance
of group LASSO, group adaptive LASSO, group TLP and group SCAD in our framework
through the GCD algorithm. The accuracy of variable selection and successful separation of
the varying, non-zero constant and zero eﬀects can be assessed by the percentage of choosing
the correct model out of the total R replicates. Integrated mean squared error (IMSE), is
adopted for evaluation of the estimation precision.
ˆ(r)
Let βj be the estimator of the coeﬃcient function βj in the rth (1 ≤ r ≤ R) replication,

88

n

grid
ˆ(r)
and {zk }k=1 be the grid points where βj is evaluated. The integrated mean squared error

ˆ
(IMSE) of βj (z) is deﬁned as

1
ˆ
IMSE(βj (z)) =
R

R

ngrid

1
ngrid

r=1

(r)

ˆ
{βj (zk ) − βj (zk )}2

(4.15)

k=1

(4.15) can be used to measure the estimation precision for the jth predictor. The overall
estimation accuracy is reﬂected by the total integrated mean squared error (TIMSE) of all
the coeﬃcients, deﬁned as TIMSE=

p
ˆ
j=0 βj (z).

ˆ
ˆ
Note that IMSE(βj ) reduces to MSE(βj )

ˆ
when βj is a constant.
We rewrite model (4.1) here as
p

Yi =

βj (Zi )Xij + εi
j=0

(i = 1, . . . , n) for describing our simulation schemes. The predictors were generated from
both continuous and discrete distributions. Two types of distribution for εi , a standard
normal distribution and a t distribution with 3 degrees of freedom, were taken. Without loss
of generality, assume the ﬁrst 3 coeﬃcients are varying, next two are constant and all the
rest coeﬃcients are redundant. The coeﬃcients were set as: β0 (z) = 5 + 3 sin(2πz), β1 (z) =
2−3 cos{(6z −5)π/3}, β2 (z) = 7−7z, β3 (z) = 4.5, β4 (z) = 3, and βj (z) = 0 for j = 5, . . . , p.
All the simulations were carried out with sample size n = 200, p =10,100,200,400 and a total
of R =200 replicates. All the approaches use LASSO as the initial estimate.
Example 4.1. We simulate the responses from model (4.1), where the index variable
Xi ∼ Uniform(0,1), and the predictors were generated from a multivariate normal distribution with mean vector 0 and Cov(Xij , X

ij

) = 0.5|j−j | for 0 ≤ j, j ≤ p. The IMSE of the

89

estimator obtained from all the 4 approaches and the true model were calculated. Under
the standard normal error, as p increases, group LASSO has a systematic less satisfactory
performance than the others which have relative stable and comparable performances, with
group TLP gradually establishing its advantage over the rest in terms of both oracle percentage and estimation precision, as shown in Table 4.1. A pretty much similar trend can be
observed in Table 4.2 when the random error was generated from t(3) distribution, though
the performance of all the procedures are slightly worse than that under the standard normal
error.

90

Table 4.1: Simulation results of Example 4.1, N (0,1) error

p=10

p=100

p=200

p=400

gLASSO
Oracle Perc.
0.615
IMSE
β0 (u)
0.1393
β1 (u)
0.3199
β2 (u)
0.4777
β3 (u)
0.1771
β4 (u)
0.0784
TIMSE
1.2022
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

N (0,1) error
gALASSO gSCAD
0.955
0.97

gTLP
1

Oracle
1

0.1380
0.1678
0.1539
0.0175
0.0231
0.5019

0.1388
0.1427
0.1213
0.0122
0.0103
0.4339

0.1492
0.1641
0.0666
0.0101
0.0081
0.3980

0.1492
0.1669
0.0674
0.0101
0.0081
0.4017

0.515

0.995

0.995

1

1

0.1650
0.2302
0.7810
0.0792
0.0711
1.3308

0.1657
0.1541
0.6678
0.0140
0.0216
1.0232

0.1648
0.1477
0.6515
0.0176
0.0128
0.9945

0.1587
0.1607
0.0713
0.0096
0.0082
0.4085

0.1506
0.1374
0.0855
0.0093
0.0081
0.3909

0.355

0.985

1

1

1

0.1771
0.2259
0.8094
0.0812
0.0804
1.3825

0.1761
0.1599
0.6731
0.0193
0.0271
1.0554

0.1781
0.1595
0.6484
0.0202
0.0144
1.0206

0.1612
0.1720
0.0764
0.0087
0.0081
0.4265

0.1523
0.1480
0.0845
0.0089
0.0085
0.4022

0.27

0.99

1

1

1

0.2104
0.2514
0.8967
0.0705
0.0749
1.5188

0.2102
0.1744
0.7930
0.0256
0.0257
1.2290

0.2099
0.1697
0.7840
0.0273
0.0124
1.2033

0.1654
0.1729
0.0746
0.0108
0.0072
0.4310

0.1654
0.1748
0.0756
0.0108
0.0072
0.4339

91

Table 4.2: Simulation results of Example 4.1, t(3) error

p=10

p=100

p=200

p=400

gLASSO
Oracle Perc.
0.29
IMSE
β0 (u)
0.1929
β1 (u)
0.3392
β2 (u)
0.5859
β3 (u)
0.1775
β4 (u)
0.1100
TIMSE
1.4520
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

t(3) error
gALASSO gSCAD
0.83
0.945

gTLP
0.97

Oracle
1

0.1921
0.2198
0.3676
0.0457
0.0463
0.8856

0.1913
0.2050
0.3358
0.0400
0.0358
0.8170

0.2063
0.2343
0.1445
0.0333
0.0233
0.6526

0.2066
0.2386
0.1464
0.0294
0.0211
0.6421

0.01

0.74

0.965

0.98

1

0.2557
0.2834
0.9739
0.1044
0.1036
1.8882

0.2537
0.2173
0.8729
0.0538
0.0549
1.5170

0.2551
0.2220
0.8721
0.0510
0.0402
1.4614

0.2109
0.2370
0.1386
0.0273
0.0260
0.7184

0.2002
0.2134
0.1544
0.0293
0.0230
0.6204

0.005

0.67

0.915

0.955

1

0.3046
0.2857
1.1327
0.0833
0.1156
2.2144

0.2963
0.2077
1.0282
0.0484
0.0501
1.8028

0.3026
0.2103
1.0327
0.0523
0.0340
1.6943

0.2004
0.2405
0.1441
0.0256
0.0237
0.7474

0.1999
0.2515
0.1577
0.0261
0.0212
0.6564

0

0.57

0.905

0.96

1

0.3481
0.2902
1.2438
0.0886
0.1170
2.3943

0.3428
0.2219
1.1509
0.0513
0.0482
1.9293

0.3398
0.2279
1.1543
0.0491
0.0346
1.8288

0.2097
0.2522
0.1555
0.0282
0.0227
0.7677

0.1955
0.2433
0.1369
0.0239
0.0223
0.6220

92

Example 4.2. Now the simulation in a genetic setup is considered. The responses were
generated from model (4.1) with SNP Xj (j = 0, . . . , p) coded as 3 categories (1,0,-1) for
genotypes (AA,Aa,aa) respectively.
The LD based simulation scheme was adopted to generate the genotype data. Let pA
and pB be the minor allele frequencies (MAFs) of two risk alleles A and B for two adjacent
SNPs, with linkage disequilibrium denoted as δ. The frequencies of four haplotypes are
pab = (1−pA )(1−pB )+δ, pAb = pA (1−pB )−δ, paB = (1−pA )pB −δ, and pAB = pA pB +δ.
Assuming the Hardy-Weinberg equilibrium, we can simulate the SNP genotype at locus A
from a multi-nomial distribution with frequencies p2 ,2pA (1−pA ) and (1−pA )2 for genotypes
A
(AA,Aa,aa) respectively. SNP genotype at locus B conditional on that at locus A can be
simulated based on the conditional probability matrix in Cui et al. (2008)[39].
Table ??–?? summarize the estimation results in example 2. For all the four approaches,
as the minor allele frequency (MAF) pA goes up from 0.1 to 0.5, the proportion of choosing the correct model generally increases, and the estimation error reduces. Group LASSO
consistently performed worse than its counterparts in terms of the two criteria. Under the
standard normal error, when p =10, the diﬀerence in performance among the 3 diﬀerent
MAFs, especially between MAF 0.3 and 0.5, is not signiﬁcant for all the four methods.
However, as p increases, dramatic diﬀerences are observed. Starting from p=100, all the approaches can barely choose the exact true model as pA=0.1, and the corresponding TIMSEs
are quite large. Given MAF 0.3, the performance of all the approaches except group TLP
in very high dimensions,such as p=200 and 400, fall way behind that given MAF 0.5. The
performance of group TLP is relative stable compared to the other 3 methods, especially
when the dimension is extremely high, as p=400.
We observe similar patterns when the procedures are assessed under t error with 3 de93

grees of freedom. In general, the performances of all the procedures under standard normal
error are superior. Group TLP outperforms the others in all the scenarios.

Table 4.3: Simulation results of Example 4.2, p = 10, n = 200, N (0,1) error

pA=0.1

pA=0.3

pA=0.5

Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

gLASSO
0.44

N (0,1) error
gALASSO gSCAD
0.795
0.935

gTLP
0.96

Oracle
1

0.7266
1.1823
1.8650
0.4954
0.2002
4.5192

0.7252
1.5831
1.3619
0.2529
0.2767
4.2036

0.7574
0.6841
0.7610
0.1010
0.0650
2.3704

0.3981
0.4124
0.3528
0.0615
0.0415
1.2934

0.3895
0.4861
0.3777
0.0573
0.0416
1.3521

0.615

0.98

0.99

0.985

1

0.2152
0.4975
1.0013
0.4007
0.1559
2.2862

0.2131
0.2814
0.3706
0.0467
0.0665
0.9808

0.2122
0.2064
0.2660
0.0268
0.0254
0.7424

0.1785
0.2204
0.1237
0.0245
0.0195
0.5699

0.1783
0.2277
0.1259
0.0237
0.0195
0.5751

0.73

0.97

0.99

0.99

1

0.2152
0.4975
1.0013
0.4007
0.1559
2.2862

0.2131
0.2814
0.3706
0.0467
0.0665
0.9808

0.2122
0.2064
0.2660
0.0268
0.0254
0.7424

0.1785
0.2204
0.1237
0.0245
0.0195
0.5699

0.1783
0.2277
0.1259
0.0237
0.0195
0.5751

94

Table 4.4: Simulation results of Example 4.2, p = 10, n = 200, t(3) error

t(3) error
pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0.33
IMSE
β0 (u)
1.6609
β1 (u)
1.4901
β2 (u)
1.9043
β3 (u)
0.5425
β4 (u)
0.4238
TIMSE
6.1169
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

gALASSO
0.54

gSCAD
0.705

gTLP
0.79

Oracle
1

1.6954
2.3887
2.2271
0.6404
0.6674
7.6370

1.7427
1.7687
1.6040
0.3444
0.2963
5.8026

1.0722
0.9836
0.9273
0.2340
0.1649
3.5783

0.9165
1.0244
0.9033
0.1560
0.1261
3.1263

0.25

0.895

0.935

0.955

1

0.3432
0.5782
1.1033
0.3668
0.2144
2.6777

0.3406
0.4111
0.6824
0.0949
0.1224
1.6856

0.3432
0.4861
0.7795
0.0828
0.0648
1.7235

0.2534
0.3823
0.3340
0.0825
0.0592
1.1496

0.2491
0.3835
0.3098
0.0706
0.0522
1.0652

0.23

0.845

0.96

0.965

1

0.3432
0.5782
1.1033
0.3668
0.2144
2.6777

0.3406
0.4111
0.6824
0.0949
0.1224
1.6856

0.3432
0.4861
0.7795
0.0828
0.0648
1.7235

0.2534
0.3823
0.3340
0.0825
0.0592
1.1496

0.2491
0.3835
0.3098
0.0706
0.0522
1.0652

95

Table 4.5: Simulation results of Example 4.2, p = 100, n = 200, N (0,1) error

pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0
IMSE
β0 (u)
5.9486
β1 (u)
2.1664
β2 (u)
2.5871
β3 (u)
7.4511
β4 (u)
4.1242
TIMSE
11.978
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

N (0,1) error
gALASSO gSCAD
0
0.005

gTLP
0.015

Oracle
1

5.9577
4.2821
2.6322
0.5889
0.6425
14.4393

5.9433
5.0012
2.6478
3.4381
2.0292
14.2111

4.8575
1.0952
0.9247
0.0965
0.0693
8.7643

0.4180
0.5239
0.5210
0.0624
0.0508
1.5761

0.01

0.565

0.82

0.99

1

1.0107
0.5794
1.1485
0.2246
0.1854
3.2486

1.0204
0.4570
0.9477
0.0596
0.1118
2.6329

1.0199
0.5103
1.5749
0.0674
0.0457
3.2207

0.1690
0.2204
0.1567
0.0233
0.0153
0.5847

0.1666
0.2202
0.1237
0.0232
0.0153
0.5489

0.55

1

1

1

1

0.1593
0.3242
0.9660
0.1357
0.1328
1.7231

0.1588
0.2071
0.8267
0.0330
0.0588
1.2843

0.1606
0.2637
1.1570
0.0360
0.0198
1.6370

0.1561
0.1953
0.1088
0.0197
0.0142
0.4942

0.1561
0.2017
0.1145
0.0197
0.0143
0.5063

96

Table 4.6: Simulation results of Example 4.2, p = 100, n = 200, t(3) error

t(3) error
pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0
IMSE
β0 (u)
6.4423
β1 (u)
2.2632
β2 (u)
2.6220
β3 (u)
0.6904
β4 (u)
0.6988
TIMSE
13.0771
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

gALASSO
0

gSCAD
0

gTLP
0

Oracle
1

6.4941
3.5782
3.3151
1.0206
1.5284
16.6350

6.4759
4.2976
3.2927
0.3533
0.6974
15.6337

5.3221
1.9710
1.9261
0.2403
0.1799
12.0716

0.8603
0.9942
0.9143
0.1557
0.1369
3.0614

0

0.065

0.27

0.86

1

1.9184
0.6734
1.3878
0.2408
0.2472
5.1465

1.9154
0.6233
1.4093
0.1414
0.1723
5.2540

1.9258
0.8530
2.1560
0.1447
0.0857
5.9097

0.3391
0.4326
0.5030
0.0698
0.0474
2.0213

0.2511
0.3638
0.3218
0.0650
0.0439
1.0456

0.035

0.725

0.935

0.985

1

0.2407
0.3715
1.3128
0.1391
0.1893
2.5354

0.2443
0.2827
1.2550
0.1130
0.1120
2.1107

0.2407
0.3359
1.4448
0.1130
0.0783
2.2240

0.2057
0.3258
0.2307
0.0507
0.0361
0.9181

0.2088
0.3332
0.2403
0.0523
0.0346
0.8692

97

Table 4.7: Simulation results of Example 4.2, p = 200, n = 200, N (0,1) error

pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0
IMSE
β0 (u)
6.4035
β1 (u)
2.1184
β2 (u)
2.3654
β3 (u)
0.7732
β4 (u)
0.4501
TIMSE
12.2412
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

N (0,1) error
gALASSO gSCAD
0
0

gTLP
0

Oracle
1

6.4500
4.3209
2.6648
0.4838
0.5315
14.8546

6.4518
4.9639
2.8046
0.3425
0.1985
14.7897

5.0816
1.1121
0.8625
0.0938
0.5025
8.9683

0.3961
0.4109
0.3500
0.0586
0.0426
1.2582

0

0.075

0.345

0.98

1

1.8631
0.6342
1.3076
0.2365
0.2280
4.4824

1.8954
5.5156
1.2343
0.0549
0.1338
4.0975

1.8833
0.8991
2.3473
0.0628
0.0575
5.2837

0.1987
0.2307
0.1570
0.0238
0.0192
0.6483

0.1703
0.2113
0.1374
0.0230
0.0193
0.5613

0.44

0.985

1

1

1

0.1588
0.2979
1.0129
0.1180
0.1310
1.7286

0.1594
0.2000
0.9003
0.0343
0.0676
1.3615

0.1583
0.2639
1.2411
0.0433
0.0216
1.7282

0.1525
0.2003
0.1141
0.0178
0.0158
0.5006

0.1525
0.2064
0.1157
0.0178
0.0159
0.5083

98

Table 4.8: Simulation results of Example 4.2, p = 200, n = 200, t(3) error

t(3) error
pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0
IMSE
β0 (u)
7.8770
β1 (u)
2.237
β2 (u)
2.542
β3 (u)
0.8399
β4 (u)
0.7117
TIMSE
14.7035
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

gALASSO
0

gSCAD
0

gTLP
0

Oracle
1

7.8787
3.5923
3.4581
0.8507
1.4850
18.344

7.8986
4.4315
3.4735
0.5609
0.5707
17.356

5.628
2.0438
2.2584
0.2691
0.1665
12.8456

0.8906
1.0194
0.8541
0.1614
0.1282
3.0536

0

0

0.015

0.705

1

3.0900
0.7765
1.4769
0.2164
0.3116
6.5667

3.1150
0.7070
1.6380
0.1412
0.2076
6.8139

3.1170
1.0440
2.6080
0.1220
0.1040
7.2260

0.7914
0.5144
0.5512
0.0745
0.0536
2.5457

0.2705
0.4026
0.3126
0.0715
0.0480
1.1053

0

0.695

0.915

0.96

1

0.2872
0.4168
1.4403
0.1521
0.2092
2.9052

0.2928
0.3450
1.4106
0.0960
0.1303
2.5833

0.3075
0.3626
1.5519
0.1116
0.0670
2.4056

0.2178
0.3954
0.2830
0.0578
0.0557
1.0997

0.2089
0.3450
0.3179
0.0553
0.0461
0.9732

99

Table 4.9: Simulation results of Example 4.2, p = 400, n = 200, N (0,1) error

pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0
IMSE
β0 (u)
6.9606
β1 (u)
2.1276
β2 (u)
2.4348
β3 (u)
0.6920
β4 (u)
0.5384
TIMSE
12.8890
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

N (0,1) error
gALASSO gSCAD
0
0

gTLP
0

Oracle
1

6.8309
4.0791
2.8612
0.6841
0.9365
15.8640

6.8046
4.8938
2.8865
0.4083
0.2659
15.3025

5.1098
1.2151
1.1964
0.1016
0.0713
9.3504

0.3685
0.4558
0.3100
0.0510
0.0392
1.2246

0

0

0.095

0.88

1

2.783
0.675
1.422
0.236
0.278
5.713

2.792
0.572
1.467
0.081
1.763
5.596

2.793
0.810
2.753
0.090
0.090
6.643

0.421
0.313
0.364
0.030
0.021
1.258

0.181
0.243
0.135
0.023
0.017
0.600

0.29

0.985

1

0.995

1

0.1817
0.2926
1.1169
0.1110
0.1392
1.8554

0.1825
0.2092
1.0209
0.0401
0.0676
1.5204

0.1831
0.2928
1.3567
0.0426
0.0250
1.9003

0.1558
0.2169
0.1325
0.0205
0.0152
0.5409

0.1551
0.2232
0.1162
0.0204
0.0152
0.5300

100

Table 4.10: Simulation results of Example 4.2, p = 400, n = 200, t(3) error

pA=0.1

pA=0.3

pA=0.5

gLASSO
Oracle Perc.
0
IMSE
β0 (u)
8.3537
β1 (u)
2.1540
β2 (u)
2.5676
β3 (u)
0.8431
β4 (u)
0.7683
TIMSE
15.1398
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE
Oracle Perc.
IMSE
β0 (u)
β1 (u)
β2 (u)
β3 (u)
β4 (u)
TIMSE

t(3) error
gALASSO gSCAD
0
0

gTLP
0

Oracle
1

8.1152
3.5756
3.3866
0.6975
1.3577
18.0344

8.2099
4.5341
3.4152
0.4919
0.6356
17.5993

6.0230
2.1497
2.5080
0.2443
0.1642
13.2892

0.9324
0.9725
0.9093
0.1773
0.1549
3.1463

0

0

0

0.425

1

4.127
0.812
1.562
0.264
0.359
7.973

4.114
0.681
1.815
0.187
0.244
8.427

4.123
0.940
2.985
0.162
0.143
8.711

1.968
0.701
0.823
0.114
0.076
5.132

0.279
0.427
0.305
0.071
0.057
1.146

0

0.61

0.935

0.97

1

0.3306
0.4178
1.4987
0.1294
0.2179
3.084

0.3289
0.3265
1.4649
0.0916
0.1178
2.5707

0.3317
0.3698
1.6647
0.0888
0.0518
2.5263

0.1991
0.3515
0.3334
0.0485
0.0457
1.1916

0.2016
0.3564
0.2950
0.0468
0.2438
0.9435

101

4.4

Real data analysis

We used the genome-wide association data from Genetic Analysis Workshop (GAW) 18 for
142 unrelated subjects to illustrate the utility of our approach. It is widely recognized that
individual’s blood pressure is related to age. The aim of our study is to track down the
genetic variants that can interpret the variation in Diastolic Blood Pressure (DBP) caused
by non-linear penetrance eﬀect over time (age). The environment condition is deﬁned as
age, and the blood pressure, measured as DBP, might not be the same for an individual with
the same gene but of diﬀerent environmental exposures. This is triggered by the complex
interplay between the age and gene eﬀects.
The dataset was cleaned by removing SNPs with MAF less than 0.05 or departure from
Hardy-Weinberg equilibrium. Subjects with missing DBP or age, as well as with more than
1/3 missing genotypes, were excluded from the dataset before ﬁnal analysis. We take a subset
of 250 SNPs from chromosome 9 and imputed the missing value before the ﬁnal analysis.
We applied our approach to the data and select the model

Y = β0 (z) + βj1 Xj1 + βj2 Xj2 + ε

where Xj1 and Xj2 correspond to SNP rs723877 and rs10972462, respectively. Both coeﬃcients βj1 and βj1 are varying. Thus we identiﬁed two genetic risk variants that are sensitive
to age to aﬀect blood pressure.
A cross validation examination, the single SNP based analysis in Ma et al [37], was
performed for SNPs on chromosome 9. From the over all association test corresponding to
LM, LMI and VC models, we calculated p-values denoted as P CON, P LIN and P VC. SNPs
with at least one of the three p-values less than 5E-06 were tabulated in Table 4.11. For SNP
102

rs10972462, testing on the constant coeﬃcients implies that the coeﬃcient function of this
SNP does change across age, as P CON¡ 0.05. Further test on linear structure indicates that
rs10972462 has a linear interaction with age (P LIN¿ 0.05). Hence it makes sense that the
p-value obtained from the test corresponding to LMI model is smaller than its counterpart
with LM and VC model. A similar observation can be concluded for SNP rs723877.
Table 4.11: List of SNPs with p-value < 5E-06 on Chromosome 9
SNP ID

GeneName

Location

P VC

P CON

P LIN

P LM

P LMI

PI

2.83E-05
3.01E-05

2.24E-06
1.24E-06

0.00357
0.00175

LMI model
rs723877
rs10972462

4.5

UNC13B
UNC13B

Chr9
Chr9

6.52E-06
2.13E-06

0.00649
0.00208

0.12311
0.07052

Discussion

A growing number of pieces of evidence have demonstrated the importance of G×E interactions in complex traits, as the responses of genetic factors to environmental exposures play
a crucial role in aﬀecting disease risks and variations of quantitative traits. Many statistical
methods have been developed to explore G×E interactions. The linear assumption of gene
eﬀect under environmental stimuli on which these methods rely is often violated in practice.
Ma et al [37] and Wu and Cui [68] relaxed the linear assumption to allow for a non-linear
genetic penetrance eﬀect for continuous and binary disease phenotype, respectively. The true
eﬀect of G×E interactions is captured by the model itself, through a sequence of likelihood
ratio tests.
A common feature of these methods, including Ma et al [37] and Wu and Cui [68],
and those reviewed in Cornelis et al [65], is that the identiﬁcation of the presence of G×E
interactions is casted in a single genetic variant based hypothesis testing framework. To
103

our best knowledge, Wu and Cui [93] for the ﬁrst approached the problem from a high
dimensional variable selection perspective. Speciﬁcally, within the VC model framework, the
presence of G×E interactions, no G×E interactions and no association with the phenotype
can be determined by separating the VC coeﬃcient functions after B spline basis expansion
approximations as varying, non-zero constant and zero, correspondingly.
The eﬀect separation and variable selection in Wu and Cui [93] is attained by penalized
estimation on the model parameters so some varying coeﬃcients are continuously shrunk to
a non-zero constant or zero. The procedure is dependent on local quadratic approximations
(LQA) to SCAD penalty function. However, LQA leads to loss in eﬃciency and accuracy,
especially when dimension p is large, due to frequent factorizations of large matrices. The
local linear approximation (LLA) to penalty functions proposed in Zou and Li [94] improved
LQA but then was shown outperformed by coordinate descent method (CD), as in Breheny
and Huang [87]. In this work, we integrate the group version of CD, or GCD, in the two-stage
iterative procedure to exploit G×E interactions in a high dimensional setting.
The most prominent character of GCD is the penalized objective function is optimized
over one individual predictor group at a time, so the computational complexity only increase
linearly with the dimension p. This attribute ensures the superior performance of our framework. We examined both convex (LASSO, adaptive LASSO) and non-convex (TLP, SCAD)
penalty functions. Extensive simulation results manifest that all the penalty except LASSO
perform satisfactorily, and TLP excels in all the scenarios. As p grows from low to very
high dimensions, the performance of approaches with all penalty functions remain relative
stable, though slight drops was observed. The phenomenon indicates the advantage of our
framework from another perspective.

104

Chapter 5
Concluding Remarks
It has been widely recognized that the naturally occurring variations in most complex disease
traits are not merely explained by genetic factors, but also can be understood from the
mechanism of genetic responses to environmental mediators. G×E interactions, the genetic
control of the pattern to environmental stimuli, shed novel light on examining the trait
variations. This dissertation focuses mainly on developing powerful statistical methods to
tackle the challenges originated from G×E interactions.
In chapter 2, we developed a new statistical approach to extend the varying coeﬃcient
model for continuous quantitative response in our previous work to the binary disease response. The varying coeﬃcients were estimated by the nonparametric B spline method due
to its computational expediency and nice asymptotic properties. Our scheme has particular
advantage in hunting down the ﬂuctuation in gene functions across environmental changes.
A signiﬁcant boost in power were indicated in the simulation study when underlying penetrance eﬀect of genetic variants is non-linear.
The simulation results also show that when the model for underlying mechanism of G×E
interactions is misspeciﬁed, VC model may not have the higher power than LM and LMI
models, since it is much more complex and needs large degrees of freedom for hypothesis
testing. To determine which model ﬁts the data more appropriately, we developed a sequential testing scheme. Our scheme is applied to two Type 2 Diabetes studies by ﬁrst evaluating

105

constant coeﬃcients, then linear and varying coeﬃcients. The novel disease signals captured
by VC model in our framework could be missed if our focus is restricted to linear model
only.
A broad spectrum of available methodologies in exploring G×E interactions are coined
within single genetic variant based hypothesis testing framework. Set based association
study, such as the gene-centric, gene-set and pathway based analysis, has continued to soar
in popularity as its advantage has increasingly been acknowledged. In chapter 3, we proposed
a variant set based framework to examine how variants in a genetic system are mediated
by a common environment factor to inﬂuence quantitative phenotypic response. We tackle
the issue from a high dimensional variable selection standpoint. Speciﬁcally, we can identify
the sensitivity of genetic variants to environment stimuli, which is tantamount to determine
the coeﬃcient function as varying, non-zero constant and zero, corresponding to cases of
existence of G×E interactions, no G×E interactions and no genetic eﬀects.
The procedure was implemented in a two stage iterative framework. We established the
selection consistency and oracle properties of the penalized estimator with SCAD penalty,
and demonstrated dramatic improvement in ﬁnite sample performance over the adaptive
LASSO in simulation study, in terms of oracle percentage and estimation accuracy. The
application of our approach to the JAK/STAT signaling pathway in LGA/SGA study correctly select the risk SNP without G×E eﬀect. This framework is critically dependent
on local quadratic approximations (LQA). Because of repeated factorizations of matrices,
signiﬁcance loss in computational eﬃciency and estimation precision will be caused when
dimension is large, especially in scenarios of p > n.
To overcome this issue, we developed the group coordinate descent (GCD) approach
within the two-stage iterative framework in chapter 4. After basis expansion, the penalized
106

loss function is minimized with regard to single predictor group at a time, and all the
predictor groups are cycled until convergence of the algorithm. Therefore, the computational
complexity only grows linearly with dimension p. Through extensive simulation study with
diﬀerent penalty functions (LASSO, adaptive LASSO, TLP and SCAD), we demonstrated
the merit of this framework. Our approach yields high oracle percentages and estimation
precision even when dimension p is much larger than sample size n.
The main objective of this dissertation is to develop novel statistical methodology for the
elucidation of complicated machinery in G×E interactions. By implementing diﬀerent link
functions, our framework on investigating the non-linear G×E interactions can be readily
extended to diﬀerent types of phenotypic responses such as count or survival outcomes.
We can also try to test if the novel hypothesis in Perry et al [52] that the signiﬁcance of
potential risk loci could be enhanced by the stratiﬁcation of patients with Type 2 Diabetes
based on BMI will result in any new ﬁndings within our framework. To the best of our
knowledge, this dissertation ﬁrst proposed the scheme of approaching G×E interactions
from a high dimensional variable selection perspective. Generalizations of our framework to
binary disease response in case control association study and survival response are currently
undergoing. The selection consistency and oracle property when dimension of predictors p
exceeds the sample size n for the procedure will also be examined. It is worth noting that
to explore G×E interactions in ultra-high dimensional feature space, we can integrate the
sure independence screen (SIS) procedure in Fan and Lv [99] into our framework. Relevant
investigations will be carried out in the near future.

107

BIBLIOGRAPHY

108

BIBLIOGRAPHY
[1]

M. Lynch and B. Walsh. Genetics and Analysis of Quantitative Traits. Sinauer Associates,Sunderland, MA, 1998.

[2]

R. Wu, C.X. Ma and G. Casella. Statistical Genetics of Quantitative traits– Linkage,
Maps and QTL. Springer, 2007.

[3]

R.S. Spielman, R.E. McGinnis, W.J. Ewens. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am
J Hum Genet. 52 (3): 506-516, 1993.

[4]

N. Risch and K. Merikangas. The future of genetics studies of complex human diseases.
Science 273: 1516–1517, 1996.

[5]

J. Altmuller, L.J. Palmer, G. Fischer, H. Scherb and M. Wjst. Genome-wide scans
of complex human diseases: true linkage is hard to ﬁnd. Am J Hum Genet. 69(3):
936–950, 2001.

[6]

International HapMap Consortium. The haplotype map of the human genome. Nature
437: 1299-1320, 2005.

[7]

M.F. Moﬀatt, M. Kabesch, L. Liang, A.L. Dixon, D. Strachan, et al. Genetic variants
regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature
448(7152):470–473, 2007.

[8]

D.F. Easton, K.A. Pooley, A.M. Dunning, P.D.P. Pharoah, et al. Genome-wide association study identiﬁes novel breast cancer susceptibility loci. Nature 447: 1087–1093,
2007.

[9]

D.J. Hunter, P. Kraft, K.B. Jacobs, D.G. Cox, et al. A genome-wide association
study identiﬁes alleles in FGFR2 associated with risk of sporadic postmenopausal
breast cancer. Nat Genet. 39(7): 870–874, 2007.

[10]

A. Helgadottir, G. Thorleifsson, A. Manolescu, S. Gretarsdottir, et al. A Common
Variant on Chromosome 9p21 Aﬀects the Risk of Myocardial Infarction. Science
316(5830): 1491–1493, 2007.

109

[11]

R. McPherson, A. Pertsemlidis, N. Kavaslar, A. Stewart, et al. A common allele on
chromosome 9 associated with coronary heart disease. Science 316(5830): 1488–1491,
2007.

[12]

R. Sladek, G. Rocheleau, J.Rung, C. Dina, et al. A genome-wide association study
identiﬁes novel risk loci for type 2 diabetes. Nature 445: 881–885, 2007.

[13]

L.J. Scott, K.L. Mohlke, L.L. Bonnycastle, C.J. Willer, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science
316(5829): 1341–1345, 2007.

[14]

M.I. McCarthy, G.R. Abecasis, L.R. Cardon, D.B. Goldstein et al. Genome-wide
association studies for complex traits: consensus, uncertainty and challenges. Nat.
Rev. Genet. 9(5): 356–369, 2008.

[15]

T.A. Manolio, F.S. Collins, N.J. Cox, D.B. Goldstein, et al. Finding the missing
heritability of complex diseases. Nature 461(7265): 747–753,2009.

[16]

D.E. Reich and E.S. Lander. On the allelic spectrum of human disease. Trends Genet.
17(9): 502–510, 2001.

[17]

J.K. Pritchard and N.J. Cox. The allelic architecture of human disease genes: common
disease-common variant...or not? Hum. Mol. Genet. 11(20): 2417–2423, 2002.

[18]

E. Zeggini, W. Rayner, A.P. Morris, A.T. Hattersley, et al. An evaluation of HapMap
sample size and tagging SNP performance in large-scale empirical and simulated data
sets. Nat. Genet. 37: 1320–1322, 2005.

[19]

E.R. Mardis. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum.
Genet. 9:387–402, 2008.

[20]

E.R. Mardis. The impact of next-generation sequencing technology on genetics.
Trends Genet. 24(3): 133–141, 2008.

[21]

J. Shendure and H. L. Lee. Next-generation DNA sequencing. Nature Biotech. 26:
1135–1145, 2008.

[22]

M.L. Metzker. Sequencing technologies - the next generation. Nat Rev Genet.
11(1):31–46 ,2009

110

[23]

M.I. McCarthy and J.N. Hirschhorn. Genome-wide association studies: potential next
steps on a genetic journey. Hum. Mol. Genet. 17(R2): R156–65, 2008.

[24]

J.B.S. Haldane. Heredity and Politics. New York, NY: W W Norton, 1938.

[25]

J.B.S. Haldane. The interaction of nature and nurture. Ann. Eugen. 13: 197–205,
1947.

[26]

D.S. Falconer. The problem of environment and selection. Amer. Natural. 86: 293–
298, 1952.

[27]

D.J. Hunter. Gene-environment interactions in human diseases. Nat. Rev. Genet. 6(4):
287–298, 2005.

[28]

C.M. Ulrich, E. Kampman, J. Bigler, C. Chen, et al. Colorectal adenomas and the
C677T MTHFR polymorphism: evidence for gene-environment interaction? Cancer
Epidemiol. Biomarkers Prev. 8(8): 659–668, 1999.

[29]

L.A. Mai. Genetic and exposure risks for chronic beryllium disease. Clin. Chest Med.
23: 827–839, 2002.

[30]

J.L. Ree. The genetics of sun sensivity in humances. Am. J. Hum. Genet. 75: 739–
751, 2004.

[31]

L.G. Costa and D.L. Eaton. Gene-Environment Interactions: Fundamentals of Ecogenetics. Hoboken, NJ: John Wiley & Sons, 2006.

[32]

A. Caspi and T.E. Moﬃtt. Gene-environment interactions in psychiatry: joining forces
with neuroscience. Nat. Rev. Neurosci. 7(7): 583–590, 2006.

[33]

L.W. Hahn, M.D. Ritchie and J.H. Moore. Multifactor dimensionality reduction
software for detecting gene-gene and gene-environment interactions. Bioinformatics
19(3): 376–382, 2003.

[34]

S.W. Guo. Gene-environment interaction and the mapping of complex traits: some
statistical models and their implications. Hum. Hered. 50(5): 286–303, 2000.

[35]

N. Chatterjee and R.J. Carroll. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92: 399–
418, 2005.

111

[36]

A. Maity, R.J. Carroll, E. Mammen and N. Chatterjee. Testing in semiparametric
models with interaction, with applications to gene-environment interactions. J. R.
Statist. Soc. B 71: 75–96, 2009.

[37]

S.J. Ma, L.J. Yang, R. Romero and Y.H. Cui. Varying coeﬃcient models for geneenvironment interaction: a non-linear look. Bioinformatics 27(15): 2119–2126, 2011.

[38]

K. Wang, M.Y. Li and H. Hakonarson. Analysing biological pathways in genome-wide
association studies. Nat. Rev. Genet. 11, 843–854, 2010.

[39]

Y.H. Cui, G.L. Kang, K.L. Sun, R. Romero, M. Qian, and W.J. Fu. Gene-centric
genomewide association study via entropy. Genetics 179: 637–650, 2008.

[40]

C. Wu and Y.H. Cui. Boosting signals in gene-based association studies via eﬃcient
SNP selection. Brief. Bioinform. in press, 2013.

[41]

J.Z. Liu, A.F. McRae, D.R. Nyholt, S.E. Medland, et al. A versatile gene-based test
for genome-wide association studies. Am. J. Hum. Genet. 87(1):139–145, 2010.

[42]

P. Zimmet, K. G. M. M. Alberti and J. Shaw. Global and societal implications of the
diabetes epidemic. Nature 414: 782–787, 2001.

[43]

C.J. Patel, R. Chen, K. Kodama, J.P. Ioannidis and A.J. Butte. Systematic identiﬁcation of interaction eﬀects between genome- and environment-wide associations in
type 2 diabetes mellitus. Hum. Genet. 132: 495–508, 2013.

[44]

A.P. Feinberg. Phenotypic plasticity and the epigenetics of human disease. Nature
447: 433–440, 2004.

[45]

M. Peacock, C.H. Turner, M.J. Econs and T. Foroud. Genetics of osteoporosis. Endocr. Rev. 23:303-326. 2002

[46]

D.B. Sparrow, G. Chapman, A.J.Smith, M.Z. Matter, J.A. Major, et al. A mechanism
for gene-environment interaction in the etiology of congenital scoliosis. Cell 149: 295306, 2012.

[47]

V.S. Laitala, J. Kaprio and K. Silventoinen. Genetics of coﬀee consumption and its
stability. Addiction 103: 2054–2061, 2008.

112

[48]

J.A. Martinez, M.S. Corbalan, A. Sanchez-Villegas, L. Forga, et al. Obesity risk
is associated with carbohydrate intake in women carrying the Gln27Glu beta2adrenoceptor polymorphism. J. Nutr. 133: 2549–2554, 2003.

[48]

B. Mukherjee, J. Ahn, S.B. Gruber and N. Chatterjee. Testing gene-environment
interaction in large-scale case-control association studies: possible choices and comparisons. Am. J. Epidemiol. 175: 177–190, 2012.

[49]

J.Q.Fan and W. Zhang. Statistical methods with varying coeﬃcient models. Stat.
Interface 1: 179–195, 2008.

[50]

J.H. Huang, C. Wu and L. Zhou. Polynomial spline estimation and inference for
varying coeﬃcient models with longitudinal data. Statistica Sinica 14: 763–788, 2004.

[51]

J.Z. Huang, C.O. Wu and L. Zhou. Varying-coeﬃcient models and basis function
approximation for the analysis of repeated measurements. Biometrika 89: 111–128,
2002.

[52]

J.R.B. Perry, B.F. Voight, L. Yengo, N. Amin, J. Dupuis, et al. Stratifying type
2 diabetes cases by BMI identiﬁes genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 8(5):e1002741.
doi:10.1371/journal.pgen.1002741. 2012.

[53]

M.C. Cornelis, A. Agrawal, J.W. Cole, N.N. Hansel, K.C. Barnes, et al. The Gene,
Environment Association Studies consortium (GENEVA): maximizing the knowledge
obtained from GWAS by collaboration across studies of multiple conditions. Genet.
Epidemiol. 34: 364–372, 2010.

[54]

G.A. Colditz and S.E. Hankinson. The Nurse’s Health Study: lifestyle and health
among women. Nat. Rev. Cancer 5: 388–396, 2005.

[55]

E.B. Rimm, E.L. Giovannucci, W.C. Willett, G.A. Colditz, A. Ascherio, B. Rosner
and M.J. Stampfer. Prospective study of alcohol consumption and risk of coronary
disease in men. Lancet 338: 464–468, 1991.

[56]

T.L. Holbrook, E. Barrett-Connor and D.L. Wingard. The association of lifetime
weight and weight control patterns with diabetes among men and women in an adult
community. Int. J. Obes. 13: 723–729, 1989.

[57]

V.J. Carey, E.E. Walters, G.A. Colditz, C.G. Solomon et al. Body fat distribution
and risk of non-insulin-dependent diabetes mellitus in women. The Nurses’ Health
Study. Am. J. Epidemiol. 145: 614–619, 1997.
113

[58]

J.M. Chan, E.B. Rimm, G.A. Colditz, M.J. Stampfer, W.C. Willett. Obesity, fat
distribution, and weight gain as risk factors for clinical diabetes in men. Diabetes
Care 17: 961–969, 1994.

[59]

S.F. Grant, G. Thorleifsson, I. Reynisdottir, R. Benediktsson, A. Manolescu, et al.
Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes.
Nat. Genet. 38: 320–323, 2006.

[60]

T. Jin and L. Liu. The Wnt signaling pathway eﬀector TCF7L2 and type 2 diabetes
mellitus. Mol. Endocrinol. 22: 2383–2392, 2008.

[61]

M. I. McCarthy. Genomics, Type 2 Diabetes, and Obesity.N. Engl. J. Med. 363:
2339–2350, 2010.

[62]

L. Qi and Y.A. Cho. Gene-environment interaction and obesity. Nutr. Rev. 66: 684–
694, 2008.

[63]

L. Liu, Y. Li, T.O. Tollefsbol. Gene-environment interactions and epigenetic basis of
human diseases. Curr. Issues Mol. Biol. 10: 25–36, 2008.

[64]

A.P. Feinberg and R.A. Irizarry. Stochastic epigenetic variation as a driving force of
development, evolutionary adaptation, and disease. Proc. Natl. Acad. Sci. USA 107:
1757–1764, 2010.

[65]

M.C. Cornelis, E.J. Tchetgen,L.M. Liang, L. Qi,N. Chatterjee, F.B. Hu, and P. Kraft.
Gene-environment interactions in genome-wide association studies: A comparative
study of tests applied to empirical studies of type 2 diabetes. Am. J. Epidemiol.
175(3):191–202, 2011.

[66]

O. Vaccaro, M. Boemi, F. Cavalot, P. De Feo, R. Miccoli, et al. The clinical reality of
guidelines for primary prevention of cardiovascular disease in type 2 diabetes in Italy.
Atherosclerosis 198: 396–402, 2008.

[67]

A.L. Price, N.J. Patterson, R.M. Plenge, M.E. Weinblatt, et al. Principal components
analysis corrects for stratiﬁcation in genome-wide association. Nat. Genet. 38: 904–
909, 2006.

[68]

C. Wu and Y.H. Cui. A novel method for identifying nonlinear gene-environment
interactions in case-control association studies. (Under review) 2013.

114

[69]

N. Chatterjee and R.J. Carroll. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92:399–
418, 2005.

[70]

A. Maity, R.J. Carrol, E. Mammen and N. Chatterjee. Testing in semiparametric
models with interaction, with applications to gene-environment interactions. J. R.
Statist. Soc. B 71: 75–96, 2009.

[71]

D.J. Schaid, J.P. Sinnwell, G.D. Jenkins, et al. Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies. Genet. Epidemiol.
36: 3-16, 2012.

[72]

B. Efron, R. Tibshirani. On testing the signiﬁcance of sets of genes. Ann. Appl. Stat.
1: 107-129, 2007.

[73]

Y.L. Tang, H.X. Wang, Z.Y. Zhu and X.Y. Song. A uniﬁed variable selection approach
for varying coeﬃcient models. Statist. Sinica 22: 601–628, 2012.

[74]

T. Hastie and R. Tibshirani. Varying-coeﬃcient models. J. R. Statist. Soc. B 55:
757–796, 1993.

[75]

J.Q. Fan and R.Z. Li. Variable selection via nonconcave penzlied likelihood and its
oracle properties. J. Amer. Stat. Assoc. 96: 1348–1360, 2001.

[76]

G. Schwarz. Estimating the dimension of a model. Ann. Stat. 6: 461–464, 1978.

[77]

M.O. Kim. Quantile regression with varying coeﬃcients. Ann. Stat. 35: 92-108, 2007.

[78]

L.F. Wang, H.Z. Li and J.Z. Huang. Variable selection in nonparametric varyingcoeﬃcient models for analysis of repeated measurements. J. Amer. Stat. Assoc. 103:
1556–1569, 2008.

[79]

L.L. Schumaker. Spline Functions: basic theory. Wiley, New York. 1981.

[80]

J. S. Rawlings, K. M. Rosler and D.A. Harrison. The JAK/STAT signaling pathway.
J. Cell Sci. 117: 1281–1283, 2004.

[81]

B. Mukherjee, J. Ahn, S.B. Gruber, and N. Chatterjee. Testing gene-environment
interaction in large-scale case-control association studies: possible choices and comparisons. Am. J. Epidemiol. 175: 177–190, 2012.

115

[82]

B.M. Neale and P.C.Sham. The future of association studies: Gene-based analysis
and replication. Am. J. Hum. Genet. 75: 353-362, 2004.

[83]

H.S. Wang and Y.C. Xia. Shrinkage Estimation of the Varying Coeﬃcient Model. J.
Amer. Stat. Assoc. 104: 747–757, 2009.

[84]

C.L. Leng. A simple approach for varying-coeﬃcient model selection. J. Stat. Plan.
Inf., 139: 2138–2146, 2138

[85]

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped
variables. J. R. Statist. Soc. B 68: 49–67, 2006.

[86]

F.R. Wei, J. Huang and H.Z. Li. Variable selection and estimation in high-dimensional
varying coeﬃcient models. Stat. Sinica 21: 1515–1540, 2011.

[87]

P. Breheny and J. Huang. Group descent algorithms for nonconvex penalized linear
and losistic regression models with grouped predictors. (Under review). 2013

[88]

J. Friedman, T. Hastie, H. H¨ﬂing and R. Tibshirani. Pathwise Coordinate Optimizao
tion. Ann. Appl. Stat., 1: 302–332, 2007.

[89]

T.T. Wu and K. Lange. Coordinate Descent Algorithms for Lasso Penalized Regression. Ann. Appl. Stat. 1: 224–244. 2008.

[90]

J. Friedman, T. Hastie and R. Tibshirani. Regularization Paths for Generalized Linear
Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. 2010

[91]

R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc
B. 58(1): 267–288, 1996.

[92]

H. Zou. The adaptive lasso and its oracle properties. J. Amer. Stat. Assoc. 101(476):
1418–1429, 2006.

[93]

C. Wu and Y.H. Cui. High dimensional variable selection in gene-environment interactions. (Manuscript) 2013.

[94]

H. Zou and R.Z. Li. One-step Sparse Estimates in Nonconcave Penalized Likelihood
Models.Ann. Stat. 36(4): 1509–1533, 2008.

116

[95]

P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized
regression, with applications to biological feature selection. Ann. Appl. Stat. 5(1):
232–253, 2011.

[96]

D. L. Donoho and I. M. Johnstone. Ideal denoising in an orthonormal basis chosen
from a library of bases. Biometrika 81: 425–455, 1994.

[97]

X. T. Shen, W. Pan and Y.Z. Zhu. Likelihood-based selection and sharp parameter
estimation. J. Amer. Stat. Assoc. 107: 223–232, 2012.

[98]

L. Xue and A. Qu. Variable selection in high-dimensional varying-coeﬃcient models
with global optimality. J. Mach. Learn. Res. 13: 1973–1998, 2012.

[99]

J.Q. Fan and J.C. Lv. Sure independence screening for ultra-high dimensional feature
space. (with discussion) J. R. Statist. Soc. B, 70: 849–911, 2008.

117