VARIABLE SELECTION IN VARYING MULTI-INDEX COEFFICIENT
MODELS WITH APPLICATIONS TO GENE-ENVIRONMENTAL
INTERACTIONS
By
Shunjie Guan

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Statistics—Doctor of Philosophy
2017

ABSTRACT
VARIABLE SELECTION IN VARYING MULTI-INDEX COEFFICIENT MODELS
WITH APPLICATIONS TO GENE-ENVIRONMENTAL INTERACTIONS
By
Shunjie Guan
Variable selection is an important topic in modern statistics literature. And varying
multi-index coefficient model(VMICM) is a promising tool to study the synergistic
interaction effects between genes and multiple environmental exposures. In this dissertation,
we proposed a variable selection approach for VMICM, we also generalized such approach
to generalized and quantile regression settings. Their theoretical properties, simulation
performance and application in genetic research were studied.
Complicated diseases have both environmental and genetic risk factors, and large amount
of research have been devoted to identify gene-environment (G×E) interaction. Defined as
different effect of a genotype on disease risk in persons with different environmental exposures
(Ottman (1996)), we can view environmental exposures as the modulating factors in the
effect of a gene. Based on this idea, we derived a three stage variable selection approach
to estimate different effects of gene variables: varying, constant and zero which respectively
correspond to nonlinear G×E effect, no G×E effect and no genetic effect. For multiple
environmental exposure variables, we also select and estimate important environmental
variables that contribute to the synergistic interaction effect. We theoretically evaluated
the oracle property of the three step estimation method. We conducted simulation studies
to further evaluate the finite sample performance of the method, considering both continuous
and discrete predictors. Application to a real data set demonstrated the utility of the method.
In Chapter 3, we generalized such variable selection approach to binary response

setting. Instead of minimizing penalized squared error loss, we chose to maximize penalized
log-likelihood function. We also theoretically evaluated the oracle property of the proposed
selection approach in binary response setting. We demonstrated the performance of the
model via simulation. At last, we applied our model to a Type II diabetes data set.
Compared to conditional mean regression, conditional quantile regression could provide
a more comprehensive understanding of the distribution of the response variable at different
quantile. Even if the center of distribution is our only interest, median regression (special
case of quantile regression) could offer a more robust estimator. Hence, we extended our
three stage variable selection approach to a quantile regression setting in Chapter 4. We
demonstrated the finite sample performance of the model via extensive simulation. And we
applied our model to a birth weight data set.

I delicate this dissertation to my parents, Bishan Lin, Ruixiong Guan and my girlfriend
Lingjie Zhou.

iv

ACKNOWLEDGMENTS

Here, I would like to offer my sincere gratitude to my advisor Dr. Yuehua Cui for his
patient, support and guidance during my study and research. Dr. Cui is kind, knowledgable
and humble, whenever I am stuck with a problem, he could always offer very useful idea
and insights. In the genetic journal club hosted by Dr. Cui, I learned a lot from other’s
presentation and eventually, I was able to give a talk on my own. I really own my completion
of this dissertation to Dr. Cui’s guidance.
I also have to thank the members of my PhD committee, Dr. Pingshou Zhong, Dr.
Hyokyoung G. Hong, and Dr Qing Lu. Their comments and suggestions are very helpful in
my graduate study.
I appreciate the help I got from all the professors in the Department of Statistics and
Probability. Especially, Dr. Taps Maiti, I took several of his classes, unlike other courses
that focus on theory and proof, his course in linear modelling focused on interpretation and
how to explain statistical concepts to non-statisticians. It is only when I began to look for
a job in the past months that I realize its importance. I really own my success in finding a
job after graduation to his insistence in interpreting statistics results.
My thanks also goes to other senior students in Dr. Cui’s group, including Dr. Bin Gao,
Dr. Honglang Wang, Dr. Tao He, Jingyi Zhang and post-doctoral researcher Xu Liu. In
just a few years, many of them become successful researchers in a university or statisticians
in big companies.
Last but not least, I thank my parents and my girlfriend for their support and confidence
in me. They are the beacon for my journey, offered me hope and a sense of direction.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 1 Introduction . . . . .
1.1 Overview . . . . . . . . . . . .
1.2 Gene-Environment Interaction
1.3 Non-Parametric Models . . .
1.4 Variable Selection . . . . . . .
1.5 Quantile Regression . . . . . .
1.6 Objective and Organization .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

ix

.
.
.
.
.
.
.

1
1
2
3
5
8
9

Variable Selection with Varying Multi-Index Coefficients
Model for G×E Interaction . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variable Selection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Selection of the order h and the number of internal knots K . . . . .
2.2.6 Selection of Initial values for β . . . . . . . . . . . . . . . . . . . . .
Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Simulation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 The Continuous Cases . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 For discrete G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10
10
13
13
14
16
18
20
20
21
22
23
23
25
29
32

Chapter 2
2.1
2.2

2.3
2.4

2.5
2.6

Chapter 3 Variable Selection for Generalized VMICM . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Variable Selection for gVMICM . . . . . . . . . . . . . . .
3.2.1 Model Setup . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Estimation Method . . . . . . . . . . . . . . . . . .
3.2.3 Computational algorithm . . . . . . . . . . . . . . .
3.2.4 Selection of Parameters . . . . . . . . . . . . . . . .
3.2.4.1 Selection of tuning parameters λ1 , λ2 , λ3 .
3.2.4.2 Selection of order h and number of interior
3.2.5 Choosing the initial values . . . . . . . . . . . . . .
3.3 Theoretical Properties . . . . . . . . . . . . . . . . . . . .
vi

. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
knots K
. . . . .
. . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

35
35
38
38
38
40
42
42
44
44
45

3.4

.
.
.
.
.

46
47
49
54
56

. .
. .
. .
. .
. .
. .
. .
. .
K
. .
. .
. .
. .
. .
. .
. .

59
59
61
62
62
64
66
66
67
67
68
69
69
71
75
79

Chapter 5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . .
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81
81
82

3.5
3.6

Simulation . . . . . . . .
3.4.1 For continuous G
3.4.2 For discrete G . .
Real Data Application .
Discussion . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Chapter 4 Variable Selection for Quantile VMICM . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Variable selection for quantile regression with VMICM . . . . . . . . . .
4.2.1 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Estimation method . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Selection of parameters . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4.1 Selection of the tuning parameters λ1 , λ2 , λ3 . . . . . . .
4.2.4.2 Selection of the order h and the number of interior knots
4.2.5 Selection of the initial values . . . . . . . . . . . . . . . . . . . . .
4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Simulation Setting . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 The continuous case . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 The discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

APPENDICES . . . . . . . . . . . . .
Appendix A Real Data Results . . .
A.1 Real data results of gVMICM .
Appendix B Algorithm . . . . . . . .
B.1 Algorithm for VMICM . . . . .
B.2 Algorithm for model (2.6) . . .
B.3 Algorithm for gVMICM . . . .
B.4 Algorithm for quantile VMICM
B.5 Algorithm for model (4.6) . . .
Appendix C Proof of Theorems . . .
C.1 Proof of Theorem 2.3.1 . . . . .
C.2 Proof of Theorem 2.3.2 . . . . .
C.3 Proof of Theorem 3.3.1 . . . . .
C.4 Proof of Theorem 3.3.2 . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

. 84
. 85
. 85
. 87
. 87
. 90
. 90
. 92
. 93
. 94
. 94
. 98
. 99
. 103

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

vii

LIST OF TABLES

Table 1:

Selection and prediction accuracy of mk (·) for continuous G . . . . . . . .

24

Table 2:

Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 25

Table 3:

Selection and prediction accuracy of mk (·) for discrete G. . . . . . . . . .

27

Table 4:

Prediction accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0)
2

29

Table 5:

The estimated loading parameters . . . . . . . . . . . . . . . . . . . . . .

31

Table 6:

Selection and prediction accuracy of mk (·) for continuous G . . . . . . . .

48

Table 7:

Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 49

Table 8:

Setup for mk (u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Table 9:

Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . .

51

2

2

Table 10: Estimation accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 53
2
Table 11: The estimated effect of β for SNP rs6537663. . . . . . . . . . . . . . . . .

56

Table 12: Selection and estimation accuracy of mk (·) for continuous G . . . . . . .

70

Table 13: Selection and estimation accuracy of β . . . . . . . . . . . . . . . . . . . .

71

Table 14: Setup for mk (u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Table 15: Selection and estimation accuracy for mk (u) with discrete G . . . . . . .

73

Table 16: Selection and estimation accuracy for β with discrete G . . . . . . . . . .

75

Table 17: Effect of SNPs in gene ST3GAL1 . . . . . . . . . . . . . . . . . . . . . . .

77

Table 18: Estimated Loading Parameter for Gene ST 3GAL1 . . . . . . . . . . . . .

78

Table 19: List of SNPs with a varying effect. . . . . . . . . . . . . . . . . . . . . . .

85

Table 20: List of SNPs with a constant effect. . . . . . . . . . . . . . . . . . . . . .

86

viii

LIST OF FIGURES

Figure 1:

Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . .

28

Figure 2:

Plot of the varying coefficient effect for gene expression P GGT 1B.

. . .

31

Figure 3:

Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . .

52

Figure 4:

Plot of effects on a log odds scale for SNP rs6537663 . . . . . . . . . . .

56

Figure 5:

Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . .

74

Figure 6:

Plot of interaction effect effects . . . . . . . . . . . . . . . . . . . . . . .

79

ix

Chapter 1
Introduction

1.1

Overview

In this dissertation, we studied how genetic and environmental factors interact to affect
a disease outcome by developing novel statistical methods. Ever since Gregor Mendel’s
famous experiments with his pea plants in the nineteenth century, researchers have been
fascinated with the role of genetics played in our lives. With decades of genetics research,
we knew more than ever the effect of various genes. For example, we knew mutations in
the CFTR gene cause cystic fibrosis, mutations in PAH gene cause phenylketonuria. In fact,
scientists have identified more than 10, 000 human disorders that are caused by mutations
in single genes. However, complex diseases such as type II diabetes have various risk factors:
environmental risk factors such as exercise level, body mass index, genetic risk factors such
as mutations in gene TCF7L2 and ABCC8. Hence, it is of great interest to study how
gene and environment interact and affect various disease traits. In this chapter, we first
provided some background information and a brief review of traditional statistical models
used to study gene-environment(G×E) interaction in section 1.2. To address the constrain
of traditional G×E interaction models, we discussed non-parametric models in section 1.3.
In section 1.4, we discussed the recent advance in variable selection via penalized regression
and how we can apply variable selection in our model to select significant risk factors. We

1

offered a brief review of quantile regression and its benefit in section 1.5. At last, the goal
and organization of this dissertation is offered in section 1.6

1.2

Gene-Environment Interaction

In recent years, more and more research suggested that gene-environment (G×E) interaction
plays an important role in complex traits such as type II diabetes and birth weight. G×E
interaction was defined by Ottman (1996) as “a different effect of a genotype on disease
risk in persons with different environmental exposures”. Traditionally, G×E interaction was
investigated via linear model:

Y = β0 + β G ∗ G + β E ∗ E k + β G×E ∗ G ∗ E k +

(1.1)

where G represents genetic factors; E k represents an environmental factor; β G , β E
represents the genetic effects and environmental effect respectively; and β G×E represents
the G×E effect between genetic factors G and environmental factor E k . However, such
model has several drawbacks. Firstly, it assumes the G×E interaction is linear, which is
often violated in many cases. Secondly, model (1.1) is only computational feasible with
a single environmental factor. With the introduction of multiple environmental factors,
model dimension will increase dramatically, resulting in biased and unstable estimation.
Nevertheless, many epidemiological studies revealed that disease risks can be modified by
simultaneously exposure to several environmental factors (Carpenter et al. (2002); Sexton
and Hattis (2007)). These limitations led to the implementation of several non-parametric
models in the G×E interaction studies.

2

1.3

Non-Parametric Models

In parametric modelling, such as linear model (1.1) or generalized linear model, we
assumed we knew the model structure in advance and we estimated the model parameters
based on the assumed structure and the data set. However, more often than we would
prefer, such assumptions cannot be justified. Which was particularly problematic since
all parameter estimation and model interpretation were made based on those assumptions.
This predicament led us to consider non-parametric modelling. Non-parametric models
make little to no assumptions, it lets the data decide the functional relationships between
the response variable and the predictors. Due to its flexibility, non-parametric models
could be applied to most data set. For example, to address the limitation of linearity in
model (1.1), Ma et al. (2011) proposed to use varying coefficient(VC) model to detect
non-linear gene-environmental interaction. Proposed by Hastie and Tibshirani (1993), a
varying coefficient model could be of the form
p

Y = β0 (X) +

βk (X)Gk +

(1.2)

k=1

where Y is the response; Gk , k = 1, . . . , p is the kth genetic factor; and X is a single
environmental factor and

is the random error. Representing the gene effect of Gk , βk (X)

is a smooth non-linear non-parametric functions indexed by X. Under such structure, βk (X)
is allowed to vary as a function of the environmental factor X. It could either be a linear or
a non-linear function. This is the reason why model (1.2) could be used to detect non-linear
gene-environmental interaction between several genetic variants and a single environmental
variant.
While VC model (1.2) alleviates the linearity constrain of model (1.1), it can only be
3

used to model G×E interaction with a single environmental factor. Similar to linear models,
it is computational infeasible to model how a mixture of environmental factors interact with
genetic variants due to dramatically increasing model dimension. To address this issue, Liu
et al. (2016) proposed we could implement varying multi-index coefficient model (VMICM)
p

Y = m0 (Xβ) +

mk (Xβ)Gk +

(1.3)

k=1

where Y is the continuous response variable; G = (G1 , . . . , Gp ) is a p dimensional
matrix representing genetic factors; X is an environmental factor matrix of dimension q;
β is the loading parameter for environment covariates X; mk (u) is a smooth non-linear
non-parametric function indexed by Xβ, and it represents the gene effect of Gk . One of the
main advantage of model (1.3) is it considers the interaction between genetic variants G and
a mixture of environmental variants X without increasing model dimension dramatically.
We could interpret mk (Xβ) as the gene effect of Gk modulated by its index Xβ. And βd
could be interpreted as the strength of interaction between the d − th environmental factors
Xd and G. We could also observe VC model (1.2) is a special case of VMICM (1.3) when
q = 1. Due to its unique ability to accommodate non-linear G×E interaction with a mixture
of environmental variants, we decide to implement VMICM in this dissertation.
Estimation for non-parametric model such as (1.2) and (1.3) could be roughly grouped
into three categories: kernel smoothing (Fan and Zhang (1999); Xia and Li (1999) ; Cai et
al. (2000)), spline-based methods (Huant et al. (2004); Hoover et al. (1998); Chiang et
al. (2001)) and wavelet estimation (Zhou and You (2004)). In this dissertation, we adopted
the idea of B-spline approximation to estimate non-parametric function mk (u) for several
reasons. Firstly, by some transformation of the B-spline basis function, we would be able to

4

separate the constant effect of Gk from its varying effect. This would be discussed in detail
in the following chapters. Secondly, the computation algorithm of B-spline approximation
is more efficient compared to kernel based methods. It is essential since we are working
with high dimension genetics data. Further, it’s easier to implement variable selection via
penalized regression in a B-spline approximation setting.

1.4

Variable Selection

To select significant gene environmental combo from a large number of variants, we need to
implement some dimension reduction technique, via either variable selection or hypothesis
testing. In this dissertation, we focused on variable selection for its efficient algorithm and
unique feature of simultaneous model selection and estimation. Traditional model selection
techniques include backward/forward selection or information criterion based technique such
as AIC or BIC. However, with the rise of big data, such methods are no longer feasible for
several reasons: exponentially increasing computation time and unstable estimation due to
increasing collinearity. Recently, variable selection via penalized regression has been gaining
popularity. Its idea is to add a penalty term to the loss or lieklihood function. It could be
of the form
p

ˆ = arg min Q(β) = arg min ||Y − Xβ||2 + n
β
pλ (|βj |)
2
β
β
j=1

(1.4)

or
p

ˆ = arg max Q(β) = arg max l(β, X, Y ) − n
β
pλ (|βj |)
β
β
j=1

5

(1.5)

where Y is the dependent variable and X is the predictor; ||Y − Xβ||22 is the squared error
loss function; l(β, X, Y ) is the log-likelihood function; and pλ (|βj |) is the penalty function
for the j-th coordinate of β. By adding an appropriate penalty term to the optimization
ˆ could be shrunk to 0, therefore,
function, some covariates of the penalized estimator β
achieving simultaneous model selection and estimation.
With different choices of penalty function pλ (·), the penalized estimator of β could possess
different properties. Fan and Li (2001) advocated three properties that a penalized estimator
should possess:
(1) Sparsity: The penalized estimator should automatically set small coefficients to zero,
therefore achieving model selection.
(2) Approximately Unbiasedness: The penalized estimator should be approximated
unbiased, especially when the true coefficient is large.
(3) Continuity: The penalized estimator should be continuous in the data, therefore,
reducing instability in model prediction.
To possess all three properties under squared error loss setting, Antoniadis and Fan
(2001) deduced that the penalty function pλ (t) should satisfy: (1) mint≥0 {t + pλ (t)} > 0;
(2) pλ (t) = 0 for large t; (3) arg mint≥0 {t + pλ (t)} = 0. Here, we present several popular
choice of penalty functions:
(1) Ridge Regression: pλ (βj ) = λ|βj |2 (Hoerl and Kennard (1970));
(2) LASSO: pλ (βj ) = λ|βj | (Tibshirani (1996));
(3) Adaptive Lasso: pλ (βj ) = wj λ|βj | (Zou (2006));
6

(aλ−β )+

j
(4) SCAD: pλ (βj ) = λ{I(βj ≤ λ) + (a−1)λ
I(βj > λ)} for some a > 2 (Fan and Li

(2001));
βj
(5) MCP: pλ (βj ) = λ 0 (1 − τsλ )+ ds for some τ > 0 (Zhang (2010)).

Among those penalty functions, adaptive LASSO, SCAD and MCP all possess the
sparsity, approximately unbiasedness and continuity property.

In this dissertation, we

focused on MCP as our penalty function.
To solve the optimization problem (1.4) or (1.5), there are several algorithms available.
Least-angle regression (LARS) (Efron et al. (2004)) could be used to calculate the entire
solution path of the LASSO problem very efficiently. Fan and Li (2001) proposed local
quadratic approximation (LQA) algorithm to solve the non-concave penalized likelihood
problem. It can be implemented with a number of penalty functions. Their idea was to locally
approximate the optimization function with a quadratic function. Therefore, transforming
problem (1.4) or (1.5) to a least square problem with a closed form solution. Even though
its efficient was surpassed by recent algorithms, the idea of LQA is still of great importance.
Building on the idea of LQA, Zou and Li (2008) proposed local linear approximation (LLA)
algorithm. Their idea was to locally approximate the non-concave penalty function linearly.
Transforming the non-concave penalty to a LASSO penalty, which could be solved using
LARS. Coordinate descent algorithm (Friedman et al. (2007), Friedman et al. (2010)) is
an even more efficient algorithm. With small modification, it could be implemented to a
wide range of penalized optimization problems. In this dissertation, we adopted the idea of
coordinate descent and LQA to solve our optimization problem.
Besides the usual penalized regression that select individual parameters, there is a group
analog, call grouped penalized regression (Yuan and Lin (2006)). Instead of penalizing
7

individual parameter, it penalizes the L2 norm of a group of parameters. Assume β is
divided into q groups β = (β 1 , . . . , β q ), and the objective function could be of the form
q

ˆ = arg min Q(β) = arg min ||Y − Xβ||2 + n
pλ (||β j ||2 ) .
β
2
β
β
j=1

(1.6)

The end result is that we select non-zero parameters as a group, either a group of parameters
being all zero or none of them being zero. The grouped penalized regression technique is
particularly useful in this dissertation.

1.5

Quantile Regression

Another objective of this dissertation was to extend the proposed variable selection for
VMICM to quantile regression setting. Quantile regression is a very important alternative
to the conventional conditional mean regression. It differs from mean regression in the loss
function. Instead of trying to minimize the squared error loss for linear model, quantile
regression tries to minimize the quantile loss function
n

ρτ (Yi − X i β) and ρτ (u) = u{τ − I(u < 0)}.

(1.7)

i=1

Quantile regression also possess several advantages. First, modelling a dataset at several
different quantiles offers a far more comprehensive view of the distribution of the response
variable.

In many scenarios, the effect of gene Gk varies at different quantile of the

distribution. Further, even when we are only interested in the center of the distribution,
median regression (quantile regression with τ = 0.5) can provides more robust estimator,
therefore, insensitive to outliers.
8

1.6

Objective and Organization

To select non-linear gene-environment interaction, we proposed a three stages iterative
variable selection approach for Varying Multi-Index Coefficient Model and its generalization.
We also extended such approach to a quantile regression setting. One of our goal is to
classify genetic variants into three categories: varying, constant and zero. Varying effect
gene is the gene that interact with environmental factors. Its effect on the response varies
as environmental factors changes. Constant effect gene is the gene that only has a constant
effect, not being modulated by environmental factors. Zero effect gene does not have an
effect at all. By selecting non-zero loading parameters β in model (1.3), another goal of the
proposed approach is to select environmental variants that interact with genetic variants.
The rest of the dissertation is organized as follow. In chapter 2, we presented the variable
selection approach for VMICM, its estimation method, and theoretical properties. We
conducted extensive simulation studies to evaluate the finite sample performance of the
proposed method. The utility of our model was demonstrated with a real data analysis. In
chapter 3, we generalized our selection approach to a binary response generalized regression
setting. We extended the model to a quantile regression setting in chapter 4, followed by
conclusion and further works in chapter 5. At last, all the proofs and details of the algorithms
were rendered in the Appendices.

9

Chapter 2
Variable Selection with Varying
Multi-Index Coefficients Model for
G×E Interaction

2.1

Introduction

Gene-environment (G×E) interaction study has been gaining popularity. As discussed in
chapter 1, varying multi-index coefficient model (VMICM) enjoys the unique ability to model
non-linear interaction between genetic variants and a mixture of environmental variants.
In this chapter, we propose a variable selection model for VMICM. Consider the varying
multi-index coefficient model of the form:
p

Y =

mk (Xβ)Gk +

(2.1)

k=0

where Y = (Y1 , Y2 , . . . , Yn )T is a continuous response variable that measures certain
phenotypic trait of interest; X is a q-dimensional environmental exposure variables; G is
a p + 1 dimensional genetic variables; mk (·), k = 0, 1, . . . , p is the unknown non-parametric
function and β is a vector of unknown loading parameter of dimension q. One of the main

10

advantage of VMICM is that it models the effects of G on Y as functions of X without
suffering the curse of dimensionality. One can interpret mk (Xβ) as the effect of Gk on Y ,
modified by multiple X variables though the index Xβ. In addition, VMICM is flexible
enough to cover a wide range of models. For instance, if q = 1 and β = 1 then it becomes
an additive varying coefficient model, and if p = 1 and G = 1 then it becomes a standard
additive single index model.
Variable selection has been a popular statistical strategy to solve large p small n
problem in a regression setup. In the past, researchers often opted for forward/backward
selection, and information based criteria such as AIC and BIC for variable selection.
Recently, variable selection via penalized regression is gaining more popularity and wider
acceptance as it features simultaneous selection and estimation of parameters. Its idea is to
add a penalty function to the loss function or log-likelihood function. Bridge regression
(Frank and Friedman (1993)), least absolute shrinkage and selection operator (LASSO)
(Tibshirani (1996)) and its extensions (adaptive-LASSO Zou (2006)), smoothly clipped
absolute deviation (SCAD) (Fan and Li (2001)) and minimax concave penalty (MCP)
(Zhang (2010)) are a few examples. To evaluate different penalized functions, Fan and
Li (2001) proposed three important criteria: sparsity, unbiasedness, and continuity. They
also showed that SCAD penalty possess oracle property, meaning that penalized regression
featuring SCAD works as well as if the correct sub-model was known in advance. Adaptive
LASSO (Zou (2006)), SCAD (Fan and Li (2001)) and MCP (Zhang (2010)) all possess oracle
property. However, for adaptive LASSO, determining weights for parameters might become
problematic when p > n. In the current work, we adopt MCP penalty function for its oracle
property and fast algorithm.
Considering the nonlinear structure about the unknown non-parametric functions mk (·)
11

and its unknown parameter β. We propose a three stage iterative variable selection strategy.
Specifically, our goal is: (1) to classify the non-parametric functions mk (·), k = 1, . . . , p into
three categories: varying, constant and zero; (2) to select zero and non-zero components of
loading parameters β; and (3) to estimate mk (·), k = 0, 1, . . . , p and β. Our approach is
motivated by the practical need to separate three different mechanisms in G×E interaction.
The zero function of mk (·) indicates no genetic effect at all; the constant function of mk (·)
indicates no G×E effect; while the varying function of mk (·) indicates the existence of G×E
effect. As shown in Liu et al. (2016), the VMICM model has the advantage to capture
the joint interaction of genes with multiple exposures as a whole. Novel insights about the
underlying genetic mechanism can be revealed by the proposed model. In addition to the
selection of the coefficient functions, we can also select important loading parameters inside
each index coefficient function, to further quantify the relative importance of individual
exposure variables.
Feng and Xue (2013) proposed a variable selection approach based on VMICM by
applying a group SCAD penalty on B-spline coefficients γ and loading parameters β. They
focused on either zero or nonzero coefficient functions mk (·). We are particularly interested
in the constant coefficient since it corresponds to no G×E effect and has important practical
implications. Tang et al. (2012) proposed a two step variable selection approach based on
an additive varying-coefficient model. They classified the non-parametric function into three
categories: varying, constant or zero. However, their model is a special case of our VMICM
model with the dimension of the X variable being one. No variable selection approach on
VMICM has been proposed to classify unknown non-parametric functions mk (·) into three
categories (varying, constant or zero), while at the same time selecting non-zero loading
parameter β. Following their previous work, we use B-spline basis functions to approximate
12

unknown non-parametric functions mk (·), then using penalized regression to classify mk (·)
into varying, constant or zero. In addition, we select non-zero β via first order approximation
and penalized regression. We show that under mild regulatory conditions, our estimators
possess the oracle property, indicating that our penalized estimator works as well as if the
correct sub-model is known in advance.
The rest of the chapter is organized as follows. Section 2.2 introduces our proposed
variable selection approach, including estimation method, iteration approach, and how to
select various tuning parameters. Method on how to select initial values for β is also
discussed. In Section 2.3, we evaluate the theoretical properties of our approach. In Section
2.4, we perform simulations to evaluate the performance of our approach in finite samples,
followed by a real data application in Section 2.5 and a discussion.

2.2

Variable Selection Method

Throughout the chapter, superscript T is used to denote matrix transpose, ||.||p is used
to denote Lp norm, and log(a) is used to denote natural logarithm of a. For the sake of
simplicity, we use constant and non-zero constant interchangeably.

2.2.1

Model Setup

The varying multi-index coefficient model is set up as follows:
p

Y =

mk (Xβ)Gk +
k=0

13

where Y n×1 = (Y1 , Y2 , . . . , Yn )T is a continuous response variable; n is the sample size.
mk (·), k = 0, 1, . . . , p are p + 1 unknown continuous functions; X n×q = (X 1 , X 2 , . . . , X q )
are continuous loading covariates; β q×1 = (β1 , . . . , βq )T are the loading parameters;
Gn×(p+1) = (G0 , G1 , . . . , Gp ), G0 = (1, . . . , 1)T and Gk = (G1k , G2k , . . . , Gnk )T is a
continuous or discrete vector of length n for k = 1, 2, . . . p. In the model, mk (Xβ) is the
effect of Gk on Y for k = 0 and m0 (Xβ) is the intercept function which models the marginal
effect of multiple X variables on Y . The error term

is an unknown random error with

mean 0 and variance σ 2 . We further assumed i and j are independent ∀ 1 ≤ i, j ≤ n and
i = j.

2.2.2

Estimation Method

Our goal was to select and estimate unknown functions {mk (·)}k=0,1,...,p and unknown
loading parameter β = (β1 , . . . , βq )T . For identifiability purpose, we assume ||β||2 = 1
and β1 > 0, and mk (·) cannot has the form of mj (u) = αT uβ T u + γ T u + c.
We approximated the unknown function {mk (u)}k=0,1,...,p using B-spline basis functions.
Without loss of generality, we assumed u ∈ [0, 1]. Let K be the number of internal knots, h be
the degree of the B-spline basis function. So h = 1 represents linear splines, h = 2 represents
quadratic splines. Denote u1 , u2 , ..., uK be internal knots satisfying 0 = u0 =< u1 < u2 <
.... < uK < uK+1 = 1. Denote Int to be left closed, right opened interval [ut−1 , ut ) for
1 ≤ t ≤ K; InK+1 to be closed interval [uK , uK+1 ]. Denote F to be a collection of
functions f defined on [0, 1] satisfying: (i) the restriction of f to Int is a polynomial of
degree h or less for 1 ≤ j ≤ K + 1; (ii) f is h − 1 times continuous differentiable on [0, 1].
Let L = K + h + 1, by (Schumaker (2007)), we have normalized B-spline basis function
B(u) = (B 1 (u), B 2 (u), . . . , B L (u)) for F . And there exists a linear transformation matrix
14

¯
Π, such that ΠB(u) = (1, B(u))
= (1, B 2 (u), B 3 (u), . . . , B L (u)) = B(u) where each
¯
component of B(u)
is a function of u. Then for 0 ≤ k ≤ p, we can estimate mk (u) by

¯
mk (u) ≈ (1, B2 (u), · · · , BL (u)) ∗ (γk1 , γk2 , · · · , γkL )T = B(u)γ k = γk1 + B(u)γ
k∗ (2.2)

where γ k∗ = (γk2 , γk3 , . . . , γkL )T and γ k = (γk1 , γ Tk∗ )T . With B-spline approximation, (2.1)
can be rewritten as
p

¯
{γk1 + B(Xβ)γ
k∗ }Gk + .

Y =

(2.3)

k=0

Thus, the original estimation problem can be transformed to estimate {γk1 , γ k∗ }k=0,1,...,p
and β. Note: the transformation Π enable us to separate the constant effect of Gk on Y
from its jointly effect with X on Y . That is: (1) if ||γ k∗ ||2 = (

L
2 1/2
l=2 γkl )

= 0, then there

exists interaction between Gk and multiple X; (2) if ||γ k∗ ||2 = 0 and |γk1 | = 0, then Gk
has a constant effect on Y , i.e., no G×E interaction effect; and (3) if further ||γ k∗ ||2 = 0
and |γk1 | = 0 then Gk has no effect on Y at all.
To select and estimate the parameters {γ k }k=0,1,...,p and β, we adopted the penalized
regression idea and minimized the following objective function:
p

n

g(Yi −

Q(β, γ) =
i=1

p

¯
[γk1 + B(Xβ)γ
k∗ ]Gik ) + n
k=0

k=1

p

q

pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) + n

+n

pλ1 (||γ k∗ ||2 )

k=1

(2.4)

pλ3 (|βd |)
d=2

where g(·) is a loss function; pλ1 (·), pλ2 (·), pλ3 (·) are penalty function of the corresponding
parameters; and I(·) is an indicator function.
Remarks: (1) From the construction of the penalty function, we penalized γk1 only if
15

||γ k∗ ||2 = 0. If ||γ k∗ ||2 = 0, it implies that the function is varying and no need to penalize
the constant part.
(2) No penalty was applied to the intercept function m0 (·) as both γ 0∗ and γ01 was not
involved in the penalty term. There is no practical motivation of penalizing the marginal
intercept function.
(3) No penalty was applied to the first loading parameter β1 in β due to the constraint.
For the penalty function, we used MCP penalty proposed by C. Zhang (Zhang (2010)),
p(x, λ) = λ 0x (1 − τsλ )+ ds with regularization parameters τ > 0 and λ > 0. For the loss
function, we set g(·) to be squared error loss, (2.4) can be further rewritten as :

p

n

Yi −

Q(β, γ) =
i=1

¯
[γk1 + B(Xβ)γ
k∗ ]Gik

2

p

k=0

k=1

p

q

pλ2 (|γk1 |)I(||γ k∗ ||1 = 0) + n

+n

pλ1 (||γ k∗ ||1 )

+n

k=1

(2.5)

pλ3 (|βd |)
d=2

where pλ1 (·), pλ2 (·), pλ3 (·) are the MCP penalty function defined above.

2.2.3

Iterative Approach

Our modeling purpose was to separate the mk (·) function into three different categories:
varying, constat or zero, denoted by V, C and Z respectively. Following the ideas by Feng
and Xue (2013) and Tang et al. (2012), we adopted a three step iterative approach to obtain
our penalized estimator.
ˆ (0) , we obtained our 1st step estimation
Step 1 : For given initial values of β, denoted by β

16

(1) (1)T T
ˆ (1) = {ˆ
ˆ k∗ }k=0,1,...,p by following a group penalized regression,
of γ, denoted by γ
γk1 , γ

ˆ (0) )
ˆ (1) = arg min Q1 (γ|λ1 , β
γ
γ

where

ˆ (0) ) =
Q1 (γ|λ1 , β

p

n

{Yi −
i=1

p

ˆ (0) )γ k∗ ]Gik }2 + n
¯
[γk1 + B(X
β

k=0

pλ1 (||γ k∗ ||2 ).
k=1

Note that instead of penalizing each coordinate of γ k∗ = (γk2 , . . . , γkL )T separately, we
penalized the L2 norm of γ k∗ because we want to assess the presence of joint varying effect of
X and Gk on Y . No penalty was applied to γ 0∗ (the B-spline coefficients for the intercept
function m0 (·)). Step 1 separates mk (·), k = 1, . . . , p into two categories: varying(V) or
(1)

(1)

non-varying(NV), and mk (·) ∈ V if ||ˆ
γ k∗ ||2 > 0 and mk (·) ∈ N V if ||ˆ
γ k∗ ||2 = 0.
(1)

ˆ k∗ = 0. From the non-varying
Step 2 : The aim of this step was to select γk1 given γ
functions obtained from step 1, we would like to further select the variables with constant
effects, and classify the non-parametric functions into constant(C) and zero (0). We penalized
(1)

γk1 only when ||ˆ
γ k∗ ||2 = 0, k = 1, 2, . . . , p, and no penalty is applied to γ01 .
(2)

(2)

(2)

ˆ (2) = {(ˆ
ˆ k∗ )k∈V , (ˆ
We obtained our step 2 estimator γ
γk1 , γ
γk1 )k∈N V } via penalized
regression
ˆ (2) = arg min Q2 (γ|λ2 , β (0) , γ
ˆ (1) )
γ
γ

17

where
n

Q2 (γ|λ2

ˆ (1) )
, β (0) , γ

(2)
(0) )γ (2) ]G −
¯
[γk1 + B(Xβ
k
k∗

{Yi −

=
i=1

k∈V

(2)

γk1 Gk }2
k∈N V

p
(2)

(1)

pλ2 (|γk1 |)I(||ˆ
γ k∗ ||2 = 0).

+n
k=1

After Step 1 and 2, we had obtained the estimator of the B-spline coefficients γ and
classify mk (·) k = 1, . . . , p into V, C or 0. The next step is to select loading parameter β
ˆ (2) .
given γ
ˆ via penalized regression
Step 3 : We obtained β

ˆ = arg min Q3 (β|λ3 , γ
ˆ (2) )
β
||β ||2 =1
where
n

Q3 (β|λ3

ˆ (2) )
,γ

=

p

(2)
(Yi −
[ˆ
γk1
i=1
k=0

q
(2)
¯
+ B(Xβ)ˆ
γ k∗ ]Gk )2

pλ3 (|βd |).

+n
d=2

ˆ (0) by β,
ˆ and iterate step 1 to 3 until convergence. The algorithms used
Then we replaced β
in each step would be discussed in Appendices.

2.2.4

Selection of tuning parameters

We proposed the following three step tuning parameters selection process based on Bayesian
Information Criterion (BIC) (Schwarz (1978)).

18

Step 1: We took λ1 as the minimizer of
p

n

log(n)
(λ )
(λ )
ˆ (0) )ˆ
¯
∗ dfλ1
[ˆ
γk11 + B(X
β
γ k∗1 ]Gk )2 +
n

(Yi −

BIC(λ1 ) = log
i=1

k=0

(λ ) (λ1 )
ˆ (0) ) defined above; β
ˆ (0) is
ˆ k∗ }k=0,1,...,p are the minimizers of Q1 (γ|λ1 , β
where {ˆ
γk11 , γ

chosen as the estimator from previous iteration; and dfλ1 is defined as the total number of
non zero coefficients if λ1 is the penalized parameter.
Step 2: We took λ2 as the minimizer of
p

n

log(n)
(λ )
(λ )
ˆ (0) )ˆ
¯
∗ dfλ2
[ˆ
γk12 + B(X
β
γ k∗2 ]Gk )2 +
n

(Yi −

BIC(λ2 ) = log
i=1

k=0

(λ ) (λ2 )
ˆ (0) ) defined above; β
ˆ (0) is
ˆ k∗ }k=0,1,...,p are the minimizers of Q2 (γ|λ2 , β
where {ˆ
γk12 , γ

chosen as the estimator from previous iteration; and dfλ2 is defined as the total number of
non zero coefficients if λ2 is the penalized parameter.
Step 3: We took λ3 as the minimizer of
p

n

(Yi −

BIC(λ3 ) = log
i=1

logn
(λ )
(λ )
ˆ (λ3 ) )T γ
¯
ˆ k∗2 ]Gk )2 +
[ˆ
γk12 + B(X
β
∗ dfλ3
n

k=0

(λ ) (λ2 )
ˆ (λ3 ) is the minimizer of Q3 (β|λ3 , γ
ˆ (2) ) defined above; {ˆ
ˆ k∗ }k=0,1,...,p are
where β
γk12 , γ

minimizer of the B-spline coefficient from Step 2; and dfλ3 is defined as the total number of
non zero β if λ3 is the penalized parameter. We searched the optimal of λ1 , λ2 , λ3 over a
grid of 100 exponentially decreasing values with the minimum to be 1E-3, and the maximum
of λ1 , λ2 , λ3 were set to be the minimum value such that all of the penalized estimators are
0.

19

2.2.5

Selection of the order h and the number of internal knots K

Since h is the order of the B-spline basis function, higher degree corresponds to more
complicated interactions and is less interpretable in practice. For instance, h = 2 implies
quadratic splines while h = 3 implies cubic splines. Hence, we searched optimal h over
1

the set h ∈ {2, 3, 4}. As for the selection of K, since only when K = Op (n 2r+1 ) (n is
the number of samples and r is the smoothness of our nonparametric function mk (·) and
r > 2), our selection approach possesses oracle properties. So we can search optimal K in
1

the neighborhood of n 2r+1 , which is denoted by K . In our simulation, K = {2, 3, 4, 5}.
In theory, we can select the optimal order and the interior knots for all the nonparametric
functions. However, this is practically infeasible due to the large search space and the
computational cost. Thus, we assumed that all the nonparametric functions share common
h and K, and fit the following intercept only model to select the optimal h and K,

Y = m0 (Xβ) + .

(2.6)

We searched the optimal K and h over a grid K ∈ K and h ∈ {2, 3, 4}. The optimal K and
h is defined as the K, h that minimize log

n
ˆ
i=1 (Yi

− Yi )2 +

log(n)
n (K

+ h + 1), where Yˆi is

the prediction of the ith subject under model (2.6).

2.2.6

Selection of Initial values for β

The algorithm described above needs a reasonable initial value for β (denoted by β initial ) to
start the iteration. Based on the optimal K and h selected, we fit the intercept only model
(2.6) via B-spline approximation and Newton-Raplson algorithm. β initial was set to be the
estimator for β in (2.6).
20

2.3

Theoretical Properties

Let β 0 and m0k (·), k = 0, 1, . . . , p be the true value of β and mk (·), respectively, and denote
γ 0 be the true value of the B-spline coefficient γ. Without loss of generality, we assumed
βl0 = 0 for l = 1, . . . s, βl0 = 0 for l = s + 1, . . . q; m0k (·) is varying for k = 0, 1, . . . , v, m0k (·) is
non-zero constant for k = v + 1, . . . , c and m0k (·) is zero for k = c + 1, . . . , p. The following
theorems established the consistency of the penalized least square estimators.
Theorem 2.3.1. Assume the regulatory conditions (A1)-(A7) in Appendices hold and the
number of knots K = Op (n1/(2r+1) ). Then
ˆ − β 0 || = Op (n−r/(2r+1) + an );
(i) ||β
(ii) ||m
ˆ k (·) − m0k (·)|| = Op (n−r/(2r+1) + an ), k = 1, . . . , q
where

0 |), p (|β 0 |), γ 0 = 0, γ 0 = 0, β 0 = 0}
an = max{pλ (||γ 0k∗ ||2 ), pλ (|γj1
j1
λ3
l
k∗
l
1
2
k,j,l

and k, j = 1, . . . , p, l = 2, . . . , q, and r is defined in Appendices.
Theorem 2.3.2. Assume the regularity conditions (A1)-(A7) the Appendices hold and the
number of knots K = Op (n1/(2r+1) ). Let

λmax = max{λ1 , λ2 , λ3 }, λmin = min{λ1 , λ2 , λ3 }.

Suppose λmax → 0 and nr/(2r+1) λmin → ∞ as n → ∞.Then with probability approaching
ˆ and m
1, β
ˆ k (·) must satisfy:
(i) βˆj = 0 for j = s + 1, . . . , q
(ii) m
ˆ k (·) = ck for k = v + 1, . . . , c where ck is some non-zero constant
21

(iii) m
ˆ k (·) = 0 for k = c + 1, . . . , p
Theorem 2.3.1 and 2.3.2 show that our proposed variable selection approach is consistent
and possesses oracle property.

2.4

Simulation

We conducted extensive simulation to evaluate the performance of our proposed approach.
The performance is measured in several ways: (1) classification accuracy for function m(·),
denoted as oracle percentage; (2) IMSE of the estimated m(·) function; (3) selection accuracy
of β; and (4) estimation accuracy of β(MSE). Denote R as the total number of simulations.
Oracle percentage of m(·) is defined as the percentage of correct classification out of
a total of R simulations. For example, if mk (·) ∈ V , and out of R simulations, mk (·) is
g
× 100%.
classified as varying for g times, then the oracle percentage of mk (·) is R

IMSE of mk (·) is defined as

1
IM SE =
R

R

[
r=1

1
ngrid

ngrid
(r)

(r)

¯ j )ˆ
(ˆ
γk1 + B(u
γ k∗ − mk (uj ))2 ]
j=1

where ngrid is the number of points that we want to estimate the MSE of the predicted
(r)

(r)

ˆ k∗ and γˆk1 are the estimators of the B-spline coefficients for the rth simulation
function; γ
ˆ (r) is the estimator of the loading parameter β
using the proposed estimation approach; β
for the rth simulation; and uj is taken at the j/ngrid × 100% quantile among the range of
ˆ
Xβ

(r)

. For our simulations, ngrid is set to be 100.

Oracle percentage of β is defined as the percentage of correct selection of β out of R
simulations. For example, if βd = 0 and out of R simulations, βd is selected to be non-zero

22

g
for g times, then the oracle percentage of βd is R
× 100%.
1
MSE of βd is calculated as R

R
2
ˆ(r)
r=1 (βd − βd )

(r)
where βˆd is the estimator for βd in the

rth simulation.

2.4.1

Simulation Setting

The simulation data was generated according to the following model,
p

Y = m0 (Xβ) +

mk (Xβ)Gk +
k=1

where X was generated from a U nif (0, 1) distribution; β = (β1 , β2 , . . . , βq )T ; β1 = β2 = √1 ;
2
and the rest βj s are zeros. We evaluated the performance of the proposed approach with
both continuous and discrete predictors G. For continuous G variables, they can be gene
expressions. For discrete G variables, they can be SNP variants. In either case, the dimension
of G can be large.

2.4.2

The Continuous Cases

In the continuous case, the non-parametric functions mk (u) are defined as follows: m0 (u) =
2sin(2πu), m1 (u) = 2cos(πu) + 2 and m2 (u) = sin(2πu) + cos(πu) + 1 are varying functions;
m3 (u) = 2 and m4 (u) = 2.5 are non-zero constants; mk (u) = 0 for k = 5, . . . , p are zeros.
The number of loading parameters is set as q = 5 and β1 = β2 = √1 , β3 = β4 = β5 = 0. G
2

was generated randomly from a N (0, 1) distribution,

∼ N (0, 1). We ran 1000 simulations

(R = 1000) to evaluate the performance of the proposed model under p = 50, 100.
Table 1 demonstrates the selection and estimation accuracy for non-parametric functions
with continuous G. The left and right penal corresponds to the case where p = 50 and
23

100 respectively. For all the cases, the selection accuracy (oracle %) is very closed to 100%.
IMSE for varying functions (m0 (·), m1 (·) and m2 (·)) is in the order of -2, and the IMSE
for constant functions (m3 (·) and m4 (·)) is in the order of -3. All of the model IMSE and
oracle IMSE are in the same order. Overall, the differences in oracle percentage and IMSE
are negligible between the case p = 50 and the case p = 100. These observations suggest
that our proposed variable selection approach possesses reasonable selection and prediction
accuracy for non-parametric function mk (·).
Table 1: Selection and prediction accuracy of mk (·) for continuous G
p = 50

p = 100

Oracle %

IMSE
Model
Oracle

Oracle %

IMSE
Model
Oracle

m0 (·)
m1 (·)
m2 (·)
m3 (·)
m4 (·)
Zero

100.0%
99.6%
99.9%
100.0%
100.0%
99.7%

3.87E-02 4.27E-02
1.58E-02 2.42E-02
2.33E-02 2.58E-02
2.09E-03 2.11E-03
2.04E-03 2.06E-03
1.94E-05
0

100.0%
99.9%
99.9%
100.0%
100.0%
99.9%

3.77E-02 4.51E-02
1.57E-02 3.14E-02
2.26E-02 2.96E-02
1.90E-03 1.97E-03
2.07E-03 2.12E-03
1.12E-05
0

m0 (·)
m1 (·)
m2 (·)
n = 1000
m3 (·)
m4 (·)
Zero

100.0%
100.0%
100.0%
100.0%
100.0%
99.8%

3.23E-02
7.17E-03
1.46E-02
1.02E-03
1.09E-03
8.50E-06

100.0%
100.0%
100.0%
100.0%
100.0%
99.9%

3.31E-02
7.07E-03
1.46E-02
9.60E-04
1.06E-03
3.46E-06

n = 500

3.40E-02
1.21E-02
1.59E-02
1.02E-03
1.09E-03
0

3.47E-02
1.17E-02
1.64E-02
9.55E-04
1.07E-03
0

Table 2 presents the selection and prediction accuracy for loading parameter β. The left
and the right penal correspond to the case where p = 50 and 100 respectively. It shows that
the selection accuracy for all β is reasonably good (> 98% in all cases). For most of the
β, the MSE is in the order of -4 or lower, except for β2 , which was -3 for both p = 50 and
p = 100 when n = 500. The order of the model estimation for β is at least the same as that
of the oracle model if not lower. And we did not observe a difference in performance between
24

the case p = 50 and the case p = 100. These results indicate that our model possesses good
selection and prediction accuracy for loading parameters β.
Table 2: Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0)
2
p = 50

p = 100

Oracle %

MSE
Model
Oracle

Oracle %

MSE
Model
Oracle

β1
β2
β3
β4
β5

100.0%
100.0%
98.1%
98.8%
98.6%

1.15E-04 1.07E-04
8.04E-03 4.12E-03
9.98E-05
0
2.99E-05
0
1.00E-04
0

100.0%
100.0%
98.2%
99.1%
98.5%

1.17E-04 1.30E-04
2.26E-03 7.62E-03
3.64E-05
0
3.13E-05
0
7.73E-05
0

β1
β2
n = 1000 β3
β4
β5

100.0%
100.0%
98.9%
99.4%
99.1%

5.30E-05 5.52E-05
5.34E-05 1.86E-03
9.36E-06
0
6.30E-06
0
7.17E-06
0

100.0%
100.0%
98.8%
99.5%
99.0%

5.00E-05 5.49E-05
5.04E-05 1.79E-03
1.16E-05
0
5.49E-06
0
6.93E-06
0

n = 500

2.4.3

For discrete G

We further evaluated how our proposed model performs with the discrete G.

In this

simulation, each G variable was simulated from a multinomial distributions with minor
allele frequency (MAF) denoted as Pa . The G variable took values 0, 1, and 2 corresponding
to the genotype aa, Aa, and AA where a is the minor allele. The frequencies corresponding
to the three genotypes (aa, Aa, and AA) are Pa2 , 2Pa (1 − Pa ) and (1 − Pa )2 . We varied
the MAF for different G variables to evaluate the impact of the MAF on the selection
performance. Specifically, for k = 1, 2, 7, Pa = 0.5; for k = 3, 4, 8, Pa = 0.3; for k = 5, 6, 9,
Pa = 0.1, and for k = 10, 11, . . . , p, Pa ∼ U nif (0.05, 0.5). For the non-parametric functions,
we set m0 (u) = 2sin(2πu), m1 (u) = m3 (u) = m5 (u) = 2cos(πu) + 2, m2 (u) = m4 (u) =
m6 (u) = sin(2πu) + cos(πu) + 1; m7 (u) = m8 (u) = m9 (u) = 2; and mk (u) = 0 for
25

k = 10, 11, . . . , p. Under the setup, we have both varying and constant effect with different
minor allele frequencies. For the zero effects, the MAF for Gk ranged uniformly from 0.05
to 0.5. X was generated from U nif (0, 1) and

was generated from N (0, 1). Finally, Y was

generated according to model (2.1). We evaluated the performance of our proposed model
via 1000 simulations under p = 50, 100 and n = 500, 1000.
Table 3 and figure 1 present the selection and estimation accuracy of non-parametric
function mk (·) for discrete G. We observed that the oracle percentage is very high (> 99%)
for all cases, indicating our proposed model can correctly select the coefficient functions with
high accuracy. The IMSE for varying functions was of the order −1 or −2, while the IMSE
for constant functions was of the order −2 or−3. Moreover, the IMSE of the proposed model
was in the same order of the IMSE of the oracle model. With the decrease of minor allele
frequency Pa of Gk (from 0.5 to 0.1), we observed an increase in both model IMSE and
oracle IMSE. This is consistent with our expectation since SNP with lower MAF provides
less information. We did not observe a difference in oracle percentage and IMSE between
the case p = 50 and p = 100. This suggests that our model performs reasonably well in
both cases. The case n = 1000 performed slightly better than the case n = 500, and it is
consistent with the asymptotic property of the model. Overall, the proposed model performs
reasonably well in the selection and estimation of non-parametric functions.

26

Table 3: Selection and prediction accuracy of mk (·) for discrete G.

Oracle %

p = 50
IMSE
Model
Oracle

Oracle %

p = 100
IMSE
Model
Oracle

m0 (·)
m1 (·)
m2 (·)
m3 (·)
m4 (·)
m5 (·)
m6 (·)
m7 (·)
m8 (·)
m9 (·)
Zero

100.0%
99.7%
99.7%
99.7%
99.6%
99.6%
99.6%
100.0%
100.0%
100.0%
99.6%

4.67E-02
3.71E-02
4.56E-02
4.39E-02
5.13E-02
1.15E-01
1.29E-01
4.53E-03
5.34E-03
1.27E-02
1.82E-04

4.34E-02
3.54E-02
4.26E-02
4.19E-02
4.93E-02
1.53E-01
1.24E-01
4.50E-03
5.30E-03
1.26E-02
0

100.0%
99.6%
99.4%
99.4%
99.4%
99.2%
99.3%
99.9%
99.8%
99.9%
99.6%

5.06E-02 5.22E-02
4.02E-02 4.53E-02
4.92E-02 4.77E-02
4.95E-02 5.44E-02
5.68E-02 5.45E-02
1.33E-01 2.02E-01
1.33E-01 1.30E-01
4.74E-03 4.56E-03
6.26E-03 6.03E-03
1.21E-02 1.24E-02
1.28E-04
0

m0 (·)
m1 (·)
m2 (·)
m3 (·)
m4 (·)
n = 1000 m5 (·)
m6 (·)
m7 (·)
m8 (·)
m9 (·)
Zero

100.0%
99.9%
99.9%
99.9%
99.9%
99.9%
99.9%
100.0%
100.0%
100.0%
99.8%

3.11E-02
1.44E-02
2.16E-02
1.69E-02
2.35E-02
4.10E-02
4.62E-02
2.29E-03
2.67E-03
5.61E-03
3.10E-05

3.31E-02
1.96E-02
2.37E-02
2.19E-02
2.50E-02
4.55E-02
5.01E-02
2.34E-03
2.66E-03
5.69E-03
0

100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
99.9%

3.11E-02
1.44E-02
2.13E-02
1.74E-02
2.40E-02
4.25E-02
4.65E-02
1.89E-03
2.62E-03
6.03E-03
1.15E-05

n = 500

27

3.20E-02
1.93E-02
2.34E-02
2.20E-02
2.56E-02
4.74E-02
4.85E-02
1.89E-03
2.63E-03
6.08E-03
0

Model IMSE for varying effect Gk

1.0

Oracle percentage for varying effect Gk

Orable IMSE for varying effect Gk

0.15

n=500 p=50
n=500 p=100
n=1000 p=50
n=1000 p=100

0.4

0.5

0.05
0.1

0.3

0.4

Oracle percentage for constant effect Gk

Model IMSE for constant effect Gk

0.5

0.1

0.2

0.4

0.5

0.5

n=500 p=50
n=500 p=100
n=1000 p=50
n=1000 p=100

orable IMSE

0.004
0.002
0.000

0.002
0.000
0.3
Minor Allele Freq.

0.4

0.010

0.010
IMSE

0.006
0.004

0.1

0.3

Orable IMSE for constant effect Gk

0.008

0.8
0.6
0.4
0.0

n=500 p=50
n=500 p=100
n=1000 p=50
n=1000 p=100

0.2

Minor Allele Freq.

n=500 p=50
n=500 p=100
n=1000 p=50
n=1000 p=100

0.012

1.0

Minor Allele Freq.

0.2

Oracle Percentage

0.2

0.012

0.3
Minor Allele Freq.

0.008

0.2

0.006

0.1

0.00

0.0

n=500 p=50
n=500 p=100
n=1000 p=50
n=1000 p=100

0.00

0.2

0.05

0.10

orable IMSE

0.10
IMSE

0.6
0.4

Oracle Percentage

0.8

0.15

n=500 p=50
n=500 p=100
n=1000 p=50
n=1000 p=100

0.1

0.2

0.3
Minor Allele Freq.

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Minor Allele Freq.

Figure 1: Selection and estimation accuracy of mk (·) for discrete G

Table 4 presents the selection and estimation result of the loading parameters β. The
left and right panel represent the case where p = 50 and p = 100, respectively. We observed
that the oracle percentage in all the cases is above 97% and the MSE for the estimation
of β is of the order −3 or lower in the proposed and oracle model. With the increase of
model dimension p (from 50 to 100), we observed a slight increase model MSE. This is
expected since model performance usually decreases as complexity increases. Further, the
case where n = 1000 performed slightly better than the case n = 500, and it is consistent
with the asymptotic theory of the model. Overall, the proposed variable selection approach

28

can correctly select and estimate the loading parameters with high accuracy.
Table 4: Prediction accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0)
2
p = 50

p = 100

Oracle %

MSE
Model
Oracle

Oracle %

MSE
Model
Oracle

β1
β2
β3
β4
β5

100.0%
99.8%
97.4%
99.1%
97.3%

2.86E-04 1.04E-04
2.84E-03 1.04E-04
9.38E-05
0
4.08E-05
0
7.04E-05
0

100.0%
99.7%
97.0%
98.6%
98.2%

6.36E-04 2.13E-04
6.89E-03 6.13E-03
2.15E-04
0
2.21E-04
0
1.68E-04
0

β1
β2
n = 1000 β3
β4
β5

100.0%
100.0%
99.3%
99.3%
99.2%

5.45E-05 5.03E-05
1.92E-03 2.09E-03
5.19E-06
0
4.21E-06
0
7.51E-06
0

100.0%
100.0%
99.0%
98.9%
99.5%

5.20E-05 6.03E-05
5.20E-05 1.74E-03
1.10E-05
0
9.69E-06
0
2.64E-06
0

n = 500

To summarize, the proposed method possesses reasonable selection and estimation
accuracy for non-parametric functions mk (·) and their loading parameters β in all cases.
With the increase of the sample size n, we observed a small increase in oracle percentage for
both non-parametric functions mk (·) and their loading parameters β. We also observed a
decrease in IMSE of mk (·) and MSE of β. These coincide with the asymptotic theory of our
model.

2.5

Real Data Application

We demonstrated the utility of the model with a human liver cohort (HLC) data set. The
data set is consisted of genotype (SNPs), gene expressions, and phenotypes (activity of several
liver enzymes). The data set can be downloaded from www.synapse.org using synapse ID:
syn4499. For more details regarding the data set, please refer to Schadt et al. (2008) and
29

Yang et al. (2010). In the HLC data set, the phenotypes are enzyme activity measurements
of Cytochrom P450. There are a total of nine P450 enzymes (CYP1A2, 2A6, 2B6, 2C8,
2C9, 2C19, 2D6, 2E1, and 3A4). However, only CYP2E1 passed Shapiro-Wilk normality
test (p-value > 0.1) after log transformation. Hence, we focused the analysis on CYP2E1
activity (Y ). For the environmental variable (X), we set X 1 = Age, X 2 = Aldehyde
Oxydase, and X 3 = Liver Triglyceride, then transform X i , i = 1, 2, 3 to range [0,1] with
X −min

(X )

ij
j=1,2,...,n ij
Xij = max
. Where Xij denotes the jth observation of X i ,
(X
j=1,2,...,n ij )−minj=1,2,...,n (Xij )

j = 1, . . . n. In this analysis, we focused on gene expressions and treated them as the G
variable. After data cleaning, we had n = 394 (sample size) and N = 19, 172 (number of
gene expressions).
We

implemented

N-Glycan

the

biosynthesis.

proposed
Based

selection
on

approach
the

to

KEGG

pathway
pathway

hsa00510
database

(http://www.genome.jp/kegg/pathway.html), pathway hsa00510 is mapped to 44 gene
expressions in our data. Due to model constrain of VMICM, the first loading parameter
must be a non-zero positive number (β1 > 0). Hence, we fit the proposed model three times
with age, aldehyde oxydase, and liver triglyceride being the first loading covariate. Gene
expression P GGT 1B was selected as varying coefficient predictor in all three models. Gene
expression B4GALT 3 and B3GN T 3 were selected as constant coefficient predictors in two
models (age and aldehyde oxydase being the first loading covariate).

30

Effect of PGGT1B

0.4
0.2
−0.2

0.0

Log of CYP2E1 Activity

7.4
7.3
7.2
7.1

−0.4

6.9

7.0

Log of CYP2E1 Activity

7.5

0.6

7.6

0.8

Environmental Effect

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.0

0.2

0.4

Xβ

0.6

0.8

1.0

1.2

Xβ

Figure 2: Plot of the varying coefficient effect for gene expression P GGT 1B.

Table 5: The estimated loading parameters
Aldehyde Oxydase

Age

Liver Triglyceride

0.879

0.477

0

Figure 2 presents the plot of the marginal environmental effect and the effect of gene
P GGT 1B on log of CYP2E1 activity with aldehyde oxydase being the first loading covariate.
With the increase of loading index Xβ, marginal environmental effect (m0 (·)) first increased
from 6.9 to 7.6, then it fluctuated around 7.5 before decreasing sharply to 7.0. For gene
expression P GGT 1B, its effect on the log of CY P 2E1 activity first decreased sharply from
0.8 to -0.2, then it fluctuated around -0.1 for the remainder of the index. This suggests that
the G×E effect of gene expression P GGT 1B mainly exists in the lower quantile of aldehyde
oxydase and age. With aldehyde oxydase being the first loading covariate, the constant
effects of gene expression B4GALT 3 and B3GN T 3 were -0.0618 and -0.0624 respectively.
These suggest that the effect of gene expression B4GALT 3 and B3GN T 3 on the log of
31

CY P 2E1 activity does not interact with environmental factors.
Table 5 presents the selection result of environmental factors with aldehyde oxydase
being the first loading covariate. The model selected aldehyde oxydase and age as significant
environmental factors, their estimation were 0.879 and 0.477 respectively. Base on their
value, aldehyde oxydase have a stronger effect than age.
This result demonstrates the unique ability of the proposed method to capture the
non-linear interaction between a mixture of environmental variants and genetic variables.
However, further biological investigation is needed to confirm this finding.

2.6

Discussion

VMICM is a promising candidate to model non-linear interaction between multiple genes
and multiple environments as a whole. It combines multiple exposure variables X into a
single index Xβ, hence can reduce model dimension and alleviate the curse of dimensionality.
In this paper, we developed a three stage variable selection approach for VMICM model.
Our goal was to identify varying, constant and zero effect that interacted with a gene. In
the meantime, we also selected important exposure variables. Rather than modeling the
G×E effect for each X variable separately, our approach can model the joint effect of the
environmental factors (X) as a whole, then identify how different genes (G) interacted with
the environmental mixture to affect the phenotype Y . Biologically speaking, our approach is
attractive since it offers an alternative strategy to look for G×E interaction, and our model
is flexible to detect any non-linear interactions. In addition, we theoretically evaluated
the selection consistency of the variable selection. Both simulation and real data analysis
demonstrated the utility of the proposed method.

32

In our model setup, the covariates X were assumed to be continuous. This is due to
the fact that the index Xβ has to be continuous in order to model the nonlinear function.
In real applications, environmental variables can be discrete such as smoking status, gender
and ethnicity. To accommodate the presence of discrete factors, the model VMICM could
be generalized to partial-linear VMICM: Y = Zα +

p
k=0 mk (Xβ)Gk

+

where Z is

the discrete covariates of dimension r and α is its effect on the response. Our variable
selection approach can be modified slightly to perform selection of non-parametric function
and the parametric component simultaneously. More specifically, the objected function (2.5)
is modified as

p

n
i=1

p
2
¯
[γk1 + B(Xβ)γ
k∗ ]Gik } + n

{Yi − Z i α −

Q(β, γ) =

k=0

k=1
q

p

r
j=1

pλ3 (|βd |).

pλ2 (|γk1 |)I(||γ k∗ ||1 = 0) + n

pλz (|αj |) + n

+n

pλ1 (||γ k∗ ||1 )

d=2

k=1

And the objective function in step 1 is modified as

ˆ (0) ) =
Q1 (γ|λz , λ1 , β

p

n

{Yi − Z i α −
i=1

ˆ (0) )γ ]Gik }2
¯
[γk1 + B(X
β
k∗

k=0
p

r

pλ1 (||γ k∗ ||2 ) + n

+n

pλz (|αj |).
j=1

k=1

In this paper, we discussed the variable selection approach for VMICM with a continuous
response variable. However, many response variables are measured on a binary scale such
as the presence of a certain disease in humans. It is natural to extend the current selection
approach to a generalized VIMCM framework, which will be investigated in chapter 3.

33

In our model formulation, we assumed different index coefficients share the common
loading parameters, i.e., β 0 = β 1 = · · · = β p = β. From a practical point of view, allowing
different loading parameters makes perfect sense. However, such a model imposes theoretical
challenges when evaluating the theoretical properties such as the selection consistency. This
is because that the loading coefficients for the kth index coefficients are not identifiable
when mk (u) ∈
/ V . When a coefficient function does not vary, β k does not exists. Thus, the
selection consistency for β k does not exists. For this reason, we imposed the same loading
parameters for all the index coefficient functions.
In addition to the application to G×E study, our model has many applications in other
fields where a potential nonlinear varying effect exists. Our method enriches the catalog of
variable selection. It contributes to the methodology development of variable selection in
theory and to the application of G×E study in practice.

34

Chapter 3
Variable Selection for Generalized
VMICM

3.1

Introduction

There has been a growing interest in identifying gene-environment (G×E) interaction in
scientific literatures. Ottman(1996) defined gene-environment interaction as “a different
effect of a genotype on disease risk in persons with different environmental exposures”.
Traditionally, G×E interactions were investigated based on a single environmental
factor, since introduction of several environmental factors will increase model dimension
exponentially, which could potentially lead to biased estimation and large standard error
(curse of dimensionality). However, more and more epidemiological studies revealed that
disease risk can be modified by simultaneously exposure to several environmental factors
(Carpenter et al. (2002), Sexton and Hattis (2007)). Further, little was known about how
multiple environmental factors as a whole could interact with genetic factors to affect the
response variable. Any investigation in this area could shed some light into the disease
etiology and offer prospects for future disease prevention.
In chapter 2, we proposed to use varying multi-index coefficient model of the form Y =
p
k=0 mk (Xβ)Gk

+

to model the non-linear gene-environmental interaction. However,

35

VMICM can only model continuous phenotype. In this chapter, we generalized VMICM to
model data with binary response variables. Consider the generalized varying multi-index
coefficient model (gVMICM)

P (Y = 1|X, G)
=
log
P (Y = 0|X, G)

p

mk (Xβ)Gk

(3.1)

k=0

where mk (·), k = 0, 1, . . . , p represents the gene effect of Gk , and is modeled as a smoothed
non-linear non-parametric function modulated by the loading index Xβ.

Its unique

structure allows us to capture the interaction effect between genetic factors and a mixture of
environmental factors. Further, model (3.1) is very flexible to cover a wide range of models.
For instance, if q = 1 and β = 1 then it reduces to a generalized varying coefficient mode; if
p = 1 and G = 1 then it becomes a standard generalized single index model.
Variable selection has been a popular topic in modern statistics literature. In the past,
people often implemented hypothesis testing, forward/backward selection combined with
AIC or BIC. Nevertheless, with the rise of big data, we were able to collect hundreds
of thousands of variables at the same time. Traditional methods are no longer feasible
because of exponentially increasing computation time and unstable estimation due to
increasing collinearity. Recently, variable selection via penalized regression has become fairly
popular. The idea is to add a penalty term to the optimization function. With different
penalty functions, the penalized estimator could possess different properties. Fan and Li
(2001) proposed three important criteria for penalized estimator: sparsity, unbiasedness
and continuity. They also characterized oracle property meaning the model performs as
well as if the true sub-model is known in advance. It has become the standard for new
penalized estimator. A few examples of the penalized function would be Bridge regression

36

(Frank and Friedman (1993)), least absolute shrinkage and selection operator (LASSO)
(Tibshirani (1996)), its extension adaptive LASSO (Zou (2006)), smoothly clipped absolute
deviation (SCAD) (Fan and Li (2001)), and minimax concave penalty (MCP) (Zhang (2010)).
Although LASSO enjoys simple formulation and efficient algorithm (LARS), it does not
possess oracle property. On the other side, Adaptive LASSO, SCAD, MCP all posses oracle
property and we decided to implement MCP in our model.
Due to the complexity of model (3.1), variable selection presents unique challenge,
specifically, the nonlinear and non-parametric structure of function mk (·) and its unknown
loading parameter β. In chapter 2, we proposed a three stage variable selection approach for
VMICM. The model classified the non-parametric gene effect into varying, constant and zero.
It also selected non-zero loading parameters. In this chapter, we extended such approach to
gVMICM. In stead of penalizing squared error loss for VMICM, we decided to implement
the penalized log-likelihood method in this model.
The rest of this chapter is organized as follows: section 3.2 introduced the proposed
variable selection method, formulation of the penalized log-likelihood, three step iterative
optimization approach, and selection of tuning parameters. In section 3.3, we discussed the
asymptotic properties of the proposed method. And we evaluated the performance of our
method via several simulations in section 3.4. Utility of our model was demonstrated with
a Type II diabetes data set in section 3.5, followed by discussion and future work.

37

3.2
3.2.1

Variable Selection for gVMICM
Model Setup

Consider the following gVMICM model

P (Y = 1|X, G)
)=
log(
P (Y = 0|X, G)

p

mk (Xβ)Gk
k=0

where Y n×1 = (Y1 , Y2 , . . . , Yn )T is a binary response variable measured over n subjects;
mk (·), k = 0, 1, . . . , p are p+1 unknown smooth non-linear functions; X = (X 1 , . . . , X q ) is a
matrix of dimension n×q representing continuous environmental variables; β = (β1 , . . . , βq )T
is the loading parameter of dimension q; Gn×(p+1) = (G0 , G1 , . . . , Gp ), G0 = (1, . . . , 1)T
and Gk = (G1k , G2k , . . . , Gnk )T is the k − th genetic factor. For k = 0, mk (Xβ) is the
effect of Gk on the response on a log odda scale; m0 (Xβ) is the intercept term measuring
the marginal environmental effect.
For ease of notation, we denoted µ =

p
k=0 mk (Xβ)Gk .

Thus, model (3.1) can be

rewritten as
log(

3.2.2

P (Y = 1|X, G)
) = µ.
P (Y = 0|X, G)

(3.2)

Estimation Method

Our goal was to select and estimate unknown functions {mk (·)}k=0,1,...,p and their unknown
loading parameter β = (β1 , . . . , βq )T . For the sake of identifiability, we assumed ||β||2 = 1
and β1 > 0, and mk (·) cannot has the form of mj (u) = αT uβ T u + γ T u + c.
We first approximated the non-parametric function mk (u) with B-spline basis functions.
Without loss of generality, we assumed u ∈ [0, 1], and denoted K to be the number of internal
38

knots and h to be the degree of B-spline basis function. So h = 1 represents linear splines;
h = 2 represents quadratic splines; and h = 3 represents cubic splines. From standard
B-spline theory, we denoted u1 , u2 , ..., uK to be the interior knots satisfying 0 = u0 =< u1 <
u2 < .... < uK < uK+1 = 1. Let Int be left closed, right opened interval [ut−1 , ut ) for
1 ≤ t ≤ K, and InK+1 be closed interval [uK , uK+1 ]. Let F to be a collection of functions
f defined on [0, 1] satisfying: (i) the restriction of f to Int is a polynomial of degree h
or less for 1 ≤ j ≤ K + 1; (ii) f is h − 1 times continuously differentiable on [0, 1]. Let
L = K + h + 1, then by Schumaker (1981)(Schumaker (2007)), we normalized B-spline basis
function B(u) = (B 1 (u), B 2 (u), . . . , B L (u)) for F . And there exists a linear transformation
¯
matrix Π, such that ΠB(u) = (1, B(u))
= (1, B 2 (u), B 3 (u), . . . , B L (u)) = B(u) where
¯
each component of B(u)
is a function of u. Hence, for 0 ≤ k ≤ p, we approximated mk (u)
by

¯
mk (u) ≈ (1, B2 (u), · · · , BL (u)) ∗ (γk1 , γk2 , · · · , γkL )T = B(u)γ k = γk1 + B(u)γ
k∗ (3.3)

where γ k∗ = (γk2 , γk3 , . . . , γkL )T and γ k = (γk1 , γ Tk∗ )T . We approximated µ with µB :
p

µB

=

¯
[γk1 + B(Xβ)γ
k∗ ]Gk

(3.4)

k=0

where γk1 , k ≥ 1 represents the main genetic effect for the kth variant and γ k∗ , k ≥ 1
represents the G×E interaction effect between the kth variant and a mixture of the
environmental variables.
With this new representation, the selection of the non-parametric function mk (·) was
transformed to the selection of its B-splines coefficients γ = {γk1 , γ k∗ }k=0,1,...,p . The
transformation Π allows us to separate constant effect of Gk from its varying effect. More
39

specifically, (1) if ||γ k∗ ||2 = 0, then there exists G×E effect; (2) if ||γ k∗ ||2 = 0 and |γk1 | = 0,
then there only exists main genetic effect; and (3) if ||γ k∗ ||2 = 0 and |γk1 | = 0, then Gk has
no effect at all. This transformation is the key step to separate different genetic effects.
Given the binary response, the log-likelihood function is defined as:
n

B

µ
(Yi µB
i − log(1 + e i ))

l(γ, β) =
i=1

B
where µB
i is the ith subject of µ . We defined the following penalized log-likelihood objective

function,
n

(Yi µB
i

M (β, γ) =

B
− log(1 + eµi )) − n

i=1

p

pλ1 (||γ k∗ ||2 )
k=1

p

−n

(3.5)

q

pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) − n
k=1

pλ3 (|βd |)
d=2

where pλ1 (·), pλ2 (·), pλ3 (·) are the penalty functions for γ k∗ , γk1 and β, respectively; I(·) is
an indicator function. The construction of the penalty functions in (3.5) implies that there
is a natural order when selecting the effect of Gk . First, the model selects Gk with varying
effect. If Gk does not have a varying effect, the model penalizes the constant effects of Gk .
We did not penalize β1 due to the model constraint. For the penalty function, we adopted the
MCP penalty function proposed by Zhang (Zhang (2010)), i.e., p(x, λ) = λ 0x (1 − τsλ )+ ds
with regularization parameters τ > 0 and λ > 0.

3.2.3

Computational algorithm

To optimize function (3.5), we followed the idea proposed in chapter 2 and adopted a 3 step
interactive approach.
40

ˆ (0) , we obtained the 1st step estimation
Step 1 : Given an initial value for β, denoted as β
(1)

(1)T

ˆ (1) = {ˆ
ˆ k∗ }Tk=0,1,...,p by following a group penalized regression, i.e.,
of γ, denoted as γ
γk1 , γ
ˆ (0) )
ˆ (1) = arg max M1 (γ|λ1 , β
γ
γ

where
ˆ (0) ) =
M1 (γ|λ1 , β

n

(Yi µB
i

p

B
− log(1 + eµi )) − n

i=1

pλ1 (||γ k∗ ||2 ).
k=1

Step 1 classified mk (·), k = 1, . . . , p into two categories: varying(V) or non-varying(NV).
(1)

(1)

That is, mk (·) ∈ V if ||ˆ
γ k∗ ||2 > 0 and mk (·) ∈ N V if ||ˆ
γ k∗ ||2 = 0.
(1)

ˆ k∗ = 0. We further selected
Step 2 : In this step, our interest was to select γk1 given γ
constant coefficient from the non-varying coefficient obtained in step 1. We penalized γk1
(1)

only when ||ˆ
γ k∗ ||2 = 0, k = 1, 2, . . . , p, and no penalty was applied to γ01 . We excluded
(1)

(2)

ˆ k∗ = 0. We obtained our step 2 estimator
γ k∗ from the model if ||ˆ
γ k∗ ||2 = 0 in step 1, i.e. γ
(2)

(2)

(2)

ˆ (2) = {(ˆ
ˆ k∗ )k∈V , (ˆ
γ
γk1 , γ
γk1 )k∈N V } via penalized regression

ˆ (2) = arg max M2 (γ|λ2 , β (0) , γ
ˆ (1) )
γ
γ

where
n

M2 (γ|λ2

ˆ (1) )
, β (0) , γ

=

(2)
(Yi µB
i

B (2)
− log(1 + eµi
)) − n

i=1

µB
i

(2)

is the ith element of µB

p
(2)

(1)

pλ2 (|γk1 |)I(||ˆ
γ k∗ ||2 = 0).
k=1

(2)

with µB

(2)

=

(2)
k∈V [γk1

(0) )γ (2) ]G +
¯
+ B(Xβ
k
k∗

(2)

k∈N V

γk1 Gk .

ˆ (0)
After step 2, we obtained our estimators of the B-splines coefficients γ for given β

41

and classified mk (·) k = 1, . . . , p into varying, constant or zero effects. The next step is to
ˆ (2) .
update and select the loading parameter β given γ
ˆ via penalized regression
Step 3 : We obtained β

ˆ = arg max M3 (β|λ3 , γ
ˆ (2) )
β
||β ||2 =1
where
n

M3 (β|λ3

ˆ (2) )
,γ

=

(3)
(Yi µB
i

B (3)
− log(1 + eµi
)) − n

i=1

µB
i

(3)

is the ith element of µB

(3)

q

pλ3 (|βd |).
d=2

with µB

(3)

=

(2)
p
γk1
k=0 [ˆ

(2)
¯
+ B(Xβ)ˆ
γ k∗ ]Gk .

ˆ we set β
ˆ (0) = β,
ˆ then iterate step 1 through 3 until
Once we have the updated β,
convergence.
Remark: For step 1 and 2, we implemented the block coordinate descent algorithm
for group penalty. For step 3, we developed our algorithm based on the idea of the local
quadratic approximation (LQA) proposed by Fan and Li(2001). For more detail, please refer
to the Appendices. Next we discuss details about selecting the tuning parameters λ1 , λ2 , λ3 ,
order h and the number of interior knots K for the B-spline approximation, as well as a
reasonable initial value for β.

3.2.4

Selection of Parameters

3.2.4.1

Selection of tuning parameters λ1 , λ2 , λ3

We proposed to use Bayesian Information Criterion (BIC) (Schwarz (1978)) to select the
tuning parameters.

42

Step 1: We selected λ1 as the minimizer of

(1)
1

ˆ
BIC(λ1 ) = −2l(ˆ
γλ , β

(0)

) + log(n) ∗ dfλ1

(1)
ˆ (0) ) defined above; β
ˆ (0) is chosen as the estimator
ˆ λ is the minimizer of M1 (γ|λ1 , β
where γ
1

from the previous iteration; and dfλ1 is defined as the total number of non-zero coefficients
if λ1 is the penalized parameter.
Step 2: We selected λ2 as the minimizer of

(2)
2

ˆ
BIC(λ2 ) = −2l(ˆ
γλ , β

(0)

) + log(n) ∗ dfλ2

(2)
ˆ (0) is chosen as the
ˆ λ is the minimizer of M2 (γ|λ2 , β (0) , γ
ˆ (1) ) defined above; β
where γ
2

estimator from the previous iteration; and dfλ2 is defined as the total number of non-zero
coefficients if λ2 is the penalized parameter.
Step 3: We selected λ3 as the minimizer of

ˆ ) + log(n) ∗ dfλ
BIC(λ3 ) = −2l(ˆ
γ (2) , β
λ3
3

ˆ λ is the minimizer of M3 (β|λ3 , γ
ˆ (2) ) defined above; γ
ˆ (2) is the minimizer of the
where β
3
B-spline coefficient from step 2; and dfλ3 is defined as the total number of non-zero β if λ3
is the penalized parameter.
The penalized parameter λ1 , λ2 , λ3 were chosen over a grid of exponentially decreasing
values with the minimum to be 1E-3. The maximum of λ1 , λ2 , λ3 were set to be the minimum
values such that all of the penalized estimators are zeros. The number of grid to be searched
was set as 100.
43

3.2.4.2

Selection of order h and number of interior knots K

In theory, it might be appealing to allow different K and h for different non-parametric
functions mk (·). In practice, it would be computationally infeasible. For computational
purpose, we let all mk (·) share the same K and h. Since h is the order of the B-spline
basis function, higher degree implies more complicated interactions between environmental
factors and genetic predictors, thus, more difficult to interpret. For this reason, we searched
1

optimal h over the set h ∈ {2, 3, 4}. For the interior knots K, only when K = Op (n 2r+1 ) (n
is the sample size and r is the smoothness of the nonparametric function mk (·) and r > 2),
our selection approach possesses oracle properties. Thus, we searched optimal K in the
1

neighborhood of n 2r+1 , denoted by K . In our simulation, we chosed K = {2, 3, 4, 5}.
The knots K and order h were then selected by fitting the following intercept only model
with the B-spline approximation and Newton-Raphson algorithm.

log(

P (Y = 1|X, G)
) = m0 (Xβ).
P (Y = 0|X, G)

(3.6)

ˆ and
ˆ 0∗ ) and the loading parameters as β,
Denote the estimated spline coefficients as (ˆ
γ01 , γ
ˆ = γˆ01 + B(X
ˆ γ 0∗ . We searched the optimal K and h over a grid K ∈ K
¯ β)ˆ
let m
ˆ 0 (X β)
and h ∈ {2, 3, 4}. Optimal K and h are defined as the K and h that minimized log Y ∗
ˆ)
log(n)
ˆ 0 (X β
ˆ − log(1 + em
m
ˆ 0 (X β)
) + n (K + h + 1).

3.2.5

Choosing the initial values

To start the iterative optimization process described above, a reasonably good initial value of
β, β initial is essential. In many literature, β initial is set to be (1, 0, . . . , 0)T or ( √1q , . . . , √1q )T .
However, neither works well in our simulations. Hence, we set the initial value of β as the
44

byproduct of selecting the optimal K and h by fitting the intercept only model in (3.6).

3.3

Theoretical Properties

Let β 0 and m0k (·), k = 0, 1, . . . , p be the true value of β and mk (·), respectively. Denote γ 0
to be the true value of the B-spline coefficients γ. Without loss of generality, we assumed
βl0 = 0 for l = 1, . . . s, βl0 = 0 for l = s + 1, . . . q; m0k (·) has varying coefficients for
k = 0, 1, . . . , v, m0k (·) has non-zero constant coefficients for k = v + 1, . . . , c and m0k (·) is
zero for k = c + 1, . . . , p. The following theorems give the consistency of the penalized
log-likelihood estimators.
Theorem 3.3.1. Assume the regulatory conditions (A1)-(A7) in Appendices hold and the
number of knots K = Op (n1/(2r+1) ). Then
ˆ − β 0 || = Op (n−r/(2r+1) + an );
(i) ||β
(ii) ||m
ˆ k (·) − m0k (·)|| = Op (n−r/(2r+1) + an ), k = 1, . . . , q
where

an = max{pλ (||γ 0k∗ ||2 ), pλ (|γ 0j1 |), pλ (|βl0 |), γ 0k∗ = 0, γj1 = 0, βl0 = 0}.
1
2
3
k,j,l

k, j = 1, . . . , p, l = 2, . . . , q. r is defined in condition (A2) in Appendices. pλ (·) denotes the
first order derivative of the penalty function pλ (·).
Theorem 3.3.2. Assume the regularity conditions(A1)-(A7) in Appendices hold and the
number of knots K = Op (n1/(2r+1) ). Let

λmax = max{λ1 , λ2 , λ3 }, λmin = min{λ1 , λ2 , λ3 }.
45

Suppose λmax → 0 and nr/(2r+1) λmin → ∞ when n → ∞.Then with probability approaching
ˆ and m
1, β
ˆ k (·) must satisfy:
(i) βˆj = 0 for j = s + 1, . . . , q
(ii) m
ˆ k (·) = ck for k = v + 1, . . . , c where ck is some non-zero constant
(iii) m
ˆ k (·) = 0 for k = c + 1, . . . , p
Theorem 3.3.1 and 3.3.2 show that our penalized likelihood estimators are consistent and
possess oracle properties.

3.4

Simulation

We evaluated the finite sample performance of our model via simulations. Its performance
was evaluated in several ways: (1) classification accuracy of the m(·) denoted as oracle
percentage; (2) IMSE of the estimated m-functions; (3) selection accuracy of β; and (4)
estimation accuracy of β (MSE). For all cases, we ran a total of R simulations.
The oracle percentage of m(·) is defined as the percentage of correct classification for
varying, constant and zero effects. For instance, if mk (·) is a varying function and mk (·) is
g
classified as varying for g times, then the oracle percentage of mk (·) is calculated as R
×100%.

IMSE of mk (·) is defined as

1
R

R

[
r=1

1
ngrid

ngrid
(r)
(r)
¯ j )ˆ
(ˆ
γk1 + B(u
γ k∗ − mk (uj ))2 ]
j=1

(r)

(r)

ˆ k∗ and γˆk1 are the estimators of the B-spline
where ngrid is the number of grid points; γ
ˆ (r) is the
coefficients for the rth simulation using the proposed estimation approach; β
estimator of the loading parameter β for the rth simulation; and uj is taken at the
46

ˆ
j/ngrid × 100% quantile among the range of X β

(r)

. In the simulations, ngrid was set

to be 100.
The oracle percentage of β is defined as the percentage of correct selection of β out of R
simulations. For example, if βd = 0, and βd is selected as non-zero for g times, then the oracle
g
1
× 100%. MSE of βd is calculated as R
percentage of βd is calculated as R

R
2
ˆ(r)
r=1 (βd − βd ) ,

(r)
where βˆd is the estimator for βd in the rth simulation.

The simulation data were generated according to model (3.1). The index matrix X was
generated from a U nif (0, 1) distribution. For the loading parameter β, β1 = β2 = √1 and
2
the rest βj s are zeros. We evaluated the performance of the proposed approach with both
continuous and discrete predictors G. The continuous G can be gene expressions and the
discrete G can be SNP variants.

3.4.1

For continuous G

In the continuous case, the non-parametric functions mk (u) are defined as follows: m0 (u) =
2sin(2πu), m1 (u) = 2cos(πu) + 2 and m2 (u) = sin(2πu) + cos(πu) + 1 are varying coefficient
functions. m3 (u) = 2 and m4 (u) = 2.5 are non-zero constant coefficients. mk (u) = 0 for
k = 5, . . . , p are zeros. G ∼ N (0, 1) . We conducted 1000 simulations (R = 1000) to evaluate
the performance of the proposed model under p = 50, 100, q = 5 and n = 1000, 2000.
Table 6 demonstrates the selection and estimation accuracy for mk (·) with continuous
predictors. The left and the right penal corresponds to the case where p = 50 and 100,
respectively. The upper and lower panel corresponds to the case where n = 1000 and 2000,
respectively. In all cases, the selection accuracy was closed to 100% for varying, constant
and zero effect coefficients. The IMSE of the proposed model was in the order of −1 or −2

47

for varying and constant effect predictors. With the increase of model dimension p (from 50
to 100), we observed a small increase in model IMSE. With the increase of the sample size
n (from 1000 to 2000), there were decreases in both model IMSE and oracle IMSE, which
is consistent with the asymptotic property of the proposed model. These suggest that the
proposed variable selection approach performs reasonably well in selection and estimation
accuracy for the non-parametric functions mk (·).
Table 6: Selection and prediction accuracy of mk (·) for continuous G
p = 50
Oracle %

p = 100

IMSE
Model
Oracle

Oracle %

IMSE
Model
Oracle

n = 1000

m0 (.)
m1 (.)
m2 (.)
m3 (.)
m4 (.)
Zero

100.0%
99.2%
99.4%
100.0%
99.9%
99.1%

1.50E-01
1.44E-01
1.34E-01
3.95E-02
5.58E-02
1.03E-03

1.47E-01
2.14E-01
1.61E-01
3.85E-02
5.53E-02
0

100.0%
99.7%
99.7%
100.0%
100.0%
99.0%

1.48E-01
1.48E-01
1.37E-01
4.19E-02
5.93E-02
1.16E-03

1.31E-01
1.66E-01
1.43E-01
3.33E-02
4.86E-02
0

n = 2000

m0 (.)
m1 (.)
m2 (.)
m3 (.)
m4 (.)
Zero

100.0%
100.0%
100.0%
100.0%
100.0%
99.4%

6.85E-02
5.17E-02
5.46E-02
1.40E-02
1.79E-02
3.33E-04

6.88E-02
6.00E-02
5.99E-02
1.46E-02
1.93E-02
0

100.0%
100.0%
100.0%
100.0%
100.0%
99.4%

7.00E-02
5.43E-02
5.40E-02
1.46E-02
1.89E-02
3.14E-04

7.05E-02
6.34E-02
5.88E-02
1.48E-02
1.87E-02
0

Table 7 presents the selection and prediction accuracy for loading parameter β. The left
and right penal corresponds to the case where p = 50 and 100, respectively. The upper and
lower panel corresponds to the case where n = 1000 and 2000, respectively. For all cases, the
selection accuracy for non-zero loading parameters (β1 , β2 ) was close to 100%. Their MSE
were in order of −2 to −4. For zero loading parameters (β3 , β4 , β5 ), the selection accuracy
for the case n = 1000 was around 97%. When the sample size increases to n = 2000, their
oracle percentages increased to 99%. Their MSE were in the order of −4 to −5. These
48

suggest that the algorithm could not shrink estimators to 0 in around 4% of the cases when
n = 1000, which is a common drawback of the LQA algorithm.
Table 7: Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0)
2

p = 50

3.4.2

p = 100

MSE
Oracle

Oracle %

Model

n = 1000

β1
β2
β3
β4
β5

100.0%
100.0%
97.3%
97.5%
96.8%

7.26E-04
1.36E-02
2.34E-04
1.90E-04
9.01E-04

n = 2000

β1
β2
β3
β4
β5

100.0%
100.0%
99.1%
99.6%
99.3%

2.76E-04
2.71E-04
5.49E-05
2.68E-05
4.58E-05

MSE
Oracle

Oracle %

Model

5.91E-04
1.11E-02
0
0
0

100.0%
100.0%
96.7%
96.7%
96.6%

6.75E-04
3.28E-03
6.22E-04
2.72E-04
2.95E-04

5.75E-04
5.78E-04
0
0
0

2.68E-04
2.66E-04
0
0
0

100.0%
100.0%
99.6%
99.4%
99.4%

2.77E-04
2.76E-04
1.57E-05
3.16E-05
2.22E-05

2.75E-04
2.74E-04
0
0
0

For discrete G

For the discrete case, each G was simulated from a multinomial distribution assuming a
minor allele frequency (MAF) Pa . G took values as 0, 1, 2 corresponds to aa, Aa, and
AA genotype, where a is the minor allele. G was simulated via the following probability
distribution function:

P (Gij = 0) = Pa2 , P (Gij = 1) = 2 ∗ Pa (1 − Pa ), P (Gij = 2) = (1 − Pa )2

where Gij is the j-th variable of the i-th subject, for i = 1, . . . , n, j = 1, . . . , p. The following
lists the choice of the coefficient functions mk (u).

49

Table 8: Setup for mk (u)
Function
m0 (u) = 2sin(2πu)
m1 (u) = 2cos(πu) + 2
m2 (u) = sin(2πu) + cos(πu) + 1
m3 (u) = 2cos(πu) + 2
m4 (u) = sin(2πu) + cos(πu) + 1
m5 (u) = 2cos(πu) + 2
m6 (u) = sin(2πu) + cos(πu) + 1
m7 (u) = 2
m8 (u) = 2
m9 (u) = 2
mk (u) = 0, k > 9

Pa of Gk
NA
0.5
0.5
0.3
0.3
0.1
0.1
0.5
0.3
0.1
U nif (0.05, 0.5)

In this setup, there were varying and constant coefficient functions corresponding to
different MAF. The purpose of setting varying MAFs is to check the selection and estimation
performance under different MAFs in G. For zero coefficients, Pa ranged uniformly from
0.05 to 0.5. X was simulated from a U nif (0, 1) distribution. Y was generated according
to model (3.1). We evaluated the performance of our proposed model with 1000 simulations
under p = 50, 100, n = 1000, 2000 and q = 5.
Table 9 and presents the selection and estimation accuracy of non-parametric function
mk (·) for discrete G. We observed sample size n and minor allele frequency (Pa ) of Gk were
the determining factors in the performance of the proposed model and we present figure 3
to better visualize their impact. With the increase of sample size (from 1000 to 2000), the
performance of the model increased. For example, the oracle percentage for m1 (·), . . . , m4 (·)
increased from around 80% to 100%. The corresponding IMSE decreased significantly. These
are consistent with the asymptotic theory of the proposed model. With the decrease of
minor allele frequency for Gk (from 0.5 to 0.1), we observed a decrease in performance,
both in terms of oracle percentage and model IMSE. For instance, in the case where n =
50

1000, the oracle percentages of {m1 (·), m2 (·)}(Pa = 0.5), {m3 (·), m4 (·)}(Pa = 0.3), and
{m5 (·), m6 (·)}(Pa = 0.1) were around 85%, 80%, and 23% respectively. Their IMSE also
increased from 0.4 (Pa = 0.5) to 0.5 (Pa = 0.3), then to 1.3 (Pa = 0.1). We believed this
is due to the fact that SNP with lower minor allele frequency provides less information.
Overall, the proposed variable selection approach works better with larger sample and with
common variant SNPs.
Table 9: Selection and estimation accuracy of mk (·) for discrete G

Oracle %

p = 50
IMSE
Model
Oracle

Oracle %

p = 100
IMSE
Model
Oracle

n = 1000

m0 (.)
m1 (.)
m2 (.)
m3 (.)
m4 (.)
m5 (.)
m6 (.)
m7 (.)
m8 (.)
m9 (.)
Zero

100.0%
83.4%
87.7%
76.0%
83.3%
23.2%
25.7%
100.0%
99.9%
100.0%
99.0%

1.63E-01
4.29E-01
3.82E-01
5.43E-01
4.37E-01
1.35E+00
1.27E+00
5.09E-02
6.08E-02
1.05E-01
2.74E-03

2.28E-01
5.61E-01
4.38E-01
5.87E-01
4.19E-01
1.05E+00
8.93E-01
6.22E-02
7.50E-02
1.24E-01
0

100.0%
83.1%
87.9%
75.5%
79.6%
22.1%
23.7%
99.9%
99.9%
99.9%
99.2%

1.75E-01
4.44E-01
3.69E-01
5.50E-01
5.11E-01
1.45E+00
1.31E+00
5.05E-02
5.83E-02
1.13E-01
2.15E-03

2.29E-01
6.08E-01
4.50E-01
6.17E-01
4.50E-01
1.18E+00
9.22E-01
5.86E-02
6.75E-02
1.21E-01
0

n = 2000

m0 (.)
m1 (.)
m2 (.)
m3 (.)
m4 (.)
m5 (.)
m6 (.)
m7 (.)
m8 (.)
m9 (.)
Zero

100.0%
100.0%
100.0%
100.0%
99.9%
77.5%
77.5%
100.0%
100.0%
100.0%
99.3%

7.47E-02
9.46E-02
9.52E-02
1.07E-01
1.05E-01
4.84E-01
4.93E-01
1.89E-02
2.17E-02
4.47E-02
9.15E-04

8.03E-02
1.38E-01
1.21E-01
1.55E-01
1.37E-01
2.99E-01
2.63E-01
2.11E-02
2.39E-02
4.95E-02
0

100.0%
99.9%
99.9%
99.8%
99.8%
75.1%
73.5%
100.0%
100.0%
100.0%
99.5%

7.42E-02
1.02E-01
9.47E-02
1.08E-01
1.05E-01
5.15E-01
5.32E-01
1.75E-02
2.08E-02
4.24E-02
6.81E-04

8.35E-02
1.49E-01
1.21E-01
1.50E-01
1.30E-01
3.08E-01
2.53E-01
1.97E-02
2.33E-02
4.56E-02
0

51

n=1000 p=50
n=1000 p=100
n=2000 p=50
n=2000 p=100

0.3

0.4

0.5

1.0
0.8
0.6

orable IMSE

0.4
0.2

0.3

0.4

Oracle percentage for constant effect Gk

Model IMSE for constant effect Gk
0.12

1.0

Minor Allele Freq.

0.5

0.1

0.2

0.4

0.5

0.5

0.10

n=1000 p=50
n=1000 p=100
n=2000 p=50
n=2000 p=100

orable IMSE

0.04
0.02
0.00

0.00
0.3
Minor Allele Freq.

0.4

0.08

0.10
IMSE

0.06
0.04
0.02

0.1

0.3

Orable IMSE for constant effect Gk

n=1000 p=50
n=1000 p=100
n=2000 p=50
n=2000 p=100

0.08

0.8
0.6
0.0

n=1000 p=50
n=1000 p=100
n=2000 p=50
n=2000 p=100

0.2

Minor Allele Freq.

0.4
0.2

Oracle Percentage

0.2
0.1

Minor Allele Freq.

0.12

0.2

0.06

0.1

0.0

0.0

n=1000 p=50
n=1000 p=100
n=2000 p=50
n=2000 p=100

0.0

0.2

0.2

0.4

0.6

IMSE

0.8

0.6
0.4

Oracle Percentage

1.0

0.8

1.2

n=1000 p=50
n=1000 p=100
n=2000 p=50
n=2000 p=100

1.2

1.0

Orable IMSE for varying effect Gk
1.4

Model IMSE for varying effect Gk
1.4

Oracle percentage for varying effect Gk

0.1

0.2

0.3
Minor Allele Freq.

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Minor Allele Freq.

Figure 3: Selection and estimation accuracy of mk (·) for discrete G

Table 10 demonstrates the selection and estimation result of the loading parameter β.
The left and right panel correspond to the case where p = 50 and p = 100 case respectively
and the upper and lower panel correspond to the case where n = 1000 and n = 2000
respectively. We observed sample size n was the determining factor in model performance.
When sample size is large (n = 2000), the oracle percentage for non-zero loading covariates
(β1 , β2 ) was 100% and the oracle percentage for zero loading covariates (β3 , β4 , β5 ) was
around 99%. The MSE for β was in the order of −3 to −5. When sample size is relatively
small (n = 1000), although the oracle percentage was 100% for non-zero loading covariates,

52

the oracle percentage for zero loading parameters decreased to around 95%. Comparing
between the case p = 50 and p = 100, we detected a small deterioration in selection accuracy
for zero loading parameters with n = 1000. This is expected since model performance usually
decreases with the increase of complexity. However, we did not detect such difference in
performance with larger sample size (n = 2000).
Table 10: Estimation accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0)
2

Oracle %

p = 50
MSE
Model
Oracle

Oracle %

p = 100
MSE
Model
Oracle

n = 1000

β1
β2
β3
β4
β5

100.0%
100.0%
95.2%
97.0%
96.1%

6.64E-04
7.37E-03
3.64E-04
1.21E-04
2.04E-04

7.28E-04
7.32E-03
0
0
0

100.0%
100.0%
95.2%
96.1%
94.7%

5.84E-04
2.76E-03
3.73E-04
4.18E-04
5.46E-04

5.13E-04
3.27E-03
0
0
0

n = 2000

β1
β2
β3
β4
β5

100.0%
100.0%
98.9%
98.7%
98.9%

2.34E-04
2.31E-04
5.00E-05
5.44E-05
4.37E-05

2.22E-04
2.20E-04
0
0
0

100.0%
100.0%
98.9%
98.9%
99.0%

2.20E-04
2.30E-03
3.24E-05
4.49E-05
5.07E-05

2.12E-04
2.37E-03
0
0
0

Based on the simulation results with both continuous and discrete G variables, we inferred
the following characteristics of the proposed model. (1) The proposed model performs
reasonably well with large sample (n = 1000 or 2000). (2) The false positive rate for loading
parameter β was around 5% when n = 1000. This is due to the implementation of LQA
algorithm since it cannot shrink zero parameters to zero in some cases. (3) Compared to rare
variant SNP (Pa = 0.1), the model performs better with common variant SNP (Pa = 0.3 or
0.5). We believe this is due to the fact that SNP with lower minor allele frequency provides
less information.

53

3.5

Real Data Application

We demonstrated the utility of our model with a type 2 diabetes data set. The data set
contains genotypes (SNPs), environments and phenotypic trait of interest for type 2 diabetes.
The data set is consisted of two nested case-control cohort studies: the Nurses Health Study
(NHS) and the Health Professional Fellow-up Study (HPFS) from the Gene Environmental
Association Studies Consortium (GENVEA). Details of these two cohorts can be found from
Coldditz and Hankinson (2005) and Rimm et al. (1991). Originally, the data set contained
3,391 females (NHS) and 2,599 males (HPFS). After data cleaning by removing subjects with
unmatch genotypes and phenotypes, SNPs with more than 10% missing rate, MAF < 0.05,
and deviation from Hardy-Weinberg equilibrium (p-value < 0.001), the data set contains
655,002 SNPs and a total of 5,865 subjects (2,494 males and 3,371 females), of which there
were 2,733 cases and 3,132 controls.
There are 12 continuous covariates including: height, weight, age, alcohol consumption
etc. We fit a marginal logistic regression model for all 12 factors. Based on their marginal
p-values and the correlation between those factors, we decided to select 5 covariates as the
environmental variables in the model. They were total physical activity (X1 denoted as act),
BMI (X2 ), alcohol intake (X3 denoted as alcohol), heme iron intake (X4 denoted as heme),
and glycemic load (X5 denoted as gl). Based on the location of the SNPs, we mapped all
SNPs to all known genes. Then we selected the genes with more than 30 SNPs. As a result,
we obtained 2,178 genes. We applied the proposed variable selection approach by fitting one
gene at a time to select significant SNPs and identify their effects. Since the first element
of the loading parameter β, β1 has to be a non-zero positive number for identifiability
purpose, we fit the proposed model five times, each time with a different variable as the first

54

component in X. If a SNP is selected as varying or constant effect in all five models, this
would suggest a convincing signal. We only considered results with a convincing signal, that
is, SNPs showing non-zero effect in all the five fitted models by varying the order of the five
environmental variables in X.
In total, our model identified 13 varying effect SNPs and 26 constant effect SNPs. Here
we presented one of the selected varying SNP as an example. Please refer to the supplemental
material for the complete list of selected varying and constant SNPs. Fig 4 presents plots of
marginal environmental effect and the interaction effect of a SNP rs6537663 with heme iron
intake being the first loading covariate. With the increase of the index, we observed that
the marginal effect first decreases then increases, followed by a rapid decrease as the total
effect of the five environmental factors increases. For the interaction effect, it fluctuated
around 0 as the total effect of the five environmental variables increases, indicating that the
SNP is not responding (or insensitive) to the changes of the five environmental variables.
As the index X β increases, the SNP reacts to the environmental changes, with a dramatic
increased risk on type 2 diabetes after a certain threshold. This estimated effect implies
that the genetic sensitivity of the SNP to the total effect of the five environmental variables
follows a threshold model. This has practical applications as people are mostly OK with the
daily environmental changes including dietary changes. However, such changes cannot pass
a certain limit. Otherwise, disease may occur. Table 11 presents the selection and estimation
results of the loading parameters β with heme iron intake being the first loading covariate.
The model selects all loading parameters except the alcohol consumption (alcohol). We also
observed that body mass index (bmi) has the largest effect which makes practical sense since
BMI is positively associated with type 2 diabetes and is a risk factor for diabetes.

55

Effect of rs6537663

1.0
0.5
−0.5

0.0

Log Odds of Type II Diabetes

1
0
−1

Log Odds of Type II Diabetes

2

1.5

Environmental Effect

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

Xβ

0.4

0.6

0.8

1.0

Xβ

Figure 4: Plot of effects on a log odds scale for SNP rs6537663

Table 11: The estimated effect of β for SNP rs6537663.
act

bmi

alcohol

heme

gl

-0.1832

0.954445

0

0.215668

0.094647

Previous study (Sale et al. (2007) and Grant et al. (2006)) suggested that gene TCF7L2
is associated with type 2 diabetes across multiple populations. Sale et al. (2007) reported
strong association between type 2 diabetes and SNPs rs7903146 and rs7901695. Our model
also selected SNP rs7901695 showing a constant effect (rs7903146 was not in our data set).

3.6

Discussion

Gene-environment interaction has been one of the major components in genetic association
studies. In this paper, we developed a 3 stage iterative variable selection approach for
generalized varying multi-index coefficient model with binary responses. Our goal was to
56

identify varying, constant and zero effects as well as to select non-zero loading parameters.
Biologically speaking, our approach is attractive since it offered a novel way to look at
G×E interaction from a systems genetics perspective.

Our model is flexible to detect

non-linear interactions. It should be preferred when the gene effect is non-linearly modified by
simultaneous exposure to multiple environmental factors. Statistically speaking, gVMICM
treats the effect of multiple variables X as a single index, thus reducing model dimension
and alleviating the curse of dimensionality.
In a typical G×E study, there are usually hundreds of thousands of genotype (SNPs)
and a couple dozens environmental variables. It is important to reduce the dimension of
gene predictors first before apply our method. For example, one could fit a marginal model
between the response and every genotype, then select a reasonable number of gene predictors
to fit the variable selection model. In human genome, a pathway usually contains a wide
range of genes and each gene could contain ten to hundred of SNPs. In this chapter, we
implemented the proposed method focusing a gene as a unit to select varying and constant
effect SNPs. We could implement such method in every pathway to select significant genes
or SNPs of importance. Alternatively, we could apply principle component analysis (PCA)
(Jolliffe (2002)) or sparse principle component analysis (sPCA) (Zou et al. (2006)) to
summarize the SNP information in a gene or pathway to several principle components (PCs),
apply the proposed method to select significant PCs.
In chapter 2, we proposed a 3 stage variable selection approach for VMICM with
continuous responses. Then we generalized such approach to binary responses in chapter
3. The regression model applied in chapter 2 is essentially a mean regression model in which
one is interested in the conditional mean. When there are outliers or the nature of interest
is not on the mean rather than on different quantiles, a quantile regression model might be
57

a natural choice. For example, when studying the effect of genes on birth weight, people are
typically interested in the effect of SNPs on the lower or upper quantile of the birth weight
because extremely low or high birth weight may pose potential risk later in life. In such
cases, it is essential to extend the proposed method to a quantile regression setup to select
important genes modulated by multiple environmental exposures to affect a trait of interest
such as birth weight. This will be addressed in chapter 4.

58

Chapter 4
Variable Selection for Quantile
VMICM

4.1

Introduction

Over the past decades, there has been a growing interest in identifying gene-environment
(G×E) interaction in scientific research.

Gene-environment interaction was defined as

a different effect of a genotype on disease risk under different environmental exposures
(Ottman (1996)). Traditionally, G×E interactions has been investigated based on a single
environmental exposure model. However, more and more epidemiological studies reveal that
disease risk can be modified by simultaneously exposure to multiple environmental factors
(Carpenter et al. (2002) and Sexton and Hattis (2007)). When multiple environmental
factors are analyzed, the model dimension can increase dramatically with the inclusion of
the interaction terms, which lead to estimation instability and large standard errors (curse
of dimensionality). To ease such burden, a varying multi-index coefficient model(VMICM)
(Liu et al. (2016)) can be applied to model the interaction between genetic factors Gk and
a mixture of environmental factors X by
p

Y =

mk (Xβ)Gk +
k=0

59

where mk (u), k = 0, 1, . . . , p are continuous smooth functions; β is q-dimensional loading
parameter; and is the random error. We have proposed an iterative 3 steps variable selection
method for VMICM in chapter 2, which classifies the non-parametric smooth function mk (u)
into three categories: varying, constant and zero. The goal of this paper is to generalize such
approach to a quantile regression setting.
The quantile VMICM is a important alternative to the conditional mean models for
analyzing G×E interaction.

First, comparing conditional mean regression, modelling

conditional quantiles offers a far more comprehensive understanding of the distribution of the
response variable. In many applications, the impact of the genetic factor G on the response
varies at different quantiles of the distribution. People are more interested in the quantiles
rather than the mean. For example, in a study to find genes associated with birth weight,
extremely low or high birth weight could be problematic since it could cause complications
later in life. In this case, one would be interested in identifying genes or SNPs affecting low
or high birth weight with their effect modified by environmental exposures. Second, even
when we are interested in the center of the conditional distribution of the response variable,
median regression (quantile regression with τ = 0.5) can provides more robust estimators,
especially when there are outliers in the trait distribution.
With the recent advancement in biotechnoloty, we are now able to collect hundreds
of thousands of single nucleotide polymorphisms (SNPs) data, at the same time with
relatively low cost. Such advancement renders traditional model selection methods such as
forward/backword selection or methods based on AIC/BIC information obsolete. Recently,
variable selection via penalized regression has been gaining popularity. The idea is to
add a penalty term to the loss(likelihood) function.

With different choices of penalty

functions, the estimator could possess different properties. Fan and Li (2001) proposed three
60

important criteria for penalized estimator: sparsity, unbiasedness and continuity. They also
characterized oracle property, meaning that the model performs as well as if the true model
is known in advance. For instance, adaptive LASSO (Zou (2006)), smoothly clipped absolute
deviation (SCAD) (Fan and Li (2001)) and minimax concave penalty (MCP) (Zhang (2010))
possess oracle property.
In this work, we proposed a quantile regression based variable selection method built upon
the VMICM model to identify how genetic effects are modified by simultaneous exposure to
multiple environmental factors to affect a disease trait. We adopted the MCP penalty in
our modeling framework. We evaluated the method through both simulation and real data
applications. The proposed variable selection method for quantile regression enriches the
literature about variable selection for quantile regression.
The rest of the paper is organized as follows.

Section 2 introduces the proposed

variable selection method, how to formulate the penalized quantile loss function, how to
iteratively optimize the penalized quantile loss function as well as how to select various
tuning parameters and initial value for β. In section 3, we evaluated the finite sample
performance of our model via monte carlo simulation. We applied our approach to a birth
weight data set in section 4, followed by a discussion.

4.2

Variable selection for quantile regression with
VMICM

Throughout the paper, superscript T is used to denote matrix transpose; ||.||p is used to
denote Lp norm; log(a) is used to denote natural logarithm of a. For the sake of simplicity,
we use constant and non-zero constant interchangeably.
61

4.2.1

Model setup

For a random sample of the data {Y n×1 , X n×q , Gn×(p+1) } with size n, we assume the
following model:
p

Y =

mk (Xβ(τ ), τ )Gk + (τ )

(4.1)

k=0

where Y = (Y1 , Y2 , . . . , Yn )T is a continuous response variable; X = (X 1 , X 2 , . . . , X q )
is a continuous q-dim environmental variable; Gn×(p+1) = (G0 , G1 , . . . , Gp ) where G0 =
(1, . . . , 1)T and Gk is a continuous or discrete genetic vector of length n for k = 1, 2, . . . p.
Our parameters of interest {mk (u, τ )}k=0,1,...,p are p + 1 unknown non-parametric functions
conditional on quantile τ ; β(τ ) = (β1 (τ ), . . . , βq (τ ))T is the loading parameters conditional
on quantile τ .

(τ ) is an unknown random error satisfies P ( (τ ) < 0|X, G) = τ for some

specific quantile 0 < τ < 1. The case with τ = 0.5 corresponds to median regression. For
ease of presentation, all parameters of interest are τ -specific. For instance, we use β and
mk (u) to represents β(τ ) and mk (u, τ ), respectively. We denote the quantile VMICM model
as τ VMICM.

4.2.2

Estimation method

Our goal is to select and estimate unknown functions {mk (·)}k=0,1,...,p and unknown loading
parameter β = (β1 , . . . , βq )T . For the sake of identifiability, we assumed ||β||2 = 1 and β1 >
0, and mk (·) cannot has the form of mk (u) = αT uβ T u+γ T u+c. Please refer to Schumaker
(2007) for details of the construction of B-splines basis function. Given the number of interior
knots K and the degree of the B-spline basis function h, we can approximate mk (u) by

¯
mk (u) ≈ γk1 + B(u)γ
k∗
62

(4.2)

where γ k∗ = (γk2 , γk3 , . . . , γkL )T , γ k = (γk1 , γ Tk∗ )T and L = K + h + 1. Hence, (4.1) can
be rewritten as :
p

¯
[γk1 + B(Xβ)γ
k∗ ]Gk + .

Y =

(4.3)

k=0

Here, the estimation of non-parametric function mk (u)k=0,1,...,p and its loading parameter
β is transformed to the estimation of {γk1 , γ k∗ }k=0,1,...,p and β. This B-spline approximation
(4.2) also enable us to separate the constant effect of Gk on Y from its joint effect with X
on Y . (1) If ||γ k∗ ||2 = 0, then Gk and X jointly affect Y ; (2) If ||γ k∗ ||2 = 0 and |γk1 | = 0,
then Gk has a constant effect on Y ; (3) If ||γ k∗ ||2 = 0 and |γk1 | = 0 then Gk has no effect
on Y at all.
Following the idea proposed in previous chapters, we adopted the following penalized
regression approach and defined the objective function as Qτ (β, γ):
p

n

ρτ Yi −

Qτ (β, γ) =
i=1

p

¯
[γk1 + B(Xβ)γ
k∗ ]Gik + n
k=0

k=1

p

q

pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) + n

+n

pλ1 (||γ k∗ ||2 )

k=1

(4.4)

pλ3 (|βd |)
d=2

where ρτ (u) = u{τ − I(u < 0)} is the quantile loss function; pλ1 (·), pλ2 (·), pλ3 (·) are penalty
functions of the corresponding parameters; and I(·) is an indicator function which equals 1
if the condition in the parentheses is satisfied, and 0 otherwise. From the construction of
the penalty function, we only penalize γk1 if ||γ k∗ ||2 = 0, meaning that we are interested in
whether the non-parametric function mk (·) is zero or a non-zero constant only when it is not
varying. No penalty is applied to the intercept function m0 (·) as both γ 0∗ and γ01 are not
involved in the penalty term. No penalty is applied to the coefficient β1 due to the constrain:

63

β1 > 0. For the penalty function, we use MCP penalty proposed by Zhang (Zhang (2010))
which is defined as p(x, λ) = λ 0x (1 − τsλ )+ ds with the regularization parameters τ > 0 and
λ > 0.

4.2.3

Estimation algorithm

In the previous chapters, we proposed a 3 step iterative approach based on the VMICM model
for both continuous and binary responses. The methods classify the non-parametric functions
mk (·) into 3 categories: varying, constat or zero, denoted by V, C and Z respectively. Here,
we generalized the estimation algorithm to a quantile regression setting.
Step 1 :
(1)

ˆ (0) , the step 1 estimator of γ, γ
ˆ (1)
For given β, denoted by β

=

(1)T

ˆ k∗ }Tk=0,1,...,p can be obtained via optimizing the following grouped penalized
{ˆ
γk1 , γ
regression
ˆ (0) )
ˆ (1) = min Q1 (γ|λ1 , β
γ
γ
where

ˆ (0) ) =
Q1 (γ|λ1 , β

p

n

ρ τ Yi −
i=1

p
(0) )γ ]G
¯
[γk1 + B(Xβ
k∗ ik + n

k=0

pλ1 (||γ k∗ ||2 ).
k=1

Instead of penalizing each coordinate of γ k∗ = (γk2 , . . . , γkL )T separately, we penalized the
L2 norm of γ k∗ since we would like to assess the presence of the varying effect of Gk on
the response variable. Step 1 classifies mk (·), k = 1, . . . , p into two categories: varying(V) or
(1)

(1)

non-varying(NV) where mk (·) ∈ V if ||ˆ
γ k∗ ||2 > 0 and mk (·) ∈ N V if ||ˆ
γ k∗ ||2 = 0.
Step 2 : Based on the step 1 estimators of the B-spline coefficients γ, the step 2 estimators
(2)

(2)

(2)

ˆ (2) = {(ˆ
ˆ k∗ )k∈V , (ˆ
γ
γk1 , γ
γk1 )k∈C } can be obtained via the penalized regression. Note that

64

(2)

(1)

ˆ k∗ = 0 automatically if γ
ˆ k∗ = 0. We obtained the estimator by,
γ

ˆ (2) = min Q2 (γ|λ2 , β (0) , γ
ˆ (1) )
γ
γ

where
n

Q2 (γ|λ2

ˆ (1) )
, β (0) , γ

i=1

(2)

(0) )γ ]G −
¯
[γk1 + B(Xβ
k∗ ik

ρτ Yi −

=

k∈V

γk1 Gik
k∈C

p
(2)

(1)

pλ2 (|γk1 |)I(||ˆ
γ k∗ ||2 = 0).

+n

(4.5)

k=1

ˆ (0) , we can obtain the estimators of the B-splines
Based on the initial estimator of β, β
ˆ (2) and classify mk (·) k = 1, . . . , p into V, C or 0.
coefficients γ, γ
ˆ via the penalized regression by
Step 3 : We obtained β

ˆ=
β

ˆ (2) )
min Q3 (β|λ3 , γ
||β ||2 =1

where
n

Q3 (β|λ3

ˆ (2) )
,γ

=

p

q

(2)
ρ τ Yi −
[ˆ
γk1
i=1
k=0

(2)
¯
+ B(Xβ)ˆ
γ k∗ ]Gik

pλ3 (|βd |).

+n
d=2

ˆ (0) = β,
ˆ then iterate step 1 to 3 until convergence. Denote γ
ˆ as the
ˆ and β
Step 4 : Set β
converged estimators.
With this iteration approach, we still need to select the tuning parameters λ1 , λ2 , λ3 , order
h and number of interior knots K for the B-spline approximation, as well as a reasonable
initial value for β.

65

4.2.4

Selection of parameters

For a traditional linear mean regression model, Bayesian Information Criterion (BIC)
(Schwarz (1978)) has been a popular choice in the selection of shrinkage parameters. Lee et
al. (2014) provided theoretical justification in the use of BIC in quantile regression models.
Hence, we use BIC as our selection criterion for shrinkage parameters, and the order h and
the number of interior knots K of the B-spline basis functions.

4.2.4.1

Selection of the tuning parameters λ1 , λ2 , λ3

Step 1: We took λ1 as the minimizer of
p

n

ρ τ Yi −

BIC(λ1 ) = log
i=1

k=0

log(n)
(λ )
(λ )
ˆ (0) )ˆ
¯
∗ dfλ1
β
γ k∗1 ]Gik +
[ˆ
γk11 + B(X
2n

(λ ) (λ1 )
ˆ (0) ) defined above; β
ˆ (0) is
ˆ k∗ }k=0,1,...,p are the minimizers of Q1 (γ|λ1 , β
where {ˆ
γk11 , γ

chosen as the estimator from previous iteration; and dfλ1 is defined as the total number of
non-zero coefficients if λ1 is the penalized parameter.
Step 2: We took λ2 as the minimizer of
p

n

ρ τ Yi −

BIC(λ2 ) = log
i=1

log(n)
(λ )
(λ )
ˆ (0) )ˆ
¯
[ˆ
γk12 + B(X
β
γ k∗2 ]Gik +
∗ dfλ2
2n

k=0

(λ ) (λ2 )
ˆ (0) ) defined above and dfλ
ˆ k∗ }k=0,1,...,p are the minimizers of Q2 (γ|λ2 , β
where {ˆ
γk12 , γ
2

is defined as the total number of non-zero coefficients if λ2 is the penalized parameter.
Step 3: We took λ3 as the minimizer of
p

n

ρ τ Yi −

BIC(λ3 ) = log
i=1

log(n)
(2)
(2)
ˆ (λ3 ) )ˆ
¯
[ˆ
γk1 + B(X
β
γ k∗ ]Gik +
∗ dfλ3
2n

k=0

66

ˆ (λ3 ) are the minimizers of Q3 (β|λ3 , γ
ˆ (2) ) defined above and dfλ3 is defined as the
where β
total number of non-zero β if λ3 is the penalized parameter.
To find the optimal tuning parameters, λ1 , λ2 , λ3 are searched over a grid of exponentially
decreasing values with the minimum to be 1E-3, and the maximum of λ1 , λ2 , λ3 are set to be
the minimum value such that all of the penalized estimators are 0. For ease of computation,
the number of grid points is set to be 100.

4.2.4.2

Selection of the order h and the number of interior knots K

As we discussed in previous chapters, higher order of the B-spline basis function implies more
complex functions and leads to less interpretable effect. From a practical point of view, the
interaction effect is less likely to be highly nonlinear. Thus, we searched optimal h over the
set h ∈ {2, 3, 4} to avoid any complications in interpretation. As for the number of interior
knots K, we searched the optimal K in K = {2, 3, 4, 5}. For every combination of K and
h, we fit the following intercept only model,

Y = m0 (Xβ) +

(4.6)

Again, as we discussed in previous chapters, this is to avoid computational burden. The
optimal K and h are those that minimize log

n
ˆ
i=1 ρτ (Yi

log(n)

− Yi ) + 2n (K + h + 1), where

Yˆi is the estimation of the i-th subject under model (4.6).

4.2.5

Selection of the initial values

The initial value of β for a single index model is usually set to be (1, 0, . . . , 0)T or
( √1q , . . . , √1q )T . However, neither works well in our simulations. Hence, we get the initial
67

value by fitting the intercept only model (4.6).

4.3

Simulation

We conducted extensive simulation to investigate the finite sample performance of the
proposed selection approach under the τ VMICM model. The performance is evaluated in
several ways: (1) the selection accuracy (oracle percentage) of the non-parametric function
mk (u); (2) IMSE of m(u);
ˆ
(3) the selection accuracy (oracle percentage) of β and (4) MSE
of β. The performance is evaluated over 1000 simulation runs.
The oracle percentage of mk (u) is defined as the percentage of correct classification of
mk (u). For instance, if mk (u) ∈ V , and mk (·) is classified as varying for g times, then the
g
× 100%. IMSE of mk (·) is defined as
oracle percentage of mk (·) is calcualted as 1000

1
1000

1000
r=1

1
[
100

100

(m
ˆ k (uj )r − mk (uj ))2 ]
j=1

where m
ˆ k (uj )r is the fitted value of mk (uj ) for the r-th simulation and uj is taken at the
ˆ
j − % quantile among the range of X β

(r)

. The oracle percentage of β is defined as the

percentage of correct selection of β. For example, if βd = 0 and βd is selected to be non-zero
g
for g times, then the oracle percentage of βd is calculated as 1000
× 100%. MSE of βd
1
is calculated as 1000

1000 ˆ(r)
r=1 (βd

(r)
− βd )2 where βˆd is the estimator for βd in the r − th

simulation.

68

4.3.1

Simulation Setting

Our data was generated according to the model,
p

Y = m0 (Xβ) +

mk (Xβ)Gk + (τ ).
k=1

Five (q = 5) independent environmental factors were generated with each one being
generated from a U nif (0, 1) distribution. For the loading parameter β = (β1 , β2 , . . . , βq )T ,
we set β1 = β2 = √1 and the rest βj s were set as zeros.
2

was generated from a N (0, 1)

distribution and (τ ) = − F −1 (τ ) where F denotes the CDF of . F −1 (τ ) was subtracted
from

to make sure the τ th-quantile of (τ ) is zero. For the genetic factor G, we evaluated

the performance of our model with both continuous and discrete variables.

4.3.2

The continuous case

We first evaluated the performance of our approach with continuous genetic predictors G
which marginally followed a N (0, 1) distribution. The non-parametric functions mk (u) were
set to be: m0 (u) = 2sin(2πu), m1 (u) = 2cos(πu) + 2 and m2 (u) = sin(2πu) + cos(πu) + 1;
m3 (u) = 2 and m4 (u) = 2.5 which were non-zero constants; mk (u) = 0 for k = 5, . . . , p
which were zero effects. We simulated the data under p = 50, 100, τ = 0.25, 0.5, 0.75, and
n = 2000.
Table 12 presents the selection and estimation result for non-parametric functions mk (u)
with continuous predictor G. For varying and constant coefficient function (m1 (·) to m4 (·)),
the oracle percentage was at 100% and model IMSE was in the order of -2 for all cases. These
suggest that our model could correctly select and estimate non-zero effect predictors at all
quantiles. Between the median regression case (τ = 0.5) and the cases where τ = 0.25, 0.75,
69

the false positive rate was much lower for median regression case (around 1%) than that
of the case τ = 0.25, 0.75 (around 10%). Also the model IMSE of the case τ = 0.5 was
smaller than the model IMSE of the case τ = 0.25, 0.75. These are expected since median
regression usually provides the most accurate estimator among different quantiles. Between
the case p = 50 and the case p = 100, we did not observe an significant difference in
model performance. Overall, the proposed variable selection approach can correctly select
non-zero effect predictors at all quantiles. Compared with median regression, although the
false positive rate and model IMSE were higher in the case τ = 0.25, 0.75, they were still
decent.
Table 12: Selection and estimation accuracy of mk (·) for continuous G
p = 50

p = 100

m0 (.)
m1 (.)
m2 (.)
0.25
m3 (.)
m4 (.)
Zero

100.0%
100.0%
100.0%
100.0%
100.0%
88.8%

2.78E+00
8.16E-02
9.59E-02
2.58E-02
2.56E-02
1.03E-02

2.96E-02
6.16E-03
1.30E-02
8.77E-04
1.02E-03
0

100.0%
100.0%
100.0%
100.0%
99.8%
90.6%

2.67E+00
7.07E-02
7.80E-02
2.19E-02
2.12E-02
7.27E-03

2.97E-02
6.16E-03
1.31E-02
9.71E-04
9.80E-04
0

m0 (.)
m1 (.)
m2 (.)
m3 (.)
m4 (.)
Zero

100.0%
100.0%
100.0%
99.7%
99.9%
98.7%

7.49E-02
4.98E-02
5.72E-02
1.25E-03
1.50E-03
1.00E-04

2.88E-02 100.0%
5.07E-03 100.0%
1.22E-02 100.0%
7.90E-04 99.9%
7.91E-04 99.8%
0
99.1%

7.70E-02
4.89E-02
5.77E-02
1.30E-03
1.48E-03
6.26E-05

2.91E-02
5.31E-03
1.23E-02
7.90E-04
8.29E-04
0

m0 (.)
m1 (.)
0.75 m2 (.)
m3 (.)
m4 (.)
Zero

100.0%
100.0%
100.0%
99.9%
99.9%
88.5%

2.60E+00
6.25E-02
7.12E-02
2.42E-02
2.62E-02
1.05E-02

3.01E-02
6.08E-03
1.35E-02
8.58E-04
9.93E-04
0

2.60E+00
6.17E-02
7.14E-02
2.18E-02
2.20E-02
7.19E-03

3.13E-02
6.18E-03
1.37E-02
9.03E-04
9.63E-04
0

0.5

100.0%
100.0%
100.0%
100.0%
100.0%
90.7%

Table 13 presents the selection and estimation results for the loading parameters β. For
non-zero loading β’s (β1 and β2 ), the oracle percentage was close to 100% in all cases. This
70

suggests our model could correctly select non-zero loading parameters. For zero loading β’s
(β3 , β4 , and β5 ), the oracle percentage for the median regression case was around 98.5%.
Compared between the case τ = 0.25 and τ = 0.75, oracle percentage for the case τ = 0.25
(99%) was slightly higher than that of the case τ = 0.75 (96%), while the MSE for the
case τ = 0.25 is higher than that of the case τ = 0.75. Between the case p = 50 and
p = 100, we did not observed a difference in model performance. Overall, the proposed
model can correctively select and estimate loading covariates with reasonably high accuracy
at all quantiles.
Table 13: Selection and estimation accuracy of β
p = 50
τ

Model

Oracle

Oracle %

Model

Oracle

β1
β2
0.25 β3
β4
β5

100.0%
99.8%
99.9%
99.9%
99.9%

1.20E-02 4.76E-05
1.47E-02 4.75E-05
7.15E-05
0
2.09E-04
0
9.60E-05
0

100.0%
100.0%
99.9%
99.9%
99.9%

6.71E-03 4.36E-05
7.69E-03 4.37E-05
5.81E-05
0
2.93E-04
0
9.53E-05
0

β1
β2
β3
β4
β5

100.0%
100.0%
98.6%
98.3%
98.7%

5.29E-05 3.78E-05
5.29E-05 3.78E-05
1.52E-06
0
3.33E-06
0
1.85E-06
0

100.0%
100.0%
98.3%
98.8%
98.9%

5.33E-05 3.54E-05
5.34E-05 3.54E-05
1.23E-06
0
3.24E-06
0
1.45E-06
0

β1
β2
0.75 β3
β4
β5

100.0%
100.0%
95.3%
95.9%
95.8%

4.40E-03 4.81E-05
4.51E-03 4.80E-05
4.14E-05
0
5.31E-05
0
3.85E-05
0

100.0%
100.0%
96.3%
96.2%
95.8%

3.60E-03 4.89E-05
3.49E-03 4.88E-05
1.87E-05
0
4.23E-05
0
2.32E-05
0

0.5

4.3.3

Oracle %

p = 100

The discrete case

We continued to evaluate the performance of our model with discrete genetic predictors G.
One of many applications of our model is to select significant single nucleotide polymorphism
71

(SNP) in a gene or pathway. All SNPs take values of 0, 1 and 2 to represent aa, Aa, and
AA genotype under an additive genetic model. We simulated G according to the following
probability distribution function,

P (Gij = 0) = M AF 2 , P (Gij = 1) = 2 ∗ M AF (1 − M AF ), P (Gij = 2) = (1 − M AF )2

where Gij is the j-th predictor of the i-th subject, i = 1, . . . , n and j = 1, . . . , p. The
data were simulated under p = 50, 100, τ = 0.25, 0.5, 0.75, and n = 2000. We set the
non-parametric function mk (u) and the corresponding MAF for Gk as follows:
Table 14: Setup for mk (u)
Function
m0 (u) = 2sin(2πu)
m1 (u) = 2cos(πu) + 2
m2 (u) = sin(2πu) + cos(πu) + 1
m3 (u) = 2cos(πu) + 2
m4 (u) = sin(2πu) + cos(πu) + 1
m5 (u) = 2cos(πu) + 2
m6 (u) = sin(2πu) + cos(πu) + 1
m7 (u) = 2
m8 (u) = 2
m9 (u) = 2
mk (u) = 0, k > 9

MAF of Gk
NA
0.5
0.5
0.3
0.3
0.1
0.1
0.5
0.3
0.1
Unif(0.05,0.5)

From the simulation setup, we would be able to evaluate how our proposed variable
selection method perform with SNPs of a wide range of MAF. Table 15 presents the selection
and estimation result for the non-parametric functions with discrete genetic predictors.
Compared to the case where τ = 0.25, 0.75, the median regression case (τ = 0.5) performed
better. For instance, the oracle percentage for varying effect Gk was around 99% for median
regression, while that of the case τ = 0.25, 0.75 ranged from 90% to 97% (p = 50) and 84%
72

to 91% (p = 100). Further, the model IMSE for the median regression case was smaller than
that of the case τ = 0.25, 0.75. Between the case where p = 50 and p = 100, the case p = 50
performed slightly better both in terms of oracle percentage and model IMSE with varying
effect predictors. Overall, the model performed reasonably well across different quantiles. In
addition, we observed the model performance was associated with minor allele frequency of
Gk . We prepared figure 5 to better visualize such trend. With an increase of minor allele
frequency of Gk (from 0.1 to 0.5), we observed an increase in oracle percentage as well as a
decrease in model IMSE. These suggest that the model performs better with common variant
SNPs. It is expected since rare variant SNP has less information.

Table 15: Selection and estimation accuracy for mk (u) with discrete G

Oracle %

τ

p = 50
IMSE
Model
Oracle

Oracle %

p = 100
IMSE
Model
Oracle

Intercept
Varying
0.25
Constant
Zero

100.0%
97.5%
94.2%
93.9%

7.54E-01
1.44E-01
1.66E-02
4.71E-03

2.47E-02
2.30E-02
3.19E-03
0

100.0%
91.4%
95.5%
96.0%

7.98E-01
2.23E-01
1.66E-02
3.05E-03

2.44E-02
2.32E-02
3.22E-03
0

Intercept
Varying
Constant
Zero

100.0%
99.9%
95.4%
93.6%

4.84E-02
9.14E-02
7.52E-03
2.87E-03

2.35E-02
2.00E-02
2.68E-03
0

100.0%
99.3%
95.5%
93.8%

4.97E-02
9.84E-02
9.30E-03
2.87E-03

2.30E-02
1.98E-02
2.76E-03
0

Intercept
0.75 Varying
Constant
Zero

100.0%
90.8%
96.7%
96.8%

1.02E+00
2.29E-01
1.35E-02
2.09E-03

2.53E-02
2.33E-02
3.19E-03
0

100.0%
84.1%
98.6%
98.5%

1.09E+00 2.48E-02
3.35E-01 2.30E-02
1.26E-02 3.14E-03
9.24E-04
0

0.5

73

Model IMSE for varying effect Gk (p=50)

1.0

Oracle percentage for varying effect Gk (p=50)

Oracle IMSE for varying effect Gk (p=50)
tau = 0.25
tau = 0.5
tau = 0.75

0.1

0.2

0.3

0.4

0.0

0.0

tau = 0.25
tau = 0.5
tau = 0.75

0.02
0.00

0.1

0.2

0.01

0.2

IMSE

0.3

IMSE

0.6
0.4

Oracle Percentage

0.4

0.8

0.03

0.5

tau = 0.25
tau = 0.5
tau = 0.75

0.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Minor Allele Freq.

Minor Allele Freq.

Oracle percentage for varying effect Gk (p=100)

Model IMSE for varying effect Gk (p=100)

Oracle IMSE for varying effect Gk (p=100)

1.0

Minor Allele Freq.

tau = 0.25
tau = 0.5
tau = 0.75

0.02

IMSE

0.4

IMSE

0.6
0.1

0.2

0.3
Minor Allele Freq.

0.4

0.5

0.0

0.0

tau = 0.25
tau = 0.5
tau = 0.75

0.00

0.2

0.01

0.4
0.2

Oracle Percentage

0.6

0.8

0.03

0.8

tau = 0.25
tau = 0.5
tau = 0.75

0.1

0.2

0.3
Minor Allele Freq.

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Minor Allele Freq.

Figure 5: Selection and estimation accuracy of mk (·) for discrete G

Table 16 shows the selection and estimation of loading parameters β with discrete G.
The proposed model select and estimate non-zero loading β with high precision (100% in
all cases). For zero loading parameters (β3 , β4 , and β5 ), the false positive rate was around
0%, 1.5%, and 6.5% for the case τ = 0.25, τ = 0.5, and τ = 0.75 respectively. The MSE for
zero loading β’s was in the order of −5 to −7, suggesting the proposed model shrunk it very
close to 0.

74

Table 16: Selection and estimation accuracy for β with discrete G

τ

Oracle %

p = 50
MSE
Model
Oracle

Oracle %

p = 100
MSE
Model
Oracle

β1
β2
0.25 β3
β4
β5

100.0%
100.0%
100.0%
100.0%
100.0%

3.98E-04
4.04E-04
0
0
0

4.42E-05
4.43E-05
0
0
0

100.0%
100.0%
100.0%
100.0%
100.0%

4.63E-04
4.69E-04
0
0
0

β1
β2
β3
β4
β5

100.0%
100.0%
99.1%
98.9%
98.9%

5.20E-05 3.96E-05
5.21E-05 3.97E-05
2.27E-07
0
3.43E-07
0
5.45E-07
0

100.0%
100.0%
99.1%
98.4%
98.7%

5.30E-05 3.97E-05
5.30E-05 3.97E-05
1.00E-06
0
8.12E-07
0
5.87E-07
0

β1
β2
0.75 β3
β4
β5

100.0%
100.0%
93.6%
94.3%
93.5%

2.75E-04 5.13E-05
2.84E-04 5.15E-05
2.52E-05
0
2.23E-05
0
3.46E-05
0

100.0%
100.0%
93.0%
93.9%
93.3%

3.70E-04 4.69E-05
3.59E-04 4.69E-05
2.26E-05
0
1.80E-05
0
3.08E-05
0

0.5

4.75E-05
4.74E-05
0
0
0

Based on the simulation results described above, we were able to conclude the followings.
(1) The proposed model selects and estimates Gk with reasonably high precision. (2)
Compared to rare variant SNPs (Pa = 0.1), the model performed better with common
variant SNPs (Pa = 0.3, 0.5). (3) The model could correctly select and estimate non-zero
loading β. At last, (4) the false positive rate for zero loading β was low for τ = 0.25, 0.5,
and it was slightly higher for τ = 0.75.

4.4

Real Data Application

We applied the proposed variable selection approach to a birth weight data set, which is
obtained from Gene Environment Association Studies initiative (GENEVA) funded by the

75

Genes, Environment and Health Initiative (GEI). Epidemiological studies often suggested
that birth weight is strongly associated with morbidity and mortality risk during the first
year, and risk of many diseases in adulthood. Birth weight is affected by fetal genes and
maternal environment. We first performed data cleaning, removing SNPs with more that 5%
missing, SNPs with minor allele frequency < 0.05 and SNPs deviates from Hardy-Weinberg
equilibrium (p-value < 0.001). After this cleaning step, the data set contains 1,126 subjects
and 590,913 SNPs. For the environmental factors X, based on the marginal p-value (<
0.05) when regressing birth weight with each X variable, we select 3 environmental factors:
mother’s mean OGTT(oral glucose tolerance test) diastolic blood pressure (X1 ), mother’s
one hour OGTT glucose level (X2 ) and mother’s mean OGTT systolic blood pressure (X3 ).
We first map all SNPs to all known genes based on its location. Then we select the genes
which contain ≥ 30 SNPs. As a result, we get 2,076 genes. Then we fit the proposed model
to all genes at τ =0.25, 0.5, and 0.75. Since the first element of the loading parameter β has
to be a non-zero positive number (β > 0), we fit the proposed model three times by varying
the order of the X variable inside the index function. If the SNP is selected as varying or
constant in all the three cases, this would suggest a convincing signal. Therefore, we only
consider the cases where all three models return the same effect classification result.
Our model identified 122 genes with constat effect and no gene with varying effect,
indicating that these genes are not sensitive to the changes of those three environmental
variables, i.e., no significant G×E effect. Consider gene ST3GAL1 located on chromosome
8 as an example. It contains 39 SNPs in our data set. Table 17 presents the estimated SNP
effect in gene ST3GAL1. The left, middle, and right column correspond to the case where
τ = 0.25, τ = 0.50, and τ = 0.75, respectively. Among those, SNPs rs13267049, rs6986303,
and rs6990329 have effect on response in lower quantile (τ = 0.25). SNP rs2142306 only has
76

effect for the τ = 0.5 quantile, and SNPs rs2736860, rs9643299, and rs7460764 have effect
when τ = 0.75. Interestingly, for SNP rs7831227, we observed an increase in negative effect
with the increase of the quantile of the response variable. This suggests that the SNP has
different effect at different quantile of the birth weight. A genome-wide associations study in
2016 (Horikoshi et al. (2016)) suggested that only 15% of the variance in birth weight was
captured by genetic variations. Thus, it is not surprise to see the limited genetic variants
with relatively small effect being selected.

Table 17: Effect of SNPs in gene ST3GAL1
τ = 0.25 τ = 0.50 τ = 0.75
rs13267049
rs2736860
rs2142306
rs6986303
rs6990329
rs9643299
rs7460764
rs7831227

0.0260
0
0
0.1129
0.1411
0
0
-0.0294

0
0
0.0720
0
0
0
0
-0.0801

0
-0.0290
0
0
0
-0.0489
-0.1077
-0.1007

Table 18 presents the parameter estimates for the marginal effect of environmental factors
on birth weight. The first, second, and third row correspond to the case where τ = 0.25,
τ = 0.50, and τ = 0.75 respectively. We observed different results at different quantiles,
suggesting that the environmental effects are different at different quantiles for the birth
weight. For example, at lower quantile (0.25), X2 (mother’s one hour OGTT glucose level)
and X3 (mother’s systolic blood pressure) showed strongest effect. At higher quantile (0.75),
X1 (mother’s diastolic blood pressure) and X3 showed strongest effect. For median regression
case (τ = 0.5), we observed a strong effect for X1 , while X2 was not selected.
Figure 6 represents the effect of environmental factors on different quantiles of birth

77

Table 18: Estimated Loading Parameter for Gene ST 3GAL1
β1

β2

τ = 0.25 0.272
τ = 0.5 0.895
τ = 0.75 0.637

β3

0.707 0.653
0.000 -0.445
-0.288 -0.715

weight. The red, blue, and black curve correspond to the effect at quantile 0.25, 0.5,
and 0.75, respectively. Since the estimated β is different at different quantile, the span
differs. We observed a higher fitted birth weight at a higher quantile. For τ = 0.75, with
the increased index Xβ, we first observed a quick decrease in fitted birth weight, then it
fluctuates around 3.3. Since the loading coefficient estimates for β2 and β3 are negative, this
implies that higher values for mother’s one hour OGTT glucose level and mother’s systolic
blood pressure potentially contribute to higher birth weight at the upper quantile. For the
median regression, the fitted birth weight first decreases from 3.3 to 3.0, then slowly increases
to 3.2. Both larger values in Mother’s diastolic blood pressure and Mother’s systolic blood
pressure contribute to relatively higher birth weight. For τ = 0.25, we saw a positive trend
in the total effect as the index increases.

78

Environmental Effect on Birthweight

3.2
2.8

3.0

Birth Weight

3.4

3.6

tau=0.25
tau=0.5
tau=0.75

−1.0

−0.5

0.0

0.5

1.0

1.5

X*beta

Figure 6: Plot of interaction effect effects

4.5

Discussion

Varying multi-index coefficient model is a novel way to model non-linear interaction between
genetic variants G and a mixture of environmental factors X. In previous chapters, we
proposed a 3 step iterative variable selection approach with the goal of selecting varying
and constat effect genetic variants as well as non-zero loading parameters. In this paper,
we generalized such approach to a conditional quantile regression setting. Compared to
condition mean regression, condition quantile regression possesses several advantages. First,
modelling the data at several quantiles offers a more comprehensive way to understand the
distribution of the response variable. In many applications, the effect of Gk on the response
differs at different quantiles. Second, quantile regression is robust to extreme observations.
By looking at different quantiles, novel insights about the underlying genetic mechanism

79

could be revealed.
From the setup of τ VMICM, the environmental factors X have to be continuous due
to the model constrain. Nevertheless, discrete factors such as gender and smoking status
might possess significant effect. In this case, we could easily generalize τ VMICM to a partial
linear τ VMICM model as Y = Zα + (ZG)δ +

p
k=0 mk (Xβ)Gk +

(τ ) where Z represent

the discrete environmental factors and δ represent the interaction of gene with discrete
environmental variables. The proposed variable selection approach could be modified to
accommodate these changes.
Although varying multi-index coefficient model enjoys many advantages over traditional
G×E model, it is not without its limitations. One of which is the constrains on its loading
parameters, ||β||2 = 1 and β1 > 0. Potentially, it would lead to different selection results if
we vary the order of loading covariates. In the application to real data, we can fit several
different models by varying the first loading covariate. If all the models return the same
selection results, then we are convinced such finding is valid. The constrains also limit the
interpretation of fitted environmental parameters, it is difficult to characterize the effect of
a single environmental factor. If environmental effect is our primary interest, we should
consider other modelling technique.
In the case study, the estimated interaction coefficients are all constants, indicating no
G×E interaction for birth weight. As the study is focused on the Thai population, this
cannot rule out the possibility that the mother’s conditions may act on fetus’ genome to
affect fetal growth in other populations. Further investigation is needed to confirm this.

80

Chapter 5
Conclusion and future work

5.1

Conclusion

The main goal of this dissertation is to develop variable selection methods to identify
non-linear G×E interactions. We first proposed to use varying multi-index coefficient models
since it allows non-linear interaction between genetic factors and multiple environmental
factors. In Chapter 2, we proposed a 3 step iterative variable selection approach for VMICM
via a penalized regression. It separates gene effects into three categories: varying, constant
and zero. It could also select non-zero loading parameters for environmental factors. In
Chapter 3, we generalized such approach to a generalized regression setting with binary
responses. Following the work of Chapter 2, we extended the proposed variable selection
approach to a quantile regression setting in Chapter 4, since it provided a more comprehensive
understanding of the data and offered more robust estimators.
In conclusion, this dissertation contributes to the literature in two ways. From the
methodological perspective, it contributes to the methods development in variable selection
under a nonparametric varying index coefficients model framework. For the selection of
the nonparametric coefficients, we separated three types of effects rather than just selecting
zero vs non-zero functions. This complicates our selection procedure and distinguishes our
methods to existing ones in the literature (e.g., Feng and Xue (2013)). Theoretical properties

81

of the selection methods were evaluated. Our methods have practical meanings and enrich
the literature of variable selection.
From the application perspective, our method development is well motivated by empirical
studies to evaluate the joint effect of multiple environmental exposures and how they interact
with genes to affect a disease trait. By taking gene or pathway information into account,
we were able to select important players in a gene set. The method developed under the
quantile regression framework makes much biological sense in certain trait such as birth
weight. Novel insights are expected under the proposed models.

5.2

Future work

In the simulation studies, we assumed any two genetic variants Gk and Gl are independent.
However, it would be more desirable if we consider linkage disequilibrium structure among
SNPs. More specifically, we could set the correlation between Gk and Gl to be ρ|k−l| where
ρ = 0.3 for low correlation case, ρ = 0.5 for median correlation case, and ρ = 0.9 for high
correlation case. To demonstrate the robustness of median regression compared to mean
regression, we could consider the case where

follows a t distribution with 3 degree of

freedom.
In Chapter 2 and 3, we theoretically proved that our penalized estimators are consistent in
both estimation and selection under a fixed number of parameters. It could be more desirable
if we could prove the selection consistency when the number of parameters increases as the
sample size increases. Also it could be beneficial if we could demonstrate the consistency
of the penalized estimators in the quantile regression setting in Chapter 4.

Further,

generalizations of the proposed selection approach to other generalized regression setting

82

such as poisson or categorical variable could be done with modification to the likelihood
function. At last, extension of the proposed model to longitudinal data could be considered.
We will consider the aforementioned mentioned future work and continue to investigate
along those line.

83

APPENDICES

84

Appendix A
Real Data Results

A.1

Real data results of gVMICM

Table 19 presented all the varying effect SNPs selected. Similarly, Table 20 presented all the
constant effect SNPs selected.
Table 19: List of SNPs with a varying effect.
GeneID

SNP

GeneID:440600
GeneID:2590
GeneID:729993
GeneID:54768
GeneID:117532
GeneID:758
GeneID:23395
GeneID:647107
GeneID:8633
GeneID:2185
GeneID:4915
GeneID:19
GeneID:286205

rs6537663
rs6666516
rs1015431
rs4788621
rs7509377
rs5766384
rs4311249
rs2404825
rs3775049
rs6557991
rs6559870
rs4742969
rs2416996

85

Table 20: List of SNPs with a constant effect.
GeneID

SNP

GeneID:114827
GeneID:2899
GeneID:260425
GeneID:9857
GeneID:2590
GeneID:6934
GeneID:55742
GeneID:867
GeneID:10867
GeneID:57494
GeneID:196385
GeneID:64328
GeneID:23348
GeneID:23348
GeneID:57099
GeneID:11060
GeneID:25780
GeneID:100505498
GeneID:117532
GeneID:29780
GeneID:25814
GeneID:9620
GeneID:23429
GeneID:80254
GeneID:8633
GeneID:157680

rs3815792
rs12118788
rs11102660
rs2293990
rs9308482
rs7901695
rs7101596
rs4489755
rs740771
rs11047510
rs11058132
rs1961415
rs7326971
rs7991210
rs16962542
rs16970994
rs6708570
rs6730602
rs11696526
rs5765571
rs713999
rs11090812
rs17009630
rs11710699
rs10516957
rs1788161

86

Appendix B
Algorithm

B.1

Algorithm for VMICM

Here we presents the algorithm used in each steps:
ˆ (0) ), we implemented the group
Step 1 : To minimize the objective function Q1 (γ|λ1 , β
coordinate descent algorithm. The design matrix has the form

ˆ (0) )G0 , G1 , B(X
ˆ (0) )G1 , . . . , Gp , B(X
ˆ (0) )Gp )
¯
¯
¯
D = (G0 , B(X
β
β
β
n×(L∗(p+1))

with the corresponding parameters (γ01 , γ 0∗ , γ11 , γ 1∗ , . . . , γ0p , γ p∗ ), where γ k∗ has a length
of L−1. We assigned a grouping index for each of the parameters (from 0 to M , where M +1
is the number of groups). Parameters with the same grouping index were in the same group
and they were penalized as a group, parameters with grouping index 0 were not penalized.
Denote D m as the design matrix for group m, m = 0, 1, . . . M . Given the penalty
parameter λ1 and the MCP tuning parameter γ M CP , γ (1) is obtained through the following
iteration:
(0) Perform Q-R decomposition on all D m , i.e., D m = Qm Rm , m = 0, 1, 2 . . . M , where
QTm Qm = I and Rm is an upper triangular matrix. Hence, Qm is the normalized
design matrix for group m.

87

ˆ initial = {ˆ
(1) For given initial values for γ, denoted as γ
γ initial
}, m = 0, 1, . . . , M .
m
ˆ OLS
ˆ OLS
˜ −m ) = QTm Y −
We obtained OLS estimate γ
via γ
= QTm (Y − Q−m γ
m
m
˜ −m where subscript Q−m represents the normalized design matrix without
QTm Q−m γ
˜ −m represents the most updated values for γ without group m.
group m and γ
ˆ0 = γ
ˆ OLS
(2) For group 0, set γ
.
0
ˆ m via
(3) For all other groups m = 1, . . . , M , obtain the MCP estimate γ
ˆm = γ
ˆ OLS
γ
if ||ˆ
γ OLS
m
m ||2 > λ ∗ τ
λ
τ S(ˆ
ˆ OLS
ˆ OLS
ˆ m = τ −1
γ OLS
≤ λ ∗ τ where S(ˆ
γ OLS
γ
m , λ) if γ
m
m , λ) = (1 − ˆ OLS )+ ∗ γ
m
||γ m ||2

ˆ initial
ˆ m . Iterate (1) through (4) until convergence
(4) Updated γ
in step (1) by γ
m
ˆ m by γ
ˆ fminal = R−1
ˆ m , for m = 0, 1, . . . , M .
(5) We adjusted the converged estimator γ
m γ
ˆ (2) = minγ Q2 (γ|λ2 , β (0) , γ
ˆ (1) ) where
Step 2 : To obtained γ
n

Q2 (γ|λ2

ˆ (1) )
, β (0) , γ

(2)
(0) )γ (2) ]G −
¯
[γk1 + B(Xβ
k
k∗

{Yi −

=
i=1

k∈V

(2)

γk1 Gk }2
k∈C

p
(2)

(1)

pλ2 (|γk1 |)I(||ˆ
γ k∗ ||2 = 0)

+n
k=1

We implement standard coordinate descent algorithm and the design matrix is:

D (2) =

ˆ (0) )Gj
ˆ (0)
¯
¯
Gj , B(X
β
j∈V , Gk , B(X β )Gk k∈C

ˆ we minimized
Step 3 : To obtain β,
p

Q3 (β|λ3

ˆ (2) )
,γ

= ||Y −

q

¯
[ˆ
γk1 + B(Xβ)ˆ
γ k∗ ]Gk ||2 + n
k=0

pλ3 (|βd |)
d=2

88

˜ and doing a local linear approximation of B(Xβ)ˆ
˜
¯
For an given initial value β
γ k∗ Gk at β,
we have
˜ γ Gk + B
˜ γ XGk (β − β)
˜
¯
¯ β)ˆ
¯ (X β)ˆ
B(Xβ)ˆ
γ k∗ Gk ≈ B(X
k∗
k∗
For the d − th coordinate of β, βd , we have

˜ γ ∗ Gk + B
˜ γ ∗ X d Gk (βd − β˜d )
¯
¯ β)ˆ
¯ (X β)ˆ
B(Xβ)ˆ
γ ∗k Gk ≈ B(X
k
k

Thus we could obtain βˆd by minimizing the following:

Qd = ||Y ∗d − X ∗d βd ||22 + npλ3 (|βd |)

where
Y ∗d = Y −
X ∗d =

p
γk1 Gk
k=0 [ˆ

p
¯
k=0 B

˜ γ ∗ Gk − B
˜ γ ∗ Gk X d β˜d ]
¯ T β)ˆ
¯ (X β)ˆ
+ B(X
k
k

˜ γ ∗ Gk X d
(X β)ˆ
k

ˆ ∗ = (βˆ∗ , . . . , βˆ∗ )T via the following
Hence, we obtained the MCP penalized estimator β
q
1
˜ calculate Y ∗ andX ∗ according to the
(1) Given an initial estimate of β, denoted as β,
d
d
above formula.
(2) Normalize X ∗d by X ∗d = X ∗d /||X ∗d ||2 .
(3) Calculate βˆdOLS = X ∗d T Y ∗d .
ˆOLS

(β
−λ)+
(4) Set βˆ1∗ = βˆ1OLS and for d = 1, βˆd∗ = d M CP if |βˆdOLS | ≤ λγ M CP and βˆd∗ = βˆdOLS
1−1/γ
if |βˆOLS | > λγ M CP .
d

adjusted
(5) Adjust βˆd∗ by βˆd
= βˆd∗ ∗ ||X ∗d ||2 .

89

(6) Repeat steps (1) through (5) for all βd , d = 1, . . . , q.
ˆ adjusted , i.e. βˆd =
(7) Normalize β

adjusted
βˆ
d
ˆ adjusted ||
||β
2

adjusted

∗ sign(β1

).

˜ in step (1) with β
ˆ = (βˆ1 , . . . , βˆq )T and iterate step (1) through (6) until
(8) Update β
convergence.

B.2

Algorithm for model (2.6)

For the intercept only model, Y = m0 (Xβ)+ , the following contain the steps in estimation.
¯
(0) We approximated m0 (Xβ) with B-spline basis function: m0 (Xβ) ≈ γ01 + B(Xβ)γ
0∗ .
ˆ let the design matrix D = (1, B(X
ˆ denote
¯
(1) Given initial value for β, denoted as β,
β)),
ˆ = (D T D)−1 D T Y .
γ = (γ01 , γ 0∗ ), we estimated γ with γ
¯
(2) Obtain β updated via minimizing ||Y − γˆ01 − B(Xβ)ˆ
γ 0∗ ||22 with Newton-Raphson
algorithm.
ˆ by β updated in step (1), and iterate until convergence.
(3) Replace β

B.3

Algorithm for gVMICM

Here we showed the algorithm used in gVMICM:
Step 1 & Step 2 followed directly from the group coordinate descent algorithm described
before, hence the details are omitted.
Step 3 :
q

ˆ = max M3 (β|λ3 , γ
ˆ (2) ) = max
pλ3 (|βd |)
β
l(ˆ
γ (2) , β) − n
||β ||2 =1
||β ||2 =1
d=2
90

We implemented the local quadratic approximation technique proposed by Fan and
˜ to be the most updated value of β, by Taylor
Li(2001)(Fan and Li (2001)). Denote β
˜ we have
expansion at β,

˜ T
˜ T (β − β)
˜ + 1 (β − β)
l(ˆ
γ (2) , β)
2

˜ +
l(ˆ
γ (2) , β) ≈ l(ˆ
γ (2) , β)

2

˜
˜
l(ˆ
γ (2) , β)(β
− β)

where
(2)

(2)

γ , β)
∂l(ˆ
γ , β) T
˜ = ( ∂l(ˆ
l(ˆ
γ (2) , β)
,...,
)
is the gradient
˜
∂β1
∂βq
β =β
2 l(ˆ
˜
γ (2) , β)

=

∂ 2 l(ˆ
γ (2) , β)
is the hessian matrix
˜
∂βj ∂βl
1≤j,l≤q β =β

and
∂l(ˆ
γ (2) , β)
= (Y −
∂βj
∂ 2 l(ˆ
γ (2) , β)
= (Y −
∂βj ∂βl

B (3)

)T

1 + e−µ
(3)

∂µB
)T
−
∂βj ∂βl
B (3)
−
µ
1+e
1

(3)

1

∂µB
∂βj
B (3)

e−µ

B (3) 2
)

(1 + e−µ

T

(3)

∂µB
∂βj

with
(3)
µB

p
(2)

(2)

¯
[ˆ
γk1 + B(Xβ)ˆ
γ k∗ ] · Gk

=

k=0
(3)

∂µB
∂βj
(3)

∂µB
=
∂βj ∂βl

p
(2)
¯ (Xβ)ˆ
B
γ k∗ · X j · Gk

=
k=0
p

(2)
¯ (Xβ)ˆ
B
γ k∗ · X j · X l · Gk
k=0

ˆ

ˆ

|β2 |

|βq |

p (β )
p ( βq )
˜ = diag(0, λ3 2 , . . . , λ3
Let Σλ3 (β)
). Then, we updated β with
ˆ
ˆ

˜−
β∗ = β

2

˜ + nΣλ (β)
˜
l(ˆ
γ (2) , β)
3
91

−1

˜ + nΣλ (β)
˜ β
˜
l(ˆ
γ (2) , β)
3

∂µB
·
∂β

(3)

l

β∗
Then we standardized β with β updated = sign(β1∗ ) ∗ . We iterate the above steps until
||β ||2
convergence.

B.4

Algorithm for quantile VMICM

Here, we presented the algorithm used in the quantile VMICM model.
Step 1&2 : The algorithm used in these steps followed directly from Peng and Wang(2015)
and we implemented the R-package “rqPen” (Sherwood and Maidman (2016)).
ˆ = min
ˆ (2) ) where
Step 3: Obtain β
||β ||2 =1 Q3 (β|λ3 , γ
p

n

ρτ Yi −

ˆ) =
Q3 (β|λ3 , γ
i=1

q

¯
[ˆ
γk1 + B(Xβ)ˆ
γ k∗ ]Gik + n
k=0

pλ3 (|βd |)
d=2

¯
Since B(Xβ)
is not a linear function of β, we adopted the idea of first order approximation
˜ as the most updated value of β. We have
¯
for B(Xβ).
Denote β

˜ γ Gk + B
˜ γ XGk (β − β)
˜
¯
¯ β)ˆ
¯ (X β)ˆ
B(Xβ)ˆ
γ k∗ Gk ≈ B(X
k∗
k∗

For individual βd , d = 1, . . . q, we have

˜ γ ∗ X d Gk (βd − β˜d )
˜ γ ∗ Gk + B
¯
¯ β)ˆ
¯ (X β)ˆ
B(Xβ)ˆ
γ ∗k Gk ≈ B(X
k
k

Then we could obtain βˆd by minimizing the following:

Qd = ρτ (Y ∗d − X ∗d βd ) + npλ3 (|βd |)

92

where Y ∗d = Y −
p
¯
k=0 B

˜ γ ∗ Gk X d .
(X β)ˆ
k

p
γk1 Gk
k=0 [ˆ

˜ γ ∗ Gk − B
˜ γ ∗ Gk X d β˜d ] and X ∗ =
¯ T β)ˆ
¯ (X β)ˆ
+ B(X
k
k
d

ˆ∗ =
Hence, we obtained the MCP penalized estimator β

(β1∗ , . . . , βq∗ )T via the following steps:
˜ calculate Y ∗ and X ∗ according to the
(1) For given initial value for β, denoted as β,
d
d
above formula.
ˆ ∗ = (β ∗ , . . . , β ∗ )T via implementing the iterative coordinate descent algorithm
(2) Obtain β
q
1
for quantile regression.
∗

ˆ
ˆ ∗ via β
ˆ = sign(βˆ∗ ) β∗ .
(3) Standardize β
1 ˆ
||β ||2

(4) Iterate steps (1) to (3) until convergence.

B.5

Algorithm for model (4.6)

Consider the intercept only model Y = m0 (Xβ) + , we proposed an iterative algorithm to
estimate its parameters in the following.
(0) Approximate m0 (Xβ) with the B-spline basis function, i.e., m0 (Xβ) ≈ γ01 +
¯
B(Xβ)γ
0∗ .
ˆ obtain γˆ01 and γ
ˆ 0∗ via the “rq” function
(1) For given initial value for β, denoted as β,
ˆ being the design matrix.
¯ β)
in R (in package “quantreg”), with B(X
(2) Obtain β updated via minimizing

n
i=1 ρτ (Y i

¯
− γˆ01 − B(Xβ)ˆ
γ 0∗ ) with the linear

approximation method described in Appendix B.4.
ˆ by β updated , and iterate until convergence.
(3) Update β

93

Appendix C
Proof of Theorems
Some notations: Denote the space of Lipschitz continuous functions for any fixed constant
c as Lip([a, b], c) = {f : |f (x1 ) − f (x2 )| ≤ c|x1 − x2 |, ∀x1 , x2 ∈ [a, b]}. Let C (p) [a, b] = {f :
f (p) ∈ C[a, b]} be the space of the pth order smooth functions.

C.1

Proof of Theorem 2.3.1

We assume the following regularity conditions:
(A1) The density function fU (β ) (·) of random variable U (β) = Xβ is bounded away
from 0 on Ω = {Xβ, X ∈ X }, where X is the compact support of X. There exists a
constant c1 , such that fU (β ) (·) ∈ Lip([a, b], c1 ).
(A2) For k = 0, 1, . . . , p, the non-parametric function mk (·) ∈ C (r) and r ≥ 2.
(A3) E(||G||6 ) < ∞ and E(| |6 ) < ∞.
(A4) The matrix M (u) = E(GGT |Xβ = u) is positive definite, each element of M (u) ∈
Lip([a, b], c4 ).
(A5) Let bn = maxk,l {pλ (||γ 0k∗ ||2 ), pλ (|γ 0k1 |), pλ (|βd0 |), γ 0k∗ = 0, γ 0k1 = 0, βl0 = 0} for
1
2
3
k = 1, . . . , p, d = 2, . . . , q. Then bn → 0 as n → 0
(A6)
1
|p (||γ k∗ ||2 )| > 0 for k = v + 1, . . . , p
n→∞ ||γ || →0+ λ1 λ1
k∗ 2

lim inf

lim inf

94

1
lim inf lim inf
|p (|γ k1 |)| > 0 for k = c + 1, . . . , p
n→∞ |γ |→0+ λ2 λ2
k1
lim inf lim inf

n→∞ |β |→0+
d

1
|p (|β |)| > 0 for d = s + 1, . . . , q
λ3 λ3 d

(A7) Let c1 , . . . , cK be the interior knots of [a, b], where a = inf{u : u ∈ Ω}, b = sup{u :
u ∈ Ω} and c0 = 1, cK+1 = b, hi = ci − ci−1 , h = max{hi }. Then exists a constant C7 such
h
that min{h
< C7 and max{hi+1 hi } = o(K −1 ).
i}

Lemma C.1.1. If mk (u), k = 0, 1, . . . , p satisfy condition (A2), then there exists a constant
C > 0 such that
−r
¯
sup |mk (u) − γk1 − B(u)γ
k∗ | ≤ CK .
u∈Ω

Proof : This result follows directly from a standard B-spline theory.
Denote φ = (β2 , . . . βq )T , hence β = (

1 − ||φ||22 , φT )T and φ0 is the true value of φ.

ˆ it is equivalent to show the consistency of φ
ˆ = (βˆ2 , · · · , βˆq ).
To show the consistency of β,
Let αn = n−r/(2r+1) + an , γ = γ 0 + αn τ 1 , φ = φ0 + αn τ 2 and τ = (τ 1 , τ 2 ), where
τ 1 = (τ01 , τ 0∗ , . . . , τp1 , τ p∗ ) and {τk1 , τ k∗ } corresponds to the B-spline coefficients γk1 , γ k∗ ;
φ

φ

φ

τ 2 = (τ1 , . . . , τq−1 ) and τl corresponds to φl .
ˆ we need to show that ∀ > 0, ∃ a large enough C,
ˆ and φ,
To show the consistency of γ
so that:
inf {Q1 (γ, φ)} > Q1 (γ 0 , φ0 ) ≥ 1 −

P

(C.1)

||τ ||=C

where
p

pλ1 (||γ k∗ ||2 ) + n

Q1 (γ, φ) = g(γ, φ) + n
k=1

and g(γ, φ) = ||Y −

p

q−1

pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) + n
k=1

p
2
¯
k=0 (γk1 + B(φ)γ k∗ )Gk ||2

95

pλ3 (|φl |)
l=1

is the squared error loss. If (C.1) holds, we

can say with probability at least 1− , there exists a local minimum in the ball {(γ 0 , φ0 )+αn ∗
ˆ − (γ 0 , φ0 )|| = Op (αn ).
τ ; ||τ || ≤ C}. Hence, there exists a local minimizer such that ||(ˆ
γ , φ)
Let

1
1
{Q1 (γ, φ) − Q1 (γ 0 , φ0 )} = {Q1 (γ 0 + αn τ 1 , φ0 + αn τ 2 ) − Q1 (γ 0 , φ0 )}
K
K
1
=
g(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − g(γ 0 , φ0 )
K

Dn (τ ) =

p

pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 )

+n
k=1
p

0 + α τ |)I(||γ 0 + α τ || = 0) − p (|γ 0 |)I(||γ 0 || = 0)
pλ2 (|γk1
n k1
n k∗ 2
λ2 k1
k∗
k∗ 2

+n
k=1
q−1

φ

pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |)

+n
j=1

Since pλ1 (||γ 0k∗ ||2 )] = 0 for k = v + 1, . . . , p and pλ3 (|φ0j |) = 0 for j = s + 1, . . . , q − 1
and I(||γ 0k∗ ||2 = 0) = 0 for k = 1, . . . , v , we have

Dn (τ ) ≥

1
g(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − g(γ 0 , φ0 )
K
v

[pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 )]

+n

k=1
p
0 + α τ |) − p (|γ 0 |)
pλ2 (|γk1
n k1
λ2 k1

+n
k=v+1
s−1

+n

φ

[pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |)]
j=1

96

Then by Taylor series expansion at (γ 0 , φ0 ), we have

Dn (τ ) ≥

αn ∂g ∂g
1
( 0,
nα2 τ I(γ 0 , φ0 )τ T (1 + op (1))
)τ T −
0
K ∂γ ∂φ
2K n

n
+
K
+

+

n
K
n
K

v
k=1
p

γ0
αn pλ (||γ 0k∗ ||2 ) 0k∗ τ Tk∗ + αn2 pλ (||γ 0k∗ ||2 )τ k∗ τ Tk∗ (1 + op (1))
1
1
||γ ||
k∗ 2

0 |)sign(γ 0 )τ + α2 p (|γ 0 |)(τ )2 (1 + o (1))
αn pλ (|γk1
p
k1
n λ
k1
k1 k1
2

2

k=v+1
s−1

φ
φ
αn pλ (|φ0j |)sign(φ0j )τj + αn2 pλ (|φ0j |)(τj )2 (1 + op (1))
j=1

3

3

:= S1 + S2 + S3 + S4 + S5

where I(γ 0 , φ0 ) is the Fisher’s information matrix.
By standard arguments, S1 is of the order Op (1 + nr/(2r+1) αn )||τ ||, S2 is of the order
Op (1 + 2nr/(2r+1) αn )||τ ||2 and for large enought C, S2 dominates S1 uniformly in ||τ || = C.
Further, by Taylor expansion at γ 0 , we have

n
S3 ≤ αn an
K

v

γ 0k∗

||γ 0k∗ ||2
k=1

τ Tk∗

n
+ αn2 max{pλ (||γ 0k∗ ||2 )}
1
K
k

v

τ k∗ τ Tk∗
k=1

n √
n
≤ αn2 v + αn2 C max{pλ (||γ 0k∗ ||2 )}
1
K
K
k

Since maxk {pλ } → 0, S3 is dominated by S2 .
1
For S4 and S5 , we have
n
S4 ≤ αn an
K

p
k=v+1

n
0 |)}
τk1 + αn2 max{pλ (|γk1
2
K
k

n 2
n
0 |)}
≤
αn C + αn2 C 2 max{pλ (|γk1
2
K
K
k

97

p

(τk1 )2
k=v+1

and

n
S5 ≤ αn an
K

s
φ
τj
j=1

n
+ αn2 max{pλ (|φ0j |)}
3
K
j

s
φ
(τj )2
j=1

n 2
n
≤
αn C + αn2 C 2 max{pλ (|φ0j |)}
3
K
K
j

Similarly, we have S4 and S5 being dominated by S2 . Hence, by choosing a large enough
ˆ − (γ 0 , φ0 )|| = Op (αn ). This proves the consistency of the penalized least
C, we have ||(ˆ
γ , φ)
ˆ
square estimator (ˆ
γ , φ).

C.2

Proof of Theorem 2.3.2

Without loss of generality, we denote φ = (φnz , φz ), where φnz = (φ1 , . . . , φs−1 ) and
φz = (φs , . . . , φq−1 ). From λmax → 0, it is very easy to see that an = 0 for large n. Then
by Theorem 2.3.1, it is sufficient to show for φnz ,

||φj − φ0j ||2 = Op (n−r/(2r+1) ), j = 1, . . . , s − 1

and for φz , for some given small

= Cn−r/(2r+1) , when n → ∞, with probability

approaching 1, for j = s, . . . , q − 1, we have

∂Q1 (γ, φ)
∂Q1 (γ, φ)
> 0 when 0 < φj < and
< 0 when − < φj < 0
∂φj
∂φj
Since we have
∂Q1 (γ, φ)
∂g(γ, φ)
=
+ npλ3 (|φj |)sign(φj ).
∂φj
∂φj

98

Do Taylor expansion at φ0 for

∂g(γ ,φ)
∂φj

only, we have

q−1 2
∂Q1 (γ, φ)
∂g(γ, φ0 )
∂ g(γ, φ0 )
=
+
(φl − φ0l )
∂φj
∂φj
∂φj ∂φl
l=1
q−1 q−1

+
l=1 k=1

∂ 3 g(γ, φ∗ )
(φ − φ0l )(φk − φ0k ) + npλ3 (|φj |)sign(φj )
∂φj ∂φl ∂φk l

where φ∗ lies between φ0 and φ. After some calculation, we have

1
1
∂Q1 (γ, φ)
= nλ3 { pλ (|φj |)sign(φj ) + Op ( n−r/(2r+1) )}
3
∂φj
λ3
λ3
Since limn→∞ lim inf φj →0 λ1 pλ (|φj |) > 0 and λ1 n−r/(2r+1) → 0, we can conclude that the
3 3
3
∂Q1 (γ ,φ)
sign of
is completely determined by sign of φj . Hence, we have proven βˆj = 0 for
∂φ
j

j = s + 1, . . . , q
For (ii) & (iii), applying similar arguments as in (i), we immediately have, with probability
ˆ k∗ = 0 for k = v + 1, . . . , p and γˆk1 = 0 for k = c + 1, . . . , p. Then
approaching 1, γ
ˆ γ , we have proven m
¯
by supu B(u) = O(1) and m
ˆ k (·) = γˆk0 + B(X
β)ˆ
ˆ k (·) = ck for
k∗
k = v + 1, . . . , c where ck is some constant and m
ˆ k (·) = 0 for k = c + 1, . . . , p.

C.3

Proof of Theorem 3.3.1

We assume the following regularity conditions:
(A1) The density function fU (β ) (·) of a random variable U (β) = Xβ is bounded away
from 0 on Ω = {Xβ, X ∈ X }, where X is the compact support of X. And there exists a
constant c1 , such that fU (β ) (·) ∈ Lip([a, b], c1 ).
(A2) For k = 0, 1, . . . , p, the non-parametric function mk (·) ∈ C (r) and r ≥ 2.
99

(A3) E(||G||6 ) < ∞.
(A4) The matrix M (u) = E(GGT |Xβ = u) is positive definite, and each element of
M (u) ∈ Lip([a, b], c4 )
(A5) Let bn = maxk,l {pλ (||γ 0k∗ ||2 ), pλ (|γ 0k1 |), pλ (|βd0 |), γ 0k∗ = 0, γ 0k1 = 0, βl0 = 0} for
1
2
3
k = 1, . . . , p, d = 2, . . . , q, then bn → 0 as n → 0.
(A6)
lim inf

lim inf

n→∞ ||γ || →0+
k∗ 2

1
|p (||γ k∗ ||2 )| > 0 for k = v + 1, . . . , p
λ1 λ1

1
lim inf lim inf
|p (|γ k1 |)| > 0 for k = c + 1, . . . , p
n→∞ |γ |→0+ λ2 λ2
k1
1
|p (|βd |)| > 0 for d = s + 1, . . . , q
n→∞ |β |→0+ λ3 λ3
d

lim inf lim inf

(A7) Let c1 , . . . , cK be the interior knots of [a, b], a = inf{u : u ∈ Ω}, b = sup{u : u ∈ Ω}
and c0 = 1, cK+1 = b, hi = ci − ci−1 , h = max{hi }. Then there exists a constant C7 such
h
< C7 and max{hi+1 hi } = o(K −1 ).
that min{h
}
i

Let φ = (β2 , . . . , βq )T , and we have β = ( 1 − ||φ||22 , φT )T , hence the restriction ||β|| =
ˆ it is enough to show
1 and β1 > 0 is equivalent to ||φ||2 < 1. To show the consistency of β,
ˆ Let αn = n−r/(2r+1) + an , γ = γ 0 + αn τ 1 , φ = φ0 + αn τ 2 and
the consistency of φ.
τ = (τ 1 , τ 2 ), where τ1 = (τ01 , τ 0∗ , . . . , τp1 , τ p∗ ) and {τk1 , τ k∗ } corresponds to the B-spline
φ

φ

φ

coefficients γk1 , γ k∗ ; τ 2 = (τ1 , . . . , τq−1 ) and τl corresponds to φl .
ˆ we need to show ∀ > 0, ∃ a large enough C, so that:
ˆ and φ,
To show consistency of γ

P

sup {M (γ, φ)} < M (γ 0 , φ0 ) ≥ 1 −
||τ ||=C

100

(C.2)

where
p

M (γ, φ) = l(γ, φ) − n

p

pλ1 (||γ k∗ ||2 ) − n
k=1

q−1

pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) − n
k=1

pλ3 (|φl |)
l=1

and l(γ, φ) is the log-likelihood function defined above. If (C.2) holds, we can see, with
probability at least 1 − , there exists a local maximum in the ball {(γ 0 , φ0 ) + αn ∗ τ ; ||τ || ≤
ˆ − (γ 0 , φ0 )|| = Op (αn ).
C}. Hence, there exists a local maximizer such that ||(ˆ
γ , φ)
Let

1
1
{M (γ, φ) − M (γ 0 , φ0 )} = {M (γ 0 + αn τ 1 , φ0 + αn τ 2 ) − M (γ 0 , φ0 )}
K
K
1
=
l(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − l(γ 0 , φ0 )
K

Dn (τ ) =

p

pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 )

−n
k=1
p

0 + α τ |)I(||γ 0 + α τ || = 0) − p (|γ 0 |)I(||γ 0 || = 0)
pλ2 (|γk1
n k1
n k∗ 2
λ2 k1
k∗
k∗ 2

−n
k=1
q−1

φ

pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |)

−n
j=1

Since pλ1 (||γ 0k∗ ||2 )] = 0 for k = v + 1, . . . , p and pλ3 (|φ0j |) = 0 for j = s + 1, . . . , q − 1
and I(||γ 0k∗ ||2 = 0) = 0 for k = 1, . . . , v , we have

101

Dn (τ ) ≤

1
l(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − l(γ 0 , φ0 )
K
v

[pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 )]

−n

k=1
p
0 + α τ |) − p (|γ 0 |)
pλ2 (|γk1
n k1
λ2 k1

−n
k=v+1
s−1

φ

[pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |)]

−n

j=1

Then by Taylor expansion at (γ 0 , φ0 ), we have

Dn (τ ) ≤

αn ∂l ∂l
T − 1 nα2 τ I(γ 0 , φ0 )τ T (1 + o (1))
)τ
( 0,
p
K ∂γ ∂φ0
2K n
n
−
K
−

−

n
K
n
K

v
k=1
p

αn pλ (||γ 0k∗ ||2 )
1

γ 0k∗

τ Tk∗
0
||γ k∗ ||2

+ αn2 pλ (||γ 0k∗ ||2 )τ k∗ τ Tk∗ (1 + op (1))
1

0 |)sign(γ 0 )τ + α2 p (|γ 0 |)(τ )2 (1 + o (1))
αn pλ (|γk1
p
k1
n λ
k1 k1
k1
k=v+1
s−1
j=1

2

2

φ
φ
αn pλ (|φ0j |)sign(φ0j )τj + αn2 pλ (|φ0j |)(τj )2 (1 + op (1))
3
3

:= S1 − S2 − S3 − S4 − S5

where I(γ 0 , φ0 ) is the Fisher’s information matrix. By standard arguments of likelihood
theory, S1 is of the order Op (1 + nr/(2r+1) αn )||τ ||, S2 is of the order Op (1 +
2nr/(2r+1) αn )||τ ||2 and for large enough C, S2 dominates S1 uniformly in ||τ || = C .

102

Further, we have

n
S3 ≤ αn an
K
≤

v
k=1

γ 0k∗
||γ 0k∗ ||2

τ Tk∗

n
+ αn2 max{pλ (||γ 0k∗ ||2 )}
1
K
k

v

τ k∗ τ Tk∗
k=1

n 2√
n
αn v||τ || + αn2 ||τ ||2 max{pλ (||γ 0k∗ ||2 )}
1
K
K
k

Since maxk {pλ } → 0, we have S3 dominated by S2 .
1

For S4 and S5 , we have
n
S4 ≤ αn an
K

p
k=v+1

n
0 |)}
τk1 + αn2 max{pλ (|γk1
2
K
k

p

(τk1 )2
k=v+1

n
n 2
0 |)}
αn ||τ || + αn2 ||τ ||2 max{pλ (|γk1
≤
2
K
K
k

and
s−1

s−1

n
n
φ
φ
S5 ≤ αn an
τl + αn2 max{pλ (|φ0l |)}
(τl )2
3
K
K
l
l=1
l=1
n 2
n 2
2
0
α ||τ || + αn ||τ || max{pλ (|φl |)}
≤
3
K n
K
l

Similarly, we have S4 and S5 dominated by S2 . Hence, by choosing a large enough C,
ˆ − (γ 0 , φ0 )|| = Op (αn ). Hence the consistency of penalized least squares
we have ||(ˆ
γ , φ)
ˆ is proven.
estimator (ˆ
γ , φ)

C.4

Proof of Theorem 3.3.2

For ease of notation, we denote φ = (φnz , φz ), where φnz = (φ1 , . . . , φs−1 ) and φz =
(φs , . . . , φq−1 ). From λmax → 0, it is very easy to see an = 0 for large n. Then by Theorem

103

3.3.1, it is sufficient to show for φnz , it satisfies,

||φl − φ0l ||2 = Op (n−r/(2r+1) ), l = 1, . . . , s − 1

and for φz and for some given small

= Cn−r/(2r+1) , when n → ∞, with probability

approaching 1, for l = s, . . . , q − 1, it satisfies,

∂M (γ, φ)
∂M (γ, φ)
< 0 when 0 < φl < and
> 0 when − < φl < 0
∂φl
∂φl
Since we have
∂l(γ, φ)
∂M (γ, φ)
=
− npλ (|φl |)sign(φl )
3
∂φl
∂φl
Do Taylor expansion at φ0 for

∂l(γ ,φ)
∂φl

only, we have

q−1

∂M (γ, φ)
∂l(γ, φ0 )
∂ 2 l(γ, φ0 )
=
+
(φk − φ0k )
∂φl
∂φl
∂φl ∂φk
k=1
q−1 q−1

+
k=1 j=1

∂ 3 l(γ, φ∗ )
(φ − φ0k )(φj − φ0j ) − npλ (|φj |)sign(φj )
3
∂φl ∂φk ∂φj k

where φ∗ lies between φ0 and φ. After some calculation, we have

∂M (γ, φ)
1
1
= nλ3 − pλ (|φl |)sign(φl ) + Op ( n−r/(2r+1) )
∂φl
λ3 3
λ3
Since limn→∞ lim inf φj →0 λ1 pλ (|φj |) > 0 and λ1 n−r/(2r+1) → 0, we can conclude that the
3 3
3
∂M (γ ,φ)
is completely determined by the sign of φj . Hence, we prove βˆj = 0 for
sign of
∂φ
j

j = s + 1, . . . , q
For (ii) & (iii), applying similar arguments as in (i), we immediately have, with probability
104

ˆ k∗ = 0 for k = v + 1, . . . , p and γˆk1 = 0 for k = c + 1, . . . , p. Then by
approaching 1, γ
ˆ γ , we prove m
¯
supu B(u) = O(1) and m
ˆ k (·) = γˆk0 + B(X
β)ˆ
ˆ k (·) = ck for k = v + 1, . . . , c,
k∗
where ck is some constant and m
ˆ k (·) = 0 for k = c + 1, . . . , p.

105

BIBLIOGRAPHY

106

BIBLIOGRAPHY

[1] Antoniadis, A., & Fan, J. (2001). Regularization of wavelet approximations. Journal of
the American Statistical Association, 96(455), 939-967.
[2] Ben Sherwood and Adam Maidman (2016). rqPen: Penalized Quantile Regression. R
package version 1.5.1. https://CRAN.R-project.org/package=rqPen
[3] Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized
regression, with applications to biological feature selection. The annals of applied
statistics, 5(1), 232.
[4] Cai, Z., Fan, J., & Li, R. (2000). Efficient estimation and inferences for varying-coefficient
models. Journal of the American Statistical Association, 95(451), 888-902.
[5] Carpenter, D. O., Arcaro, K., & Spink, D. C. (2002). Understanding the human health
effects of chemical mixtures. Environmental Health Perspectives, 110(Suppl 1), 25.
[6] Cheverud, J. M. (2001). A simple correction for multiple comparisons in interval mapping
genome scans. Heredity, 87(1), 52-58.
[7] Chiang, C. T., Rice, J. A., & Wu, C. O. (2001). Smoothing spline estimation for
varying coefficient models with repeatedly measured dependent variables. Journal of the
American Statistical Association, 96(454), 605-619.
[8] Colditz, G. A., & Hankinson, S. E. (2005). The Nurses’ Health Study: lifestyle and health
among women. Nature Reviews Cancer, 5(5), 388-396.
[9] Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The
Annals of statistics, 32(2), 407-499.
[10] Fan, J., & Zhang, W. (1999). Statistical estimation in varying coefficient models. Annals
of Statistics, 1491-1518.
[11] Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American statistical Association, 96(456), 1348-1360.
[12] Feng, S., & Xue, L. (2013). Variable selection for single-index varying-coefficient model.
Frontiers of Mathematics in China, 8(3), 541-565.
[13] Frank, L. E., & Friedman, J. H. (1993). A statistical view of some chemometrics
regression tools. Technometrics, 35(2), 109-135.
107

[14] Friedman, J., Hastie, T., Hfling, H., & Tibshirani, R. (2007). Pathwise coordinate
optimization. The Annals of Applied Statistics, 1(2), 302-332.
[15] Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of statistical software, 33(1), 1.
[16] Grant, S. F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz,
J., ... & Styrkarsdottir, U. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene
confers risk of type 2 diabetes. Nature genetics, 38(3), 320-323.
[17] Hastie, T., & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal
Statistical Society. Series B (Methodological), 757-796.
[18] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1), 55-67.
[19] Horikoshi, M., Beaumont, R. N., Day, F. R., Warrington, N. M., Kooijman, M. N.,
Fernandez-Tajes, J., ... & Bradfield, J. P. (2016). Genome-wide associations for birth
weight and correlations with adult disease. Nature, 538(7624), 248-252.
[20] Hoover, D. R., Rice, J. A., Wu, C. O., & Yang, L. P. (1998). Nonparametric smoothing
estimates of time-varying coefficient models with longitudinal data. Biometrika, 809-822.
[21] Huang, J. Z., Wu, C. O., & Zhou, L. (2004). Polynomial spline estimation and inference
for varying coefficient models with longitudinal data. Statistica Sinica, 763-788.
[22] Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
[23] Lee, E. R., Noh, H., & Park, B. U. (2014). Model selection via Bayesian information
criterion for quantile regression models. Journal of the American Statistical Association,
109(505), 216-229.
[24] Liu, X., Cui, Y., & Li, R. (2016). Partial linear varying multi-index coefficient model
for integrative gene-environment interactions. Statistica Sinica, 26, 1037-1060.
[25] Ma, S., Yang, L., Romero, R., & Cui, Y. (2011). Varying coefficient model for
geneenvironment interaction: a non-linear look. Bioinformatics, 27(15), 2119-2126.
[26] Ma, S., & Song, P. X. K. (2015). Varying index coefficient models. Journal of the
American Statistical Association, 110(509), 341-356.
[27] Ma, S., Xu, S. (2015). Semiparametric nonlinear regression for detecting gene and
environment interactions. Journal of Statistical Planning and Inference, 156, 31-47.

108

[28] Nyholt, D. R. (2004). A simple correction for multiple testing for single-nucleotide
polymorphisms in linkage disequilibrium with each other. The American Journal of
Human Genetics, 74(4), 765-769.
[29] Ottman, R. (1996). Geneenvironment interaction:
Preventive medicine, 25(6), 764.

definitions and study designs.

[30] Peng, B., & Wang, L. (2015). An iterative coordinate descent algorithm for
high-dimensional nonconvex penalized quantile regression. Journal of Computational and
Graphical Statistics, 24(3), 676-694.
[31] Rimm, E. B., Giovannucci, E. L., Willett, W. C., Colditz, G. A., Ascherio, A., Rosner,
B., & Stampfer, M. J. (1991). Prospective study of alcohol consumption and risk of
coronary disease in men. The Lancet, 338(8765), 464-468.
[32] Sale, M. M., Smith, S. G., Mychaleckyj, J. C., Keene, K. L., Langefeld, C. D., Leak, T.
S., ... & Freedman, B. I. (2007). Variants of the transcription factor 7-like 2 (TCF7L2)
gene are associated with type 2 diabetes in an African-American population enriched for
nephropathy. Diabetes, 56(10), 2638-2642.
[33] Schadt, E. E., Molony, C., Chudin, E., Hao, K., Yang, X., Lum, P. Y., ... & Zhu, J.
(2008). Mapping the genetic architecture of gene expression in human liver. PLoS Biol,
6(5), e107. Chicago
[34] Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press.
[35] Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 6(2),
461-464.
[36] Sexton, K., & Hattis, D. (2007). Assessing cumulative health risks from exposure to
environmental mixturesthree fundamental questions. Environmental Health Perspectives,
825-832.
[37] Tang, Y., Wang, H. J., Zhu, Z., & Song, X. (2012). A unified variable selection approach
for varying coefficient models. Statistica Sinica, 601-628.
[38] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), 267-288.
[39] Yang, X., Zhang, B., Molony, C., Chudin, E., Hao, K., Zhu, J., ... & Guengerich, F. P.
(2010). Systematic genetic and genomic analysis of cytochrome P450 enzyme activities
in human liver. Genome research, 20(8), 1020-1036.
[40] Yu, Y., & Ruppert, D. (2002). Penalized spline estimation for partially linear
single-index models. Journal of the American Statistical Association, 97(460), 1042-1054.
109

[41] Xia, Y., & Li, W. K. (1999). On the estimation and testing of functional-coefficient
linear models. Statistica Sinica, 735-757.
[42] Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
68(1), 49-67.
[43] Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty.
The Annals of statistics, 38(2), 894-942.
[44] Zhou, X., & You, J. (2004). Wavelet estimation in varying-coefficient partially linear
regression models. Statistics & probability letters, 68(1), 91-104.
[45] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American
statistical association, 101(476), 1418-1429.
[46] Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood
models. Annals of statistics, 36(4), 1509.
[47] Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis.
Journal of computational and graphical statistics, 15(2), 265-286.

110