FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC
GENE-ENVIRONMENT INTERACTIONS WITH LONGITUDINAL DATA
By
Jingyi Zhang

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Statistics—Doctor of Philosophy
2017

ABSTRACT
FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC
GENE-ENVIRONMENT INTERACTIONS WITH LONGITUDINAL DATA
By
Jingyi Zhang
Rooted in genetics, human complex diseases are largely influenced by environmental factors. Existing literature has shown the power of integrative gene-environment interaction analysis by considering the joint effect of environmental mixtures on disease risk. Motivated by that, we propose a
functional varying index coefficient model for longitudinal measurements of a phenotypic trait and
multiple environmental variables, and assess how the genetic effects on a longitudinal disease trait
are nonlinearly modified by a mixture of environmental influences. We derive an estimation procedure for the nonparametric functional varying index coefficients under the quadratic inference
functions and penalized splines framework. Theoretical results such as estimation consistency and
asymptotic normality of the estimates are established. In addition, we propose a hypothesis testing
procedure to assess the significance of the nonparametric index coefficient function. We evaluate
the performance of our estimation and testing procedure through Monte Carlo simulation studies.
The proposed method is illustrated by applying to a real data set from a pain sensitivity study
in which SNP effects are nonlinearly modulated by the combination of dosage levels and other
environmental variables to affect blood pressure and heart rate of patients.
In order to deal with discrete measurements for risk of disease, we further extend our proposed
FVICM to a generalized varying index coefficient model (gFVICM) to binary longitudinal traits.
We apply penalized splines to approximate the nonparametric varying index coefficients and develop an estimation procedure based on the quadratic inference functions. The asymptotic normality
established in the theoretical results enables us to develop a model selection criteria and construct
a test statistic based on the quadratic inference function. In hypothesis test, we investigate the
linearity of G×E interactions using the proposed testing procedure. The utility of the method is

further demonstrated through a pain sensitivity case study in which SNP effects are nonlinearly
modulated by the combination of environmental mixtures to affect high blood pressure.
Genetic pleiotropy refers to the situation in which a single gene influences multiple traits and so
it is considered as a major factor that underlies genetic correlation among traits. For some complex
diseases, there are multiple phenotypes that can used to diagnose or to quantify the risk of diseases
and usually they have shared genetic determinations. In multivariate longitudinal data, multiple
response variables are jointly measured over time from the same individual. It is appropriate
to take into account the correlation between multivariate longitudinal responses. Therefore, we
propose the joint partially linear varying coefficient models and the testing framework to jointly
test the association of genetic factors with bivariate phenotypic values adjusting for environmental
factors. We extended the quadratic inference functions to deal with the longitudinal correlations
and used penalized splines for the approximation of nonparametric coefficients. The proposed
method is illustrated by applying to a real data set from a pain sensitivity study, in which systolic
blood pressure (SBP) and diastolic blood pressure (DBP) were correlated longitudinal quantified
phenotypes of SNP effects.

ACKNOWLEDGEMENTS

I would like to express the deepest appreciation to my advisor and committee chair Professor
Yuehua Cui for his excellent guidance and continuous support in my Ph.D. research. Without his
guidance and persistent help this dissertation would not have been possible. I also would like to
thank my committee members, Professor Ping-Shou Zhong, Professor Lyudmila Sakhanenko and
Professor Qing Lu, for their insightful comments and suggestions in my research.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2
2.1
2.2
2.3
2.4

2.5

2.6

2.7
2.8

FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quadratic inference function for FVICM with longitudinal data . . . . . . . . .
Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Algorithm for estimation . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Choice of the basis for the inverse of the correlation matrix . . . . . . .
2.4.4 Choice of the tuning parameter . . . . . . . . . . . . . . . . . . . . . .
Hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Linear mixed model representation for FVICM model . . . . . . . . . .
2.5.2 LRT and pseudo-LRT in LMM . . . . . . . . . . . . . . . . . . . . . .
2.5.2.1 LRT for one variance component . . . . . . . . . . . . . . .
2.5.2.2 Pseudo-LRT for multiple variance components . . . . . . . .
2.5.3 Pseudo-LRT in FVICM model . . . . . . . . . . . . . . . . . . . . . .
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Performance of estimation . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3 Performance of hypothesis tests . . . . . . . . . . . . . . . . . . . . .
Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

6
6
8
11
12
12
12
13
14
14
14
15
15
17
18
18
18
19
23
25
28

GENERALIZED FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The model and estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Quadratic inference function for gFVICM . . . . . . . . . . . . . . . . . .
3.2.3 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model selection and hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Nonparametric goodness-of-fit test based on QIF . . . . . . . . . . . . . .
3.3.3 Test for linearity of interaction function in gFVICM . . . . . . . . . . . . .
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Performance of estimation . . . . . . . . . . . . . . . . . . . . . . . . . .

30
30
32
32
33
35
36
36
37
38
39
40

CHAPTER 3
3.1
3.2

3.3

3.4

v

3.5
3.6

3.4.2 Performance of hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . 45
Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

CHAPTER 4

4.1
4.2

4.3

4.4
4.5

DETECTING GENETIC ASSOCIATIONS WITH MULTIVARIATE PARTIALLY LINEAR VARYING-COEFFICIENTS MODELS FOR MULTIPLE LONGITUDINAL TRAITS . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Joint models and statistical methods . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Joint multivariate models . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Objective function based on QIF . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.5 Nonparametric goodness-of-fit test . . . . . . . . . . . . . . . . . . . . .
4.2.6 Two-step hypothesis testing procedure . . . . . . . . . . . . . . . . . . .
4.2.6.1 Step 1: Joint test . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.6.2 Step 2: Marginal tests . . . . . . . . . . . . . . . . . . . . . .
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Performance of estimation . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Performance of hypothesis tests . . . . . . . . . . . . . . . . . . . . . .
4.3.3.1 Performance for joint test . . . . . . . . . . . . . . . . . . . .
4.3.3.2 Performance for marginal tests . . . . . . . . . . . . . . . . . .
Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

52
52
54
54
55
57
59
59
60
60
61
62
62
63
65
65
66
68
71

CHAPTER 5 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 73
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

vi

LIST OF TABLES

Table 2.1

Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and
correlation ρ=0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 2.2

Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and
correlation ρ=0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Table 2.3

List of SNPs with MAF, allele, p-values under different hypothesis testing and
MSE for SBP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Table 2.4

List of SNPs with MAF, allele, p-values under different hypothesis testing and
MSE for DBP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Table 2.5

List of SNPs with MAF, allele, p-values under different hypothesis testing and
MSE for HR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Table 3.1

Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size
N = 200, 500, T = 10 and correlation ρ=0.5. . . . . . . . . . . . . . . . . . . . 40

Table 3.2

Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size
N = 200, 500, T = 20 and correlation ρ=0.5. . . . . . . . . . . . . . . . . . . . 41

Table 3.3

List of SNPs with MAF, allele, p-values under different hypothesis testing and
values of objective function QN . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Table 3.4

Estimated odds for different genotypes at each dosage levels. . . . . . . . . . . . 50

Table 4.1

Estimation results for parameters α1 and α2 with sample size N = 200, 500. . . . 63

Table 4.2

List of SNPs with MAF, allele, p-values under the joint and marginal testing
for SBP and DBP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Table 4.3

List of SNPs with MAF, allele, estimation of coefficients, p-values of significance for coefficients corresponding to SBP and DBP, respectively. . . . . . . . 69

vii

LIST OF FIGURES

Figure 2.1

The estimation of function m0 (·) under different MAFs when N=200, 500
and ρ=0.5. The estimated and true functions are denoted by the solid and
dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 2.2

The estimation of function m1 (·) under different MAFs when N=200, 500
and ρ=0.5. The estimated and true functions are denoted by the solid and
dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 2.3

The estimation of function m0 (·) under different MAFs when N=200, 500
and ρ=0.8. The estimated and true functions are denoted by the solid and
dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 2.4

The estimation of function m1 (·) under different MAFs when N=200, 500
and ρ=0.8 The estimated and true functions are denoted by the solid and
dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Figure 2.5

The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=200, 500 and ρ=0.5. . . . . . . . . . . . 24

Figure 2.6

The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=200, 500 and ρ=0.8. . . . . . . . . . . . 24

Figure 2.7

Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for
SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is SBP. . . . . . . . . . . 27

Figure 2.8

Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for
SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is DBP. . . . . . . . . . . 28

Figure 2.9

Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for
SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is HR. . . . . . . . . . . . 29

Figure 3.1

The estimation of function m0 (·) when N=200, 500 and T =10. The estimated
and true functions are denoted by the solid and dashed lines respectively. The
95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 42

viii

Figure 3.2

The estimation of function m0 (·) when N=200, 500 and T =20. The estimated
and true functions are denoted by the solid and dashed lines respectively. The
95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 43

Figure 3.3

The estimation of function m1 (·) when N=200, 500 and T =10. The estimated
and true functions are denoted by the solid and dashed lines respectively. The
95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 44

Figure 3.4

The estimation of function m1 (·) when N=200, 500 and T =20. The estimated
and true functions are denoted by the solid and dashed lines respectively. The
95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 45

Figure 3.5

The empirical size and power of testing the linearity of nonparametric function m1 when N=200, 500 and T =10, 20. . . . . . . . . . . . . . . . . . . . . . 46

Figure 3.6

The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=500 and T =10. . . . . . . . . . . . . . . 47

Figure 3.7

Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for
SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed lines. . . . . . . . . . . . . . . . . . . 49

Figure 4.1

The estimation of nonparametric functions β01 (·) and β11 (·) when N=200,
500. The estimated and true functions are denoted by the solid and dashed
lines respectively. The 95% confidence bands are denoted by the dotted-dash line. 64

Figure 4.2

The estimation of nonparametric functions β02 (·) and β12 (·) when N=200,
500. The estimated and true functions are denoted by the solid and dashed
lines respectively. The 95% confidence bands are denoted by the dotted-dash line. 65

Figure 4.3

The power plot for the joint test under different sample sizes N=200, 500
when T =10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 4.4

The power plots for the marginal test for N=200, 500 and T =10. . . . . . . . . 67

Figure 4.5

Plot of the estimate (solid curve) of the nonparametric function β1SBP (·) for
SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed line. . . . . . . . . . . . . . . . . . . . 70

Figure 4.6

Plot of the estimate (solid curve) of the nonparametric function β1DBP (·) for
SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed line. . . . . . . . . . . . . . . . . . . . 71

ix

CHAPTER 1
INTRODUCTION

Gene-environment (G×E) interaction is defined as how genotypes influence phenotypes differently
under different environmental conditions (Falconer, 1952). An increasing number of studies have
confirmed the role of G×E interaction in many human diseases. One classic example is Phenylketonuria (PKU). PKU is caused by a defect in the gene coding for a particular enzyme which is
needed to break down phenylalanine. Newborns found to have high levels of phenylalanine in
their blood can be put on a special, phenylalanine-free diet to avoid the severe effects of PKU
(Baker, 2004). This example confirms the role of gene-environment interaction by showing that a
change in environment can affect the phenotype of a particular trait.
In genetic epidemiology, G×E interactions are useful for understanding the risk of some complex human diseases. Famous studies such as Parkinson disease (Ross and Smith (2007)) and type
2 diabetes (Zimmet, Alberti, and Shaw (2001)) both indicate the importance of G×E interaction in
complex human diseases. However, the underlying mechanism of G×E interaction is still poorly
understood due to the lack of powerful statistical methods. The traditional way to investigate G×E
interaction is based on a single environment exposure model. Parametric models such as additive
linear models, use the products of two variables to denote the interaction effects. However, this
product may not capture the true interaction effect of gene and environment, since it could not be
able to capture the possible nonlinear G×E interactions.
In order to assess possible nonlinear G×E interactions, some nonparametric and semiparametric models have been developed, such as varying coefficient models (VCM) proposed by Hastie
and Tibishirani (1993). In a VCM model, the coefficients of covariates are allowed to change with
some other variables through smooth functions, so nonlinear interactions can be assessed. A VCM
has the form
L

Y=

∑ ml (X)Zl + ε,

(1.1)

l=1

where Y is the response variable, (X, ZT )T is a vector of predictors consisting of a scalar X and a
1

L-dimensional vector Z = (Z1 , Z2 , ..., ZL )T . ml (·), l = 1, ..., L, are unknown nonparametric smooth
functions. In particular, dealing with G×E interaction problems, one can replace the predicts Z to
be genetic variants, for example, single nucleotide polymorphisms (SNPs).
Epidemiological evidences suggested that a disease risk can be modified by simultaneous exposure to multiple environmental agents with effect larger than simple addition of individual factor
acting along. The specification of VCM in (1.1) may be limited in dealing with simultaneous
exposure to multiple environmental factors. It will result in difficulties in the estimation of the
coefficient function ml (X) when variable X = (X1 , ..., X p )T in Model (1.1) is multidimensional.
To overcome such challenge, Ma and Song (2015) proposed the varying index coefficient model
(VICM) with a form
L

Y=

∑ ml (βlT X)Zl + ε,

(1.2)

l=1

where β l = (βl1 , ..., βl p )T are the coefficients for covariate vector X with βlk be the loading weight
for the k-th covariate of X, i.e. Xk associated with Zl . The single index β T X is actually a linear
combination of several environmental effects. The VICM model is able to pursue the interaction
between a set of environmental factors as a whole and genetic variables on the disease risk, if
the covariate Z is specified to be some genetic effect, such as a SNP variable. Liu et al. (2016)
extended the model to a partial linear varying multi-index coefficient model (PLVMICM):
L

Y = m0 (β0T X) + α T0 Z + ∑ {ml (βlT X)Gl + α Tl ZGl } + ε,

(1.3)

l=1

where Gl , l = 1, . . . , L are genetic variables of interest, ml (·), l = 0, 1, ..., L are unknown index
functions, α 0 , ..., α L and β 0 , ..., β L are parametric parameters. The main genetic effect for each
Gl is captured by the function ml (βlT X) when the function is approximated by some nonparametric techniques such as B-spline approximation. Model (1.3) is an extension of Model (1.2) by
considering partial linear covariates Z with application in capturing nonlinear G×E interactions. However, the above mentioned medoels can not be used directly for longitudinal data because
of the assumption of independence among observations. Little work has been done to deal with

2

nonlinear G×E interaction in longitudinal studies. This motivates us to extend the varying index
coefficient model to longitudinal traits.
Longitudinal studies play an important role in epidemiology, biological and clinical research.
Longitudinal studies are used to characterize normal growth and aging, to assess the effect of risk
factors on human health, and to evaluate the effectiveness of treatments. Some researches, such as
Sitlani et al. (2015), Furlotte et al. (2015) and Xu et al. (2014) demonstrate the longitudinal design
is more powerful in detecting genetic associations than cross-sectional designs.
The traditional way to analyze longitudinal data is using the regression models (Belle et al.,
2004). However, standard regression methods assume that all observations are independent. The
assumption of independence would result in invalid inferences. One approach to deal with the
issue is using a complete model which specify the correlation structure among observations for
each subject. In linear mixed effect models, we can make specific assumptions for the covariance
structure in observations. Based on the parametric assumptions for the covariance components,
we can apply the maximum likelihood methods to estimate the regression parameters. Another
regression approach for inference with longitudinal data is known as generalized estimating equation (GEE) approach proposed by Liang and Zeger in 1986. The GEE method is very popular in
recent decades and has been widely used in longitudinal data analysis. However, the application
of the GEE method has several disadvantages. One of those disadvantages is that GEE may fail to
produce consistent estimators if the nuisance correlation parameters are not consistently estimated
(Crowder 1986, 1995). In addition, model selection and hypothesis testing can be complicated
since there is no objective function in the estimation procedure of the GEE approach.
The quadratic inference function (QIF) approach proposed by Qu et al. (2000) is one of the
improvements of the GEE method. The benefits of using QIF approach in longitudinal analysis
have been discussed in some researches, such as Qu et al. (2000), Qu and Li (2006) and Song et al.
(2009). The QIF method avoids estimating the nuisance correlation parameters by using a linear
combination of several basis matrices to approximate the inverse of correlation matrix. When the
working correlation structure is correctly specified, both the QIF and GEE are equally efficient.

3

However, when the working correlation structure is misspecified, the QIF is more efficient than the
GEE. In addition, the QIF has an asymptotic χ 2 distribution, based on which we can implement
the model selection criteria like BIC, and construct testing statistic. These advantages of using QIF
in longitudinal analysis motivates us to extend the QIF method to the varying coefficient model in
detecting G×E interactions on risk of diseases.
We organise the rest of this dissertation in the following way: In Chapter 2, in order to capture
the dynamic nonlinear G×E interaction with the combined effect of environmental factors for
longitudinal data, we propose a functional varying index coefficient model (FVICM) for correlated
responses, i.e.,
β T0 Xi j ) + m1 (β
β T1 Xi j )Gi + εi j ,
Yi j = m0 (β

(1.4)

where Yi j is the response variable which measures the risk of certain disease on the ith subject at
the jth time point, where i = 1, · · · , N, j = 1, · · · , ni ; Xi j is a p-dimensional vector of environmental
variables, which can be either time-dependent or time invariant; Gi denotes the genetic variable;
εi j is an error term with mean 0 and some correlation structure; m0 (·) and m1 (·) are unknown
functions; β 0 and β 1 are p-dimensional vectors of index loading coefficients. Compared with
Model (1.2) and (1.3), Model (1.4) assumes correlations among observations from the same subject, which can be applied to capture the longitudinal correlation among different time points for
a subject. We use penalized splines to approximate the nonparametric functions in the model and
then develop an estimation procedure based on QIF method. In addition, we are interested to see
whether the interaction function m1 (·) is significantly nonlinear or not. This is a natural concern
in our model since if a linear interaction function is sufficient, a varying coefficient model would
not be necessary. We develop the testing procedure by representing our model in a linear mixed
model form and then apply the pseudo-likelihood ratio test. On the other hand, discrete longitudinal traits, for example, the binary traits are very common in clinical researches. To deal with
binary responses, in Chapter 3, we extend Model (1.4) to a generalized functional varying-index

4

coefficient model (gFVICM) with a form
β T0 Xi j ) + m1 (β
β T1 Xi j )Gi j ,
g E(Yi j |Xi j , Gi j ) = m0 (β

(1.5)

where g(·) is the logit link function, m0 (·) and m1 (·) are unknown nonparametric functions, β 0 and
β 1 are p-dimensional vectors of index coefficients. The QIF method is modified to accommodate
binary response. To test the linearity of interaction function m1 (·), we develop a hypothesis testing
procedure which is built on the asymptotical property of the QIF objective function. In Chapter 4,
we consider G×E interactions when multiple longitudinal traits are measured to improve the power
of association test, especially to identify any pleiotropic effect (i.e., one gene can affect multiple
traits). For this purpose, we propose a multivariate partially linear varying coefficients model
and derive a testing framework to jointly test the association of genetic factors with multivariate
phenotypic values adjusting for environmental factors. The joint models are written as
Yli j = Yli (ti j ) = β0l {X(ti j )} + β1l {X(ti j )}Gi + α l Zi j + εli j ,

(1.6)

where Yli j is the response variable which measures the l-th phenotype on the i-th subject at the j-th
time point, X(ti j ) is a time-varying covariate, Zi j is a p-dimensional vector of covariates, which
can either depend or be independent of time, Gi denotes the genetic variable within subject, εli j is
an error term and




εi = 




ε 1i
..
.



 ∼ N 0, Σ



ε Li
with Σ be some covariance structure, β0l (·) and β1l (·) are unknown functions. The robustness of
QIF method in the variance of the estimators helps us to build the estimation and hypothesis testing
procedure. In Chapter 5, we conclude the thesis with a brief conclusion about the contributions of
this thesis and point out some future research directions. Proofs are provided in the Appendix.

5

CHAPTER 2
FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC
GENE-ENVIRONMENT INTERACTIONS

2.1

Introduction

It has been broadly recognized that gene-environment (G×E) interaction plays important role in
human complex diseases. A growing number of scientific researches have confirmed the role of
G×E interaction in many human diseases, such as Parkinson disease (Ross and Smith, 2007) and
type 2 diabetes (Zimmet et al., 2001). G×E interaction is defined as how genotypes influence
phenotypes differently under different environmental conditions (Falconer, 1952). It also refers
to the genetic sensitivity to environmental changes. Usually, G×E has been investigated based
on a single environment exposure model. Evidence from epidemiological studies has suggested
that disease risk can be modified by simultaneous exposure to multiple environmental factors. The
effect of simultaneous exposure is higher than the simple addition of the effects of factors acting
alone (Carpenter et al., 2002; Sexton and Hattis, 2007). This motivated us to assess the combined
effect of environmental mixtures, and how they, as a whole, interact with genes to affect disease
risk (Liu et al. 2016). In our previous model, we proposed a varying-index coefficient model to
capture the nonlinear interaction between a gene and environmental mixtures (Liu et al. 2016). The
method was extended for any univariate trait distribution in a generalized linear model framework
(Liu et al. 2017).
In biomedical studies, longitudinal traits are often observed, with repeated measures of the
same subject over time. The increased power of a longitudinal design to detect genetic associations
over cross-sectional designs has been evaluated (Sitlani et al. 2015; Furlotte et al. 2015; Xu et al.
2014). With longitudinal disease traits, one can study the dynamic gene effect over time. Coupling
with longitudinal measure of environmental exposures, one can study how genes respond to the
dynamic change of environmental factors to affect a disease trait. This motivates us to extend the

6

varying index coefficient model to longitudinal traits.
To explore time-dependent effects in longitudinal data analysis, some nonparametric and semiparametric models such as varying coefficient models have been proposed, for example, Hoover
et al. (1998), Wu, Chiang, and Hoover (1998), Fan and Zhang (2000), Martinussen and Scheike
(2001), Chiang, Rice, and Wu (2001), Huang, Wu, and Zhou (2002), Ma and Song (2015). However, these methods do not fit to our purpose. In order to capture the dynamic nonlinear G×E interaction with combined effect of environmental factors for longitudinal data, we propose a functional
varying index coefficient model (FVICM) for correlated response, i.e.,
β T1 Xi j )Gi + εi j ,
β T0 Xi j ) + m1 (β
Yi j = m0 (β

(2.1)

where Yi j is the response variable which measures the risk of certain disease on the ith subject at
the jth time point, where i = 1, · · · , N, j = 1, · · · , ni ; Xi j is a p-dimensional vector of environmental
variables, which can be either time-dependent or time invariant; Gi denotes the genetic variable;
εi j is an error term with mean 0 and some correlation structure; m0 (·) and m1 (·) are unknown functions; β 0 and β 1 are p-dimensional vectors of index loading coefficients. For model identifiability,
we have the constraints β 0 = β 1 = 1 and the first elements of β 0 and β 1 are positive.
Qu et al. (2000) proposed the quadratic inference function (QIF) for longitudinal data analysis,
as an improvement of the generalized estimation equation (GEE) approach introduced by Liang
and Zeger (1986). The QIF approach avoids estimating the nuisance correlation parameters by
assuming that the inverse of the correlation matrix can be approximated by a linear combination
of several basis matrices. Qu et al. (2000) found that the QIF estimator could be generally more
efficient than the GEE estimator. Qu and Li (2006) applied the QIF method to the varying coefficient model for longitudinal data. Bai et al. (2009) developed an estimating procedure for single
index models with longitudinal data also based on QIF method. Motivated by that, in this paper,
we extend the QIF method to the FVICM model for dynamic G×E interactions.
Our goal in this work is to develop a set of statistical estimation and hypothesis testing procedure for model (2.1). We first approximate the varying index coefficient function by the penalized

7

splines (Ruppert and Carroll, 2000) and then extend the QIF approach to our model in order to estimate the index loading coefficients and the penalized spline coefficients. Under certain regularity
conditions, we establish the consistency and asymptotic normality of the resulting estimators.
Another goal of this work is to test the linearity of the G×E interaction effect. This is of particular interest in our model setting since if the G×E interaction is linear, a simple linear regression
model should be fit, and fitting any higher order nonlinear functions would be unnecessary. With
a mixed effects model representation of the penalized spline approximations (Speed, 1991; Ruppert, Wand, and Carroll, 2003; Wand, 2003), we can transform the problem of testing an unknown
function into testing some fixed effects and a variance component in a linear mixed effects model
setup with multiple variance components, which will be evaluated in this study.
This chapter is organized as follows: in Section 2.2, we propose an estimation procedure under
the FVICM model, and further establish the consistency and asymptotic normality of the proposed
estimator in Section 2.3. In Section 2.4, we discuss some practical issues to implement the proposed estimation procedures. In Section 2.5, a pseudo-likelihood ratio test procedure with a linear
mixed effects model representation is illustrated. We assess the finite sample performance of the
proposed procedure with Monte Carlo simulation in Section 2.6 and illustrate the proposed method
by an analysis of a pain sensitivity data set in Section 2.7, followed by discussions in Section 2.8.
Technical details are rendered in Appendix.

2.2

Quadratic inference function for FVICM with longitudinal data

For longitudinal data, suppose the response yi j , p-dimensional covariate vector x i j , and SNP variable Gi are observed from the ith observation at the jth time point. SNP variable {Gi , i = 1, ..., N}
does not change over time. Assume the model satisfies
β T0 xi j ) + m1 (β
β T1 xi j )Gi ,
E(yi j |xxi j , Gi ) = m0 (β

8

We can approximate the unknown coefficient functions m0 (u0 ) and m1 (u1 ) by a q-degree truncated
power spline basis, i.e.
m0 (u0 ) = m0 (u0 , β ) ≈ B(u0 )T γ 0 ,
m1 (u1 ) = m1 (u1 , β ) ≈ B(u1 )T γ 1 ,
q

q

β T0 , β T1 )T , B(u) = (1, u, u2 , · · · , uq , (u−κ1 )+ , · · · , (u−κK )+ )T is a q-degree truncated
where β = (β
power spline basis with K knots κ1 , · · · , κK . γ 0 and γ 1 are (q + K + 1)-dimensional vectors of
spline coefficients. Let γ = (γγ T0 , γ T1 )T .
For longitudinal data, the conditional variance-covariance matrix of the response need to be
modelled. The method of generalized estimation equation (GEE) is often applied to estimate the
unknowns. The GEE is defined as,
N

∑ µ˙ Ti V−1
i (yi − µ i ) = 0,

i=1

where Vi is the covariance matrix of yi , yi = (yi1 , ..., yini )T , µ i = E(yi ) is the mean function and
µ˙ i is the first derivative of µ i with respect to the parameters. Based on the spline approximation,
the mean function can be written as

 µ i1 (θθ )

..
µ i = µ i (θθ ) = 
.


µ in (θθ )





 
 
=
 
 

i

and the first derivative of µ i is

T β T x )γγ xT
0 i1 0 i1
 Bd (β

..
µ˙ i = 
.


β T0 xini )γγ 0 xTin
BTd (β

i

β T0 xi1 )γγ 0 + BT (β
β T1 xi1 )γγ 1 Gi
BT (β
..
.
β T0 xini )γγ 0 + BT (β
β T1 xini )γγ 1 Gi
BT (β




,



β T1 xi1 )γγ 1 Gi xTi1
BTd (β

β T0 xi1 )
BT (β

β T1 xi1 )Gi
BT (β

..
.

..
.

..
.

β T1 xini )γγ 1 Gi xTin
BTd (β

∂ B(u)

i

q−1

β T0 xini ) BT (β
β T1 xini )Gi
BT (β




,



q−1

β T , γ T )T .
where Bd (u) = ∂ u = (0, 1, 2u, · · · , quq−1 , q(u − κ1 )+ , · · · , q(u − κK )+ ), θ = (β
When Vi is unknown, Liang and Zeger (1986) suggested that Vi can be simplified as Vi =
1/2

Ai

1/2

R(ρ)Ai

with Ai being a diagonal matrix of marginal variances and R(ρ) being a common

9

working correlation matrix with a small number of nuisance parameters ρ. When ρ is consistently estimated, the GEE estimators of the regression coefficients are consistent. When such
consistent estimators for the nuisance parameters do not exist, Qu et al. (2000) suggested that
the inverse of R(ρ) can be represented by a linear combination of a class of basis matrices such
as R−1 (ρ) ≈ a1 M1 + a2 M2 · · · + ah Mh , where M1 is the identity matrix and M2 , · · · , Mh are
symmetric matrices. The advantage of this method is that the estimation of nuisance parameters
a1 , · · · , ah are not required. Following this idea, we define the estimation function as,


N µ˙ T A−1/2 M A−1/2 (y − µ )
1 i
i
i 
 ∑i=1 i i
N

1
1
.

..
g¯N (θθ ) = ∑ gi (θθ ) = 

N i=1
N


−1/2
−1/2
TA
˙
µ
M
A
(y
−
µ
)
∑N
h i
i
i
i=1 i i

(2.2)

Because the dimension of the estimation equation g¯N is greater than the number of parameters, we
cannot obtain the estimators by simply setting each element in g¯N to be zero. Qu et al. (2000)
introduced the Quadratic Inference Function (QIF) based on the generalized method of moments
(Hansen, 1982). Thus, we can estimate the parameters by minimizing the QIF, which is defined as
−1 g¯ ,
QN (θθ ) = N g¯TN C¯N
N

(2.3)

T
where C¯N = N1 ∑N
i=1 gi gi is a consistent estimator for var(gi ). By minimizing the quadratic infer-

ence function, we can obtain the estimation of the parameters
θ = arg min QN (θθ ).
θ

To overcome the well known over-parameterization issue, Qu et al. (2000) further proposed the
penalized quadratic inference function
N −1 QN (θθ ) + λ θ T Dθθ ,

(2.4)

where D is a diagonal matrix with element 1 if the corresponding parameters are spline coefficients
associated with the knots and 0 otherwise, i.e., D = diag(0T(2p+q+1)×1 , 1TK×1 , 0T(q+1)×1 , 1TK×1 ).
Then the estimator is given by
θ = arg min(N −1 QN (θθ ) + λ θ T Dθθ ).
θ

10

(2.5)

2.3

Asymptotic properties

In this section, we establish the asymptotic properties for the penalized quadratic inference function
estimators with fixed knots. Assume θ 0 is the parameter satisfying Eθ (gi ) = 0. Theorem 1
0
provides the consistency of the resulting estimators. We show the asymptotic normality of the
estimators in Theorem 2. The theoretical results are similar to those provided in Qu and Li (2006).
The difference is that we have constraints for the index loading parameters in our model, i.e.
β 0 = β 1 =1, and β01 > 0, β11 > 0. To handle the constraints, we do the reparameterization as
βl1 =

1 − β l,−1 22 with β l,−1 = (βl2 , ..., βl p )T for l=1, 2 (Yu and Ruppert, 2002; Cui et al.,

2011; Ma and Song, 2015). Then the parameters space of β l , l=1,2, becomes
[{( 1 − β l,−1 22 , βl2 , ..., βl p )T } : β l,−1 22 < 1].
Let


Jl =

∂β l
∂ β Tl,−1


=

β Tl,−1 /
−β

1−

β l,−1 22

I p−1





β T0,−1 , β T1,−1 )T , and θ ∗ =
be the Jacobian matrix of dimension p × (p − 1). Denote β −1 = (β
β −1 , γ )T . From θ to θ ∗ , we have Jacobian matrix J = diag(J0 , J1 , Iq+K+1 , Iq+K+1 ).
(β
Theorem 1 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing parameter λN = o(1), then the estimator θ , which is obtained by minimizing the penalized
quadratic function in (2.4), exists and converges to θ 0 in probability.
Theorem 2 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing
parameter λN = o(N −1/2 ), then the estimator θ obtained by minimizing the penalized quadratic
function in (2.4) is asymptotically normally distributed, i.e.,
√
d
−1 T
N(θ − θ 0 ) −
→ N(0, J(GT0 C−1
0 G0 ) J ),
where G0 and C0 are given in the Appendix.

11

2.4

Practical issues

In this section, we discuss some practical issues when we implement the proposed method.

2.4.1

Algorithm for estimation

A two-step iterative Newton-Raphson algorithm is applied when we estimate the index loading
parameters and the varying spline coefficients. The algorithm of the estimation procedure can be
summarized in the following steps.
Step 0 Choose initial values for β and γ . Denote them by β (old) and γ (old) .
Step 1 Estimate γ (new) by
γ (new) = arg min(N −1 QN (γγ , β (old) ) + λ γ T Dγγ .
γ

The Newton-Raphson algorithm is used for the minimization.
Step 2 Estimate β (new) by
β , γ (new) ).
β (new) = arg min QN (β
β

Also use Newton-Raphson for minimization.
(old)

Step 3 Update β l

(old)

by β l

(new)

= sign(βl1

(new)

βl
)β

(new)

/ βl

2,

l = 1, 2. Update γ (old) by

setting γ (old) = γ (new) .
Step 4 Repeat Steps 1-3 until convergence.

2.4.2

Model selection

It is important to determine the order and number of knots in the spline approximation since too
many knots in the model might overfit the data. Under the assumption E(g) = 0 (g is the estimation
function in (2.2) for a single observation) and the number of estimating equations is larger than
2 in distribution (Hansen, 1982), where r is the
the number of parameters, we have Q(θ ) → χr−k

12

dimension of g¯N (θθ ), k is the dimension of θ , θ is the estimator by minimizing the QIF when
certain order and number of knots are chosen. This asymptotic property of the QIF provides a
goodness-of-fit test, which can be useful to determine the order and number of knots to be selected
in our model.
However, it is also possible that the goodness-of-fit tests fail to reject several different models
which may not be nested. Since Q(θ ) is asymptotically chi-square distributed, we can use BIC
to penalize Q(θ ) for the difference of the numbers of estimating equations and parameters. In
particular, the BIC criterion for a model with r estimating equation and k parameters is defined as
Q(θ ) + (r − k) ln N,
The model with minimum BIC would be considered better. If we choose h basis matrices in (2.2),
then r − k = hk − k = (h − 1)k. As we discussed in Section 2.4.3, we usually use h=2 in our setting.
Thus, the BIC criterion is actually
Q(θ ) + k ln N,
where k is the number of parameters in the model.
In our simulation and real data application, we search the optimal order and the number of
knots over a set of combinations of q and K using BIC. Knots are evenly distributed in the range
of u(= β T X).

2.4.3

Choice of the basis for the inverse of the correlation matrix

Qu and Li (2006) offered several choices of basis matrixes. For exchangeable working correlation,
M1 is identity matrix and M2 has 0 on the diagonal and 1 off-diagonal. If the working correlation
is AR(1), we can set M2 to have 1 on its two subdiagonals and 0 elsewhere. Prior information
on correlation can help us to determine the choice of appropriate basis matrices. The effect of
choosing different basis matrices is discussed in Qu and Li (2006) through simulation studies. Qu
and Lindsay (2003) also proposed an adaptive estimation method to approximate the correlation
empirically when there is no prior information available.
13

2.4.4

Choice of the tuning parameter

Since the penalized spline is used to approximate the unknown functions, we need to determine
the tuning parameter λ involved in the method. As Qu and Li (2006) suggested, we can extend
the generalized cross-validation (Ruppert, 2002) to the penalized QIF and define the generalized
cross-validation statistic as
GCV(λ ) =

N −1 QN
(1 − N −1 df)2

where df = tr[(Q¨ N + 2Nλ D)−1 Q¨ N ] is the effective degree of freedom, QN is defined in (2.3) and
Q¨ N is the second derivative of QN . The desirable choice of tuning parameter λ is
λ = arg min GCV(λ ).
λ

In the implementation of GCV, the golden search method can be applied in order to reduce the
computational time.

2.5
2.5.1

Hypothesis test
Linear mixed model representation for FVICM model

In our proposed FVICM model (2.1), it is of interest to test the unspecified coefficient function.
In particular, we are interested in testing whether a linear function is good enough to describe the
G×E interaction. Given β , let u0 = β T1 X, u1 = β T0 X, with the truncated power spline basis, the
coefficient function can be modeled by
q
m1 (u1 ) = γ10 + γ11 u1 + γ12 u21 + · · · + γ1q u1 +

K

q

∑ b1k (u1 − κk )+.
k=1

Our goal is to test the linearity of m1 (u1 ), which is equivalent to test
H0 : γ12 = · · · = γ1q = 0, b11 = · · · = b1K = 0.
q

q T

q

Let w0i j = (1, u0i j , · · · , u0i j )T , z0i j = (u0i j − κ1 )+ , · · · , (u0i j − κK )+
q

q

, γ˜ 0 = (γ00 , · · · , γ0q )T ,
q T

b0 = (b01 , · · · , b0K )T , w1i j = (1, u1i j , · · · , u1i j )T , z1i j = (u1i j −κ11 )+ , · · · , (u1i j −κ1K )+
(b11 , · · · , b1K )T , γ˜ 1 = (γ10 , · · · , γ1q )T ,
14

,b1 =

m0 (u0i j ) = wT0i j γ˜ 0 + zT0i j b0 ,
m1 (u1i j ) = wT1i j γ˜ 1 + zT1i j b1 .
We further define Yi = (yi1 , · · · , yini )T , W0i = (w0i1 , · · · , w0ini )T , W1i = (w1i1 Gi , · · · , w1ini Gi )T ,
Z0i = (z0i1 , · · · , z0ini )T , Z1i = (z1i1 Gi , · · · , z1ini Gi )T , then a linear mixed model (LMM) representation (Wang and Chen, 2012) can be obtained as,
Yi = W0i γ˜ 0 + W1i γ˜ 1 + Z0i b0 + Z1i b1 + 1i ai + ε i , i = 1, · · · , n,

(2.6)

bl ∼ N(0, σb2 IK ), l = 0, 1, ε i ∼ N(0, σε2 I),
l
where the random incept effects ai are assumed to be independent as N(0, σa2 ) which model the
correlation in the response.
With the LMM representation, testing the linearity of the varying index coefficients is equivalent to test some fixed effects and a variance component in model (2.6). To be specific, we want to
test
H0 : γ12 = · · · = γ1q = 0 and σb2 = 0.
1

2.5.2
2.5.2.1

(2.7)

LRT and pseudo-LRT in LMM
LRT for one variance component

Crainiceanu and Ruppert (2004) proposed the likelihood ratio test in linear mixed effect models
with one variance component. Consider a LMM with one variance component
  

  

2
0 
 b   0K 
 b   σb Σ
β + Zb + ε , E   = 
Y = Xβ
 , Cov   = 
,
ε
0n
ε
0 σε2 In

(2.8)

where β is a p-dimensional vector of fixed effect coefficients, b is a K-dimensional vector of
random effects, 0K is a K-dimensional vector of zeros, Σ is a known K × K symmetric positive
definite matrix. Let λ = σb2 /σε2 be the signal-to-noise ratio and then the covariance matrix of
15

ΣZT . Consider testing for the null
Y cab be written as Cov(Y) = σε2 Vλ , where Vλ = In + λ ZΣ
hypothesis
H0 : β p+1−p = 0, · · · , β p = 0, σb2 = 0

(2.9)

for p > 0.
The LRT statistic is defined as
β , λ , σε2 ) − sup L(β
β , λ , σε2 ) .
LRTn ∝ 2 sup L(β
HA

H0

If we substitute the parameters β and σε2 with their profile estimators
Y,
β (λ ) = (XT V−1
X)−1 XT V−1
λ
λ
σε2 (λ ) =

{Y − Xβ (λ )}T Vλ−1 {Y − Xβ (λ )}
n

,

for fixed λ , we obtain the LRT statistic
LRTn = sup {n log(YT S0 Y) − n log(YT PTλ V−1
Pλ Y) − log |Vλ |},
λ

(2.10)

λ ≥0

where Pλ = In − X(XT V−1
X)−1 XT V−1
, X0 denotes the design matrix of fixed effects under the
λ
λ
null hypothesis, S0 = In − X0 (XT0 X0 )−1 XT0 .
Theorem 1 in Crainiceanu and Ruppert (2004) provides the distribution of LRT statistic (2.10).
Σ1/2 , ξs be the eigenvalues of Σ 1/2 ZT ZΣ
Σ1/2 , s = 1, · · · , K,
Let µs be the eigenvalues of Σ 1/2 ZT P0 ZΣ
then
p
∑1 u2s

d

LRTn = n 1 + n−p
∑1 w2s
iid

+ sup fn (λ ),
λ ≥0

iid

where us ∼ N(0, 1) for s = 1, · · · , K, ws ∼ N(0, 1) for s = 1, · · · , n − p, and
fn (λ ) = n log 1 +

K
Nn (λ )
− ∑ log(1 + λ µs ),
Dn (λ )
s=1

with
K

Nn (λ ) =

λ µs

∑ 1 + λ µs w2s ,

s=1

16

(2.11)

n−p
w2s
Dn (λ ) = ∑
+ ∑ w2s .
s=1 1 + λ µs s=K+1
K

The distribution in (2.11) only depends on the eigenvalues µs and ξs . Based on the spectral decomposition, simulation from this distribution can be done very rapidly. Detailed algorithm for
this simulation can be found in Crainiceanu and Ruppert (2004).

2.5.2.2

Pseudo-LRT for multiple variance components

For a LMM with multiple variance components
β + Zb1 + · · · + ZbL + ε ,
Y = Xβ

(2.12)

bl ∼ N(0, σb2 Σ l ), l = 1, · · · , L, ε ∼ N(0, σε2 In ),
l

where bl , l = 1, · · · , L are random effects and L > 1. Suppose we are interested in testing
H0 : β p+1−p = 0, · · · , β p = 0, σb2 = 0.
L
Greven et al. (2008) proposed to approximate the distribution of LRT for the model (2.12)
based on the pseudo-likelihood ratio test theory (Liang and Self, 1996) by using a pseudo-outcome.
In the framework of model (2.12), bi , i = L, are nuisance random parameters. We can define the
pseudo-outcome as
Y = Y − ∑ Zi bi ,
i=L

where bi are the best linear unbiased predictors (BLUP) of nuisance random effects bi , i = L. The
the model (2.12) can be reduced to
β + ZL bL + ε .
Y = Xβ

(2.13)

Then the method for testing one variance component introduced by Crainiceanu and Ruppert
(2004) can be applied to the model in (2.13).

17

2.5.3

Pseudo-LRT in FVICM model

For the model in (2.6), we can use the idea of Greven et al. (2008) and define the pseudo-outcome
Yi = Yi − Z0i b0 − Ui ai , i = 1, · · · , n,
where b0 and ai are BLUPs of b0 and ai , respectively. The reduced model using pseudo-outcome
for model (2.6) can be written as
Yi = W0i γ˜ 0 + W1i γ˜ 1 + Z1i b1 + ε i . i = 1, · · · , n.

(2.14)

For the new model (2.14) using pseudo-response, we can apply the method for the single variance
component model introduced in Section 2.10 to test hypothesis (2.7). Statistical significance can
be assessed through the resampling approach described in section 2.5.2.1.

2.6
2.6.1

Simulation study
Simulation

In this section, the finite sample performance of the proposed method is evaluated through Monte
Carlo simulation studies. We generate three covariates X1 , X2 , X3 . For each subject i, X1i j , X2i j , X3i j
are generated independently from uniform distribution U(0, 1). We set the minor allele frequency (MAF) as pA =(0.1, 0.3, 0.5) and assume Hardy-Weinberg equilibrium. We use AA, Aa and
aa to denote three different SNP genotypes, where allele A is the minor allele. These genotypes
are simulated from a multinomial distribution with frequencies p2A , 2pA (1 − pA ) and (1 − pA )2 ,
respectively. Variable G takes value in the set {0,1,2}, corresponding to genotypes {aa, Aa, AA}
respectively. The error term εi = (εi1 , · · · , εini ) are independently generated from the multivariate
normal distribution N(0, 0.1R(ρ)). The true correlation structure R(ρ) is assumed to be exchangeable with ρ=0.5 and 0.8.
√
√
We set m0 (u0 ) = cos(πu0 ) and m1 (u1 ) = sin[π(u1 − A)/(B − A)] with A = 3/2 − 1.645/ 12
√
√ √ √ √
√
√
and B = 3/2+1.645/ 12. The true parameters are β 0 = ( 5, 4, 4)/ 13 and β 1 = (1, 1, 1)/ 3.
18

To simplify the simulation and save computational time, we consider the balanced case, which
means each observation has the same number of time points. We draw 1000 data sets with sample
size N = 200, 500 and time points ni = T = 10. Since the true correlation structure is exchangeable, we set M1 to be the identity matrix and M2 to be 0 on the diagonal and 1 off-diagonal. The
order and number of knots of the splines are chosen by using the BIC method.

2.6.2

Performance of estimation

Table 2.1 summarizes the results based on 1000 replications. In this table, the average bias (Bias),
the standard deviation of the 1000 estimates (SD), the average of the estimated standard error (SE)
based on the theoretical results, and the estimated coverage probability (CP) at 95% confidence
level are reported. Note that the estimation of the loading parameter β 1 improves as MAF pA
increases, while the estimation of β 0 show a opposite direction. This is because we have limited
data information to estimate the marginal effects m0 (·) when pA increases. As the sample size
increases, the performance of the estimation improves by showing smaller bias, SD and SE.
Table 2.1 Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and correlation
ρ=0.5.
N Param True
200 β01 0.620
β02 0.555
β03 0.555
β11 0.577
β12 0.577
β13 0.577
500

β01
β02
β03
β11
β12
β13

0.620
0.555
0.555
0.577
0.577
0.577

pA = 0.1
Bias
SD
7.3E-04 0.008
-3.9E-04 0.008
-6.2E-04 0.008
-2.3E-05 0.018
-6.3E-04 0.018
-3.9E-04 0.018

SE
0.008
0.009
0.008
0.020
0.020
0.020

CP
95.6
93.2
94.4
91.0
91.3
91.0

pA = 0.3
Bias
SD
1.7E-03 0.009
-1.0E-03 0.010
-1.2E-03 0.010
-3.1E-04 0.011
-3.0E-04 0.011
2.8E-04 0.011

SE
0.010
0.010
0.010
0.011
0.011
0.011

CP
96.2
92.5
94.2
93.7
94.3
94.8

pA = 0.5
Bias
SD
1.5E-03 0.011
-1.2E-03 0.012
-8.5E-04 0.012
-8.6E-04 0.009
-6.8E-05 0.009
7.1E-04 0.009

SE
0.011
0.011
0.011
0.009
0.009
0.009

CP
95.0
92.4
93.0
94.7
93.8
93.1

7.5E-04
-5.7E-04
-3.4E-04
6.4E-04
-6.0E-04
-4.1E-04

0.005
0.005
0.005
0.012
0.012
0.012

95.5
94.4
93.9
93.8
93.6
94.6

1.7E-03
-1.1E-03
-8.7E-04
-1.7E-04
-1.5E-05
6.0E-05

0.006
0.006
0.006
0.007
0.007
0.007

95.1
94.6
94.1
95.6
96.1
95.0

1.6E-03
-8.8E-04
-1.1E-03
-7.3E-04
5.3E-04
1.1E-04

0.007
0.007
0.007
0.006
0.006
0.006

95.8
95.2
94.7
95.1
94.7
95.6

0.005
0.005
0.005
0.012
0.012
0.012

0.006
0.006
0.006
0.007
0.007
0.007

0.007
0.007
0.007
0.006
0.006
0.006

The plots for the estimations of m0 (u0 ) and m1 (u1 ) under different sample size and MAFs are
shown in Figure 2.1 and Figure 2.2. The estimated and true functions are denoted by the solid

19

and dashed lines, respectively. The 95% confidence band is denoted by the dotted-dash line. The
estimated curves almost overlap with the corresponding true curves as shown in the plots. The
confidence bands are tight, especially under a large sample size. Note that the estimation for the
interaction effects m1 (u1 ) improves as MAF pA increases, while the estimation for the marginal
effects m0 (u0 ) show a opposite direction, which coincides with the results for the parametric estimation in Table 2.1.
N=200, pA=0.1

N=200, pA=0.3
0.5

0.0

−0.5

0.5

m0(u0)

m0(u0)

m0(u0)

0.5

0.0

−0.5

0.25

0.50

0.75
u0

1.00

1.25

−1.0
0.25

N=500, pA=0.1

0.50

0.75
u0

1.00

1.25

0.0

−0.5

−0.5

0.75
u0

1.00

1.25

1.00

1.25

0.0

−0.5

−1.0

−1.0

−1.0

0.75
u0

0.5

m0(u0)

m0(u0)

0.0

0.50

n=500, pA=0.5

0.5

0.50

0.25

N=500, pA=0.3

0.5

0.25

0.0

−0.5

−1.0

−1.0

m0(u0)

N=200, pA=0.5

0.25

0.50

0.75
u0

1.00

1.25

0.25

0.50

0.75
u0

1.00

1.25

Figure 2.1 The estimation of function m0 (·) under different MAFs when N=200, 500 and ρ=0.5.
The estimated and true functions are denoted by the solid and dashed lines respectively. The 95%
confidence band is denoted by the dotted-dash line.

20

N=200, pA=0.1

N=200, pA=0.3

N=200, pA=0.5

0.5

0.5

0.5

0.0

m1(u1)

1.0

m1(u1)

1.0

m1(u1)

1.0

0.0

0.0

−0.5

−0.5

−0.5
0.25

0.50

0.75
u1

1.00

1.25

0.25

N=500, pA=0.1

0.50

0.75
u1

1.00

1.25

0.25

N=500, pA=0.3

0.5

0.5

−0.5

−0.5
0.50

0.75
u1

1.00

1.25

1.25

0.0

0.0

0.25

1.00

m1(u1)

0.5

m1(u1)

1.0

m1(u1)

1.0

−0.5

0.75
u1

N=500, pA=0.5

1.0

0.0

0.50

0.25

0.50

0.75
u1

1.00

1.25

0.25

0.50

0.75
u1

1.00

1.25

Figure 2.2 The estimation of function m1 (·) under different MAFs when N=200, 500 and ρ=0.5.
The estimated and true functions are denoted by the solid and dashed lines respectively. The 95%
confidence band is denoted by the dotted-dash line.

The performance of the estimation for ρ = 0.8 is shown in Table 2.2, Figure 2.3 and Figure
2.4. It is seen that the SD and SE are smaller when ρ is larger compared to the results when
ρ = 0.5. The confidence bands are a little bit wider, especially for m0 when pA =0.5 and for m1
when pA =0.1 for larger ρ. In summary, the simulation results show that the estimation method
performs reasonably well under different simulation settings in finite samples.

21

Table 2.2 Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and correlation
ρ=0.8
N Param True
200 β01 0.620
β02 0.555
β03 0.555
β11 0.577
β12 0.577
β13 0.577
500

β01
β02
β03
β11
β12
β13

0.620
0.555
0.555
0.577
0.577
0.577

pA = 0.1
Bias
SD
4.4E-04 0.005
-2.2E-04 0.006
-3.6E-04 0.006
-6.7E-05 0.014
-2.4E-04 0.014
-1.8E-04 0.014

SE
0.005
0.005
0.005
0.012
0.012
0.012

CP
95.8
92.3
94.1
90.3
91.8
89.9

pA = 0.3
Bias
SD
5.5E-04 0.006
-3.2E-04 0.007
-4.0E-04 0.006
-2.4E-04 0.007
-1.3E-04 0.007
2.4E-04 0.007

SE
0.006
0.006
0.006
0.007
0.007
0.007

CP
95.8
91.8
94.6
94.3
94.2
93.7

pA = 0.5
Bias
SD
-5.0E-06 0.007
-2.8E-04 0.008
1.4E-04 0.007
-7.7E-04 0.006
3.3E-05 0.006
6.4E-04 0.006

SE
0.007
0.007
0.007
0.006
0.006
0.006

CP
95.3
92.7
93.7
92.7
93.4
93.5

5.3E-04
-4.0E-04
-2.3E-04
2.5E-04
-2.9E-04
-1.1E-04

0.003
0.003
0.003
0.007
0.007
0.007

94.0
93.8
93.1
94.0
93.5
95.4

5.8E-04
-4.2E-04
-2.8E-04
-1.5E-04
4.4E-05
6.1E-05

0.004
0.004
0.004
0.004
0.004
0.004

95.4
94.8
94.2
95.3
95.7
95.4

3.3E-04
-1.3E-04
-2.9E-04
-6.8E-04
4.5E-04
1.9E-04

0.005
0.004
0.004
0.004
0.004
0.004

95.6
95.0
94.7
93.9
93.5
95.2

0.004
0.003
0.004
0.008
0.008
0.007

N=200, pA=0.1

0.004
0.004
0.004
0.004
0.004
0.004

N=200, pA=0.3

0.004
0.005
0.004
0.004
0.004
0.004

N=200, pA=0.5
1.0

0.5

0.0

−0.5

0.5

m0(u0)

m0(u0)

m0(u0)

0.5

0.0

−0.5

−0.5

−1.0

−1.0

−1.0
0.25

0.50

0.75
u0

1.00

1.25

0.25

N=500, pA=0.1

0.75
u0

1.00

1.25

−0.5

0.0

−0.5

0.25

0.50

0.75
u0

1.00

1.25

0.75
u0

1.00

1.25

N=500, pA=0.5

0.0

−0.5

−1.0

−1.0

−1.0

0.50

0.5

m0(u0)

0.0

0.25

1.0

0.5

m0(u0)

m0(u0)

0.50

N=500, pA=0.3

0.5

0.0

0.25

0.50

0.75
u0

1.00

1.25

0.25

0.50

0.75
u0

1.00

1.25

Figure 2.3 The estimation of function m0 (·) under different MAFs when N=200, 500 and ρ=0.8.
The estimated and true functions are denoted by the solid and dashed lines respectively. The 95%
confidence band is denoted by the dotted-dash line.

22

N=200, pA=0.3

N=200, pA=0.5
1.0

0.5

0.5

0.5

0.0

m1(u1)

1.0

1.0

m1(u1)

m1(u1)

N=200, pA=0.1

0.0

0.0

−0.5
−0.5

−0.5
0.25

0.50

0.75
u1

1.00

1.25

0.25

0.75
u1

1.00

1.25

0.25

N=500, pA=0.3

0.5

0.5

0.5

0.50

0.75
u1

1.00

1.25

1.25

−0.5

−0.5
0.25

1.00

0.0

0.0

−0.5

0.75
u1

m1(u1)

1.0

1.0

0.0

0.50

N=500, pA=0.5

1.0

m1(u1)

m1(u1)

N=500, pA=0.1

0.50

0.25

0.50

0.75
u1

1.00

1.25

0.25

0.50

0.75
u1

1.00

1.25

Figure 2.4 The estimation of function m1 (·) under different MAFs when N=200, 500 and ρ=0.8
The estimated and true functions are denoted by the solid and dashed lines respectively. The 95%
confidence band is denoted by the dotted-dash line.

2.6.3

Performance of hypothesis tests

We evaluate the performance of the test for the nonparametric function under the null hypothesis
H0 : m1 (·) = m01 (·), where m01 (u1 ) = δ0 + δ1 u1 , δ0 and δ1 are some constants, which corresponds
to a linear G×E interaction. If we fail to reject the null, then a linear model can be fit to further
assess the linear G×E interaction. Otherwise, we conclude nonlinear G×E interaction. Power is
evaluated under a sequence of alternative models with different values of τ, which is denoted by
H1τ : mτ1 (·) = m01 (·) + τ{m1 (·) − m01 (·)}. When τ = 0, the corresponding power is the false positive
rate.
Figure 2.5 shows the size (when τ = 0) and power (when τ > 0) at significance level 0.05. We
obtain 1000 Monte Carlo simulations each with 5000 replications to access the null distribution of
test statistic under sample sizes N = 200, 500 with ρ = 0.5. The empirical Type I error under three
MAFs are very close to the nominal level 0.05 and the power increases dramatically when MAF

23

increases from 0.1 to 0.3. Results for ρ = 0.8 is presented in Figure 2.6. Similarly, the empirical
Type I error is close to 0.05 and the power increases rapidly when MAF increases from 0.1 to 0.3.
Compared to the performance when ρ = 0.5 shown in Figure 2.5, the power increases a little bit
slower when ρ = 0.8. The results indicate that our method can reasonably control the false positive
rates and has appropriate power to detect the genetic variation.
N=200, ρ=0.5

N=500, ρ=0.5

0.7

0.7
Power

1.0

Power

1.0

0.5

0.5

0.3

0.3

pA

0.1
0.3
0.5

0.1
0.00

0.05

0.10
τ

0.15

pA

0.1
0.3
0.5

0.1

0.20

0.00

0.05

0.10
τ

0.15

0.20

Figure 2.5 The empirical size and power of testing the linearity of nonparametric function m1 under
different MAFs when N=200, 500 and ρ=0.5.

N=200, ρ=0.8

N=500, ρ=0.8

0.7

0.7
Power

1.0

Power

1.0

0.5

0.5

0.3

0.3

pA

0.1
0.3
0.5

0.1
0.00

0.05

0.10
τ

0.15

pA

0.1
0.3
0.5

0.1

0.20

0.00

0.05

0.10
τ

0.15

0.20

Figure 2.6 The empirical size and power of testing the linearity of nonparametric function m1 under
different MAFs when N=200, 500 and ρ=0.8.

24

2.7

Real data application

We applied the proposed FVICM model to a real data set from a study examining the association of
the A118G SNP in OPRM1 to experimental pain sensitivity (Jonson and Terra, 2002). A group of
163 men and women in ages from 32 to 86 years participated in the study. Systolic blood pressure
(SBP), diastolic blood pressure (DBP) and heart rate (HR) were measured at 6 Dobutamine dosage
levels for each subject. Dobutamine is a medication that is used to treat congestive heart failure
by increasing heart rate and cardiac contractility. Dobutamine was injected into these subjects to
investigate their response in heart rate and blood pressure to this drug, at different dosage levels:
0 (baseline), 5, 10, 20, 30 and 40 mcg/min. In this study, dosage levels are treated as time and
measurements at different dosage levels are considered as longitudinal measures. In addition to
that, age and body mass index (BMI) were also recorded.
Total five SNPs in genes Beta1 AR and Beta2 AR were genotyped, namely, codon16, codon27,
codon49, codon389, and codon492. We choose X1 = dosage level as the "time-varying" variable,
and X2 = age and X3 = BMI as the "time-invariant" variable. Our goal is to evaluate how the SNPs
interact with age, BMI and dose level to affect SBP, DBP and HR. With the proposed FVICM
model, we can model the dynamic gene effect on drug response under different dosage levels.
In this analysis, we test whether any SNP is associated with the drug response based on the
hypothesis test H0 : m1 (u1 ) = δ0 + δ1 u1 with p-value denoted by pm1 in Table 2.3 - 2.5. We also
reported the p-values for testing the significance of coefficients β11 , β12 and β13 , which are labeled
by pβ , pβ and pβ , based on the asymptotic normality of the estimates. We also compare our
11
12
13
∗ (X )+β ∗ X +
proposed model to an additive varying-coefficient model (AVCM) E(Y |X, G) = β01
1
02 2
∗ X + {β ∗ (X ) + β ∗ X + β ∗ X }G, where β ∗ (·) and β ∗ (·) are unknown functions of X . To
β03
3
1
11 1
12 2
13 3
01
11

see the relative gain by integrative analysis, we calculate the MSEs of both models. The p-values
∗ (·) = β ∗ = β ∗ = 0 for AVCM is also reported in the tables and denoted by
for testing H0 : β11
12
13

pAVCM .
Table 2.3 summarizes the performance of our method for response SBP. In the table, pm1 for all

25

the 5 SNPs are smaller than the significance level 0.05, which implies the nonlinear function of the
SNPs on SBP in response to the dosage level, age and BMI as a whole. The MSEs in the last two
columns shows that FVICM fits the data better than AVCM, indicating the benefit of integrative
analysis. Besides, the testing results for AVCM do not show significance of the coefficients, which
further implies that the genetic effects of SNPs are nonlinearly modified by the mixture of these
three variables. Figure 2.7 shows the fitted nonlinear functions for each SNP, along with the 95%
confidence bands.
Table 2.3 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for
SBP.
SNP ID
codon16
codon27
codon49
codon389
codon492

MAF Alleles
0.3990
0.4160
0.1387
0.3045
0.4250

A/G
G/C
G/A
G/C
T/C

p m1

pβ

11

p-value
pβ

12

pβ

13

<1.0E-04 0.0011 <1.0E-04 0.0917
<1.0E-04 <1.0E-04 0.0027 0.1675
<1.0E-04 <1.0E-04 0.3614 0.8668
<1.0E-04 <1.0E-04 <1.0E-04 0.7552
<1.0E-04 0.4102 <1.0E-04 0.0182

pAVCM

MSE
FVICM AVCM

0.5308
0.6748
0.2910
0.3927
0.2990

0.0403
0.0388
0.0398
0.0397
0.0392

0.0421
0.0415
0.0410
0.0431
0.0409

Table 2.4 presents similar results for response DBP. The values of pm1 shows that the test
results for all 5 SNPs are significant, indicating nonlinear interactions for all 5 SNPs, while no
significance is shown for AVCM model. MSEs further support our method by showing smaller
value for FVICM comparing with AVCM. The estimated interaction curves with 95% confidence
bands are shown in Figure 2.8.
Table 2.4 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for
DBP.
SNP ID
codon16
codon27
codon49
codon389
codon492

MAF Alleles
0.3990
0.4160
0.1387
0.3045
0.4250

A/G
G/C
G/A
G/C
T/C

pm1

pβ

11

p-value
pβ
12

pβ

13

pAVCM

0.0066 <1.0E-04 0.2834
0.0007 0.3160
0.0004 0.8431 <1.0E-04 <1.0E-04 0.0946
0.0003 0.5750 <1.0E-04 0.0042 0.7986
0.0001 <1.0E-04 0.9675 <1.0E-04 0.2615
0.0001 0.7934 <1.0E-04 <1.0E-04 0.5837

MSE
VICM AVCM
0.0366
0.0360
0.0369
0.0369
0.0369

0.0372
0.0386
0.0395
0.0377
0.0389

In Table 2.5, the performance of our method for trait HR also leads to similar conclusion expect
26

codon16

codon27

0.05

codon49

0.3
0.10

0.1

m1(u1)

0.05

0.2

m1(u1)

m1(u1)

0.00

−0.05

0.00

−0.10
−0.05
−0.15

0.0
−1.00

−0.75

−0.50
u1

−0.25

0.00

−0.10
0.00

0.25

0.50
u1

0.75

codon389

1.00

0.00

0.25

0.75

1.00

0.50
u1

0.75

1.00

codon492
0.6

0.0

m1(u1)

m1(u1)

0.4

−0.1

0.2

0.0
−0.2
0.00

0.25

0.50
u1

0.75

1.00

0.00

0.25

0.50
u1

Figure 2.7 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs
codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by
the dashed line. Response is SBP.
for SNP codon16, which shows significant test results for both models. For all the other SNPs,
FVICM outperforms AVCM in terms of MSE. Figure 2.9 displays the corresponding estimated
nonlinear interaction curves.
Table 2.5 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for
HR.
SNP ID
codon16
codon27
codon49
codon389
codon492

MAF Alleles
0.3990
0.4160
0.1387
0.3045
0.4250

A/G
G/C
G/A
G/C
T/C

pm1

pβ

11

p-value
pβ

12

pβ

13

pAVCM

<1.0E-04 <1.0E-04 0.1158 0.0028 0.0328
<1.0E-04 0.0007 0.6434 0.0001 0.9620
0.0001
0.0147 0.0172 0.0133 0.8371
<1.0E-04 <1.0E-04 0.0024 0.0021 0.8959
0.0002 <1.0E-04 0.0011 0.0582 0.3732

27

MSE
VICM AVCM
0.0309
0.0320
0.0298
0.0311
0.0315

0.0308
0.0325
0.0300
0.0313
0.0316

codon16

codon27

codon49

0.3

0.15
0.3

0.2

0.10

0.05

m1(u1)

m1(u1)

m1(u1)

0.2
0.1

0.1

0.00

0.0
0.0

−0.05

−0.1

−0.1
−0.2

0.2

u1

0.6

1.0

0.3

0.6
u1

0.9

codon389

−1.00

−0.75

−0.50
u1

−0.25

0.00

codon492
0.3

0.05
0.2
m1(u1)

m1(u1)

0.00
−0.05
−0.10

0.1

0.0

−0.15
0.0

u1

0.5

1.0

0.3

0.6
u1

0.9

Figure 2.8 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs
codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by
the dashed line. Response is DBP.

2.8

Discussion

In this paper, we propose a functional varying index coefficient modeling procedure to study gene
effects nonlinearly modified by a mixture of environmental variables in a longitudinal design.
We implement the quadratic inference function (QIF) method to estimate the index loading and
spline coefficients. Furthermore, we apply the pseudo likelihood ratio test in a linear mixed model
representation to test the linearity of the nonparametric coefficient function. Simulation study has
been conducted to illustrate the estimation and testing procedures and confirm the asymptotical
property. Real analysis shows that our model outperforms the additive varying coefficient model,
which considers the G×E effect for each single environmental factor separately.
Our FVICM model distinguishes the varying coefficient model for longitudinal data. In fact,
the varying coefficient model is a special case of our model when the dimension of the X variable

28

codon16

codon27

codon49
0.2

0.0

−0.1

0.00

0.1
m1(u1)

0.05

m1(u1)

m1(u1)

0.10

−0.2

0.0
−0.3

−0.05
0.00

0.25

0.50
u1

0.75

1.00

−0.50

−0.25
u1

0.00

0.25

codon389

0.00

0.25

0.50
u1

0.75

1.00

codon492
0.15

0.10
m1(u1)

m1(u1)

0.1

0.0

0.05

0.00

−0.05
0.00

0.25

0.50
u1

0.75

1.00

0.5

u1

1.0

Figure 2.9 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs
codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by
the dashed line. Response is HR.
reduces to one. FVICM is able to capture the effect of genes nonlinearly modified by the joint
effect of multiple environmental variables as a whole. In addition, it can reduce multiple testing
burden by treating multiple environmental variables as a single index variable.
We apply the model to a pain sensitivity study. Testing results indicate that all five SNPs have
significant nonlinear interaction effects with environmental factors, which makes practical sense
since these SNPs were genotyped from candidate genes. Our model was motivated by a practical
need in G×E study. However, the method can be applied to any longitudinal data in which the
purpose is to model nonlinear interaction effects. For example, we can consider gene expressions
in a pathway (denoted as X ) and model how they regulate downstream genes (G) to affect a disease
trait. Both the trait and gene expressions can be measured over time. Thus, one can understand the
dynamic effect of genes nonlinearly regulated by a pathway to affect a disease trait.

29

CHAPTER 3
GENERALIZED FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR
DYNAMIC GENE-ENVIRONMENT INTERACTIONS

3.1

Introduction

Longitudinal data analysis is very common in epidemiological studies when the response variables
are measured over time on objectives. Many studies demonstrated the increased power of a longitudinal design in detecting genetic associations over cross-sectional designs (Sitlani et al. 2015;
Furlotte et al. 2015; Xu et al. 2014). On the other hand, there has been growing interest in the role
of G×E interaction in many human diseases, such as Parkinson disease (Ross and Smith, 2007)
and type 2 diabetes (Zimmet et al., 2001). In many studies, G×E has been traditionally investigated based on a single environment exposure model. However, evidence from an increasing number
of studies has shown that risk of disease can be modified by simultaneous exposure to multiple environmental factors, which might be higher than what would be expected from simple addition of
the single effects of environmental factors (Carpenter et al., 2002; Sexton and Hattis, 2007). Thus,
of particular interest and complexity are assessing the combined effect of environmental mixtures
and the mechanism in which they interact with genes to affect disease risk. Some researches have
been done to assess nonlinear interactions between environmental mixtures and genes by applying some nonparametric or semiparametric models, such as the varying index coefficients model
(VICM) proposed by Ma and Song (2015) and the partial linear multi-varying index coefficients
model (PLMVICM) by Liu et al. (2016) and the generalized PLMVICM by Liu et al. (2017).
However, these methods were developed for cross-sectional data and they can not be used for longitudinal data. This motivates us to extend the varying index coefficient model to longitudinal
traits.
In our previous work, we proposed a functional varying index coefficient model to capture the
nonlinear G×E interaction for continuous longitudinal traits. However, in practice, it is possible

30

that the response measured over time is a discrete variable, for example, a binary measure of
a disease status. In human genetics, many disease traits are binary in nature, being affected vs
unaffected (or cases vs controls). In order to investigate the dynamic nonlinear G×E interaction
with environmental mixtures as a whole for a binary longitudinal trait, we propose the following
generalized functional varying index coefficient model (gFVICM):
β T0 Xi j ) + m1 (β
β T1 Xi j )Gi j ,
g E(Yi j |Xi j , Gi j ) = m0 (β

(3.1)

where Yi j (= 0 or 1) denotes the binary longitudinal response variable observed for the ith subject
at the jth time point; Xi j is a p-dimensional vector of environmental variables, which can be either
time-variant or time-invariant variables; Gi denotes the SNP variable which does not depend on
time; g(·) is a known link function; m0 (·) and m1 (·) are unknown nonparametric smooth functions
which depend on the data; and β 0 and β 1 are p-dimensional vectors of index loading parameβ T1 X) captures the interaction effect between environmental
ters. In this model, the function m1 (β
mixtures and the genetic variable (e.g., a single nucleotide polymorphism (SNP)) on the risk of
disease.
The aim of this paper is to develop a set of statistical estimation and hypothesis testing procedure for model (3.1). The Generalized Estimation Equation (GEE) method, proposed by Liang
and Zeger in 1986, has been widely used in longitudinal data analysis. However, there are several
disadvantages of GEE method due to some of its critical assumptions (Song et al., 2009). One
disadvantage is that the consistency of GEE estimators are based on the consistency of estimators
for the nuisance correlation parameter (Crowder, 1986, 1995). Another shortcoming of the GEE
method is that model selection and hypothesis testing are complicated. This is because the estimation procedure of the GEE method does not involve an objective function. The quadratic inference
function (QIF) approach proposed by Qu et al. (2000) is one of the improvements of the GEE
method. The QIF avoids estimating the nuisance correlation parameters and has been confirmed
by Qu et al. (2000) to be generally more efficient than the GEE. In addition, since the QIF is
built upon an objective function which is asymptotically chi-square distributed, we can naturally

31

implement the model selection criterion such as BIC to the QIF. The asymptotic property can also
allows us to conduct hypothesis tests. This motivates us to extend the QIF method to our model
for estimation and hypothesis testing.
In our proposed estimation procedure, we first use penalized splines (Ruppert and Carroll,
2000) to approximate the nonparametric smooth functions ml (·), l=0, 1. Then we develop a profile
estimation procedure to estimate the index loading parameters and spline coefficients iteratively
based on the QIF approach. In order to avoid overfitting and reduce the number of parameters in
spline approximation, we use BIC method by adding a penalty to the objective function. Under
certain regularity conditions we establish the asymptotic normality of the resulting estimators. In
addition, we are interested in testing the linearity of G×E interaction, i.e. the linearity of function
m1 (·). The QIF can be regarded as an inference function which has properties similar to the
likelihood ratio test. Based on that, we construct a testing procedure for linearity of nonparametric
interaction function, where the test statistic asymptotically follows a χ 2 distribution.
The rest of this chapter is organized in the following way: In Section 3.2, we propose an estimation procedure for model (3.1) and also provide the consistency and asymptotic normality of the
proposed estimator; In Section 3.3, we derive a testing procedure for the linearity of nonparametric
interaction function based on the goodness-of-fit test of QIF. The finite sample performance of the
proposed procedure are accessed by Monte Carlo simulations illustrated in Section 3.4; In Section
3.5, the application of the proposed methodology is shown through the analysis of a pain sensitivity
data with a binary response variable indicating whether a subject has hypertension or not (Yes=1,
No=0); Some discussions are given in Section 3.6; The proofs of are rendered in Appendix.

3.2
3.2.1

The model and estimation methods
The model

For a longitudinal disease trait, suppose the binary response yi j , the p-dimensional covariate vector
x i j , and the SNP variable Gi are observed for the ith observation at the jth time point, where
32

i=1,...,N, j=1,...,ni . Assume that the observations from different subjects are independent, but
those within the same subject are correlated. We also assume the model satisfies the first moment
assumption:
β T0 x i j ) + m1 (β
β T1 x i j )Gi ,
µi j (xxi j , Gi ) = E(yi j |xxi j , Gi ) = g−1 m0 (β
where g−1 (·) is a given inverse link function. If we use a logit link function for binary response,
then the model can be written as
µi j (xxi j , Gi ) = P(yi j = 1|xxi j , Gi ) =

β T0 xi j ) + m1 (β
β T1 xi j )Gi }
exp{m0 (β
β T0 x i j ) + m1 (β
β T1 x i j )Gi }
1 + exp{m0 (β

.

For model identifiability, we have the constraints β 0 = β 1 = 1 and the first elements of β 0
and β 1 are positive.

3.2.2

Quadratic inference function for gFVICM

First, we approximate the unknown coefficient functions m0 (u0 ) and m1 (u1 ) by truncated power
spline basis as
ml (ul ) = ml (ul , β ) ≈ B(ul )T γ l , for l = 0, 1,
q

(3.2)

q

β T0 , β T1 )T , B(u) = (1, u, u2 , ..., uq , (u − κ1 )+ , ..., (u − κK )+ )T is a q-degree truncated
where β = (β
power spline basis with K knots κ1 , ..., κK ; γ 0 and γ 1 are (q + K + 1)-dimensional vectors of spline
coefficients.
A marginal approach such as the GEE assumes that the marginal mean µi j is a function of the
covariates through a link function and the variance of yi j is a function of the mean var(yi j ) = V (µi ).
The generalized estimation equation for longitudinal data is
N

∑ µ˙ Ti V−1
i (yi − µ i ) = 0,

i=1

∂µ

where yi = (yi1 , ..., yini )T , µ i = E(yi ) is the mean function and µ˙ i = ∂ θi is the first derivative of
β T , γ T )T , with γ = (γγ T0 , γ T1 )T . The covariance matrix Vi can
µ i with respect to parameters θ = (β
1/2

be decomposed as Vi = Ai

1/2

R(ρ)Ai

with Ai being a diagonal matrix of marginal variances and
33

R(ρ) being a common working correlation matrix with a small number of nuisance parameters ρ.
Using the spline approximation in (3.2), the mean function can be written as

 
−1 T β T x )γγ + BT (β
β T1 xi1 )γγ 1 Gi }
0 i1 0
 µi1 (θθ )   g {B (β

 
..
..
=
µ i = µ i (θθ ) = 
.
.

 

 
β T0 xini )γγ 0 + BT (β
β T1 xini )γγ 1 Gi }
µini (θθ )
g−1 {BT (β
and the first derivative of µ i is



µ˙ i = 


βT
γ T
(g−1 ) BT
d (β 0 xi1 )γ 0 xi1
..
.

T
βT
γ
(g−1 ) BT
d (β 1 xi1 )γ 1 Gi xi1
..
.

βT
γ T
(g−1 ) BT
d (β 0 xini )γ 0 xini

T
βT
γ
(g−1 ) BT
d (β 1 xini )γ 1 Gi xini

∂ B(u)

βT
(g−1 ) BT (β
0 xi1 )
..
.




,



βT
(g−1 ) BT (β
1 xi1 )Gi
..
.

−1 T β T x )G
βT
(g−1 ) BT (β
0 xini ) (g ) B (β
1 ini i

q−1

q−1

where Bd (u) = ∂ u = (0, 1, 2u, ..., quq−1 , q(u − κ1 )+ , ..., q(u − κK )+ ).
In the QIF method, the inverse of the working correlation matrix can be approximated by a
linear combination of several basis matrices, i.e.
R−1 (ρ) ≈ a1 M1 + ... + ah Mh ,
where M1 is the identity matrix and M2 , ..., Mh are known basis matrixes. For example, if the
working correlation is exchangeable, R−1 ≈ a1 M1 + a2 M2 with M2 having 0 on the diagonal and
1 off-diagonal. If the working correlation is AR(1), then R−1 ≈ a∗1 M1 + a∗2 M∗2 and we can set
M∗2 to have 1 on its two subdiagonals and 0 elsewhere. The advantage of this method is that the
estimation of nuisance parameters a1 , ..., ah are not required.
Following this idea, we can derive the estimation function

−1/2
−1/2
µ˙ Ti Ai
M1 Ai
(yi − µ i )
∑N
i=1

1
1 N
..
g¯N (θθ ) = ∑ gi (θθ ) = 
.
N i=1
N

˙ T −1/2 Mh A−1/2
(yi − µ i )
∑N
i
i=1 µ i Ai








(3.3)

We cannot obtain the estimators by simply setting each element in g¯N to be zero since the number
of equations is more than the number of unknown parameters. To deal with this issue, we can
estimate the parameters by minimizing the following quadratic inference function,
−1 g¯ ,
QN (θθ ) = N g¯TN C¯N
N

34



,


T
where C¯N = N1 ∑N
i=1 gi gi is a consistent estimator for var(gi ). By minimizing the quadratic infer-

ence function, we can obtain the estimation of the parameters as,
θ = arg min QN (θθ ).
θ

In order to overcome the issue of over-parameterization, we can add a penalty term to QIF to
penalize the number of knots in the approximation. The penalized QIF is written as
N −1 QN (θθ ) + λ θ T Dθθ ,

(3.4)

where D is a diagonal matrix with 1 if the corresponding parameter is the spline coefficient associated with knots and 0 otherwise, that is, D = diag(0T(2p+q+1)×1 , 1TK×1 , 0T(q+1)×1 , 1TK×1 ). Then the
estimator is given by
θ = arg min N −1 QN (θθ ) + λ θ T Dθθ .

(3.5)

θ

To determine the tuning parameter λ , we can extend the generalized cross-validation (Ruppert,
2002; Qu and Li, 2006; Bai et al., 2009) to the penalized QIF. The generalized cross-validation
statistic is defined as
GCV(λ ) =

N −1 QN
(1 − N −1 df)2

with the effective degree of freedom df = tr (Q¨ N +2Nλ D)−1 Q¨ N , where Q¨ N is the second derivative of QN . The desirable choice of tuning parameter λ is which minimize the GCV(λ ). In the
implementation of GCV, the desired value of λ can be found using a grid search by predefining a
set of values for λ .

3.2.3

Theoretical results

To establish the asymptotic properties for the estimators of the index loading parameters and the
penalized spline regression coefficients, we assume θ 0 is the parameter satisfying Eθ (gi ) = 0.
0
√
Theorem 3 provides the consistency of the resulting estimators. We show the N-consistency
and asymptotic normality of the estimators in Theorem 4. The theoretical results are similar to
Theorem 1 and 2 in Chapter 2.
35

First, to handle the constraints β 0 = β 1 =1, and β01 > 0, β11 > 0, we set βl1 =

1 − β l,−1 22

with β l,−1 = (βl2 , ..., βl p )T for l=1, 2. Then the parameters space of β l , l=1,2, becomes
[{( 1 − β l,−1 22 , βl2 , ..., βl p )T } : β l,−1 22 < 1].
Let


Jl =

∂β l
∂ β Tl,−1


=

β Tl,−1 /
−β

1−

β l,−1 22

I p−1





β T0,−1 , β T1,−1 )T , and θ ∗ =
be the Jacobian matrix of dimension p × (p − 1). Denote β −1 = (β
β −1 , γ )T . From θ to θ ∗ , we have Jacobian matrix J = diag(J0 , J1 , Iq+K+1 , Iq+K+1 ).
(β
Theorem 3 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing parameter λN = o(1), then the estimator θ , which is obtained by minimizing the penalized
quadratic function in (3.4), exists and converges to θ 0 in probability.
Theorem 4 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing
parameter λN = o(N −1/2 ), then the estimator θ obtained by minimizing the penalized quadratic
function in (3.4) is asymptotically normally distributed, i.e.,
√
d
−1 T
N(θ − θ 0 ) −
→ N(0, J(GT0 C−1
0 G0 ) J ),
where the detailed calculation of G0 and C0 are given in the Appendix.

3.3
3.3.1

Model selection and hypothesis test
Model selection

Model selection is important in the spline approximation since too many parameters in the model
might result in the overfitting issue. According to the theocratical property of generalized method
of moments estimator (Hansen, 1982), under the assumption E(g1 ) = 0 and also the number of
2 in distribution,
estimating equations is larger than the number of parameters, we have Q(θ ) → χr−k

36

where r is the dimension of g¯N (θθ ), k is the dimension of θ and θ is the estimator by minimizing
the QIF when certain order and number of knots are chosen. This asymptotic property of the
QIF provides a goodness-of-fit test, which can be useful to determine the order and number of
knots to be selected in our model. However, it is also possible that the goodness-of-fit tests fail to
reject several different models which may not be nested. Since Q(θ ) is asymptotically chi-square
distributed, it is natural to extend the BIC to the QIF approach, by replacing twice the negative
log-likelihood function by the QIF objective function. In particular, the BIC criterion for a model
with r estimating equation and k parameters is given as,
Q(θ ) + (r − k) ln N,
The model with the minimum BIC would be considered the optimal one. If we choose h basis
matrices in (3.3), then r − k = hk − k = (h − 1)k.
In our simulation and real data application, we search the optimal order and number of knots
over a set of combinations of q and K using the BIC criterion. Knots are evenly distributed in the
range of the single index β T X.

3.3.2

Nonparametric goodness-of-fit test based on QIF

The QIF can also be regarded as an inference function since it has properties similar to the likeψ , ζ ),
lihood ratio test. Suppose that the d-dimensional parameter vector γ is partitioned into (ψ
where ψ is the parameter of interest with dimension d1 , and ζ is the nuisance parameter with
dimension d2 = d − d1 . If we are interested in testing
H0 : ψ = ψ 0 ,
the test statistic
ψ 0 , ζ ) − Q(ψ , ζ )
Q(ψ

37

follows an asymptotically chi-square distribution with d1 degrees of freedom. The following theorem introduced by Qu et al. (2000) provided a way to conduct hypothesis testing in the QIF
framework.
Theorem 5 (Qu et al., 2000) Suppose that all required regularity conditions are satisfied and ψ
ψ 0 , ζ ) − Q(ψ , ζ ) is asymptotically chi-square
has dimension d1 . Under the null hypothesis, Q(ψ
distribution with d1 degrees of freedom, where
ψ 0 , ζ ), (ψ , ζ ) = arg min Q(ψ
ψ , ζ ).
ζ = arg min Q(ψ

(3.6)

When there is no nuisance parameter, which is a special case of the condition in Theorem 5,
Q(γγ 0 ) − Q(γ ) has an asymptotical chi-square distribution with d degree of freedom under the null
hypothesis.

3.3.3

Test for linearity of interaction function in gFVICM

In our proposed gFVICM model (3.1), it is of interest to test the unspecified coefficient function.
In particular, we are interested in testing whether a linear function is good enough to describe the
G×E interaction. If the we fail to reject the linearity of the coefficient function, then a parametric
linear interaction function should be fitted to further assess if there exists linear G×E interaction;
otherwise, we conclude there exists nonlinear G×E interaction. Let u1 = β T1 X. With the truncated
power spline basis, the coefficient function can be modeled by
q
m1 (u1 ) ≈ γ10 + γ11 u1 + γ12 u21 + · · · + γ1q u1 +

K+q+1

∑

q

γ1k (u1 − κk )+ .

k=q+1

Our goal is to test the linearity of m1 (u1 ), which is equivalent to test
H0 : γ12 = · · · = γ1,K+q+1 = 0.
β T , γ T )T under the null hypothesis with
Let θ be the estimator of the full parameter θ = (β
T

θ = (β , γ T0 , γ10 , γ11 , 0T )T =

38

arg min
γ12 =···=γ1,K+q+1 =0

QN (θθ ),

and the estimator of θ under the alternative as
θ = arg min QN (θθ ).
Then the test statistic
TN = QN (θ ) − QN (θ ),
asymptotically follows a chi-square distribution with K + q − 1 degrees of freedom, following
Theorem 5.

3.4

Simulation study

The finite sample performance of the proposed method was evaluated through Monte Carlo simulation studies. We considered the following logistic regression model
P(yi j = 1|Xi j , Gi , β ) =

exp{η(Xi j , Gi , β )}
,
1 + exp{η(Xi j , Gi , β )}

where
β T0 Xi j ) + m1 (β
β T1 Xi j )Gi .
η(Xi j , Gi , β ) = m0 (β
We simulated a three-dimensional environmental variables X = (X1 , X2 , X3 ). For the ith subject,
X1i j , X2i j , X3i j are independently generated from a uniform distribution U(0, 1). We set the minor
allele frequency (MAF) as pA = 0.1, 0.3, 0.5 and assumed Hardy-Weinberg equilibrium. We used
AA, Aa and aa to denote three different SNP genotypes. These genotypes were simulated from a
multinomial distribution with frequencies p2A , 2pA (1 − pA ) and (1 − pA )2 , respectively. Variable G
was coded as {0,1,2}, corresponding to genotypes {aa, Aa, AA} respectively. To create correlated
responses, we implemented the R package ‘bindata’ developed by Leisch et al. (1998) under an
AR(1) correlation structure with correlation parameter ρ=0.5. When implementing the function
‘rmvbin’ to generate the correlated binary data, one should specify the marginal probabilities and
the correlation structure.
√
√
We set m0 (u0 ) = cos(πu0 ) and m1 (u1 ) = sin[π(u1 − A)/(B − A)] with A = 3/2 − 1.645/ 12
√
√ √ √ √
√
and B = 3/2 + 1.645/ 12. The true parameters were β 0 = ( 5, 4, 4)/ 13 and β 1 =
39

√
(1, 1, 1)/ 3. We drew 500 data sets with sample size N = 200, 500 and time points ni = T = 10, 20,
respectively. The basis matrix M2 was set to have 1 on its two subdiagonals and 0 elsewhere. The
order and number of knots of the splines were selected through the BIC method.

3.4.1

Performance of estimation

Table 3.1 and Table 3.2 summarize the parameters estimation results under different sample sizes
and measurement times respectively. In these two tables, the average bias (Bias), the standard
deviation of the 500 estimates (SD), the average of the estimated standard error (SE) based on
the theoretical results, and the estimated coverage probability (CP) at 95% confidence level are
reported. It is shown from each table that, as the sample size increases, the performance of the
estimation improves by showing smaller bias, SD and SE. More repeated measurement for each
subject also results in improvement in estimations, which can be shown when we compare Table
3.1 and Table 3.2. For example, the CP for β01 improves from 86.8% to 90% when the number
of measurement time increases from 10 to 20, under a sample size of 200. The estimation of the
loading parameter β 1 improves as MAF pA increases, while the estimation of β 0 show a opposite
direction. This is because we have limited data information to estimate the marginal effects m0 (·)
when pA increases.
Table 3.1 Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size N =
200, 500, T = 10 and correlation ρ=0.5.
N Param True
200 β01 0.620
β02 0.555
β03 0.555
β11 0.577
β12 0.577
β13 0.577
500

β01
β02
β03
β11
β12
β13

0.620
0.555
0.555
0.577
0.577
0.577

pA = 0.1
Bias
SD
-0.008 0.058
-0.002 0.066
-4.1E-05 0.064
-0.013 0.134
-0.024 0.139
-0.013 0.140
0.002
0.003
-0.003
-0.007
-0.003
-0.005

0.038
0.039
0.038
0.078
0.079
0.075

SE
0.057
0.056
0.056
0.091
0.090
0.091

CP
93.6
90.0
91.8
82.2
79.2
82.4

0.038
0.037
0.037
0.065
0.066
0.066

94.8
93.0
93.8
89.6
88.0
90.6

pA = 0.3
Bias
SD
-0.003 0.074
-0.004 0.080
-0.008 0.074
-0.003 0.092
-0.10 0.096
-0.11 0.095
-0.002
-0.002
-6.3E-04
-0.002
-0.001
-0.004

40

0.043
0.043
0.045
0.055
0.052
0.054

SE
0.064
0.062
0.062
0.072
0.071
0.071

CP
90.6
87.2
88.8
87.2
85.2
86.0

0.043
0.042
0.042
0.049
0.049
0.049

95.4
93.4
93.0
92.0
92.8
92.8

pA = 0.5
Bias
SD
-0.007 0.084
-0.008 0.093
-0.006 0.094
0.007 0.088
-0.013 0.085
-0.014 0.088
-0.003
-2.1E-04
-0.004
0.002
-0.003
-0.005

0.047
0.052
0.050
0.045
0.049
0.047

SE
0.068
0.065
0.066
0.063
0.062
0.062

CP
86.8
84.5
86.6
84.9
87.2
85.4

0.048
0.046
0.046
0.044
0.043
0.043

95.0
92.4
92.5
94.2
91.4
92.0

Table 3.2 Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size N =
200, 500, T = 20 and correlation ρ=0.5.
N Param True
200 β01 0.620
β02 0.555
β03 0.555
β11 0.577
β12 0.577
β13 0.577
500

β01
β02
β03
β11
β12
β13

0.620
0.555
0.555
0.577
0.577
0.577

pA = 0.1
Bias
SD
-0.002 0.044
-0.004 0.048
2.3E-04 0.048
-0.006 0.074
-0.004 0.075
-0.005 0.072

SE
0.043
0.042
0.042
0.064
0.064
0.064

CP
94.4
90.2
91.2
90.0
91.4
91.8

pA = 0.3
Bias SD
SE
-0.002 0.053 0.049
-0.003 0.056 0.048
-0.002 0.057 0.048
0.008 0.062 0.054
-0.009 0.063 0.053
-0.009 0.063 0.054

CP
94.2
90.2
90.2
91.8
89.8
90.0

-0.002
-3.2E04
3.5E04
-0.005
-0.002
0.003

0.028
0.027
0.027
0.044
0.045
0.042

97.2
94.8
93.2
92.2
91.0
96.4

-0.002
0.001
-0.002
0.004
-0.003
-0.005

95.0
94.0
92.8
92.4
93.0
93.4

0.028
0.028
0.029
0.044
0.045
0.042

0.033
0.033
0.034
0.039
0.037
0.038

0.033
0.032
0.032
0.036
0.036
0.036

pA = 0.5
Bias
SD
-0.011 0.062
-0.003 0.065
0.005 0.063
0.004 0.063
-0.009 0.064
-0.006 0.064
-0.004
7.9E-05
6.3E-04
0.006
-0.004
-0.005

0.037
0.038
0.037
0.038
0.036
0.035

SE
0.052
0.050
0.050
0.050
0.050
0.050

CP
90.0
88.3
87.4
88.2
87.1
86.8

0.036
0.035
0.035
0.035
0.035
0.035

93.2
92.8
94.2
93.0
94.2
93.6

The plots for the estimations of m0 (u0 ) and m1 (u1 ) under different sample sizes and number of
replications are shown in Figure 3.1–3.4. The estimated and true functions are denoted by the solid and dashed lines, respectively. The 95% confidence bands are denoted by the dotted-dash line.
The estimated curves almost overlap with the corresponding true curves as shown in the plots, indicating the estimation accuracy of the method. Also the confidence bands are tight, especially for
large sample size and large number of measurement times. Note that the estimation for the interaction effects m1 (u1 ) improves as MAF pA increases, while the estimation for the marginal effects
m0 (u0 ) show a opposite direction, which coincides with the results for the parametric estimation
in Table 3.1 and Table 3.2.

41

0.6

0.8

1.0

1.2

1.4

2.0
1.5

m0(u0)

−1.5

−0.5 0.0

0.5

1.0

1.5
−1.5

−0.5 0.0

m0(u0)

0.5

1.0

1.5
1.0
0.5

0.4

0.6

0.8

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

N=500, T=10, pA=0.1

N=500, T=10, pA=0.3

N=500, T=10, pA=0.5

0.8

u0

1.0

1.2

1.4

1.5

m0(u0)

−0.5 0.0
−1.5

−1.5
0.6

0.5

1.0

1.5

m0(u0)

0.5

1.0

1.5
1.0
0.5
−0.5 0.0
−1.5

0.4

1.4

2.0

u0

2.0

u0

2.0

u0

−0.5 0.0

m0(u0)

−0.5 0.0
−1.5

0.4

m0(u0)

N=200, T=10, pA=0.5

2.0

N=200, T=10, pA=0.3

2.0

N=200, T=10, pA=0.1

0.4

0.6

0.8

u0

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

1.4

u0

Figure 3.1 The estimation of function m0 (·) when N=200, 500 and T =10. The estimated and true
functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are
denoted by the dotted-dash lines.

42

0.6

0.8

1.0

1.2

1.4

2.0
1.5

m0(u0)

−1.5

−0.5 0.0

0.5

1.0

1.5
−1.5

−0.5 0.0

m0(u0)

0.5

1.0

1.5
1.0
0.5

0.4

0.6

0.8

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

N=500, T=20, pA=0.1

N=500, T=20, pA=0.3

N=500, T=20, pA=0.5

0.8

u0

1.0

1.2

1.4

1.5

m0(u0)

−0.5 0.0
−1.5

−1.5
0.6

0.5

1.0

1.5

m0(u0)

0.5

1.0

1.5
1.0
0.5
−0.5 0.0
−1.5

0.4

1.4

2.0

u0

2.0

u0

2.0

u0

−0.5 0.0

m0(u0)

−0.5 0.0
−1.5

0.4

m0(u0)

N=200, T=20, pA=0.5

2.0

N=200, T=20, pA=0.3

2.0

N=200, T=20, pA=0.1

0.4

0.6

0.8

u0

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

1.4

u0

Figure 3.2 The estimation of function m0 (·) when N=200, 500 and T =20. The estimated and true
functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are
denoted by the dotted-dash lines.

43

0.6

0.8

1.0

1.2

1.4

1.5
1.0
−1.5

−1.0

−0.5

0.0

m1(u1)

0.5

1.0
−1.5

−1.0

−0.5

0.0

m1(u1)

0.5

1.0
0.5
0.0

0.4

0.6

0.8

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

N=500, T=10, pA=0.1

N=500, T=10, pA=0.3

N=500, T=10, pA=0.5

0.8

u1

1.0

1.2

1.4

1.0
−0.5
−1.5

−1.0

−0.5
−1.0
−1.5
0.6

0.0

m1(u1)

0.5

1.0

m1(u1)

0.5

1.0
0.5
0.0
−0.5
−1.0
−1.5

0.4

1.4

1.5

u1

1.5

u1

1.5

u1

0.0

m1(u1)

−0.5
−1.0
−1.5

0.4

m1(u1)

N=200, T=10, pA=0.5

1.5

N=200, T=10, pA=0.3

1.5

N=200, T=10, pA=0.1

0.4

0.6

0.8

u1

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

1.4

u1

Figure 3.3 The estimation of function m1 (·) when N=200, 500 and T =10. The estimated and true
functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are
denoted by the dotted-dash lines.

44

0.6

0.8

1.0

1.2

1.5
1.0
−1.5

−1.0

−0.5

0.0

m1(u1)

0.5

1.0
−1.5

−1.0

−0.5

0.0

m1(u1)

0.5

1.0
0.5
0.0

1.4

0.4

0.6

0.8

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

N=500, T=20, pA=0.1

N=500, T=20, pA=0.3

N=500, T=20, pA=0.5

0.8

1.0

1.2

1.4

1.0
−0.5
−1.5

−1.0

−0.5
−1.0
−1.5
0.6

0.0

m1(u1)

0.5

1.0

m1(u1)

0.5

1.0
0.5
0.0
−0.5
−1.0
−1.5

0.4

1.4

1.5

u1

1.5

u1

1.5

u1

0.0

m1(u1)

−0.5
−1.0
−1.5

0.4

m1(u1)

N=200, T=20, pA=0.5

1.5

N=200, T=20, pA=0.3

1.5

N=200, T=20, pA=0.1

0.4

u1

0.6

0.8

u1

1.0

1.2

1.4

0.4

0.6

0.8

1.0

1.2

1.4

u1

Figure 3.4 The estimation of function m1 (·) when N=200, 500 and T =20. The estimated and true
functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are
denoted by the dotted-dash lines.

3.4.2

Performance of hypothesis tests

We evaluated the performance of the test for the nonparametric function under the null hypothesis
H0 : m1 (·) = m01 (·), where m01 (u1 ) = δ0 + δ1 u1 , δ0 and δ1 are some constants, which corresponds
to a linear G×E interaction. Power is evaluated under a sequence of alternative models with
different values of τ, which is denoted by H1τ : mτ1 (·) = m01 (·) + τ{m1 (·) − m01 (·)}. When τ =
0, the corresponding power is the false positive rate.
Figure 3.5 shows the size (when τ = 0) and power (when τ > 0) at significance level 0.05
based on 500 Monte Carlo simulations for N=200, 500 under different measurement times T =10
(left panel) and T =20 (right panel). The empirical Type I error is large when N = 200, which
decreases dramatically when the sample size increases to N=500. The power increases when the
45

sample size increases from 200 to 500. The results indicate that our method can reasonably control
the false positive rates and has appropriate power to detect the linearity function under a relatively
large sample size. Comparing the results for T =10 and T =20, we can see that the testing power
improves when the number of measurement time increases.
Power plot when N=200, 500, T=10

Power plot when N=200, 500, T=20

0.85
0.85

0.70

Power

Power

0.70
0.50

0.50

0.30

0.30
sample_size

sample_size

0.10

200
500

0.10
0.0

0.1

0.2

τ

0.3

0.4

0.5

200
500

0.0

0.1

0.2

τ

0.3

0.4

0.5

Figure 3.5 The empirical size and power of testing the linearity of nonparametric function m1 when
N=200, 500 and T =10, 20.

To assess how the values of MAF affect the testing performance, we plot the power plot under
different MAFs pA =0.1, 0.3, 0.5 when N=500, T =10, which is shown in Figure 3.6. Note that the
power increases dramatically when MAF increases from 0.1 to 0.3. The values of power are very
close when pA =0.3 and 0.5.

46

N=500, T=10

Power

0.7

0.5

0.3
pA

0.1
0.3
0.5

0.1
0.0

0.1

0.2

τ

0.3

0.4

0.5

Figure 3.6 The empirical size and power of testing the linearity of nonparametric function m1 under
different MAFs when N=500 and T =10.

3.5

Real data application

We applied the proposed gFVICM model to a data set from a study examining the association of
the A118G SNP of OPRM1 to experimental pain sensitivity. A sample of 163 healthy volunteers
were evolved in this study. For each volunteer, Systolic Blood Pressure (SBP) and Diastolic Blood
Pressure (DBP) were measured at 6 Dobutamine dosage levels: 0 (baseline), 5, 10, 20, 30 and
40mcg/min. Clinically, a person is said to be hypertensive if the individuals SBP is greater than
140mm Hg or DBP is greater than 90mm Hg (Choi et al. 2014). Thus, the response variable Y is
a binary variable indicating whether a person has hypertension or not, i.e. Y = 1 for hypertension
and Y = 0 for non-hypertension.
One longitudinal covariate X1 = dosage, two time-invariant covariates X2 = age and X3 = BMI
are included as the environmental factors in the model. The genetic variables are five SNPs located
at codon16, codon27, codon49, codon389, and codon492 in the gene. Our purpose is to evaluate
how the mixture of age, BMI and dosage modifies the SNP effect on the risk of hypertension. In
particular, we test the hypothesis H0 : m1 (u1 ) = δ0 + δ1 u1 with p-value denoted by pm1 in Ta-

47

ble 2.7. We also reported the p-values for testing the significance of three components of index
loading coefficients β 1 = (β11 , β12 , β13 ), which are labeled by pβ , pβ and pβ , based on
11
12
13
the asymptotic property of the estimations. We also compared our proposed model to a gener∗ (X ) + β ∗ X + β ∗ X +
alized additive varying-coefficient model (gAVCM) E(Y |X, G) = η β01
1
03 3
02 2
∗ (X ) + β ∗ X + β ∗ X G , where β ∗ (·) and β ∗ (·) are unknown functions of X . To see the
β11
1
1
12 2
13 3
01
11

relative gain by integrative analysis, we calculated the objective function QN in both models. The
∗ (·) = β ∗ = β ∗ = 0 for gAVCM are also reported in the tables and
p-values for testing H0 : β11
13
12

denoted by pgAVCM .
In Table 3.3, pm1 for all the 5 SNPs are smaller than the significance level 0.05, which means
the functions capturing the G×E interactions are nonlinear for all these 5 SNPs. The objective
function QN in the last two columns shows that gFVICM fits the data better than gAVCM, indicating the benefit of integrative analysis. Besides, the testing results for gAVCM do not show
significance of the coefficients for interactions. The results imply that the genetic effects of SNPs are modified by the mixture of environmental variables, rather than separately. Figure 3.7
exhibits the fitted nonlinear curves indicating G×E interactions for each SNP, along with the 95%
confidence bands.
Table 3.3 List of SNPs with MAF, allele, p-values under different hypothesis testing and values of
objective function QN .
SNP ID
codon16
codon27
codon49
codon389
codon492

MAF Alleles
0.3990
0.4160
0.1387
0.3045
0.4250

A/G
G/C
G/A
G/C
T/C

p m1

pβ

11

p-value
pβ

12

pβ

13

pgAVCM

<1.0E-04 <1.0E-04 0.0207 0.3475 0.2960
<1.0E-04 0.2329 <1.0E-04 0.0014 0.6982
<1.0E-04 <1.0E-04 <1.0E-04 0.6325 0.1777
<1.0E-04 <1.0E-04 <1.0E-04 0.3329 0.8436
<1.0E-04 0.6731 <1.0E-04 0.0008 0.5001

QN
gFVICM gAVCM
3.9240
6.9502
6.6303
3.3678
6.0766

11.2082
9.3500
12.2648
10.5593
7.4877

Table 3.4 displays the estimated odds for different genotypes at different dosage levels. The
changes in the values of odds demonstrate the interaction between SNP and environmental mixtures at different dosage levels. For example, we noted that the odds for genotype AA in SNP
codon16 does not change too much as the dosage level increases, which means that the genetic

48

codon16
1.0

m1(u1)

0.5

0.0

−0.5
0.00

0.25

0.50
u1

0.75

1.00

codon27

codon49

0

m1(u1)

m1(u1)

5.0

2.5

−2

−4

0.0
0.00

0.25

0.50
u1

0.75

1.00

0.00

0.25

codon389

0.50
u1

0.75

1.00

codon492

0.5

5.0

m1(u1)

m1(u1)

0.0

−0.5

2.5

−1.0
0.0
−1.5
0.25

0.50
u1

0.75

0.25

0.50
u1

0.75

Figure 3.7 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs
codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted
by the dashed lines.

49

effect of this genotype remains the same when subjects are exposed to different environments at
different doses. While for the other two genotypes, there is an increase in the value of odds until
dosage level four, indicating increased blood pressure as dosage level increases from 0mcg/min to
20mcg/min.
Table 3.4 Estimated odds for different genotypes at each dosage levels.
SNP ID Genotyoe 1
codon16
AA
0.92
GG
1.22
GA
1.35
codon27
GG
1.25
CC
1.06
CG
1.08
codon49
GG
4.19
AA
1.13
GA
1.12
codon389
GG
1.78
CC
1.41
CG
0.91
codon492
TT
1.11
CC
1.12
CT
1.03

3.6

Dosage level
2
3
4
0.92 0.92 0.92
1.89 2.63 3.66
1.66 1.94 2.28
2.07 2.75 2.96
1.75 2.34 2.58
1.63 2.07 2.21
3.56 3.05 2.28
1.50 1.79 1.98
2.17 3.41 4.66
1.78 1.78 1.77
1.87 2.21 2.35
1.57 2.14 2.33
1.99 2.69 2.77
1.97 2.61 2.63
1.67 2.12 2.12

5
0.92
3.17
2.14
2.32
2.06
1.81
1.76
1.79
3.88
1.77
2.09
1.86
2.20
2.05
1.70

6
0.93
1.73
1.63
1.96
1.73
1.56
1.38
1.52
2.80
1.77
1.91
1.65
2.33
2.14
1.78

Discussion

In this paper, we proposed a generalized varying index coefficient modeling procedure to assess
the interaction effect of multiple environmental factors as a whole with a genetic factor. The model
was motivated by empirical evidence and was developed under an longitudinal design with a binary
disease response. We developed a profile estimation procedure to estimate the index coefficients
and nonparametric interaction functions iteratively. The estimation was conducted under the QIF
framework. To estimate the nonparametric functions, we first approximated the function using
truncated power spline basis, then estimated the spline coefficients based on QIF. Furthermore, we
proposed a nonparametric hypothesis test to assess the linearity of the nonparametric interaction
function. Simulation study has been conducted to illustrate the estimation and testing procedures

50

to evaluate the finite sample performance. The results indicate reasonable estimation performance
of the method under different sample sizes and measurement times.
Our method was proposed to evaluate the joint interaction effect between multiple environmental variables as a whole with genetic variables. Compared to the generalized additive varying
coefficient model (gAVCM), which considers the G×E effect for each single environmental factor
separately, our model presents two advantages: 1) it is biologically more attractive if there are
synergistic effects between multiple exposures; and 2) it can potentially increase the testing power
for detecting interactions since it can reduce multiple testing burden by treating multiple exposures
as a single index variable. Although our method was motivated by a genetic association study, the
developed model and inference procedures can be applied to other disciplines with the purpose to
model the synergistic effect of multiple variables as a whole.
We applied our method to a real data set from a pain sensitivity study. Testing results indicate
that all of the five SNPs are nonlinear moderated, by the synergistic effect of the three variables
with dosage as a “time"-varying variable, to affect the risk of high blood pressure. These five
SNPs were genotyped from a candidate gene which has been shown to be related to blood pressure
changes (Johnson and Terra, 2002). Although the purpose of the data was not generated to evaluate
the genetic effect on “hypertension", we simply applied the method to this data set to demonstrate
the utility of the method. The estimated odds of different genotypes for a particular SNP at different
dosage levels does give insights into the effect of the SNPs nonlinearly modulated by dosages. Of
particular interest is SNP condon49 in which individuals carrying genotype GG show a decreasing
risk of developing high blood pressure as the dosage level increases, indicating a protective effect
of this genotype. For the same SNP, individuals carrying genotype GA show a different pattern of
developing high blood pressure as the dosage level increases. Such a dynamic change of genetic
effect over different dosage levels cannot be revealed by a cross-sectional study, indicating the
relative merit of a longitudinal design.

51

CHAPTER 4
DETECTING GENETIC ASSOCIATIONS WITH MULTIVARIATE PARTIALLY
LINEAR VARYING-COEFFICIENTS MODELS FOR MULTIPLE LONGITUDINAL
TRAITS

4.1

Introduction

Cross-sectional disease traits have been the primary focus in genetic association studies. Given the
improved power to identify disease genes with phenotypic data measured over time, longitudinal
designs are becoming popular in genetic association studies (Sitlani et al. 2015; Macgregor et al.
2005; Furlotte et al. 2012; Xu et al. 2014). Most statistical methods developed so far focus on a
single outcome of interest. When multiple outcomes are measured over time, for example, multiple
measures of heart function in a longitudinal study of cardiac function, methods focusing on just a
single outcome over time may not provide a complete picture of cardiac function.
In genetics, the phenomenon that a single gene or locus influences more than one trait is known
as pleiotropy (Wang et al., 2014; Gratten and Visscher, 2016). Genetic pleiotropy plays a crucial
role in many complex diseases. One of the most well-known examples is the phenylketonuria
(PKU) disease (Lobo, 2008). The conventional approach to identify genetic pleiotropic effects
on multiple traits is to test the association between a gene and each trait individually and then
determine whether the genetic effect is significantly associated with more than one trait. The
disadvantages of this approach, such as the inflation in the family-wise Type I error and incomplete
information in individual tests compared to a combined analysis for multiple traits, have been
discussed in some studies (e.g. Wang et al., 2014). Therefore, a joint genetic association test on
multiple traits is more desirable to control the family-wise Type I error and enhance the power of
tests.
In real life, timing is a very important factor in the development of a disease. Genetic effects
on a disease trait vary during the life span of an individual. The function of a gene depends largely
on when it turns on and off, which could show a temporal pattern. In order to capture the dynamic

52

effect of a gene on a disease trait over time, it is natural to model the dynamic effect as a potential
nonlinear function over time. Considering multiple longitudinal traits, we proposed the following
partially linear varying coefficients model,
Yli j = Yli (ti j ) = β0l (ti j ) + β1l (ti j )Gi + α l Zi j + εli j ,

(4.1)

where Yli j is the response variable which measures the l-th phenotype on the i-th subject at the
j-th time point; Zi j is a p-dimensional vector of covariates, which can be either dependent or
independent of time; Gi denotes the time-invariant genetic variable within subject; β0l (·) and β1l (·)
are unknown functions; and εli j is an error term which is assumed to following the following joint
distribution,



εi = 




ε 1i
..
.



 ∼ N 0, Σ



ε Li
with Σ be some covariance structure. If we use a time-varying environmental factor Xi j instead of
ti j in the model, i.e.
Yli j = β0l (Xi j ) + β1l (Xi j )Gi + α l Zi j + εli j ,
then the model can be used for jointly modeling dynamic G×E interactions for multiple longitudinal traits.
Models for multivariate longitudinal traits are necessarily complex, because they must consider
different types of correlations for each independent subject: correlation between measurements for
the same trait at different time points, correlation between measurements at the same time point
on different traits, and correlation between measurements at different time points and on different
traits.
Qu and Li (2006) applied the method of quadratic inference functions (QIF) to the varying
coefficients models for longitudinal data. One important advantage is that the QIF method only
requires correct specification of the mean structure and does not require any likelihood or approximation of the likelihood in hypothesis testing. In addition, when the working correlation structure

53

is misspecified, the QIF is more efficient than the generalized estimation equation (GEE) approach.
Another advantage of QIF approach is that the inference function has an asymptotic form, which
provides a model selection criteria similar to AIC and BIC. It also allows us to test whether coefficients are significantly time-varying based on the asymptotic results.
The purpose of this paper is to develop a set of hypothesis testing procedure, including joint
testing for multiple traits and marginal testing for each individual trait, for model (4.1). We first use
penalized splines (Ruppert and Carroll, 2000) to approximate the nonparametric functions in the
model. Then we develop a 2-step testing procedure to first jointly test the genetic effect on multiple
traits and then separately test marginal genetic effect on each trait based on the QIF approach.
Estimation of the parametric coefficients and nonparametric spline coefficients are obtained under
the QIF framework.
This chapter is organized as follows: we state our proposed model in Section 4.2.1, and define
the objective function in a QIF method in Section 4.2.2. Estimation procedure and asymptotical
properties of estimators are provided in Section 4.2.3. A model selection criteria using BIC is
provided in Section 4.2.4. A theorem for goodness-of-fit test in QIF approach is established in
Section 4.2.5 and we propose a 2-step testing procedure based on that in Section 4.2.6. We assess
the finite sample performance of the proposed procedure with Monte Carlo simulation in Section
4.3 and illustrate the proposed methodology by the analysis of a pain sensitivity data set in Section
4.4. Conclusions and discussion are made in Section 4.5. Proofs are included in Appendix.

4.2
4.2.1

Joint models and statistical methods
Joint multivariate models

In multivariate longitudinal studies, suppose yli j is the l-th continuous outcome collected on the
i-th observation at the time point ti j , where l = 1, . . . , L, i = 1, . . . , N, j = 1, . . . , ni . The joint

54

partially linear varying coefficient models are defined as
yli j = yli (ti j ) = β0l (ti j ) + β1l (ti j )Gi + α l Z(ti j ) + εli j ,
where Gi is the SNP variable which is not depend on time and type of measurement, Z (ti j ) is the
p-dimensional covariate vector, which can be either time-dependent or time-independent. εli j is
an error term and




εi = 




ε 1i
..
.



 ∼ N 0, Σ



ε Li
with Σ be some covariance structure. β0l (·) and β1l (·) are unknown smooth nonparametric functions. To illustrate the idea, in the following we demonstrate the methods assuming L=2. For the
situation where there are more than two traits (L > 2), the technique can be easily extended. For
the case when L=2, the joint models can be written as
y1i j = y1i (ti j ) = β01 (ti j ) + β11 (ti j )Gi + α 1 Z(ti j ) + ε1i j ,
y2i j = y2i (ti j ) = β02 (ti j ) + β12 (ti j )Gi + α 2 Z(ti j ) + ε2i j ,
where







 

σ12 Σ 11



ρ12 σ1 σ2 Σ 12 
 ε 1i 
 0  
εi = 
 ∼ N   , 

ε 2i
0
ρ12 σ1 σ2 Σ 21
σ22 Σ 22

4.2.2

Objective function based on QIF

To construct the objective function using the QIF approach, we first approximate the unknown
functions β01 , β11 , β02 and β12 by a q-degree truncated power spline basis, i.e.
βsl (t) ≈ Bsl (t)T γ sl , for s = 0, 1 and l = 1, 2,
q

(4.2)

q

where Bsl (t) = (1,t,t 2 , ...,t qsl , (t − κ1 )+sl , ..., (t − κKsl )+sl )T is a truncated power spline basis with
degree qsl and Ksl knots κ1 , ..., κKsl . γ sl is a (qsl + Ksl + 1)-dimensional vector of spline coefficients.
55

In a GEE approach we solve
N

∑ µ˙ Ti V−1
i (yi − µ i ) = 0,

(4.3)

i=1

where yi = (yT1i , yT2i )T , yli = (yli1 , ..., ylini )T ; µ i = E(yi ) is the mean function and µ˙ i is the first
derivative of µ i with respect to the parameters; Vi is the covariance matrix of yi and can be de1/2

composed as Vi = Ai

1/2

ρ )Ai
R(ρ

ρ)
with Ai being a diagonal matrix of marginal variances and R(ρ

being a correlation matrix with nuisance parameters ρ . QIF approach considers the inverse of the
correlation matrix R as a linear combination of several known basis matrices in a form
R−1 ≈ a1 M1 + a2 M2 + ... + ah Mh ,

(4.4)

where M1 is the identity matrix and M2 , ..., Mh are symmetric basis matrixes. For exchangeable
working correlation, M2 has 0 on the diagonal and 1 elsewhere. If the working correlation is
AR(1), we can set M2 to have 1 on its two subdiagonals and 0 elsewhere. Plugging the expression
of R−1 (4.4) into the GEE stated in (4.3), we define the estimation function as

−1/2
−1/2
µ˙ Ti Ai
M1 Ai
(yi − µ i )
∑N
i=1

1
1 N
..
g¯N (θθ ) = ∑ gi (θθ ) = 
.
N i=1
N

˙ T −1/2 Mh A−1/2
(yi − µ i )
∑N
i
i=1 µ i Ai








Using the spline approximation, the mean function µ i can be written as

 
T
T
 µ1i1 (θθ )   B01 (ti1 )γγ 01 + B11 (ti1 )γγ 11 Gi + α 1 Z(ti1 )

 
..
..

 
.
.




 
 

 
µ1ini (θθ )   BT01 (tini )γγ 01 + BT11 (tini )γγ 11 Gi + α 1 Z(tini )
 µ 1i (θθ )  
=
µ i (θθ ) = 
=

  T
T

 
µ 2i (θθ )
 µ2i1 (θθ )   B02 (ti1 )γγ 02 + B12 (ti1 )γγ 12 Gi + α 2 Z(ti1 )

 
..
..

 
.
.

 

 
µ2ini (θθ )
BT02 (tini )γγ 02 + BT12 (tini )γγ 12 Gi + α 2 Z(tini )

56

(4.5)









,








and the first derivative of µ i is

T
T
0
0
0
 B01 (ti1 ) B11 (ti1 )Gi Z(ti1 )

..
..
..
..
..
..

.
.
.
.
.
.


 T
0
0
0
 B01 (tini ) BT11 (tini )Gi Z(tini )
µ˙ i = 


0
0
0
BT02 (ti1 ) BT12 (ti1 )Gi Z(ti1 )


..
..
..
..
..
..

.
.
.
.
.
.


0
0
0
BT02 (tini ) BT12 (tini )Gi Z(tini )









,








where θ = (γγ T01 , γ T11 , α T1 , γ T02 , γ T12 , α T2 )T .
Setting each component in (4.5) to be zero will result in more equations than unknown parameters. Following the idea of generalized method of moments (Hansen, 1982), the QIF is defined
as
−1 g¯ ,
QN (θθ ) = N g¯TN C¯N
N

(4.6)

T
where C¯N = N1 ∑N
i=1 gi gi is a consistent estimator for var(gi ). Minimizing the objective function

(4.6) provides the estimations of parameters.

4.2.3

Estimation

The estimation of the parameters can be obtained through minimizing the objective function, i.e.
θ = arg min QN (θθ ).
θ

To avoid over-fitting, we can define a penalized QIF in a form
N −1 QN (θθ ) + λ θ T Dθθ ,

(4.7)

where D is a diagonal matrix with 1 if the corresponding parameter is the spline coefficient associated with knots and 0 otherwise. Minimizing the penalized QIF provides
θ = arg min(N −1 QN (θθ ) + λ θ T Dθθ ).
θ

57

(4.8)

To estimate the tuning parameter λ , we can extend the generalized cross-validation (Ruppert, 2002;
Qu and Li, 2006; Bai et al., 2009) to the penalized QIF and define the generalized cross-validation
statistic as
GCV(λ ) =

N −1 QN
(1 − N −1 d f )2

with the effective degree of freedom
df = tr (Q¨ N + 2Nλ D)−1 Q¨ N ,
where Q¨ N is the second derivative of QN . The optimized tuning parameter λ is given as
λ = arg min GCV(λ ).
λ

To establish the asymptotic properties for the penalized quadratic inference function estimators
with fixed knots, we assume θ 0 is the parameter satisfying Eθ (gi ) = 0. Similar theoretical results
0

are provided in Qu and Li (2006). Following their idea and extend those results to the estimators
√
in our model, we get the strong consistency of the resulting estimators in Theorem 6. The Nconsistency and asymptotic normality of the estimators are given in Theorem 7 .
Theorem 6 Suppose conditions (B1)-(B6) in the Appendix hold and the smoothing parameter
λN = o(1), then the estimator θ , which is obtained by minimizing the penalized quadratic function
in (4.7), exists and converges to θ 0 almost surely.
Theorem 7 Suppose conditions (B1)-(B6) in the Appendix hold and the smoothing parameter
λN = o(N −1/2 ), then the estimator θ obtained by minimizing the penalized quadratic function
in (4.7) is asymptotically normally distributed with the limiting distribution,
√
d
−1
N(θ − θ 0 ) −
→ N(0, (GT0 C−1
0 G0 ) ),
where the calculation of G0 and C0 can be found in Appendix.

58

4.2.4

Model selection

In contrast to the complicated model selection in GEE method due to the lack of an objective function in its estimation procedure, it is natural to extend the BIC method to the QIF approach since
the QIF objective function and twice the negative log-likelihood function have similar asymptotic
properties. Under the assumption E(g) = 0 and the number of estimating equations is larger than
d

2 (Hansen, 1982), where r is the dimension of
the number of parameters, we have Q(θ ) −
→ χr−k

g¯N (θθ ), k is the dimension of θ , θ is the estimator by minimizing the QIF when certain order and
number of knots are chosen. Based on the asymptotic property of QIF, the BIC criterion for a
model with r estimating equations and k parameters is
Q(θ ) + (r − k) ln N.
The model with minimum BIC would be considered optimal.

4.2.5

Nonparametric goodness-of-fit test

Compared to GEE, an advantage of QIF approach is that QIF provides a goodness-of-fit test without estimations for second moment parameters. In Model (4.1), it is of interest to test whether the
spline approximations for the varying coefficient functions are appropriate.
Qu et al. (2000) constructed a test statistic based on QIF. Suppose that the d-dimension paramψ , ζ ), where ψ is the parameter of interest with dimension d1 ,
eter vector γ is partitioned into (ψ
and ζ is a nuisance parameter with dimension d2 = d − d1 . If we are interested in testing
H0 : ψ = ψ 0 ,
the test statistic
ψ 0 , ζ ) − Q(ψ , ζ )
Q(ψ
follows an asymptotically chi-square distribution with d1 degrees of freedom.

59

Theorem 8 (Qu et al., 2000) Suppose that all required regularity conditions are satisfied and ψ
ψ 0 , ζ ) − QN (ψ , ζ ) is asymptotically chi-square
has dimension d1 . Under the null hypothesis, QN (ψ
distribution with d1 degrees of freedom, where
ψ 0 , ζ ), (ψ , ζ ) = arg min QN (ψ
ψ , ζ ).
ζ = arg min QN (ψ

4.2.6

(4.9)

Two-step hypothesis testing procedure

In Model (4.1), it is of interest to test whether the genetic effects on multiple traits are significant or
not. Based on Theorem 8, we develop a 2-step testing procedure for testing the significance of the
varying coefficient functions. In the first step, the joint test is performed to see whether a genetic
factor has significant effect on at least one longitudinal trait. If the testing result in the first step
is significant, we will further conduct the marginal tests in the second step to assess if the genetic
effect is significant on both traits or just one trait. So the first step is a joint test of significance
followed by a marginal test to assess individual significance.

4.2.6.1

Step 1: Joint test

First, we are interested in testing whether the genetic factor G has effect on at least one longitudinal
trait. The hypothesis is stated as
H0 : β11 (·) = β12 (·) = 0 v.s. H1 : β11 (·) = 0 or β12 (·) = 0.
This can be handled through the truncated power spline approximation of the nonparametric functions stated in (4.2). In particular, test this hypothesis is equivalent to test the following null
hypothesis
H0 : γ 11 = γ 12 = 0.
According to Theorem 8, we can construct a test statistic
TN = QN (θ ) − QN (θ ),

60

where
θ = arg min QN γ 01 , γ 11 , α 1 , γ 02 , γ 12 , α 2 | y1 , y2 , G, Z ,
γ 11 =γγ 12 =0

and
θ = arg min QN γ 01 , γ 11 , α 1 , γ 02 , γ 12 , α 2 | y1 , y2 , G, Z .
The test statistic TN has an asymptotical χ 2 distribution with degree of freedom equals the number
of constraints under H0 , according to Theorem 8.

4.2.6.2

Step 2: Marginal tests

From the joint test, if there exist a significant genetic effect on at least one longitudinal trait, then
we can further test the marginal effects:
H0 : β1l (·) = 0 v.s. H1 : β1l (·) = 0, l = 1, 2.
Based on (4.2), this is equivalent to test the following two hypotheses
H0 : γ 11 = 0
and
H0 : γ 12 = 0
separately.
For testing H0 : γ 11 = 0, we use test statistic
TN1 = QN (γ 01 , 0, α 1 ) − QN (γ 01 , γ 11 , α 1 ),
where
(γ 01 , 0, α 1 ) = arg min QN γ 01 , γ 11 , α 1 | y1 , G, Z ,
γ 11 =0

and
(γ 01 , γ 11 , α 1 ) = arg min QN γ 01 , γ 11 , α 1 | y1 , G, Z .

61

We can also construct another test statistic
TN2 = QN (γ 02 , 0, α 2 ) − QN (γ 02 , γ 12 , α 2 )
for testing H0 : γ 12 = 0, where
(γ 02 , 0, α 2 ) = arg min QN γ 02 , γ 12 , α 2 | y2 , G, Z ,
γ 12 =0

and
(γ 02 , γ 12 , α 2 ) = arg min QN γ 01 , γ 12 , α 2 | y2 , G, Z .
The asymptotical distribution of test statistics TN1 and TN2 can be obtained from Theorem 8.

4.3
4.3.1

Simulation study
Simulation setup

In this section, the finite sample performance of the proposed method is evaluated through Monte
Carlo simulation studies. The two continuous variables are generated from the models
y1i (ti j ) = β01 (ti j ) + β11 (ti j )Gi + α1 z(ti j ) + ε1i j ,
y2i (ti j ) = β02 (ti j ) + β12 (ti j )Gi + α2 z(ti j ) + ε2i j ,
where β01 (ti j ) = 0.5 cos(2πti j ), β11 = sin(π(ti j − 0.2)), β02 (ti j ) = sin(πti j ) − 0.5, β12 (ti j ) =
cos(πti j − 0.8), α1 = 0.2 and α2 = 0.3. We generate T time points ti = (ti1 , . . . ,tiT ) from a uniform distribution U(0, 1). The predictor variable z(ti j ) is also generated from U(0, 1). We set
the minor allele frequency (MAF) as pA = 0.5 and assume Hardy-Weinberg equilibrium. Three
different SNP genotypes AA, Aa and aa are simulated from a multinomial distribution with frequencies p2A , 2pA (1 − pA ) and (1 − pA )2 , respectively. Variable G takes value in the set {0,1,2},
corresponding to genotypes {aa, Aa, AA}. We assume ε1i j and ε2i j are jointly normally distributed
with the correlation corr(ε1i j , ε2i j ) = 0.5. Then we generate the error terms from a multivariate

62

normal distribution




 

 

σ12 Σ11

 

0.5σ1 σ2 Σ12  
 ε 1i 
  0  

 ∼ N   , 
 
ε 2i
0
0.5σ1 σ2 Σ12
σ22 Σ22
We set the marginal variances σ12 = σ22 = 0.1. The true correlation structure of Σ11 , Σ22 , and Σ12
are all exchangeable with ρ1 = 0.5, ρ2 = 0.5 and ρ12 = 0.2, respectively.
We draw 1000 data sets with sample size N = 200, 500 and time points ni = T = 10, in order
to compare the performances of our proposed method under different sample sizes. We set M1 to
be the identity matrix and M2 to has 1 on subdiagonals and 0 elsewhere. The order and number of
knots of the splines are chosen by the BIC method.

4.3.2

Performance of estimation

Table 4.1 summarizes the parameter estimation for unknown coefficients. In this table, the average bias (Bias), the standard deviation of the 1000 estimates (SD), the average of the estimated
standard error (SE) based on the theoretical results, and the estimated coverage probability (CP)
at 95% confidence level are reported. In general, the biases for all parameter estimations are very
small, the coverage probabilities are very close to the confidence level 95%, which indicate good
performance of our proposed estimation procedure. As the sample size increases, the performance
of the estimation improves by showing smaller bias, SD and SE.
Table 4.1 Estimation results for parameters α1 and α2 with sample size N = 200, 500.
N Parameter True
200
α1
0.2
α2
0.3
500

α1
α2

0.2
0.3

Bias
SD
SE CP
0.0004 0.018 0.018 94.6
0.0006 0.018 0.018 94.2
-5.2E-05 0.012 0.012 95.2
-0.0003 0.012 0.012 94.8

The plots for the estimations of nonparametric functions β01 (·) and β11 (·) under different sample sizes are shown in Figure 4.1. The estimated and true functions are denoted by the solid and
dashed lines, respectively. The 95% confidence bands are denoted by the dotted-dash line. The
63

estimated curves almost overlap with the corresponding true curves as shown in the plots, indicating good estimate of the function. Larger sample size leads to tighter confidence bands. Figure
4.2 displays the estimations for functions β02 (·) and β12 (·), which are included in the model corresponding to response variable y2 . The results are similar to those in Figure 4.1 and further
demonstrate the good performance of our estimation methods.
N=200

0.3

0.25
β01(t)

β01(t)

N=500

0.50

0.0

0.00

−0.25

−0.3

−0.50
−0.6

0.25

0.50
t

0.75

0.25

0.75

N=500

1.0

1.0

0.5

0.5

β11(t)

β11(t)

N=200

0.50
t

0.0

0.0

0.25

0.50
t

0.75

0.25

0.50
t

0.75

Figure 4.1 The estimation of nonparametric functions β01 (·) and β11 (·) when N=200, 500. The
estimated and true functions are denoted by the solid and dashed lines respectively. The 95%
confidence bands are denoted by the dotted-dash line.

64

N=200

N=500

0.4

0.4

0.2

0.2

β02(t)

β02(t)

0.6

0.0

0.0

−0.2

−0.2
0.25

0.50
t

0.75

0.25

N=200

0.50
t

0.75

N=500

0.5
β12(t)

0.5
β12(t)

1.0

1.0

0.0

0.0

−0.5

−0.5
0.25

0.50
t

0.75

0.25

0.50
t

0.75

Figure 4.2 The estimation of nonparametric functions β02 (·) and β12 (·) when N=200, 500. The
estimated and true functions are denoted by the solid and dashed lines respectively. The 95%
confidence bands are denoted by the dotted-dash line.

4.3.3
4.3.3.1

Performance of hypothesis tests
Performance for joint test

We evaluate the performance of the joint test under the null hypothesis H0 : β11 (·) = β12 (·) = 0.
Power is evaluated under a sequence of alternative models with different values of τ, which is
τ (·) = τβ (·) and β τ (·) = τβ (·).
denoted by H1τ : β11
11
12
12

Figure 4.3 shows the empirical size (when τ = 0) and power function (when τ > 0) at significance level 0.05. We obtain 1000 Monte Carlo simulations to assess the null distribution of test
statistic under sample sizes N = 200, 500. The empirical Type I error under both sample sizes are

65

close to the nominal level 0.05 and the power increases dramatically when τ increases from 0 to
0.05. To see the effect of sample size, we compare the performances under N=200 and N=500.
As expected, the Type I error is closer to 0.05 and the power increase faster for larger sample size
when N=500. For a relatively small sample size N=200, the Type I error is a little inflated and the
power function increases slower compared to the case with N = 500. Overall, the results indicate
good performance of the proposed joint testing procedure.

Power plot of joint test
1.0

Power

0.7
0.5
0.3
sample_size

200
500

0.1
0.00

0.01

0.02

τ

0.03

0.04

0.05

Figure 4.3 The power plot for the joint test under different sample sizes N=200, 500 when T =10.

4.3.3.2

Performance for marginal tests

The performance of the marginal tests for the nonparametric functions corresponding to different
traits is evaluated through simulations. Two null hypotheses H0 : β11 (·) = 0 and H0 : β12 (·) = 0
were considered separately. For each test, power is evaluated under a sequence of alternative
τ (·) = τβ (·), l = 1, 2, correspondingly.
models, denoted by Haτ : β1l
1l

Figure 4.4 displays the power for both marginal tests under different sample sizes N=200 and
500.The empirical Type I error under both sample sizes are very close to the nominal level 0.05
and the power increases dramatically when τ increases from 0 to 0.05. It is obvious that the power

66

increases more rapidly with larger sample size, while the overall performances under N=500 and
N=200 are close, which indicates that our method performances well and does not require very
large sample size.

Power for testing significance of β11
1.0

Power

0.7
0.5
0.3
sample_size

0.1

200
500

0.000

0.025

0.050
τ

0.075

0.100

Power for testing significance of β12
1.0

Power

0.7
0.5
0.3
sample_size

200
500

0.1
0.000

0.025

0.050
τ

0.075

0.100

Figure 4.4 The power plots for the marginal test for N=200, 500 and T =10.

In summary, the simulation results indicate that our proposed estimation method works well.
The test results indicate that the asymptotic χ 2 distribution for the proposed joint and marginal test
works reasonably well under a finite sample size.
67

4.4

Real data application

We applied the proposed multivariate partially linear varying coefficients model and the two-step
hypothesis testing procedure to a data set from a study examining the association of the A118G
SNP of OPRM1 to experimental pain sensitivity. A sample of 163 healthy volunteers were evolved
in this study. For each subject, Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP)
were measured at 6 dosage levels of Dobutamine to assess the genetic effect on drug response. The
6 dosage levels are: 0 (baseline), 5, 10, 20, 30 and 40mcg/min. We treat the dosage levels as time
in this analysis.
We consider the partially linear varying coefficient model with two longitudinal traits, with the
form
YiSBP
= β0SBP (Xi j ) + β1SBP (Xi j )Gi + α SBP Zi + εiSBP
j
j ,
YiDBP
= β0DBP (Xi j ) + β1DBP (Xi j )Gi + α DBP Zi + εiDBP
j
j .
The two longitudinal traits are SBP and DBP. One time-invariant covariate Z = age are included
in the model. Five SNPs codon16, codon27, codon49, codon389, codon492 are considered.
Table 4.2 displays the results of the joint and marginal testing for responses SBP and DBP,
respectively. We note that condon49 and condon389 show significant result with p-values smaller
than the significance level 0.05. Further marginal tests tell us that SNP condon49 has significant
association with SBP but not DBP, while SNP condon389 has significant association with DBP but
not SBP. The results indicate no pleiotropic effect of the two SNPs. In addition, we note that for
SNP condon16, the joint test does not show significant result but the p-value of the marginal test
for SBP is smaller than the significance level. If we choose α = 0.1 significance level, then the
SNP shows a potential pleiotropic effect on the two traits.

68

Table 4.2 List of SNPs with MAF, allele, p-values under the joint and marginal testing for SBP and
DBP.
SNP ID
codon16
codon27
codon49
codon389
codon492

MAF Alleles
0.3990 A/G
0.4160 G/C
0.1387 G/A
0.3045 G/C
0.4250 T/C

β1SBP = β1DBP = 0
0.0866
0.3048
0.0229
0.0343
0.3234

p-values under different null hypotheses
β1SBP = 0 β1DBP = 0 β1SBP is linear
0.0428
0.0514
0.8311
0.3819
0.1018
0.5938
0.0410
0.8349
0.7846
0.3550
0.0242
0.2150
0.5779
0.6957
0.2611

β1DBP is linear
0.3094
0.1012
0.4439
0.2938
0.7875

Table 4.3 shows the estimation results for the age effect α1 and α2 . The results show that age
does not have significant contribution to the two blood traits in this data set.
Table 4.3 List of SNPs with MAF, allele, estimation of coefficients, p-values of significance for
coefficients corresponding to SBP and DBP, respectively.
SNP ID
codon16
codon27
codon49
codon389
codon492

MAF Alleles
0.3990 A/G
0.4160 G/C
0.1387 G/A
0.3045 G/C
0.4250 T/C

α SBP

α DBP

0.0109
0.0235
0.0356
0.0103
0.0213

-0.0435
-0.0365
-0.0336
-0.0644
-0.0494

p-value
H0 : α SBP = 0
0.8562
0.7078
0.5637
0.8743
0.7366

H0 : α DBP = 0
0.5005
0.5902
0.6079
0.3409
0.4679

Figure 4.5 illustrates the estimated shapes for nonparametric coefficient functions for different
SNPs for trait SBP. The 95% confidence bands cover the zero line for SNPs condon27, condon389
and condon492, which agrees with the testing results that these SNPs do not show significance. For
SNP condon49 which shows significance at α = 0.05 level, the estimated increasing effect function
as dosage increases suggests that this SNP positively responds to dosage increase to affect SBP.
Figure 4.6 illustrates the estimated shapes for nonparametric coefficient functions for different
SNPs for trait DBP. Again, we observe that the 95% confidence bands cover the zero line for SNPs
condon27, condon49 and condon429 which agrees with the testing results that these SNPs are
not significant. For SNP condon389, the estimated effect function shows a marginal significance,
indicating the role of this SNP to drug response on DBP.

69

codon16, SBP
0.12

β1(dose)

0.08

0.04

0.00
0.00

0.25

0.50
dose

0.75

codon27, SBP

codon49, SBP

0.05

0.15

β1(dose)

β1(dose)

1.00

0.00

0.10
0.05

−0.05
0.00
−0.10
0.00

0.25

0.50
dose

0.75

−0.05
0.00

1.00

codon389, SBP

0.50
dose

0.75

1.00

codon492, SBP
0.05

β1(dose)

0.05

β1(dose)

0.25

0.00

−0.05

0.00

−0.05

−0.10
0.00

0.25

0.50
dose

0.75

−0.10
0.00

1.00

0.25

0.50
dose

Figure 4.5 Plot of the estimate (solid curve) of the nonparametric function
codon16, codon27, codon49, codon389 and codon492. The 95% confidence
by the dashed line.

70

0.75

1.00

β1SBP (·) for SNPs
bands are denoted

codon16, DBP
0.12

β1(dose)

0.08

0.04

0.00

0.00

0.25

0.50
dose

0.75

codon27, DBP

1.00

codon49, DBP

0.05

0.10

β1(dose)

β1(dose)

0.05
0.00

0.00
−0.05

−0.05

−0.10
0.00

0.25

0.50
dose

0.75

1.00

0.00

codon389, DBP

0.25

0.50
dose

0.75

1.00

codon492, DBP

0.05
0.04

β1(dose)

β1(dose)

0.00

−0.05

0.00

−0.04
−0.10
0.00

0.25

0.50
dose

0.75

1.00

0.00

0.25

0.50
dose

0.75

1.00

Figure 4.6 Plot of the estimate (solid curve) of the nonparametric function β1DBP (·) for SNPs
codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by
the dashed line.

4.5

Discussion

Identification of genetic pleiotropy effects has been an important task in genetic association studies. If one gene is associated with multiple traits, special attention should be paid to such genes
71

when designing drug target on those genes. In this paper, we propose a joint multivariate varying coefficient modeling procedure to accommodate correlated longitudinal traits and propose a
testing procedure to find the dynamic genetic association between SNPs and multiple traits. Both
simulation and real data analysis demonstrate the utility of the proposed method.
One difficulty in jointly modeling multiple longitudinal traits is to model the complex correlation structure. For each subject, we should consider correlation between measurements for the
same trait at different time points, correlation between measurements at the same time point on
different traits, and correlation between measurements at different time points and on different
traits. We implement the QIF approach in estimation and testing procedures. There are several
advantages for QIF approach. First, the QIF approach only requires correct specification of the
mean structure and does not require any joint likelihood in hypothesis testing. Second, it avoids
estimating the nuisance correlation structure parameters by assuming that the inverse of working
correlation matrix can be approximated by a linear combination of several known basis matrices.
Third, when the working correlation structure is misspecified, the QIF is more efficient than the
GEE approach. Forth, the inference function of the QIF approach has an explicit asymptotic form,
which provides a model selection criteria and allows us to test whether coefficients are significant
or time varying based on the asymptotic results.
Our method was demonstrated with the L = 2 case. The proposed method can be extended
to multiple longitudinal traits with L > 2, although the computational cost might increase. In the
real application, we investigate association of SNPs in a candidate gene with two longitudinal
traits SBP and DBP. Although the data were not longitudinal in terms of time measurement, the
increasing dosage levels can be treated in a time scale. So we can apply the proposed method.
The results indicate a weakly pleiotropic effect for SNP condon16 to affect both SBP and DBP.
As the application shows, our method is not restricted to a longitudinal study. It also applies to
other studies where a certain trait can be measured in a linear scale. Therefore, our method is
directly applicable to neuro-genetics studies in which the purpose is to identify SNPs associated
with spatial distribution of neuroimages in brain.

72

CHAPTER 5
CONCLUSION AND FUTURE WORK

5.1

Conclusion

The originalities and contributions of our work can be summarized in two respects. From the respect of statistical methodology, we propose a functional varying index coefficient model (FVICM)
to capture the nonlinear G×E interactions under a longitudinal design. In the development of the
estimation procedure, penalized spline method is implemented to approximate the nonparametric
unctions in the model. A profile estimation procedure is proposed to estimate two sets of parameters: the index loading coefficients and spline coefficients. Then the quadratic inference function
approach for analysing longitudinal data is extended to estimate the index loading coefficients and
spline coefficients in the profile estimation method. To test the linearity of nonparametric G×E
interaction function, we apply the pseudo-likelihood ratio test using a linear mixed model representation of our proposed model. Consistency and asymptotic normality of the estimators are
established.
To deal with binary longitudinal traits, it is a natural extension of the FVICM to a generalized
functional varying index coefficient model (gFVICM) for investigating nonlinear G×E interactions. We modify the profile estimation procedure with QIF approach to the gFVICM and establish
theoretical results of the estimators. Then we proposed a testing procedure based on the asymptotic
χ 2 distribution of the objective function in QIF approach. The testing procedure can be used to
assess the linearity of the interaction function.
For some complex diseases, there are multiple phenotypes that can be used to quantify the risk
of diseases and sometimes they have shared genetic determinations and this phenomenon is termed
genetic pleiotropy. A joint modeling for multiple longitudinal traits using varying index coefficient
model is proposed in our work to deal with correlated longitudinal traits in G×E interaction problems. One difficulty of the joint model is the specification of the complicated correlation structure.

73

The important advantage of QIF approach is that a misspecified working correlation does not affect the consistency of the regression parameter estimation, and the QIF provides a robust sandwich
estimator for the variance of the regression parameter estimator. When the working correlation
structure is misspecified, the QIF is more efficient than the GEE.
From the application perspective, the varying index coefficient modeling is a powerful tool
when we consider the joint effect of environmental mixtures and how they interact with genes to
affect disease risk. Our methods are well motivated by epidemiological studies with the hope to
identify any synergistic G×E interaction effects. Real data application shows that, compared to the
additive varying coefficient model, which consider the G×E for each single environmental factor
separately, our models outperform in detecting the significant interaction effect since it can reduce
multiple testing burden by treating the serval environmental variables as a single index variable.
Also, the assumption of a nonparametric interaction is flexible for possible nonlinear interactions
in practice. We also provide a framework to assess the simultaneous genetic effect on multiple
phenotypes with longitudinal data.

5.2

Future work

In the future, we plan to extend the functional varying index coefficient model to joint modeling
of binary and continuous longitudinal traits. This is practically important for some diseases. For
example, over-weighted people will have a higher chance to develop hypertension. Both obese and
hypertension might share some common genetic determinants. Jointly modeling the binary hypertension and continuous body weight or BMI could shed novel insights into the genetic etiology of
the disease. The main difficulty of joint modeling for binary and continuous longitudinal traits is
the lack of a joint distribution. To overcome this difficulty, many researchers introduce a continuous latent variable underlying the binary response and assuming a joint normal distribution for the
latent variable and the continuous variable. Catalano and Ryan (1992) suggested to decompose the
joint distribution into two components: a marginal distribution for the continuous response and a

74

conditional distribution for the binary distribution given the continuous response. The first component can be easily modeled. The conditional distribution for binary response can be modeled
using the underlying latent continuous variable through a probit link function. Kürüm et al. (2016)
proposed the time-varying coefficient models for joint modeling binary and continuous longitudinal responses based on the above idea. However, they only focus on the estimation part and did
not provide a testing method for the nonparametric functions in the model. Motivated by their
work, we can extend their method to varying index coefficient models for joint modeling binary
and continuous longitudinal traits and develop a testing method for joint testing of the significance
or linearity of the interaction functions. This will be investigated in our future work.

75

APPENDIX

76

Regularity conditions
To establish the asymptotic properties for the estimator of θ , we need the following regularity
conditions.
(A1) {ni } is a bounded sequence of integers.
(A2) The parameter space Ω is compact and θ ∗0 is an interior point of Ω.
(A3) The parameter θ ∗ is identified, that is, there is a unique θ ∗0 ∈ Ω such that the mean zero
model assumption E[g(θθ ∗0 )] = 0.
(A4) E[g(θθ )] is continuous in θ .
∗
∗ T
∗
(A5) C¯N (θ ) = N1 ∑N
i=1 gi (θ )gi (θ ) converges almost surely to C0 , which is a constant and

invertible matrix.
∂ g¯

∗

(A6) The first derivative of g¯N exists and is continuous. ∂ θN∗ (θ ) converges in probability to G0
∗
if θ converges in probability to θ ∗0 .
(B1) {ni } is a bounded sequence of integers.
(B2) The parameter space Ωθ is compact and θ 0 is an interior point of Ωθ .
(B3) The parameter θ is identified, that is, there is a unique θ 0 ∈ Ωθ such that the first moment
assumption E[gi (θθ 0 )] = 0 holds for i = 1, ..., N, and E[gi (θθ )] is continuous.
(B4) E[g(θθ )] is continuous in θ .
T
(B5) C¯N (θ ) = N1 ∑N
i=1 gi (θ )gi (θ ) converges almost surely to C0 , which is a constant and invert-

ible matrix.
∂ g¯

(B6) The first derivative of g¯N exists and is continuous. ∂ θN (θ ) converges in probability to G0 if
θ converges in probability to θ 0 .

77

Proof of Theorem 1:
∗

If we can prove that θ exist and converges to θ ∗0 almost surely, then we can prove the consistency
∗

of θ directly. θ = arg min(N −1 QN (θθ ∗ ) + λ θ ∗T Dθθ ∗ ) exists because (2.4) has zero as a lower
∗

bound and the global minimum exists. To prove the consistency, first, the estimator θ is obtained
by minimizing N −1 QN (θθ ∗ ) + λ θ ∗T Dθθ ∗ , then we have
∗
∗T
∗
1
1
θ ∗0 .
QN (θ ) + λN θ Dθ ≤ QN (θθ ∗0 ) + λN θ ∗T
0 Dθ
N
N

(1)

Since
1
−1 (θ
θ ∗0 )g¯N (θθ ∗0 ) = o(1)
QN (θθ ∗0 ) = g¯TN (θθ ∗0 )C¯N
N
by the strong law of large number and (A5), and λN = o(1),
1
a.s.
QN (θθ ∗0 ) + λN θ ∗T
Dθθ ∗0 −−→ 0.
0
N
Thus, we can obtain from (1) that
∗
∗ −1 ∗
∗ a.s.
1
QN (θ ) = g¯TN (θ )C¯N
(θ )g¯N (θ ) −−→ 0.
N

(2)

Since the parameter space Ω is compact, by Glvenko-Cantelli theorem,
a.s.

sup g¯N (θθ ∗ ) − E[g(θθ ∗ )] −−→ 0.

θ ∗ ∈Ω

Hence, by (A5) and the continuous mapping theorem,
∗ −1 ∗
∗
∗
∗
a.s.
g¯TN (θ )C¯N
(θ )g¯N (θ ) − E[g(θ )]T C−1
E[g(
θ
)] −−→ 0.
0

Combined with (2), we get
∗

∗

a.s.

E[g(θ )]T C−1
0 E[g(θ )] −−→ 0.

(3)

∗

Then we will show that it is impossible that θ remains outside of U, where U is any neighbor∗

hood of the true parameter θ ∗0 . Suppose there exist a neighborhood U such that θ ∈ U c . Since
∗

θ ∗ )] is a continuous function and U c is compact, there exists a point θ ∈ U c
E[g(θθ ∗ )]T C−1
0 E[g(θ
such that
∗

∗

E[g(θ )]T C−1
0 E[g(θ )]
78

achieve its minimum in U c . By the identification of θ ∗ in (A3), there is a unique θ ∗0 ∈ Ω satisfying
E[g(θθ 0 )] = 0, we have
θ ∗ )] > 0,
E[g(θθ ∗ )]T C−1
0 E[g(θ
∗

which contradicts (3). Then we can prove that θ converges almost surely to θ ∗ . Thus, θ is a
consistent estimator of θ .

Proof of Theorem 2
The estimate of θ satisfies
0=

∗
1 ∂ QN ∗
(θ ) + 2λN Dθ .
∗
N ∂θ

By Taylor expansion, we obtain
0=

∗
1 ∂ QN ∗ ∗
1 ∂ 2 QN ∗
(
θ
)
+
2λ
D
(
θ
− θ ∗0 ),
(θθ 0 ) + 2λN Dθθ ∗0 +
N
∗2
θ
N ∂
N ∂θ

∗

∗

where θ is some value between θ and θ ∗0 . Thus, we can have
−1 1 ∂ QN
∗
1 ∂ 2 QN ∗
θ − θ ∗0 = −
(
θ
)
+
2λ
D
(θθ ∗ ) + 2λN Dθθ ∗0 .
N
N ∂ θ ∗2
N ∂θ∗ 0
∗

∗

∗

(4)

Since θ converges to θ ∗0 in probability and θ is between θ and θ ∗0 , by (A5) and (A6) we can
get
1 ∂ 2 QN ∗
(θ )
N ∂ θ ∗2

=

2

∂ g¯N T ∗ ¯ −1 ∗ ∂ g¯N ∗
(θ )CN (θ ) ∗ (θ ) + o p (1)
∂θ∗
∂θ

p

−
→ 2GT0 C−1
0 G0
When λN = o(N −1/2 ),
−1
1 ∂ 2 QN ∗
1
−1
−1/2 ).
(
θ
)
+
2λ
D
= (GT0 C−1
N
0 G0 ) + o p (N
∗2
N ∂θ
2

Similarly, since
∂ g¯N T ∗ ¯ −1 ∗
1 ∂ QN ∗
θ
(θ
)
=
(θθ 0 )CN (θθ 0 )g¯N (θθ ∗0 )
N ∂θ∗ 0
∂θ∗

79

and λN = o(N −1/2 ), we have
1 ∂ QN ∗
(θθ ) + 2λN Dθθ ∗0 = GT0 C0−1 g¯N (θθ ∗0 ) + o(N −1/2 ).
N ∂θ∗ 0
Therefore, (4) can be written as
√
√
∗
−1 T −1
θ ∗0 ) + o p (1).
N(θ − θ ∗0 ) = − N(GT0 C−1
0 G0 ) G0 C0 g¯N (θ

(5)

By Central Limit Theorem,
√
d
N g¯N (θθ ∗0 ) −
→ N(0, C0 ).
Using (5) and (6), we obtain
√
∗
d
−1
N(θ − θ ∗0 ) −
→ N(0, (GT0 C−1
0 G0 ) ),
and directly,
√
d
−1 T
N(θ − θ 0 ) −
→ N(0, J(GT0 C−1
0 G0 ) J ).

80

(6)

BIBLIOGRAPHY

81

BIBLIOGRAPHY

[1] Bai,Y., Fung W. K. and Zhu, Z. (2009). Penalized quadratic inference functions for singleindex models with longitudinal data. Journal of Multivariate Analysis 100, 152-161.
[2] Baker, C. (2004). "Chapter 3. Environment Illustrated". Behavioral Genetics AAAS. ISBN
978-0871686978.
[3] Belle, G., Fisher, L. D., Heagerty, P. J. and Lumley, T. (2004). "Chapter 18: Longitudinal
data analysis". Biostatistics: A Methodology For the Health Sciences, 2nd Edition.
[4] Carpenter, D. O., Arcaro, K., and Spink, D. C. (2002). Understanding the human health
effects of chemical mixtures. Environmental Health Perspectives 110(suppl 1), 25-42.
[5] Catalano, P. J. and Ryan, L. M. (1992) Bivariate latent variable models for clustered discrete
and continuous outcomes. Journal of the American Statistical Association 87(419), 651-658.
[6] Choi, Y., Chowdhury, R. and Swaminathan, B. (2014). Prediction of hypertension based
on the genetic analysis of longitudinal phenotypes: a comparison of different modeling approaches for the binary trait of hypertension. BMC Proceedings 8(Suppl 1):S78.
doi:10.1186/1753-6561-8-S1-S78.
[7] Crainiceanu, C. and Ruppert, D. (2004). Likelihood ratio tests in linear mixed models with
one variance component. Journal of the Royal Statistical Society, Series B 65, 165-185.
[8] Crainiceanu, C., Ruppert, D., Claeskens, G., and Wand, P. (2005). Exact likelihood ratio tests
for penalized splines. Biometrica 92, 91-103.
[9] Crowder M. (1986). On consistency and inconsistency of estimating equations. Econometric
Theory 2, 305-330.
[10] Crowder M. (1995). On the use of a working correlation matrix in using generalized linear
models for repeated measures. Biometrika 82(2), 407-410.
[11] Cui, X., Härdle, W., and Zhu, L. (2011). The EFM approach for single-index models. The
Annals of Statistics 39, 16581688.
[12] Falconer, D. S. (1952). The problem of environment and selection. The American Naturalist
86, 293-298.
[13] Furlotte, N. A., Eskin, E. and Eyheramendy, S. (2014). Genome-wide association mapping
with longitudinal data. Genet Epidemiol 36, 463-471.
[14] Gratten, J. and Visscher, P. M. (2016). Genetic pleiotropy in complex traits and diseases:
implications for genomic medicine. Genome Medicine 8:78, doi: 10.1186/s13073-016-0332x.

82

[15] Greven, S., Crainiceanu, C., Kühenhoff, H., and Peters, A. (2008). Restricted likelihood ratio
testing for zero variance components in linear mixed models. Journal of Computational and
Graphical Statistics 17, 870-891.
[16] Johnson, J. A. and Terra, S. G. (2002). Beta-adrenergic receptor polymorphisms: cardiovascular disease associations and pharmacogenetics. Pharm Res 19, 1779-1787.
[17] Kürüm, E., Li, R., Shiffman, S. and Yao, W. (2016). Time-varying coefficient models for
joint modeling binary and continuous outcomes in longitudinal data. Statistica Sinica 26,
979-1000.
[18] Leisch, F., Weingessel, A., and Hornik, K. (1998). On the generation of correlated artificial binary data. Working Paper Series, SFB Adaptive Information Systems and Modelling in
Economics and Management Science, Vienna University of Economics.
[19] Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalised linear
models. Biometrika 73, 12-22.
[20] Liu, X., Cui, Y. and Li, R. (2016). Partial linear varying multi-index coefficient model for
integrative gene-environment interactions. Statistica Sinica 26, 1037-1060.
[21] Liu, X., Gao, B. and Cui Y. (2017) Generalized partial linear varying multi-index coefficient
model for gene-environment interactions. Statistical Applications in Genetics and Molecular
Biology 16(1), 59-74.
[22] Lobo, I. (2008). Pleiotropy: one gene can affect multiple traits. Nature Education 1, 10.
[23] Ma, S. and Song, P. (2015). Varying Index Coefficient Models. Journal of the American
Statistical Association 110, 341-356.
[24] Macgregor, S., Knott, S. A., White, I. and Visscher, P. M. (2005) Quantitative trait locus
analysis of longitudinal quantitative trait data in complex pedigrees. Genetics 171, 13651376.
[25] Song, P., Jiang, Z., Park, E. and Qu A. (2009) Quadratic inferance functions in marginal
models for longitudinal data. Statistics in Medicine 28, 3683-3696.
[26] Qu, A. and Li, R. (2006). Quadratic inference functions for varying coefficient models with
longitudinal data. Biometrics 62, 379-391.
[27] Qu, A., Lindsay, B. G. and Li, B. (2000). Improving generalised estimation equations using
quadratic inference fucntions. Biometrika 87, 823-836.
[28] Ross, C. A., and Smith, W. W. (2007). Gene-enviroment interactions in Parkinson’s disease.
Parkinsonism and Related Disorders 13, S309-S315.
[29] Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics 11, 735-757.

83

[30] Ruppert, D. and Carroll, R. J. (2000). Spatially-adaptive penalties for spline fitting. Australian
and New Zealand Journal of Statistics 42, 205-223.
[31] Self, S. G. and Liang, K. Y. (1987). Assessing cumulative health risks from exposure to
environmental mixtures - three fundamental questions. Journal of the American Statistical
Association 82, 605-610.
[32] Sexton, K. and Hattis, D. (2007). Asymptotic properties of maximum likelihood estimators
and likelihood ratio under non-standard conditions. Environmental Health Perspectives 115,
825-832.
[33] Sitlani, C. M., Rice, K. M., Lumley, T. et al. (2015). Generalized estimating equations for
genome-wide association studies using longitudinal phenotype data, Stat Med 34, 118-130.
[34] Stram, D. O. and Lee, J. W. (1994). Variance components testing in the longitudinal mixed
effects model. Biometrics 50, 1171-1177.
[35] Wang, Y. and Chen, H. (2012). On testing an unspecified function through a linear mixed
effects model with multiple variance compnents. Biometrics 68, 1113-1125.
[36] Wang, W., Feng, Z., Bull, S. B. and Wang Z. (2014). A 2-step strategy for detecting pleiotropic effects on multiple longitudinal traits. Front. Genet. doi.org/10.3389/fgene.2014.00357.
[37] Xu, Z., Shen, X., Pan, W. (2014) Longitudinal analysis is more powerful than cross-sectional
analysis in detecting genetic association with neuroimaging phenotypes. PLoS One 9(8),
e102312.
[38] Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partially linear single-index
models. Journal of the American Statistical Association 97(460), 10421054.
[39] Zhang, D. (2004). Genaralized linear mixed models with varying coefficients for longitudinal
data. Biometrics 60, 8-15.
[40] Zimmet, P., Alberti, K., and Shaw, J. (2001). Global and societal implications of the diabetes
epidemic. Nature 414, 782-787.

84