FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS WITH LONGITUDINAL DATA By Jingyi Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Doctor of Philosophy 2017 ABSTRACT FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS WITH LONGITUDINAL DATA By Jingyi Zhang Rooted in genetics, human complex diseases are largely influenced by environmental factors. Existing literature has shown the power of integrative gene-environment interaction analysis by considering the joint effect of environmental mixtures on disease risk. Motivated by that, we propose a functional varying index coefficient model for longitudinal measurements of a phenotypic trait and multiple environmental variables, and assess how the genetic effects on a longitudinal disease trait are nonlinearly modified by a mixture of environmental influences. We derive an estimation procedure for the nonparametric functional varying index coefficients under the quadratic inference functions and penalized splines framework. Theoretical results such as estimation consistency and asymptotic normality of the estimates are established. In addition, we propose a hypothesis testing procedure to assess the significance of the nonparametric index coefficient function. We evaluate the performance of our estimation and testing procedure through Monte Carlo simulation studies. The proposed method is illustrated by applying to a real data set from a pain sensitivity study in which SNP effects are nonlinearly modulated by the combination of dosage levels and other environmental variables to affect blood pressure and heart rate of patients. In order to deal with discrete measurements for risk of disease, we further extend our proposed FVICM to a generalized varying index coefficient model (gFVICM) to binary longitudinal traits. We apply penalized splines to approximate the nonparametric varying index coefficients and develop an estimation procedure based on the quadratic inference functions. The asymptotic normality established in the theoretical results enables us to develop a model selection criteria and construct a test statistic based on the quadratic inference function. In hypothesis test, we investigate the linearity of G×E interactions using the proposed testing procedure. The utility of the method is further demonstrated through a pain sensitivity case study in which SNP effects are nonlinearly modulated by the combination of environmental mixtures to affect high blood pressure. Genetic pleiotropy refers to the situation in which a single gene influences multiple traits and so it is considered as a major factor that underlies genetic correlation among traits. For some complex diseases, there are multiple phenotypes that can used to diagnose or to quantify the risk of diseases and usually they have shared genetic determinations. In multivariate longitudinal data, multiple response variables are jointly measured over time from the same individual. It is appropriate to take into account the correlation between multivariate longitudinal responses. Therefore, we propose the joint partially linear varying coefficient models and the testing framework to jointly test the association of genetic factors with bivariate phenotypic values adjusting for environmental factors. We extended the quadratic inference functions to deal with the longitudinal correlations and used penalized splines for the approximation of nonparametric coefficients. The proposed method is illustrated by applying to a real data set from a pain sensitivity study, in which systolic blood pressure (SBP) and diastolic blood pressure (DBP) were correlated longitudinal quantified phenotypes of SNP effects. ACKNOWLEDGEMENTS I would like to express the deepest appreciation to my advisor and committee chair Professor Yuehua Cui for his excellent guidance and continuous support in my Ph.D. research. Without his guidance and persistent help this dissertation would not have been possible. I also would like to thank my committee members, Professor Ping-Shou Zhong, Professor Lyudmila Sakhanenko and Professor Qing Lu, for their insightful comments and suggestions in my research. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quadratic inference function for FVICM with longitudinal data . . . . . . . . . Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Algorithm for estimation . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Choice of the basis for the inverse of the correlation matrix . . . . . . . 2.4.4 Choice of the tuning parameter . . . . . . . . . . . . . . . . . . . . . . Hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Linear mixed model representation for FVICM model . . . . . . . . . . 2.5.2 LRT and pseudo-LRT in LMM . . . . . . . . . . . . . . . . . . . . . . 2.5.2.1 LRT for one variance component . . . . . . . . . . . . . . . 2.5.2.2 Pseudo-LRT for multiple variance components . . . . . . . . 2.5.3 Pseudo-LRT in FVICM model . . . . . . . . . . . . . . . . . . . . . . Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Performance of estimation . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Performance of hypothesis tests . . . . . . . . . . . . . . . . . . . . . Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 8 11 12 12 12 13 14 14 14 15 15 17 18 18 18 19 23 25 28 GENERALIZED FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The model and estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Quadratic inference function for gFVICM . . . . . . . . . . . . . . . . . . 3.2.3 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model selection and hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Nonparametric goodness-of-fit test based on QIF . . . . . . . . . . . . . . 3.3.3 Test for linearity of interaction function in gFVICM . . . . . . . . . . . . . Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Performance of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 32 32 33 35 36 36 37 38 39 40 CHAPTER 3 3.1 3.2 3.3 3.4 v 3.5 3.6 3.4.2 Performance of hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . 45 Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 CHAPTER 4 4.1 4.2 4.3 4.4 4.5 DETECTING GENETIC ASSOCIATIONS WITH MULTIVARIATE PARTIALLY LINEAR VARYING-COEFFICIENTS MODELS FOR MULTIPLE LONGITUDINAL TRAITS . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joint models and statistical methods . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Joint multivariate models . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Objective function based on QIF . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Nonparametric goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . 4.2.6 Two-step hypothesis testing procedure . . . . . . . . . . . . . . . . . . . 4.2.6.1 Step 1: Joint test . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6.2 Step 2: Marginal tests . . . . . . . . . . . . . . . . . . . . . . Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Performance of estimation . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Performance of hypothesis tests . . . . . . . . . . . . . . . . . . . . . . 4.3.3.1 Performance for joint test . . . . . . . . . . . . . . . . . . . . 4.3.3.2 Performance for marginal tests . . . . . . . . . . . . . . . . . . Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 54 54 55 57 59 59 60 60 61 62 62 63 65 65 66 68 71 CHAPTER 5 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 73 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 vi LIST OF TABLES Table 2.1 Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and correlation ρ=0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Table 2.2 Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and correlation ρ=0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Table 2.3 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for SBP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Table 2.4 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for DBP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Table 2.5 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for HR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 3.1 Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size N = 200, 500, T = 10 and correlation ρ=0.5. . . . . . . . . . . . . . . . . . . . 40 Table 3.2 Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size N = 200, 500, T = 20 and correlation ρ=0.5. . . . . . . . . . . . . . . . . . . . 41 Table 3.3 List of SNPs with MAF, allele, p-values under different hypothesis testing and values of objective function QN . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Table 3.4 Estimated odds for different genotypes at each dosage levels. . . . . . . . . . . . 50 Table 4.1 Estimation results for parameters α1 and α2 with sample size N = 200, 500. . . . 63 Table 4.2 List of SNPs with MAF, allele, p-values under the joint and marginal testing for SBP and DBP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Table 4.3 List of SNPs with MAF, allele, estimation of coefficients, p-values of significance for coefficients corresponding to SBP and DBP, respectively. . . . . . . . 69 vii LIST OF FIGURES Figure 2.1 The estimation of function m0 (·) under different MAFs when N=200, 500 and ρ=0.5. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 2.2 The estimation of function m1 (·) under different MAFs when N=200, 500 and ρ=0.5. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 2.3 The estimation of function m0 (·) under different MAFs when N=200, 500 and ρ=0.8. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 2.4 The estimation of function m1 (·) under different MAFs when N=200, 500 and ρ=0.8 The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotteddash line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 2.5 The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=200, 500 and ρ=0.5. . . . . . . . . . . . 24 Figure 2.6 The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=200, 500 and ρ=0.8. . . . . . . . . . . . 24 Figure 2.7 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is SBP. . . . . . . . . . . 27 Figure 2.8 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is DBP. . . . . . . . . . . 28 Figure 2.9 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is HR. . . . . . . . . . . . 29 Figure 3.1 The estimation of function m0 (·) when N=200, 500 and T =10. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 42 viii Figure 3.2 The estimation of function m0 (·) when N=200, 500 and T =20. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 43 Figure 3.3 The estimation of function m1 (·) when N=200, 500 and T =10. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 44 Figure 3.4 The estimation of function m1 (·) when N=200, 500 and T =20. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. . . . . . . . . . . . 45 Figure 3.5 The empirical size and power of testing the linearity of nonparametric function m1 when N=200, 500 and T =10, 20. . . . . . . . . . . . . . . . . . . . . . 46 Figure 3.6 The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=500 and T =10. . . . . . . . . . . . . . . 47 Figure 3.7 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed lines. . . . . . . . . . . . . . . . . . . 49 Figure 4.1 The estimation of nonparametric functions β01 (·) and β11 (·) when N=200, 500. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash line. 64 Figure 4.2 The estimation of nonparametric functions β02 (·) and β12 (·) when N=200, 500. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash line. 65 Figure 4.3 The power plot for the joint test under different sample sizes N=200, 500 when T =10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 4.4 The power plots for the marginal test for N=200, 500 and T =10. . . . . . . . . 67 Figure 4.5 Plot of the estimate (solid curve) of the nonparametric function β1SBP (·) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed line. . . . . . . . . . . . . . . . . . . . 70 Figure 4.6 Plot of the estimate (solid curve) of the nonparametric function β1DBP (·) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed line. . . . . . . . . . . . . . . . . . . . 71 ix CHAPTER 1 INTRODUCTION Gene-environment (G×E) interaction is defined as how genotypes influence phenotypes differently under different environmental conditions (Falconer, 1952). An increasing number of studies have confirmed the role of G×E interaction in many human diseases. One classic example is Phenylketonuria (PKU). PKU is caused by a defect in the gene coding for a particular enzyme which is needed to break down phenylalanine. Newborns found to have high levels of phenylalanine in their blood can be put on a special, phenylalanine-free diet to avoid the severe effects of PKU (Baker, 2004). This example confirms the role of gene-environment interaction by showing that a change in environment can affect the phenotype of a particular trait. In genetic epidemiology, G×E interactions are useful for understanding the risk of some complex human diseases. Famous studies such as Parkinson disease (Ross and Smith (2007)) and type 2 diabetes (Zimmet, Alberti, and Shaw (2001)) both indicate the importance of G×E interaction in complex human diseases. However, the underlying mechanism of G×E interaction is still poorly understood due to the lack of powerful statistical methods. The traditional way to investigate G×E interaction is based on a single environment exposure model. Parametric models such as additive linear models, use the products of two variables to denote the interaction effects. However, this product may not capture the true interaction effect of gene and environment, since it could not be able to capture the possible nonlinear G×E interactions. In order to assess possible nonlinear G×E interactions, some nonparametric and semiparametric models have been developed, such as varying coefficient models (VCM) proposed by Hastie and Tibishirani (1993). In a VCM model, the coefficients of covariates are allowed to change with some other variables through smooth functions, so nonlinear interactions can be assessed. A VCM has the form L Y= ∑ ml (X)Zl + ε, (1.1) l=1 where Y is the response variable, (X, ZT )T is a vector of predictors consisting of a scalar X and a 1 L-dimensional vector Z = (Z1 , Z2 , ..., ZL )T . ml (·), l = 1, ..., L, are unknown nonparametric smooth functions. In particular, dealing with G×E interaction problems, one can replace the predicts Z to be genetic variants, for example, single nucleotide polymorphisms (SNPs). Epidemiological evidences suggested that a disease risk can be modified by simultaneous exposure to multiple environmental agents with effect larger than simple addition of individual factor acting along. The specification of VCM in (1.1) may be limited in dealing with simultaneous exposure to multiple environmental factors. It will result in difficulties in the estimation of the coefficient function ml (X) when variable X = (X1 , ..., X p )T in Model (1.1) is multidimensional. To overcome such challenge, Ma and Song (2015) proposed the varying index coefficient model (VICM) with a form L Y= ∑ ml (βlT X)Zl + ε, (1.2) l=1 where β l = (βl1 , ..., βl p )T are the coefficients for covariate vector X with βlk be the loading weight for the k-th covariate of X, i.e. Xk associated with Zl . The single index β T X is actually a linear combination of several environmental effects. The VICM model is able to pursue the interaction between a set of environmental factors as a whole and genetic variables on the disease risk, if the covariate Z is specified to be some genetic effect, such as a SNP variable. Liu et al. (2016) extended the model to a partial linear varying multi-index coefficient model (PLVMICM): L Y = m0 (β0T X) + α T0 Z + ∑ {ml (βlT X)Gl + α Tl ZGl } + ε, (1.3) l=1 where Gl , l = 1, . . . , L are genetic variables of interest, ml (·), l = 0, 1, ..., L are unknown index functions, α 0 , ..., α L and β 0 , ..., β L are parametric parameters. The main genetic effect for each Gl is captured by the function ml (βlT X) when the function is approximated by some nonparametric techniques such as B-spline approximation. Model (1.3) is an extension of Model (1.2) by considering partial linear covariates Z with application in capturing nonlinear G×E interactions. However, the above mentioned medoels can not be used directly for longitudinal data because of the assumption of independence among observations. Little work has been done to deal with 2 nonlinear G×E interaction in longitudinal studies. This motivates us to extend the varying index coefficient model to longitudinal traits. Longitudinal studies play an important role in epidemiology, biological and clinical research. Longitudinal studies are used to characterize normal growth and aging, to assess the effect of risk factors on human health, and to evaluate the effectiveness of treatments. Some researches, such as Sitlani et al. (2015), Furlotte et al. (2015) and Xu et al. (2014) demonstrate the longitudinal design is more powerful in detecting genetic associations than cross-sectional designs. The traditional way to analyze longitudinal data is using the regression models (Belle et al., 2004). However, standard regression methods assume that all observations are independent. The assumption of independence would result in invalid inferences. One approach to deal with the issue is using a complete model which specify the correlation structure among observations for each subject. In linear mixed effect models, we can make specific assumptions for the covariance structure in observations. Based on the parametric assumptions for the covariance components, we can apply the maximum likelihood methods to estimate the regression parameters. Another regression approach for inference with longitudinal data is known as generalized estimating equation (GEE) approach proposed by Liang and Zeger in 1986. The GEE method is very popular in recent decades and has been widely used in longitudinal data analysis. However, the application of the GEE method has several disadvantages. One of those disadvantages is that GEE may fail to produce consistent estimators if the nuisance correlation parameters are not consistently estimated (Crowder 1986, 1995). In addition, model selection and hypothesis testing can be complicated since there is no objective function in the estimation procedure of the GEE approach. The quadratic inference function (QIF) approach proposed by Qu et al. (2000) is one of the improvements of the GEE method. The benefits of using QIF approach in longitudinal analysis have been discussed in some researches, such as Qu et al. (2000), Qu and Li (2006) and Song et al. (2009). The QIF method avoids estimating the nuisance correlation parameters by using a linear combination of several basis matrices to approximate the inverse of correlation matrix. When the working correlation structure is correctly specified, both the QIF and GEE are equally efficient. 3 However, when the working correlation structure is misspecified, the QIF is more efficient than the GEE. In addition, the QIF has an asymptotic χ 2 distribution, based on which we can implement the model selection criteria like BIC, and construct testing statistic. These advantages of using QIF in longitudinal analysis motivates us to extend the QIF method to the varying coefficient model in detecting G×E interactions on risk of diseases. We organise the rest of this dissertation in the following way: In Chapter 2, in order to capture the dynamic nonlinear G×E interaction with the combined effect of environmental factors for longitudinal data, we propose a functional varying index coefficient model (FVICM) for correlated responses, i.e., β T0 Xi j ) + m1 (β β T1 Xi j )Gi + εi j , Yi j = m0 (β (1.4) where Yi j is the response variable which measures the risk of certain disease on the ith subject at the jth time point, where i = 1, · · · , N, j = 1, · · · , ni ; Xi j is a p-dimensional vector of environmental variables, which can be either time-dependent or time invariant; Gi denotes the genetic variable; εi j is an error term with mean 0 and some correlation structure; m0 (·) and m1 (·) are unknown functions; β 0 and β 1 are p-dimensional vectors of index loading coefficients. Compared with Model (1.2) and (1.3), Model (1.4) assumes correlations among observations from the same subject, which can be applied to capture the longitudinal correlation among different time points for a subject. We use penalized splines to approximate the nonparametric functions in the model and then develop an estimation procedure based on QIF method. In addition, we are interested to see whether the interaction function m1 (·) is significantly nonlinear or not. This is a natural concern in our model since if a linear interaction function is sufficient, a varying coefficient model would not be necessary. We develop the testing procedure by representing our model in a linear mixed model form and then apply the pseudo-likelihood ratio test. On the other hand, discrete longitudinal traits, for example, the binary traits are very common in clinical researches. To deal with binary responses, in Chapter 3, we extend Model (1.4) to a generalized functional varying-index 4 coefficient model (gFVICM) with a form β T0 Xi j ) + m1 (β β T1 Xi j )Gi j , g E(Yi j |Xi j , Gi j ) = m0 (β (1.5) where g(·) is the logit link function, m0 (·) and m1 (·) are unknown nonparametric functions, β 0 and β 1 are p-dimensional vectors of index coefficients. The QIF method is modified to accommodate binary response. To test the linearity of interaction function m1 (·), we develop a hypothesis testing procedure which is built on the asymptotical property of the QIF objective function. In Chapter 4, we consider G×E interactions when multiple longitudinal traits are measured to improve the power of association test, especially to identify any pleiotropic effect (i.e., one gene can affect multiple traits). For this purpose, we propose a multivariate partially linear varying coefficients model and derive a testing framework to jointly test the association of genetic factors with multivariate phenotypic values adjusting for environmental factors. The joint models are written as Yli j = Yli (ti j ) = β0l {X(ti j )} + β1l {X(ti j )}Gi + α l Zi j + εli j , (1.6) where Yli j is the response variable which measures the l-th phenotype on the i-th subject at the j-th time point, X(ti j ) is a time-varying covariate, Zi j is a p-dimensional vector of covariates, which can either depend or be independent of time, Gi denotes the genetic variable within subject, εli j is an error term and    εi =     ε 1i .. .    ∼ N 0, Σ   ε Li with Σ be some covariance structure, β0l (·) and β1l (·) are unknown functions. The robustness of QIF method in the variance of the estimators helps us to build the estimation and hypothesis testing procedure. In Chapter 5, we conclude the thesis with a brief conclusion about the contributions of this thesis and point out some future research directions. Proofs are provided in the Appendix. 5 CHAPTER 2 FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS 2.1 Introduction It has been broadly recognized that gene-environment (G×E) interaction plays important role in human complex diseases. A growing number of scientific researches have confirmed the role of G×E interaction in many human diseases, such as Parkinson disease (Ross and Smith, 2007) and type 2 diabetes (Zimmet et al., 2001). G×E interaction is defined as how genotypes influence phenotypes differently under different environmental conditions (Falconer, 1952). It also refers to the genetic sensitivity to environmental changes. Usually, G×E has been investigated based on a single environment exposure model. Evidence from epidemiological studies has suggested that disease risk can be modified by simultaneous exposure to multiple environmental factors. The effect of simultaneous exposure is higher than the simple addition of the effects of factors acting alone (Carpenter et al., 2002; Sexton and Hattis, 2007). This motivated us to assess the combined effect of environmental mixtures, and how they, as a whole, interact with genes to affect disease risk (Liu et al. 2016). In our previous model, we proposed a varying-index coefficient model to capture the nonlinear interaction between a gene and environmental mixtures (Liu et al. 2016). The method was extended for any univariate trait distribution in a generalized linear model framework (Liu et al. 2017). In biomedical studies, longitudinal traits are often observed, with repeated measures of the same subject over time. The increased power of a longitudinal design to detect genetic associations over cross-sectional designs has been evaluated (Sitlani et al. 2015; Furlotte et al. 2015; Xu et al. 2014). With longitudinal disease traits, one can study the dynamic gene effect over time. Coupling with longitudinal measure of environmental exposures, one can study how genes respond to the dynamic change of environmental factors to affect a disease trait. This motivates us to extend the 6 varying index coefficient model to longitudinal traits. To explore time-dependent effects in longitudinal data analysis, some nonparametric and semiparametric models such as varying coefficient models have been proposed, for example, Hoover et al. (1998), Wu, Chiang, and Hoover (1998), Fan and Zhang (2000), Martinussen and Scheike (2001), Chiang, Rice, and Wu (2001), Huang, Wu, and Zhou (2002), Ma and Song (2015). However, these methods do not fit to our purpose. In order to capture the dynamic nonlinear G×E interaction with combined effect of environmental factors for longitudinal data, we propose a functional varying index coefficient model (FVICM) for correlated response, i.e., β T1 Xi j )Gi + εi j , β T0 Xi j ) + m1 (β Yi j = m0 (β (2.1) where Yi j is the response variable which measures the risk of certain disease on the ith subject at the jth time point, where i = 1, · · · , N, j = 1, · · · , ni ; Xi j is a p-dimensional vector of environmental variables, which can be either time-dependent or time invariant; Gi denotes the genetic variable; εi j is an error term with mean 0 and some correlation structure; m0 (·) and m1 (·) are unknown functions; β 0 and β 1 are p-dimensional vectors of index loading coefficients. For model identifiability, we have the constraints β 0 = β 1 = 1 and the first elements of β 0 and β 1 are positive. Qu et al. (2000) proposed the quadratic inference function (QIF) for longitudinal data analysis, as an improvement of the generalized estimation equation (GEE) approach introduced by Liang and Zeger (1986). The QIF approach avoids estimating the nuisance correlation parameters by assuming that the inverse of the correlation matrix can be approximated by a linear combination of several basis matrices. Qu et al. (2000) found that the QIF estimator could be generally more efficient than the GEE estimator. Qu and Li (2006) applied the QIF method to the varying coefficient model for longitudinal data. Bai et al. (2009) developed an estimating procedure for single index models with longitudinal data also based on QIF method. Motivated by that, in this paper, we extend the QIF method to the FVICM model for dynamic G×E interactions. Our goal in this work is to develop a set of statistical estimation and hypothesis testing procedure for model (2.1). We first approximate the varying index coefficient function by the penalized 7 splines (Ruppert and Carroll, 2000) and then extend the QIF approach to our model in order to estimate the index loading coefficients and the penalized spline coefficients. Under certain regularity conditions, we establish the consistency and asymptotic normality of the resulting estimators. Another goal of this work is to test the linearity of the G×E interaction effect. This is of particular interest in our model setting since if the G×E interaction is linear, a simple linear regression model should be fit, and fitting any higher order nonlinear functions would be unnecessary. With a mixed effects model representation of the penalized spline approximations (Speed, 1991; Ruppert, Wand, and Carroll, 2003; Wand, 2003), we can transform the problem of testing an unknown function into testing some fixed effects and a variance component in a linear mixed effects model setup with multiple variance components, which will be evaluated in this study. This chapter is organized as follows: in Section 2.2, we propose an estimation procedure under the FVICM model, and further establish the consistency and asymptotic normality of the proposed estimator in Section 2.3. In Section 2.4, we discuss some practical issues to implement the proposed estimation procedures. In Section 2.5, a pseudo-likelihood ratio test procedure with a linear mixed effects model representation is illustrated. We assess the finite sample performance of the proposed procedure with Monte Carlo simulation in Section 2.6 and illustrate the proposed method by an analysis of a pain sensitivity data set in Section 2.7, followed by discussions in Section 2.8. Technical details are rendered in Appendix. 2.2 Quadratic inference function for FVICM with longitudinal data For longitudinal data, suppose the response yi j , p-dimensional covariate vector x i j , and SNP variable Gi are observed from the ith observation at the jth time point. SNP variable {Gi , i = 1, ..., N} does not change over time. Assume the model satisfies β T0 xi j ) + m1 (β β T1 xi j )Gi , E(yi j |xxi j , Gi ) = m0 (β 8 We can approximate the unknown coefficient functions m0 (u0 ) and m1 (u1 ) by a q-degree truncated power spline basis, i.e. m0 (u0 ) = m0 (u0 , β ) ≈ B(u0 )T γ 0 , m1 (u1 ) = m1 (u1 , β ) ≈ B(u1 )T γ 1 , q q β T0 , β T1 )T , B(u) = (1, u, u2 , · · · , uq , (u−κ1 )+ , · · · , (u−κK )+ )T is a q-degree truncated where β = (β power spline basis with K knots κ1 , · · · , κK . γ 0 and γ 1 are (q + K + 1)-dimensional vectors of spline coefficients. Let γ = (γγ T0 , γ T1 )T . For longitudinal data, the conditional variance-covariance matrix of the response need to be modelled. The method of generalized estimation equation (GEE) is often applied to estimate the unknowns. The GEE is defined as, N ∑ µ˙ Ti V−1 i (yi − µ i ) = 0, i=1 where Vi is the covariance matrix of yi , yi = (yi1 , ..., yini )T , µ i = E(yi ) is the mean function and µ˙ i is the first derivative of µ i with respect to the parameters. Based on the spline approximation, the mean function can be written as   µ i1 (θθ )  .. µ i = µ i (θθ ) =  .   µ in (θθ )       =     i and the first derivative of µ i is  T β T x )γγ xT 0 i1 0 i1  Bd (β  .. µ˙ i =  .   β T0 xini )γγ 0 xTin BTd (β i β T0 xi1 )γγ 0 + BT (β β T1 xi1 )γγ 1 Gi BT (β .. . β T0 xini )γγ 0 + BT (β β T1 xini )γγ 1 Gi BT (β    ,   β T1 xi1 )γγ 1 Gi xTi1 BTd (β β T0 xi1 ) BT (β β T1 xi1 )Gi BT (β .. . .. . .. . β T1 xini )γγ 1 Gi xTin BTd (β ∂ B(u) i q−1 β T0 xini ) BT (β β T1 xini )Gi BT (β    ,   q−1 β T , γ T )T . where Bd (u) = ∂ u = (0, 1, 2u, · · · , quq−1 , q(u − κ1 )+ , · · · , q(u − κK )+ ), θ = (β When Vi is unknown, Liang and Zeger (1986) suggested that Vi can be simplified as Vi = 1/2 Ai 1/2 R(ρ)Ai with Ai being a diagonal matrix of marginal variances and R(ρ) being a common 9 working correlation matrix with a small number of nuisance parameters ρ. When ρ is consistently estimated, the GEE estimators of the regression coefficients are consistent. When such consistent estimators for the nuisance parameters do not exist, Qu et al. (2000) suggested that the inverse of R(ρ) can be represented by a linear combination of a class of basis matrices such as R−1 (ρ) ≈ a1 M1 + a2 M2 · · · + ah Mh , where M1 is the identity matrix and M2 , · · · , Mh are symmetric matrices. The advantage of this method is that the estimation of nuisance parameters a1 , · · · , ah are not required. Following this idea, we define the estimation function as,   N µ˙ T A−1/2 M A−1/2 (y − µ ) 1 i i i   ∑i=1 i i N  1 1 .  .. g¯N (θθ ) = ∑ gi (θθ ) =   N i=1 N   −1/2 −1/2 TA ˙ µ M A (y − µ ) ∑N h i i i i=1 i i (2.2) Because the dimension of the estimation equation g¯N is greater than the number of parameters, we cannot obtain the estimators by simply setting each element in g¯N to be zero. Qu et al. (2000) introduced the Quadratic Inference Function (QIF) based on the generalized method of moments (Hansen, 1982). Thus, we can estimate the parameters by minimizing the QIF, which is defined as −1 g¯ , QN (θθ ) = N g¯TN C¯N N (2.3) T where C¯N = N1 ∑N i=1 gi gi is a consistent estimator for var(gi ). By minimizing the quadratic infer- ence function, we can obtain the estimation of the parameters θ = arg min QN (θθ ). θ To overcome the well known over-parameterization issue, Qu et al. (2000) further proposed the penalized quadratic inference function N −1 QN (θθ ) + λ θ T Dθθ , (2.4) where D is a diagonal matrix with element 1 if the corresponding parameters are spline coefficients associated with the knots and 0 otherwise, i.e., D = diag(0T(2p+q+1)×1 , 1TK×1 , 0T(q+1)×1 , 1TK×1 ). Then the estimator is given by θ = arg min(N −1 QN (θθ ) + λ θ T Dθθ ). θ 10 (2.5) 2.3 Asymptotic properties In this section, we establish the asymptotic properties for the penalized quadratic inference function estimators with fixed knots. Assume θ 0 is the parameter satisfying Eθ (gi ) = 0. Theorem 1 0 provides the consistency of the resulting estimators. We show the asymptotic normality of the estimators in Theorem 2. The theoretical results are similar to those provided in Qu and Li (2006). The difference is that we have constraints for the index loading parameters in our model, i.e. β 0 = β 1 =1, and β01 > 0, β11 > 0. To handle the constraints, we do the reparameterization as βl1 = 1 − β l,−1 22 with β l,−1 = (βl2 , ..., βl p )T for l=1, 2 (Yu and Ruppert, 2002; Cui et al., 2011; Ma and Song, 2015). Then the parameters space of β l , l=1,2, becomes [{( 1 − β l,−1 22 , βl2 , ..., βl p )T } : β l,−1 22 < 1]. Let  Jl = ∂β l ∂ β Tl,−1  = β Tl,−1 / −β 1− β l,−1 22 I p−1    β T0,−1 , β T1,−1 )T , and θ ∗ = be the Jacobian matrix of dimension p × (p − 1). Denote β −1 = (β β −1 , γ )T . From θ to θ ∗ , we have Jacobian matrix J = diag(J0 , J1 , Iq+K+1 , Iq+K+1 ). (β Theorem 1 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing parameter λN = o(1), then the estimator θ , which is obtained by minimizing the penalized quadratic function in (2.4), exists and converges to θ 0 in probability. Theorem 2 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing parameter λN = o(N −1/2 ), then the estimator θ obtained by minimizing the penalized quadratic function in (2.4) is asymptotically normally distributed, i.e., √ d −1 T N(θ − θ 0 ) − → N(0, J(GT0 C−1 0 G0 ) J ), where G0 and C0 are given in the Appendix. 11 2.4 Practical issues In this section, we discuss some practical issues when we implement the proposed method. 2.4.1 Algorithm for estimation A two-step iterative Newton-Raphson algorithm is applied when we estimate the index loading parameters and the varying spline coefficients. The algorithm of the estimation procedure can be summarized in the following steps. Step 0 Choose initial values for β and γ . Denote them by β (old) and γ (old) . Step 1 Estimate γ (new) by γ (new) = arg min(N −1 QN (γγ , β (old) ) + λ γ T Dγγ . γ The Newton-Raphson algorithm is used for the minimization. Step 2 Estimate β (new) by β , γ (new) ). β (new) = arg min QN (β β Also use Newton-Raphson for minimization. (old) Step 3 Update β l (old) by β l (new) = sign(βl1 (new) βl )β (new) / βl 2, l = 1, 2. Update γ (old) by setting γ (old) = γ (new) . Step 4 Repeat Steps 1-3 until convergence. 2.4.2 Model selection It is important to determine the order and number of knots in the spline approximation since too many knots in the model might overfit the data. Under the assumption E(g) = 0 (g is the estimation function in (2.2) for a single observation) and the number of estimating equations is larger than 2 in distribution (Hansen, 1982), where r is the the number of parameters, we have Q(θ ) → χr−k 12 dimension of g¯N (θθ ), k is the dimension of θ , θ is the estimator by minimizing the QIF when certain order and number of knots are chosen. This asymptotic property of the QIF provides a goodness-of-fit test, which can be useful to determine the order and number of knots to be selected in our model. However, it is also possible that the goodness-of-fit tests fail to reject several different models which may not be nested. Since Q(θ ) is asymptotically chi-square distributed, we can use BIC to penalize Q(θ ) for the difference of the numbers of estimating equations and parameters. In particular, the BIC criterion for a model with r estimating equation and k parameters is defined as Q(θ ) + (r − k) ln N, The model with minimum BIC would be considered better. If we choose h basis matrices in (2.2), then r − k = hk − k = (h − 1)k. As we discussed in Section 2.4.3, we usually use h=2 in our setting. Thus, the BIC criterion is actually Q(θ ) + k ln N, where k is the number of parameters in the model. In our simulation and real data application, we search the optimal order and the number of knots over a set of combinations of q and K using BIC. Knots are evenly distributed in the range of u(= β T X). 2.4.3 Choice of the basis for the inverse of the correlation matrix Qu and Li (2006) offered several choices of basis matrixes. For exchangeable working correlation, M1 is identity matrix and M2 has 0 on the diagonal and 1 off-diagonal. If the working correlation is AR(1), we can set M2 to have 1 on its two subdiagonals and 0 elsewhere. Prior information on correlation can help us to determine the choice of appropriate basis matrices. The effect of choosing different basis matrices is discussed in Qu and Li (2006) through simulation studies. Qu and Lindsay (2003) also proposed an adaptive estimation method to approximate the correlation empirically when there is no prior information available. 13 2.4.4 Choice of the tuning parameter Since the penalized spline is used to approximate the unknown functions, we need to determine the tuning parameter λ involved in the method. As Qu and Li (2006) suggested, we can extend the generalized cross-validation (Ruppert, 2002) to the penalized QIF and define the generalized cross-validation statistic as GCV(λ ) = N −1 QN (1 − N −1 df)2 where df = tr[(Q¨ N + 2Nλ D)−1 Q¨ N ] is the effective degree of freedom, QN is defined in (2.3) and Q¨ N is the second derivative of QN . The desirable choice of tuning parameter λ is λ = arg min GCV(λ ). λ In the implementation of GCV, the golden search method can be applied in order to reduce the computational time. 2.5 2.5.1 Hypothesis test Linear mixed model representation for FVICM model In our proposed FVICM model (2.1), it is of interest to test the unspecified coefficient function. In particular, we are interested in testing whether a linear function is good enough to describe the G×E interaction. Given β , let u0 = β T1 X, u1 = β T0 X, with the truncated power spline basis, the coefficient function can be modeled by q m1 (u1 ) = γ10 + γ11 u1 + γ12 u21 + · · · + γ1q u1 + K q ∑ b1k (u1 − κk )+. k=1 Our goal is to test the linearity of m1 (u1 ), which is equivalent to test H0 : γ12 = · · · = γ1q = 0, b11 = · · · = b1K = 0. q q T q Let w0i j = (1, u0i j , · · · , u0i j )T , z0i j = (u0i j − κ1 )+ , · · · , (u0i j − κK )+ q q , γ˜ 0 = (γ00 , · · · , γ0q )T , q T b0 = (b01 , · · · , b0K )T , w1i j = (1, u1i j , · · · , u1i j )T , z1i j = (u1i j −κ11 )+ , · · · , (u1i j −κ1K )+ (b11 , · · · , b1K )T , γ˜ 1 = (γ10 , · · · , γ1q )T , 14 ,b1 = m0 (u0i j ) = wT0i j γ˜ 0 + zT0i j b0 , m1 (u1i j ) = wT1i j γ˜ 1 + zT1i j b1 . We further define Yi = (yi1 , · · · , yini )T , W0i = (w0i1 , · · · , w0ini )T , W1i = (w1i1 Gi , · · · , w1ini Gi )T , Z0i = (z0i1 , · · · , z0ini )T , Z1i = (z1i1 Gi , · · · , z1ini Gi )T , then a linear mixed model (LMM) representation (Wang and Chen, 2012) can be obtained as, Yi = W0i γ˜ 0 + W1i γ˜ 1 + Z0i b0 + Z1i b1 + 1i ai + ε i , i = 1, · · · , n, (2.6) bl ∼ N(0, σb2 IK ), l = 0, 1, ε i ∼ N(0, σε2 I), l where the random incept effects ai are assumed to be independent as N(0, σa2 ) which model the correlation in the response. With the LMM representation, testing the linearity of the varying index coefficients is equivalent to test some fixed effects and a variance component in model (2.6). To be specific, we want to test H0 : γ12 = · · · = γ1q = 0 and σb2 = 0. 1 2.5.2 2.5.2.1 (2.7) LRT and pseudo-LRT in LMM LRT for one variance component Crainiceanu and Ruppert (2004) proposed the likelihood ratio test in linear mixed effect models with one variance component. Consider a LMM with one variance component         2 0   b   0K   b   σb Σ β + Zb + ε , E   =  Y = Xβ  , Cov   =  , ε 0n ε 0 σε2 In (2.8) where β is a p-dimensional vector of fixed effect coefficients, b is a K-dimensional vector of random effects, 0K is a K-dimensional vector of zeros, Σ is a known K × K symmetric positive definite matrix. Let λ = σb2 /σε2 be the signal-to-noise ratio and then the covariance matrix of 15 ΣZT . Consider testing for the null Y cab be written as Cov(Y) = σε2 Vλ , where Vλ = In + λ ZΣ hypothesis H0 : β p+1−p = 0, · · · , β p = 0, σb2 = 0 (2.9) for p > 0. The LRT statistic is defined as β , λ , σε2 ) − sup L(β β , λ , σε2 ) . LRTn ∝ 2 sup L(β HA H0 If we substitute the parameters β and σε2 with their profile estimators Y, β (λ ) = (XT V−1 X)−1 XT V−1 λ λ σε2 (λ ) = {Y − Xβ (λ )}T Vλ−1 {Y − Xβ (λ )} n , for fixed λ , we obtain the LRT statistic LRTn = sup {n log(YT S0 Y) − n log(YT PTλ V−1 Pλ Y) − log |Vλ |}, λ (2.10) λ ≥0 where Pλ = In − X(XT V−1 X)−1 XT V−1 , X0 denotes the design matrix of fixed effects under the λ λ null hypothesis, S0 = In − X0 (XT0 X0 )−1 XT0 . Theorem 1 in Crainiceanu and Ruppert (2004) provides the distribution of LRT statistic (2.10). Σ1/2 , ξs be the eigenvalues of Σ 1/2 ZT ZΣ Σ1/2 , s = 1, · · · , K, Let µs be the eigenvalues of Σ 1/2 ZT P0 ZΣ then p ∑1 u2s d LRTn = n 1 + n−p ∑1 w2s iid + sup fn (λ ), λ ≥0 iid where us ∼ N(0, 1) for s = 1, · · · , K, ws ∼ N(0, 1) for s = 1, · · · , n − p, and fn (λ ) = n log 1 + K Nn (λ ) − ∑ log(1 + λ µs ), Dn (λ ) s=1 with K Nn (λ ) = λ µs ∑ 1 + λ µs w2s , s=1 16 (2.11) n−p w2s Dn (λ ) = ∑ + ∑ w2s . s=1 1 + λ µs s=K+1 K The distribution in (2.11) only depends on the eigenvalues µs and ξs . Based on the spectral decomposition, simulation from this distribution can be done very rapidly. Detailed algorithm for this simulation can be found in Crainiceanu and Ruppert (2004). 2.5.2.2 Pseudo-LRT for multiple variance components For a LMM with multiple variance components β + Zb1 + · · · + ZbL + ε , Y = Xβ (2.12) bl ∼ N(0, σb2 Σ l ), l = 1, · · · , L, ε ∼ N(0, σε2 In ), l where bl , l = 1, · · · , L are random effects and L > 1. Suppose we are interested in testing H0 : β p+1−p = 0, · · · , β p = 0, σb2 = 0. L Greven et al. (2008) proposed to approximate the distribution of LRT for the model (2.12) based on the pseudo-likelihood ratio test theory (Liang and Self, 1996) by using a pseudo-outcome. In the framework of model (2.12), bi , i = L, are nuisance random parameters. We can define the pseudo-outcome as Y = Y − ∑ Zi bi , i=L where bi are the best linear unbiased predictors (BLUP) of nuisance random effects bi , i = L. The the model (2.12) can be reduced to β + ZL bL + ε . Y = Xβ (2.13) Then the method for testing one variance component introduced by Crainiceanu and Ruppert (2004) can be applied to the model in (2.13). 17 2.5.3 Pseudo-LRT in FVICM model For the model in (2.6), we can use the idea of Greven et al. (2008) and define the pseudo-outcome Yi = Yi − Z0i b0 − Ui ai , i = 1, · · · , n, where b0 and ai are BLUPs of b0 and ai , respectively. The reduced model using pseudo-outcome for model (2.6) can be written as Yi = W0i γ˜ 0 + W1i γ˜ 1 + Z1i b1 + ε i . i = 1, · · · , n. (2.14) For the new model (2.14) using pseudo-response, we can apply the method for the single variance component model introduced in Section 2.10 to test hypothesis (2.7). Statistical significance can be assessed through the resampling approach described in section 2.5.2.1. 2.6 2.6.1 Simulation study Simulation In this section, the finite sample performance of the proposed method is evaluated through Monte Carlo simulation studies. We generate three covariates X1 , X2 , X3 . For each subject i, X1i j , X2i j , X3i j are generated independently from uniform distribution U(0, 1). We set the minor allele frequency (MAF) as pA =(0.1, 0.3, 0.5) and assume Hardy-Weinberg equilibrium. We use AA, Aa and aa to denote three different SNP genotypes, where allele A is the minor allele. These genotypes are simulated from a multinomial distribution with frequencies p2A , 2pA (1 − pA ) and (1 − pA )2 , respectively. Variable G takes value in the set {0,1,2}, corresponding to genotypes {aa, Aa, AA} respectively. The error term εi = (εi1 , · · · , εini ) are independently generated from the multivariate normal distribution N(0, 0.1R(ρ)). The true correlation structure R(ρ) is assumed to be exchangeable with ρ=0.5 and 0.8. √ √ We set m0 (u0 ) = cos(πu0 ) and m1 (u1 ) = sin[π(u1 − A)/(B − A)] with A = 3/2 − 1.645/ 12 √ √ √ √ √ √ √ and B = 3/2+1.645/ 12. The true parameters are β 0 = ( 5, 4, 4)/ 13 and β 1 = (1, 1, 1)/ 3. 18 To simplify the simulation and save computational time, we consider the balanced case, which means each observation has the same number of time points. We draw 1000 data sets with sample size N = 200, 500 and time points ni = T = 10. Since the true correlation structure is exchangeable, we set M1 to be the identity matrix and M2 to be 0 on the diagonal and 1 off-diagonal. The order and number of knots of the splines are chosen by using the BIC method. 2.6.2 Performance of estimation Table 2.1 summarizes the results based on 1000 replications. In this table, the average bias (Bias), the standard deviation of the 1000 estimates (SD), the average of the estimated standard error (SE) based on the theoretical results, and the estimated coverage probability (CP) at 95% confidence level are reported. Note that the estimation of the loading parameter β 1 improves as MAF pA increases, while the estimation of β 0 show a opposite direction. This is because we have limited data information to estimate the marginal effects m0 (·) when pA increases. As the sample size increases, the performance of the estimation improves by showing smaller bias, SD and SE. Table 2.1 Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and correlation ρ=0.5. N Param True 200 β01 0.620 β02 0.555 β03 0.555 β11 0.577 β12 0.577 β13 0.577 500 β01 β02 β03 β11 β12 β13 0.620 0.555 0.555 0.577 0.577 0.577 pA = 0.1 Bias SD 7.3E-04 0.008 -3.9E-04 0.008 -6.2E-04 0.008 -2.3E-05 0.018 -6.3E-04 0.018 -3.9E-04 0.018 SE 0.008 0.009 0.008 0.020 0.020 0.020 CP 95.6 93.2 94.4 91.0 91.3 91.0 pA = 0.3 Bias SD 1.7E-03 0.009 -1.0E-03 0.010 -1.2E-03 0.010 -3.1E-04 0.011 -3.0E-04 0.011 2.8E-04 0.011 SE 0.010 0.010 0.010 0.011 0.011 0.011 CP 96.2 92.5 94.2 93.7 94.3 94.8 pA = 0.5 Bias SD 1.5E-03 0.011 -1.2E-03 0.012 -8.5E-04 0.012 -8.6E-04 0.009 -6.8E-05 0.009 7.1E-04 0.009 SE 0.011 0.011 0.011 0.009 0.009 0.009 CP 95.0 92.4 93.0 94.7 93.8 93.1 7.5E-04 -5.7E-04 -3.4E-04 6.4E-04 -6.0E-04 -4.1E-04 0.005 0.005 0.005 0.012 0.012 0.012 95.5 94.4 93.9 93.8 93.6 94.6 1.7E-03 -1.1E-03 -8.7E-04 -1.7E-04 -1.5E-05 6.0E-05 0.006 0.006 0.006 0.007 0.007 0.007 95.1 94.6 94.1 95.6 96.1 95.0 1.6E-03 -8.8E-04 -1.1E-03 -7.3E-04 5.3E-04 1.1E-04 0.007 0.007 0.007 0.006 0.006 0.006 95.8 95.2 94.7 95.1 94.7 95.6 0.005 0.005 0.005 0.012 0.012 0.012 0.006 0.006 0.006 0.007 0.007 0.007 0.007 0.007 0.007 0.006 0.006 0.006 The plots for the estimations of m0 (u0 ) and m1 (u1 ) under different sample size and MAFs are shown in Figure 2.1 and Figure 2.2. The estimated and true functions are denoted by the solid 19 and dashed lines, respectively. The 95% confidence band is denoted by the dotted-dash line. The estimated curves almost overlap with the corresponding true curves as shown in the plots. The confidence bands are tight, especially under a large sample size. Note that the estimation for the interaction effects m1 (u1 ) improves as MAF pA increases, while the estimation for the marginal effects m0 (u0 ) show a opposite direction, which coincides with the results for the parametric estimation in Table 2.1. N=200, pA=0.1 N=200, pA=0.3 0.5 0.0 −0.5 0.5 m0(u0) m0(u0) m0(u0) 0.5 0.0 −0.5 0.25 0.50 0.75 u0 1.00 1.25 −1.0 0.25 N=500, pA=0.1 0.50 0.75 u0 1.00 1.25 0.0 −0.5 −0.5 0.75 u0 1.00 1.25 1.00 1.25 0.0 −0.5 −1.0 −1.0 −1.0 0.75 u0 0.5 m0(u0) m0(u0) 0.0 0.50 n=500, pA=0.5 0.5 0.50 0.25 N=500, pA=0.3 0.5 0.25 0.0 −0.5 −1.0 −1.0 m0(u0) N=200, pA=0.5 0.25 0.50 0.75 u0 1.00 1.25 0.25 0.50 0.75 u0 1.00 1.25 Figure 2.1 The estimation of function m0 (·) under different MAFs when N=200, 500 and ρ=0.5. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotted-dash line. 20 N=200, pA=0.1 N=200, pA=0.3 N=200, pA=0.5 0.5 0.5 0.5 0.0 m1(u1) 1.0 m1(u1) 1.0 m1(u1) 1.0 0.0 0.0 −0.5 −0.5 −0.5 0.25 0.50 0.75 u1 1.00 1.25 0.25 N=500, pA=0.1 0.50 0.75 u1 1.00 1.25 0.25 N=500, pA=0.3 0.5 0.5 −0.5 −0.5 0.50 0.75 u1 1.00 1.25 1.25 0.0 0.0 0.25 1.00 m1(u1) 0.5 m1(u1) 1.0 m1(u1) 1.0 −0.5 0.75 u1 N=500, pA=0.5 1.0 0.0 0.50 0.25 0.50 0.75 u1 1.00 1.25 0.25 0.50 0.75 u1 1.00 1.25 Figure 2.2 The estimation of function m1 (·) under different MAFs when N=200, 500 and ρ=0.5. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotted-dash line. The performance of the estimation for ρ = 0.8 is shown in Table 2.2, Figure 2.3 and Figure 2.4. It is seen that the SD and SE are smaller when ρ is larger compared to the results when ρ = 0.5. The confidence bands are a little bit wider, especially for m0 when pA =0.5 and for m1 when pA =0.1 for larger ρ. In summary, the simulation results show that the estimation method performs reasonably well under different simulation settings in finite samples. 21 Table 2.2 Simulation results for pA = 0.1, 0.3, 0.5 with sample size N = 200, 500 and correlation ρ=0.8 N Param True 200 β01 0.620 β02 0.555 β03 0.555 β11 0.577 β12 0.577 β13 0.577 500 β01 β02 β03 β11 β12 β13 0.620 0.555 0.555 0.577 0.577 0.577 pA = 0.1 Bias SD 4.4E-04 0.005 -2.2E-04 0.006 -3.6E-04 0.006 -6.7E-05 0.014 -2.4E-04 0.014 -1.8E-04 0.014 SE 0.005 0.005 0.005 0.012 0.012 0.012 CP 95.8 92.3 94.1 90.3 91.8 89.9 pA = 0.3 Bias SD 5.5E-04 0.006 -3.2E-04 0.007 -4.0E-04 0.006 -2.4E-04 0.007 -1.3E-04 0.007 2.4E-04 0.007 SE 0.006 0.006 0.006 0.007 0.007 0.007 CP 95.8 91.8 94.6 94.3 94.2 93.7 pA = 0.5 Bias SD -5.0E-06 0.007 -2.8E-04 0.008 1.4E-04 0.007 -7.7E-04 0.006 3.3E-05 0.006 6.4E-04 0.006 SE 0.007 0.007 0.007 0.006 0.006 0.006 CP 95.3 92.7 93.7 92.7 93.4 93.5 5.3E-04 -4.0E-04 -2.3E-04 2.5E-04 -2.9E-04 -1.1E-04 0.003 0.003 0.003 0.007 0.007 0.007 94.0 93.8 93.1 94.0 93.5 95.4 5.8E-04 -4.2E-04 -2.8E-04 -1.5E-04 4.4E-05 6.1E-05 0.004 0.004 0.004 0.004 0.004 0.004 95.4 94.8 94.2 95.3 95.7 95.4 3.3E-04 -1.3E-04 -2.9E-04 -6.8E-04 4.5E-04 1.9E-04 0.005 0.004 0.004 0.004 0.004 0.004 95.6 95.0 94.7 93.9 93.5 95.2 0.004 0.003 0.004 0.008 0.008 0.007 N=200, pA=0.1 0.004 0.004 0.004 0.004 0.004 0.004 N=200, pA=0.3 0.004 0.005 0.004 0.004 0.004 0.004 N=200, pA=0.5 1.0 0.5 0.0 −0.5 0.5 m0(u0) m0(u0) m0(u0) 0.5 0.0 −0.5 −0.5 −1.0 −1.0 −1.0 0.25 0.50 0.75 u0 1.00 1.25 0.25 N=500, pA=0.1 0.75 u0 1.00 1.25 −0.5 0.0 −0.5 0.25 0.50 0.75 u0 1.00 1.25 0.75 u0 1.00 1.25 N=500, pA=0.5 0.0 −0.5 −1.0 −1.0 −1.0 0.50 0.5 m0(u0) 0.0 0.25 1.0 0.5 m0(u0) m0(u0) 0.50 N=500, pA=0.3 0.5 0.0 0.25 0.50 0.75 u0 1.00 1.25 0.25 0.50 0.75 u0 1.00 1.25 Figure 2.3 The estimation of function m0 (·) under different MAFs when N=200, 500 and ρ=0.8. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotted-dash line. 22 N=200, pA=0.3 N=200, pA=0.5 1.0 0.5 0.5 0.5 0.0 m1(u1) 1.0 1.0 m1(u1) m1(u1) N=200, pA=0.1 0.0 0.0 −0.5 −0.5 −0.5 0.25 0.50 0.75 u1 1.00 1.25 0.25 0.75 u1 1.00 1.25 0.25 N=500, pA=0.3 0.5 0.5 0.5 0.50 0.75 u1 1.00 1.25 1.25 −0.5 −0.5 0.25 1.00 0.0 0.0 −0.5 0.75 u1 m1(u1) 1.0 1.0 0.0 0.50 N=500, pA=0.5 1.0 m1(u1) m1(u1) N=500, pA=0.1 0.50 0.25 0.50 0.75 u1 1.00 1.25 0.25 0.50 0.75 u1 1.00 1.25 Figure 2.4 The estimation of function m1 (·) under different MAFs when N=200, 500 and ρ=0.8 The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence band is denoted by the dotted-dash line. 2.6.3 Performance of hypothesis tests We evaluate the performance of the test for the nonparametric function under the null hypothesis H0 : m1 (·) = m01 (·), where m01 (u1 ) = δ0 + δ1 u1 , δ0 and δ1 are some constants, which corresponds to a linear G×E interaction. If we fail to reject the null, then a linear model can be fit to further assess the linear G×E interaction. Otherwise, we conclude nonlinear G×E interaction. Power is evaluated under a sequence of alternative models with different values of τ, which is denoted by H1τ : mτ1 (·) = m01 (·) + τ{m1 (·) − m01 (·)}. When τ = 0, the corresponding power is the false positive rate. Figure 2.5 shows the size (when τ = 0) and power (when τ > 0) at significance level 0.05. We obtain 1000 Monte Carlo simulations each with 5000 replications to access the null distribution of test statistic under sample sizes N = 200, 500 with ρ = 0.5. The empirical Type I error under three MAFs are very close to the nominal level 0.05 and the power increases dramatically when MAF 23 increases from 0.1 to 0.3. Results for ρ = 0.8 is presented in Figure 2.6. Similarly, the empirical Type I error is close to 0.05 and the power increases rapidly when MAF increases from 0.1 to 0.3. Compared to the performance when ρ = 0.5 shown in Figure 2.5, the power increases a little bit slower when ρ = 0.8. The results indicate that our method can reasonably control the false positive rates and has appropriate power to detect the genetic variation. N=200, ρ=0.5 N=500, ρ=0.5 0.7 0.7 Power 1.0 Power 1.0 0.5 0.5 0.3 0.3 pA 0.1 0.3 0.5 0.1 0.00 0.05 0.10 τ 0.15 pA 0.1 0.3 0.5 0.1 0.20 0.00 0.05 0.10 τ 0.15 0.20 Figure 2.5 The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=200, 500 and ρ=0.5. N=200, ρ=0.8 N=500, ρ=0.8 0.7 0.7 Power 1.0 Power 1.0 0.5 0.5 0.3 0.3 pA 0.1 0.3 0.5 0.1 0.00 0.05 0.10 τ 0.15 pA 0.1 0.3 0.5 0.1 0.20 0.00 0.05 0.10 τ 0.15 0.20 Figure 2.6 The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=200, 500 and ρ=0.8. 24 2.7 Real data application We applied the proposed FVICM model to a real data set from a study examining the association of the A118G SNP in OPRM1 to experimental pain sensitivity (Jonson and Terra, 2002). A group of 163 men and women in ages from 32 to 86 years participated in the study. Systolic blood pressure (SBP), diastolic blood pressure (DBP) and heart rate (HR) were measured at 6 Dobutamine dosage levels for each subject. Dobutamine is a medication that is used to treat congestive heart failure by increasing heart rate and cardiac contractility. Dobutamine was injected into these subjects to investigate their response in heart rate and blood pressure to this drug, at different dosage levels: 0 (baseline), 5, 10, 20, 30 and 40 mcg/min. In this study, dosage levels are treated as time and measurements at different dosage levels are considered as longitudinal measures. In addition to that, age and body mass index (BMI) were also recorded. Total five SNPs in genes Beta1 AR and Beta2 AR were genotyped, namely, codon16, codon27, codon49, codon389, and codon492. We choose X1 = dosage level as the "time-varying" variable, and X2 = age and X3 = BMI as the "time-invariant" variable. Our goal is to evaluate how the SNPs interact with age, BMI and dose level to affect SBP, DBP and HR. With the proposed FVICM model, we can model the dynamic gene effect on drug response under different dosage levels. In this analysis, we test whether any SNP is associated with the drug response based on the hypothesis test H0 : m1 (u1 ) = δ0 + δ1 u1 with p-value denoted by pm1 in Table 2.3 - 2.5. We also reported the p-values for testing the significance of coefficients β11 , β12 and β13 , which are labeled by pβ , pβ and pβ , based on the asymptotic normality of the estimates. We also compare our 11 12 13 ∗ (X )+β ∗ X + proposed model to an additive varying-coefficient model (AVCM) E(Y |X, G) = β01 1 02 2 ∗ X + {β ∗ (X ) + β ∗ X + β ∗ X }G, where β ∗ (·) and β ∗ (·) are unknown functions of X . To β03 3 1 11 1 12 2 13 3 01 11 see the relative gain by integrative analysis, we calculate the MSEs of both models. The p-values ∗ (·) = β ∗ = β ∗ = 0 for AVCM is also reported in the tables and denoted by for testing H0 : β11 12 13 pAVCM . Table 2.3 summarizes the performance of our method for response SBP. In the table, pm1 for all 25 the 5 SNPs are smaller than the significance level 0.05, which implies the nonlinear function of the SNPs on SBP in response to the dosage level, age and BMI as a whole. The MSEs in the last two columns shows that FVICM fits the data better than AVCM, indicating the benefit of integrative analysis. Besides, the testing results for AVCM do not show significance of the coefficients, which further implies that the genetic effects of SNPs are nonlinearly modified by the mixture of these three variables. Figure 2.7 shows the fitted nonlinear functions for each SNP, along with the 95% confidence bands. Table 2.3 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for SBP. SNP ID codon16 codon27 codon49 codon389 codon492 MAF Alleles 0.3990 0.4160 0.1387 0.3045 0.4250 A/G G/C G/A G/C T/C p m1 pβ 11 p-value pβ 12 pβ 13 <1.0E-04 0.0011 <1.0E-04 0.0917 <1.0E-04 <1.0E-04 0.0027 0.1675 <1.0E-04 <1.0E-04 0.3614 0.8668 <1.0E-04 <1.0E-04 <1.0E-04 0.7552 <1.0E-04 0.4102 <1.0E-04 0.0182 pAVCM MSE FVICM AVCM 0.5308 0.6748 0.2910 0.3927 0.2990 0.0403 0.0388 0.0398 0.0397 0.0392 0.0421 0.0415 0.0410 0.0431 0.0409 Table 2.4 presents similar results for response DBP. The values of pm1 shows that the test results for all 5 SNPs are significant, indicating nonlinear interactions for all 5 SNPs, while no significance is shown for AVCM model. MSEs further support our method by showing smaller value for FVICM comparing with AVCM. The estimated interaction curves with 95% confidence bands are shown in Figure 2.8. Table 2.4 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for DBP. SNP ID codon16 codon27 codon49 codon389 codon492 MAF Alleles 0.3990 0.4160 0.1387 0.3045 0.4250 A/G G/C G/A G/C T/C pm1 pβ 11 p-value pβ 12 pβ 13 pAVCM 0.0066 <1.0E-04 0.2834 0.0007 0.3160 0.0004 0.8431 <1.0E-04 <1.0E-04 0.0946 0.0003 0.5750 <1.0E-04 0.0042 0.7986 0.0001 <1.0E-04 0.9675 <1.0E-04 0.2615 0.0001 0.7934 <1.0E-04 <1.0E-04 0.5837 MSE VICM AVCM 0.0366 0.0360 0.0369 0.0369 0.0369 0.0372 0.0386 0.0395 0.0377 0.0389 In Table 2.5, the performance of our method for trait HR also leads to similar conclusion expect 26 codon16 codon27 0.05 codon49 0.3 0.10 0.1 m1(u1) 0.05 0.2 m1(u1) m1(u1) 0.00 −0.05 0.00 −0.10 −0.05 −0.15 0.0 −1.00 −0.75 −0.50 u1 −0.25 0.00 −0.10 0.00 0.25 0.50 u1 0.75 codon389 1.00 0.00 0.25 0.75 1.00 0.50 u1 0.75 1.00 codon492 0.6 0.0 m1(u1) m1(u1) 0.4 −0.1 0.2 0.0 −0.2 0.00 0.25 0.50 u1 0.75 1.00 0.00 0.25 0.50 u1 Figure 2.7 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is SBP. for SNP codon16, which shows significant test results for both models. For all the other SNPs, FVICM outperforms AVCM in terms of MSE. Figure 2.9 displays the corresponding estimated nonlinear interaction curves. Table 2.5 List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE for HR. SNP ID codon16 codon27 codon49 codon389 codon492 MAF Alleles 0.3990 0.4160 0.1387 0.3045 0.4250 A/G G/C G/A G/C T/C pm1 pβ 11 p-value pβ 12 pβ 13 pAVCM <1.0E-04 <1.0E-04 0.1158 0.0028 0.0328 <1.0E-04 0.0007 0.6434 0.0001 0.9620 0.0001 0.0147 0.0172 0.0133 0.8371 <1.0E-04 <1.0E-04 0.0024 0.0021 0.8959 0.0002 <1.0E-04 0.0011 0.0582 0.3732 27 MSE VICM AVCM 0.0309 0.0320 0.0298 0.0311 0.0315 0.0308 0.0325 0.0300 0.0313 0.0316 codon16 codon27 codon49 0.3 0.15 0.3 0.2 0.10 0.05 m1(u1) m1(u1) m1(u1) 0.2 0.1 0.1 0.00 0.0 0.0 −0.05 −0.1 −0.1 −0.2 0.2 u1 0.6 1.0 0.3 0.6 u1 0.9 codon389 −1.00 −0.75 −0.50 u1 −0.25 0.00 codon492 0.3 0.05 0.2 m1(u1) m1(u1) 0.00 −0.05 −0.10 0.1 0.0 −0.15 0.0 u1 0.5 1.0 0.3 0.6 u1 0.9 Figure 2.8 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is DBP. 2.8 Discussion In this paper, we propose a functional varying index coefficient modeling procedure to study gene effects nonlinearly modified by a mixture of environmental variables in a longitudinal design. We implement the quadratic inference function (QIF) method to estimate the index loading and spline coefficients. Furthermore, we apply the pseudo likelihood ratio test in a linear mixed model representation to test the linearity of the nonparametric coefficient function. Simulation study has been conducted to illustrate the estimation and testing procedures and confirm the asymptotical property. Real analysis shows that our model outperforms the additive varying coefficient model, which considers the G×E effect for each single environmental factor separately. Our FVICM model distinguishes the varying coefficient model for longitudinal data. In fact, the varying coefficient model is a special case of our model when the dimension of the X variable 28 codon16 codon27 codon49 0.2 0.0 −0.1 0.00 0.1 m1(u1) 0.05 m1(u1) m1(u1) 0.10 −0.2 0.0 −0.3 −0.05 0.00 0.25 0.50 u1 0.75 1.00 −0.50 −0.25 u1 0.00 0.25 codon389 0.00 0.25 0.50 u1 0.75 1.00 codon492 0.15 0.10 m1(u1) m1(u1) 0.1 0.0 0.05 0.00 −0.05 0.00 0.25 0.50 u1 0.75 1.00 0.5 u1 1.0 Figure 2.9 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence band is denoted by the dashed line. Response is HR. reduces to one. FVICM is able to capture the effect of genes nonlinearly modified by the joint effect of multiple environmental variables as a whole. In addition, it can reduce multiple testing burden by treating multiple environmental variables as a single index variable. We apply the model to a pain sensitivity study. Testing results indicate that all five SNPs have significant nonlinear interaction effects with environmental factors, which makes practical sense since these SNPs were genotyped from candidate genes. Our model was motivated by a practical need in G×E study. However, the method can be applied to any longitudinal data in which the purpose is to model nonlinear interaction effects. For example, we can consider gene expressions in a pathway (denoted as X ) and model how they regulate downstream genes (G) to affect a disease trait. Both the trait and gene expressions can be measured over time. Thus, one can understand the dynamic effect of genes nonlinearly regulated by a pathway to affect a disease trait. 29 CHAPTER 3 GENERALIZED FUNCTIONAL VARYING INDEX COEFFICIENT MODEL FOR DYNAMIC GENE-ENVIRONMENT INTERACTIONS 3.1 Introduction Longitudinal data analysis is very common in epidemiological studies when the response variables are measured over time on objectives. Many studies demonstrated the increased power of a longitudinal design in detecting genetic associations over cross-sectional designs (Sitlani et al. 2015; Furlotte et al. 2015; Xu et al. 2014). On the other hand, there has been growing interest in the role of G×E interaction in many human diseases, such as Parkinson disease (Ross and Smith, 2007) and type 2 diabetes (Zimmet et al., 2001). In many studies, G×E has been traditionally investigated based on a single environment exposure model. However, evidence from an increasing number of studies has shown that risk of disease can be modified by simultaneous exposure to multiple environmental factors, which might be higher than what would be expected from simple addition of the single effects of environmental factors (Carpenter et al., 2002; Sexton and Hattis, 2007). Thus, of particular interest and complexity are assessing the combined effect of environmental mixtures and the mechanism in which they interact with genes to affect disease risk. Some researches have been done to assess nonlinear interactions between environmental mixtures and genes by applying some nonparametric or semiparametric models, such as the varying index coefficients model (VICM) proposed by Ma and Song (2015) and the partial linear multi-varying index coefficients model (PLMVICM) by Liu et al. (2016) and the generalized PLMVICM by Liu et al. (2017). However, these methods were developed for cross-sectional data and they can not be used for longitudinal data. This motivates us to extend the varying index coefficient model to longitudinal traits. In our previous work, we proposed a functional varying index coefficient model to capture the nonlinear G×E interaction for continuous longitudinal traits. However, in practice, it is possible 30 that the response measured over time is a discrete variable, for example, a binary measure of a disease status. In human genetics, many disease traits are binary in nature, being affected vs unaffected (or cases vs controls). In order to investigate the dynamic nonlinear G×E interaction with environmental mixtures as a whole for a binary longitudinal trait, we propose the following generalized functional varying index coefficient model (gFVICM): β T0 Xi j ) + m1 (β β T1 Xi j )Gi j , g E(Yi j |Xi j , Gi j ) = m0 (β (3.1) where Yi j (= 0 or 1) denotes the binary longitudinal response variable observed for the ith subject at the jth time point; Xi j is a p-dimensional vector of environmental variables, which can be either time-variant or time-invariant variables; Gi denotes the SNP variable which does not depend on time; g(·) is a known link function; m0 (·) and m1 (·) are unknown nonparametric smooth functions which depend on the data; and β 0 and β 1 are p-dimensional vectors of index loading parameβ T1 X) captures the interaction effect between environmental ters. In this model, the function m1 (β mixtures and the genetic variable (e.g., a single nucleotide polymorphism (SNP)) on the risk of disease. The aim of this paper is to develop a set of statistical estimation and hypothesis testing procedure for model (3.1). The Generalized Estimation Equation (GEE) method, proposed by Liang and Zeger in 1986, has been widely used in longitudinal data analysis. However, there are several disadvantages of GEE method due to some of its critical assumptions (Song et al., 2009). One disadvantage is that the consistency of GEE estimators are based on the consistency of estimators for the nuisance correlation parameter (Crowder, 1986, 1995). Another shortcoming of the GEE method is that model selection and hypothesis testing are complicated. This is because the estimation procedure of the GEE method does not involve an objective function. The quadratic inference function (QIF) approach proposed by Qu et al. (2000) is one of the improvements of the GEE method. The QIF avoids estimating the nuisance correlation parameters and has been confirmed by Qu et al. (2000) to be generally more efficient than the GEE. In addition, since the QIF is built upon an objective function which is asymptotically chi-square distributed, we can naturally 31 implement the model selection criterion such as BIC to the QIF. The asymptotic property can also allows us to conduct hypothesis tests. This motivates us to extend the QIF method to our model for estimation and hypothesis testing. In our proposed estimation procedure, we first use penalized splines (Ruppert and Carroll, 2000) to approximate the nonparametric smooth functions ml (·), l=0, 1. Then we develop a profile estimation procedure to estimate the index loading parameters and spline coefficients iteratively based on the QIF approach. In order to avoid overfitting and reduce the number of parameters in spline approximation, we use BIC method by adding a penalty to the objective function. Under certain regularity conditions we establish the asymptotic normality of the resulting estimators. In addition, we are interested in testing the linearity of G×E interaction, i.e. the linearity of function m1 (·). The QIF can be regarded as an inference function which has properties similar to the likelihood ratio test. Based on that, we construct a testing procedure for linearity of nonparametric interaction function, where the test statistic asymptotically follows a χ 2 distribution. The rest of this chapter is organized in the following way: In Section 3.2, we propose an estimation procedure for model (3.1) and also provide the consistency and asymptotic normality of the proposed estimator; In Section 3.3, we derive a testing procedure for the linearity of nonparametric interaction function based on the goodness-of-fit test of QIF. The finite sample performance of the proposed procedure are accessed by Monte Carlo simulations illustrated in Section 3.4; In Section 3.5, the application of the proposed methodology is shown through the analysis of a pain sensitivity data with a binary response variable indicating whether a subject has hypertension or not (Yes=1, No=0); Some discussions are given in Section 3.6; The proofs of are rendered in Appendix. 3.2 3.2.1 The model and estimation methods The model For a longitudinal disease trait, suppose the binary response yi j , the p-dimensional covariate vector x i j , and the SNP variable Gi are observed for the ith observation at the jth time point, where 32 i=1,...,N, j=1,...,ni . Assume that the observations from different subjects are independent, but those within the same subject are correlated. We also assume the model satisfies the first moment assumption: β T0 x i j ) + m1 (β β T1 x i j )Gi , µi j (xxi j , Gi ) = E(yi j |xxi j , Gi ) = g−1 m0 (β where g−1 (·) is a given inverse link function. If we use a logit link function for binary response, then the model can be written as µi j (xxi j , Gi ) = P(yi j = 1|xxi j , Gi ) = β T0 xi j ) + m1 (β β T1 xi j )Gi } exp{m0 (β β T0 x i j ) + m1 (β β T1 x i j )Gi } 1 + exp{m0 (β . For model identifiability, we have the constraints β 0 = β 1 = 1 and the first elements of β 0 and β 1 are positive. 3.2.2 Quadratic inference function for gFVICM First, we approximate the unknown coefficient functions m0 (u0 ) and m1 (u1 ) by truncated power spline basis as ml (ul ) = ml (ul , β ) ≈ B(ul )T γ l , for l = 0, 1, q (3.2) q β T0 , β T1 )T , B(u) = (1, u, u2 , ..., uq , (u − κ1 )+ , ..., (u − κK )+ )T is a q-degree truncated where β = (β power spline basis with K knots κ1 , ..., κK ; γ 0 and γ 1 are (q + K + 1)-dimensional vectors of spline coefficients. A marginal approach such as the GEE assumes that the marginal mean µi j is a function of the covariates through a link function and the variance of yi j is a function of the mean var(yi j ) = V (µi ). The generalized estimation equation for longitudinal data is N ∑ µ˙ Ti V−1 i (yi − µ i ) = 0, i=1 ∂µ where yi = (yi1 , ..., yini )T , µ i = E(yi ) is the mean function and µ˙ i = ∂ θi is the first derivative of β T , γ T )T , with γ = (γγ T0 , γ T1 )T . The covariance matrix Vi can µ i with respect to parameters θ = (β 1/2 be decomposed as Vi = Ai 1/2 R(ρ)Ai with Ai being a diagonal matrix of marginal variances and 33 R(ρ) being a common working correlation matrix with a small number of nuisance parameters ρ. Using the spline approximation in (3.2), the mean function can be written as    −1 T β T x )γγ + BT (β β T1 xi1 )γγ 1 Gi } 0 i1 0  µi1 (θθ )   g {B (β    .. .. = µ i = µ i (θθ ) =  . .       β T0 xini )γγ 0 + BT (β β T1 xini )γγ 1 Gi } µini (θθ ) g−1 {BT (β and the first derivative of µ i is   µ˙ i =   βT γ T (g−1 ) BT d (β 0 xi1 )γ 0 xi1 .. . T βT γ (g−1 ) BT d (β 1 xi1 )γ 1 Gi xi1 .. . βT γ T (g−1 ) BT d (β 0 xini )γ 0 xini T βT γ (g−1 ) BT d (β 1 xini )γ 1 Gi xini ∂ B(u) βT (g−1 ) BT (β 0 xi1 ) .. .    ,   βT (g−1 ) BT (β 1 xi1 )Gi .. . −1 T β T x )G βT (g−1 ) BT (β 0 xini ) (g ) B (β 1 ini i q−1 q−1 where Bd (u) = ∂ u = (0, 1, 2u, ..., quq−1 , q(u − κ1 )+ , ..., q(u − κK )+ ). In the QIF method, the inverse of the working correlation matrix can be approximated by a linear combination of several basis matrices, i.e. R−1 (ρ) ≈ a1 M1 + ... + ah Mh , where M1 is the identity matrix and M2 , ..., Mh are known basis matrixes. For example, if the working correlation is exchangeable, R−1 ≈ a1 M1 + a2 M2 with M2 having 0 on the diagonal and 1 off-diagonal. If the working correlation is AR(1), then R−1 ≈ a∗1 M1 + a∗2 M∗2 and we can set M∗2 to have 1 on its two subdiagonals and 0 elsewhere. The advantage of this method is that the estimation of nuisance parameters a1 , ..., ah are not required. Following this idea, we can derive the estimation function  −1/2 −1/2 µ˙ Ti Ai M1 Ai (yi − µ i ) ∑N i=1  1 1 N .. g¯N (θθ ) = ∑ gi (θθ ) =  . N i=1 N  ˙ T −1/2 Mh A−1/2 (yi − µ i ) ∑N i i=1 µ i Ai       (3.3) We cannot obtain the estimators by simply setting each element in g¯N to be zero since the number of equations is more than the number of unknown parameters. To deal with this issue, we can estimate the parameters by minimizing the following quadratic inference function, −1 g¯ , QN (θθ ) = N g¯TN C¯N N 34   ,  T where C¯N = N1 ∑N i=1 gi gi is a consistent estimator for var(gi ). By minimizing the quadratic infer- ence function, we can obtain the estimation of the parameters as, θ = arg min QN (θθ ). θ In order to overcome the issue of over-parameterization, we can add a penalty term to QIF to penalize the number of knots in the approximation. The penalized QIF is written as N −1 QN (θθ ) + λ θ T Dθθ , (3.4) where D is a diagonal matrix with 1 if the corresponding parameter is the spline coefficient associated with knots and 0 otherwise, that is, D = diag(0T(2p+q+1)×1 , 1TK×1 , 0T(q+1)×1 , 1TK×1 ). Then the estimator is given by θ = arg min N −1 QN (θθ ) + λ θ T Dθθ . (3.5) θ To determine the tuning parameter λ , we can extend the generalized cross-validation (Ruppert, 2002; Qu and Li, 2006; Bai et al., 2009) to the penalized QIF. The generalized cross-validation statistic is defined as GCV(λ ) = N −1 QN (1 − N −1 df)2 with the effective degree of freedom df = tr (Q¨ N +2Nλ D)−1 Q¨ N , where Q¨ N is the second derivative of QN . The desirable choice of tuning parameter λ is which minimize the GCV(λ ). In the implementation of GCV, the desired value of λ can be found using a grid search by predefining a set of values for λ . 3.2.3 Theoretical results To establish the asymptotic properties for the estimators of the index loading parameters and the penalized spline regression coefficients, we assume θ 0 is the parameter satisfying Eθ (gi ) = 0. 0 √ Theorem 3 provides the consistency of the resulting estimators. We show the N-consistency and asymptotic normality of the estimators in Theorem 4. The theoretical results are similar to Theorem 1 and 2 in Chapter 2. 35 First, to handle the constraints β 0 = β 1 =1, and β01 > 0, β11 > 0, we set βl1 = 1 − β l,−1 22 with β l,−1 = (βl2 , ..., βl p )T for l=1, 2. Then the parameters space of β l , l=1,2, becomes [{( 1 − β l,−1 22 , βl2 , ..., βl p )T } : β l,−1 22 < 1]. Let  Jl = ∂β l ∂ β Tl,−1  = β Tl,−1 / −β 1− β l,−1 22 I p−1    β T0,−1 , β T1,−1 )T , and θ ∗ = be the Jacobian matrix of dimension p × (p − 1). Denote β −1 = (β β −1 , γ )T . From θ to θ ∗ , we have Jacobian matrix J = diag(J0 , J1 , Iq+K+1 , Iq+K+1 ). (β Theorem 3 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing parameter λN = o(1), then the estimator θ , which is obtained by minimizing the penalized quadratic function in (3.4), exists and converges to θ 0 in probability. Theorem 4 Suppose the assumptions (A1)-(A6) in the Appendix are satisfied, and the smoothing parameter λN = o(N −1/2 ), then the estimator θ obtained by minimizing the penalized quadratic function in (3.4) is asymptotically normally distributed, i.e., √ d −1 T N(θ − θ 0 ) − → N(0, J(GT0 C−1 0 G0 ) J ), where the detailed calculation of G0 and C0 are given in the Appendix. 3.3 3.3.1 Model selection and hypothesis test Model selection Model selection is important in the spline approximation since too many parameters in the model might result in the overfitting issue. According to the theocratical property of generalized method of moments estimator (Hansen, 1982), under the assumption E(g1 ) = 0 and also the number of 2 in distribution, estimating equations is larger than the number of parameters, we have Q(θ ) → χr−k 36 where r is the dimension of g¯N (θθ ), k is the dimension of θ and θ is the estimator by minimizing the QIF when certain order and number of knots are chosen. This asymptotic property of the QIF provides a goodness-of-fit test, which can be useful to determine the order and number of knots to be selected in our model. However, it is also possible that the goodness-of-fit tests fail to reject several different models which may not be nested. Since Q(θ ) is asymptotically chi-square distributed, it is natural to extend the BIC to the QIF approach, by replacing twice the negative log-likelihood function by the QIF objective function. In particular, the BIC criterion for a model with r estimating equation and k parameters is given as, Q(θ ) + (r − k) ln N, The model with the minimum BIC would be considered the optimal one. If we choose h basis matrices in (3.3), then r − k = hk − k = (h − 1)k. In our simulation and real data application, we search the optimal order and number of knots over a set of combinations of q and K using the BIC criterion. Knots are evenly distributed in the range of the single index β T X. 3.3.2 Nonparametric goodness-of-fit test based on QIF The QIF can also be regarded as an inference function since it has properties similar to the likeψ , ζ ), lihood ratio test. Suppose that the d-dimensional parameter vector γ is partitioned into (ψ where ψ is the parameter of interest with dimension d1 , and ζ is the nuisance parameter with dimension d2 = d − d1 . If we are interested in testing H0 : ψ = ψ 0 , the test statistic ψ 0 , ζ ) − Q(ψ , ζ ) Q(ψ 37 follows an asymptotically chi-square distribution with d1 degrees of freedom. The following theorem introduced by Qu et al. (2000) provided a way to conduct hypothesis testing in the QIF framework. Theorem 5 (Qu et al., 2000) Suppose that all required regularity conditions are satisfied and ψ ψ 0 , ζ ) − Q(ψ , ζ ) is asymptotically chi-square has dimension d1 . Under the null hypothesis, Q(ψ distribution with d1 degrees of freedom, where ψ 0 , ζ ), (ψ , ζ ) = arg min Q(ψ ψ , ζ ). ζ = arg min Q(ψ (3.6) When there is no nuisance parameter, which is a special case of the condition in Theorem 5, Q(γγ 0 ) − Q(γ ) has an asymptotical chi-square distribution with d degree of freedom under the null hypothesis. 3.3.3 Test for linearity of interaction function in gFVICM In our proposed gFVICM model (3.1), it is of interest to test the unspecified coefficient function. In particular, we are interested in testing whether a linear function is good enough to describe the G×E interaction. If the we fail to reject the linearity of the coefficient function, then a parametric linear interaction function should be fitted to further assess if there exists linear G×E interaction; otherwise, we conclude there exists nonlinear G×E interaction. Let u1 = β T1 X. With the truncated power spline basis, the coefficient function can be modeled by q m1 (u1 ) ≈ γ10 + γ11 u1 + γ12 u21 + · · · + γ1q u1 + K+q+1 ∑ q γ1k (u1 − κk )+ . k=q+1 Our goal is to test the linearity of m1 (u1 ), which is equivalent to test H0 : γ12 = · · · = γ1,K+q+1 = 0. β T , γ T )T under the null hypothesis with Let θ be the estimator of the full parameter θ = (β T θ = (β , γ T0 , γ10 , γ11 , 0T )T = 38 arg min γ12 =···=γ1,K+q+1 =0 QN (θθ ), and the estimator of θ under the alternative as θ = arg min QN (θθ ). Then the test statistic TN = QN (θ ) − QN (θ ), asymptotically follows a chi-square distribution with K + q − 1 degrees of freedom, following Theorem 5. 3.4 Simulation study The finite sample performance of the proposed method was evaluated through Monte Carlo simulation studies. We considered the following logistic regression model P(yi j = 1|Xi j , Gi , β ) = exp{η(Xi j , Gi , β )} , 1 + exp{η(Xi j , Gi , β )} where β T0 Xi j ) + m1 (β β T1 Xi j )Gi . η(Xi j , Gi , β ) = m0 (β We simulated a three-dimensional environmental variables X = (X1 , X2 , X3 ). For the ith subject, X1i j , X2i j , X3i j are independently generated from a uniform distribution U(0, 1). We set the minor allele frequency (MAF) as pA = 0.1, 0.3, 0.5 and assumed Hardy-Weinberg equilibrium. We used AA, Aa and aa to denote three different SNP genotypes. These genotypes were simulated from a multinomial distribution with frequencies p2A , 2pA (1 − pA ) and (1 − pA )2 , respectively. Variable G was coded as {0,1,2}, corresponding to genotypes {aa, Aa, AA} respectively. To create correlated responses, we implemented the R package ‘bindata’ developed by Leisch et al. (1998) under an AR(1) correlation structure with correlation parameter ρ=0.5. When implementing the function ‘rmvbin’ to generate the correlated binary data, one should specify the marginal probabilities and the correlation structure. √ √ We set m0 (u0 ) = cos(πu0 ) and m1 (u1 ) = sin[π(u1 − A)/(B − A)] with A = 3/2 − 1.645/ 12 √ √ √ √ √ √ and B = 3/2 + 1.645/ 12. The true parameters were β 0 = ( 5, 4, 4)/ 13 and β 1 = 39 √ (1, 1, 1)/ 3. We drew 500 data sets with sample size N = 200, 500 and time points ni = T = 10, 20, respectively. The basis matrix M2 was set to have 1 on its two subdiagonals and 0 elsewhere. The order and number of knots of the splines were selected through the BIC method. 3.4.1 Performance of estimation Table 3.1 and Table 3.2 summarize the parameters estimation results under different sample sizes and measurement times respectively. In these two tables, the average bias (Bias), the standard deviation of the 500 estimates (SD), the average of the estimated standard error (SE) based on the theoretical results, and the estimated coverage probability (CP) at 95% confidence level are reported. It is shown from each table that, as the sample size increases, the performance of the estimation improves by showing smaller bias, SD and SE. More repeated measurement for each subject also results in improvement in estimations, which can be shown when we compare Table 3.1 and Table 3.2. For example, the CP for β01 improves from 86.8% to 90% when the number of measurement time increases from 10 to 20, under a sample size of 200. The estimation of the loading parameter β 1 improves as MAF pA increases, while the estimation of β 0 show a opposite direction. This is because we have limited data information to estimate the marginal effects m0 (·) when pA increases. Table 3.1 Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size N = 200, 500, T = 10 and correlation ρ=0.5. N Param True 200 β01 0.620 β02 0.555 β03 0.555 β11 0.577 β12 0.577 β13 0.577 500 β01 β02 β03 β11 β12 β13 0.620 0.555 0.555 0.577 0.577 0.577 pA = 0.1 Bias SD -0.008 0.058 -0.002 0.066 -4.1E-05 0.064 -0.013 0.134 -0.024 0.139 -0.013 0.140 0.002 0.003 -0.003 -0.007 -0.003 -0.005 0.038 0.039 0.038 0.078 0.079 0.075 SE 0.057 0.056 0.056 0.091 0.090 0.091 CP 93.6 90.0 91.8 82.2 79.2 82.4 0.038 0.037 0.037 0.065 0.066 0.066 94.8 93.0 93.8 89.6 88.0 90.6 pA = 0.3 Bias SD -0.003 0.074 -0.004 0.080 -0.008 0.074 -0.003 0.092 -0.10 0.096 -0.11 0.095 -0.002 -0.002 -6.3E-04 -0.002 -0.001 -0.004 40 0.043 0.043 0.045 0.055 0.052 0.054 SE 0.064 0.062 0.062 0.072 0.071 0.071 CP 90.6 87.2 88.8 87.2 85.2 86.0 0.043 0.042 0.042 0.049 0.049 0.049 95.4 93.4 93.0 92.0 92.8 92.8 pA = 0.5 Bias SD -0.007 0.084 -0.008 0.093 -0.006 0.094 0.007 0.088 -0.013 0.085 -0.014 0.088 -0.003 -2.1E-04 -0.004 0.002 -0.003 -0.005 0.047 0.052 0.050 0.045 0.049 0.047 SE 0.068 0.065 0.066 0.063 0.062 0.062 CP 86.8 84.5 86.6 84.9 87.2 85.4 0.048 0.046 0.046 0.044 0.043 0.043 95.0 92.4 92.5 94.2 91.4 92.0 Table 3.2 Simulation results under different MAFs pA = 0.1, 0.3, 0.5 with sample size N = 200, 500, T = 20 and correlation ρ=0.5. N Param True 200 β01 0.620 β02 0.555 β03 0.555 β11 0.577 β12 0.577 β13 0.577 500 β01 β02 β03 β11 β12 β13 0.620 0.555 0.555 0.577 0.577 0.577 pA = 0.1 Bias SD -0.002 0.044 -0.004 0.048 2.3E-04 0.048 -0.006 0.074 -0.004 0.075 -0.005 0.072 SE 0.043 0.042 0.042 0.064 0.064 0.064 CP 94.4 90.2 91.2 90.0 91.4 91.8 pA = 0.3 Bias SD SE -0.002 0.053 0.049 -0.003 0.056 0.048 -0.002 0.057 0.048 0.008 0.062 0.054 -0.009 0.063 0.053 -0.009 0.063 0.054 CP 94.2 90.2 90.2 91.8 89.8 90.0 -0.002 -3.2E04 3.5E04 -0.005 -0.002 0.003 0.028 0.027 0.027 0.044 0.045 0.042 97.2 94.8 93.2 92.2 91.0 96.4 -0.002 0.001 -0.002 0.004 -0.003 -0.005 95.0 94.0 92.8 92.4 93.0 93.4 0.028 0.028 0.029 0.044 0.045 0.042 0.033 0.033 0.034 0.039 0.037 0.038 0.033 0.032 0.032 0.036 0.036 0.036 pA = 0.5 Bias SD -0.011 0.062 -0.003 0.065 0.005 0.063 0.004 0.063 -0.009 0.064 -0.006 0.064 -0.004 7.9E-05 6.3E-04 0.006 -0.004 -0.005 0.037 0.038 0.037 0.038 0.036 0.035 SE 0.052 0.050 0.050 0.050 0.050 0.050 CP 90.0 88.3 87.4 88.2 87.1 86.8 0.036 0.035 0.035 0.035 0.035 0.035 93.2 92.8 94.2 93.0 94.2 93.6 The plots for the estimations of m0 (u0 ) and m1 (u1 ) under different sample sizes and number of replications are shown in Figure 3.1–3.4. The estimated and true functions are denoted by the solid and dashed lines, respectively. The 95% confidence bands are denoted by the dotted-dash line. The estimated curves almost overlap with the corresponding true curves as shown in the plots, indicating the estimation accuracy of the method. Also the confidence bands are tight, especially for large sample size and large number of measurement times. Note that the estimation for the interaction effects m1 (u1 ) improves as MAF pA increases, while the estimation for the marginal effects m0 (u0 ) show a opposite direction, which coincides with the results for the parametric estimation in Table 3.1 and Table 3.2. 41 0.6 0.8 1.0 1.2 1.4 2.0 1.5 m0(u0) −1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −0.5 0.0 m0(u0) 0.5 1.0 1.5 1.0 0.5 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 N=500, T=10, pA=0.1 N=500, T=10, pA=0.3 N=500, T=10, pA=0.5 0.8 u0 1.0 1.2 1.4 1.5 m0(u0) −0.5 0.0 −1.5 −1.5 0.6 0.5 1.0 1.5 m0(u0) 0.5 1.0 1.5 1.0 0.5 −0.5 0.0 −1.5 0.4 1.4 2.0 u0 2.0 u0 2.0 u0 −0.5 0.0 m0(u0) −0.5 0.0 −1.5 0.4 m0(u0) N=200, T=10, pA=0.5 2.0 N=200, T=10, pA=0.3 2.0 N=200, T=10, pA=0.1 0.4 0.6 0.8 u0 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4 u0 Figure 3.1 The estimation of function m0 (·) when N=200, 500 and T =10. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. 42 0.6 0.8 1.0 1.2 1.4 2.0 1.5 m0(u0) −1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −0.5 0.0 m0(u0) 0.5 1.0 1.5 1.0 0.5 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 N=500, T=20, pA=0.1 N=500, T=20, pA=0.3 N=500, T=20, pA=0.5 0.8 u0 1.0 1.2 1.4 1.5 m0(u0) −0.5 0.0 −1.5 −1.5 0.6 0.5 1.0 1.5 m0(u0) 0.5 1.0 1.5 1.0 0.5 −0.5 0.0 −1.5 0.4 1.4 2.0 u0 2.0 u0 2.0 u0 −0.5 0.0 m0(u0) −0.5 0.0 −1.5 0.4 m0(u0) N=200, T=20, pA=0.5 2.0 N=200, T=20, pA=0.3 2.0 N=200, T=20, pA=0.1 0.4 0.6 0.8 u0 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4 u0 Figure 3.2 The estimation of function m0 (·) when N=200, 500 and T =20. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. 43 0.6 0.8 1.0 1.2 1.4 1.5 1.0 −1.5 −1.0 −0.5 0.0 m1(u1) 0.5 1.0 −1.5 −1.0 −0.5 0.0 m1(u1) 0.5 1.0 0.5 0.0 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 N=500, T=10, pA=0.1 N=500, T=10, pA=0.3 N=500, T=10, pA=0.5 0.8 u1 1.0 1.2 1.4 1.0 −0.5 −1.5 −1.0 −0.5 −1.0 −1.5 0.6 0.0 m1(u1) 0.5 1.0 m1(u1) 0.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 0.4 1.4 1.5 u1 1.5 u1 1.5 u1 0.0 m1(u1) −0.5 −1.0 −1.5 0.4 m1(u1) N=200, T=10, pA=0.5 1.5 N=200, T=10, pA=0.3 1.5 N=200, T=10, pA=0.1 0.4 0.6 0.8 u1 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4 u1 Figure 3.3 The estimation of function m1 (·) when N=200, 500 and T =10. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. 44 0.6 0.8 1.0 1.2 1.5 1.0 −1.5 −1.0 −0.5 0.0 m1(u1) 0.5 1.0 −1.5 −1.0 −0.5 0.0 m1(u1) 0.5 1.0 0.5 0.0 1.4 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 N=500, T=20, pA=0.1 N=500, T=20, pA=0.3 N=500, T=20, pA=0.5 0.8 1.0 1.2 1.4 1.0 −0.5 −1.5 −1.0 −0.5 −1.0 −1.5 0.6 0.0 m1(u1) 0.5 1.0 m1(u1) 0.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 0.4 1.4 1.5 u1 1.5 u1 1.5 u1 0.0 m1(u1) −0.5 −1.0 −1.5 0.4 m1(u1) N=200, T=20, pA=0.5 1.5 N=200, T=20, pA=0.3 1.5 N=200, T=20, pA=0.1 0.4 u1 0.6 0.8 u1 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4 u1 Figure 3.4 The estimation of function m1 (·) when N=200, 500 and T =20. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash lines. 3.4.2 Performance of hypothesis tests We evaluated the performance of the test for the nonparametric function under the null hypothesis H0 : m1 (·) = m01 (·), where m01 (u1 ) = δ0 + δ1 u1 , δ0 and δ1 are some constants, which corresponds to a linear G×E interaction. Power is evaluated under a sequence of alternative models with different values of τ, which is denoted by H1τ : mτ1 (·) = m01 (·) + τ{m1 (·) − m01 (·)}. When τ = 0, the corresponding power is the false positive rate. Figure 3.5 shows the size (when τ = 0) and power (when τ > 0) at significance level 0.05 based on 500 Monte Carlo simulations for N=200, 500 under different measurement times T =10 (left panel) and T =20 (right panel). The empirical Type I error is large when N = 200, which decreases dramatically when the sample size increases to N=500. The power increases when the 45 sample size increases from 200 to 500. The results indicate that our method can reasonably control the false positive rates and has appropriate power to detect the linearity function under a relatively large sample size. Comparing the results for T =10 and T =20, we can see that the testing power improves when the number of measurement time increases. Power plot when N=200, 500, T=10 Power plot when N=200, 500, T=20 0.85 0.85 0.70 Power Power 0.70 0.50 0.50 0.30 0.30 sample_size sample_size 0.10 200 500 0.10 0.0 0.1 0.2 τ 0.3 0.4 0.5 200 500 0.0 0.1 0.2 τ 0.3 0.4 0.5 Figure 3.5 The empirical size and power of testing the linearity of nonparametric function m1 when N=200, 500 and T =10, 20. To assess how the values of MAF affect the testing performance, we plot the power plot under different MAFs pA =0.1, 0.3, 0.5 when N=500, T =10, which is shown in Figure 3.6. Note that the power increases dramatically when MAF increases from 0.1 to 0.3. The values of power are very close when pA =0.3 and 0.5. 46 N=500, T=10 Power 0.7 0.5 0.3 pA 0.1 0.3 0.5 0.1 0.0 0.1 0.2 τ 0.3 0.4 0.5 Figure 3.6 The empirical size and power of testing the linearity of nonparametric function m1 under different MAFs when N=500 and T =10. 3.5 Real data application We applied the proposed gFVICM model to a data set from a study examining the association of the A118G SNP of OPRM1 to experimental pain sensitivity. A sample of 163 healthy volunteers were evolved in this study. For each volunteer, Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) were measured at 6 Dobutamine dosage levels: 0 (baseline), 5, 10, 20, 30 and 40mcg/min. Clinically, a person is said to be hypertensive if the individuals SBP is greater than 140mm Hg or DBP is greater than 90mm Hg (Choi et al. 2014). Thus, the response variable Y is a binary variable indicating whether a person has hypertension or not, i.e. Y = 1 for hypertension and Y = 0 for non-hypertension. One longitudinal covariate X1 = dosage, two time-invariant covariates X2 = age and X3 = BMI are included as the environmental factors in the model. The genetic variables are five SNPs located at codon16, codon27, codon49, codon389, and codon492 in the gene. Our purpose is to evaluate how the mixture of age, BMI and dosage modifies the SNP effect on the risk of hypertension. In particular, we test the hypothesis H0 : m1 (u1 ) = δ0 + δ1 u1 with p-value denoted by pm1 in Ta- 47 ble 2.7. We also reported the p-values for testing the significance of three components of index loading coefficients β 1 = (β11 , β12 , β13 ), which are labeled by pβ , pβ and pβ , based on 11 12 13 the asymptotic property of the estimations. We also compared our proposed model to a gener∗ (X ) + β ∗ X + β ∗ X + alized additive varying-coefficient model (gAVCM) E(Y |X, G) = η β01 1 03 3 02 2 ∗ (X ) + β ∗ X + β ∗ X G , where β ∗ (·) and β ∗ (·) are unknown functions of X . To see the β11 1 1 12 2 13 3 01 11 relative gain by integrative analysis, we calculated the objective function QN in both models. The ∗ (·) = β ∗ = β ∗ = 0 for gAVCM are also reported in the tables and p-values for testing H0 : β11 13 12 denoted by pgAVCM . In Table 3.3, pm1 for all the 5 SNPs are smaller than the significance level 0.05, which means the functions capturing the G×E interactions are nonlinear for all these 5 SNPs. The objective function QN in the last two columns shows that gFVICM fits the data better than gAVCM, indicating the benefit of integrative analysis. Besides, the testing results for gAVCM do not show significance of the coefficients for interactions. The results imply that the genetic effects of SNPs are modified by the mixture of environmental variables, rather than separately. Figure 3.7 exhibits the fitted nonlinear curves indicating G×E interactions for each SNP, along with the 95% confidence bands. Table 3.3 List of SNPs with MAF, allele, p-values under different hypothesis testing and values of objective function QN . SNP ID codon16 codon27 codon49 codon389 codon492 MAF Alleles 0.3990 0.4160 0.1387 0.3045 0.4250 A/G G/C G/A G/C T/C p m1 pβ 11 p-value pβ 12 pβ 13 pgAVCM <1.0E-04 <1.0E-04 0.0207 0.3475 0.2960 <1.0E-04 0.2329 <1.0E-04 0.0014 0.6982 <1.0E-04 <1.0E-04 <1.0E-04 0.6325 0.1777 <1.0E-04 <1.0E-04 <1.0E-04 0.3329 0.8436 <1.0E-04 0.6731 <1.0E-04 0.0008 0.5001 QN gFVICM gAVCM 3.9240 6.9502 6.6303 3.3678 6.0766 11.2082 9.3500 12.2648 10.5593 7.4877 Table 3.4 displays the estimated odds for different genotypes at different dosage levels. The changes in the values of odds demonstrate the interaction between SNP and environmental mixtures at different dosage levels. For example, we noted that the odds for genotype AA in SNP codon16 does not change too much as the dosage level increases, which means that the genetic 48 codon16 1.0 m1(u1) 0.5 0.0 −0.5 0.00 0.25 0.50 u1 0.75 1.00 codon27 codon49 0 m1(u1) m1(u1) 5.0 2.5 −2 −4 0.0 0.00 0.25 0.50 u1 0.75 1.00 0.00 0.25 codon389 0.50 u1 0.75 1.00 codon492 0.5 5.0 m1(u1) m1(u1) 0.0 −0.5 2.5 −1.0 0.0 −1.5 0.25 0.50 u1 0.75 0.25 0.50 u1 0.75 Figure 3.7 Plot of the estimate (solid curve) of the nonparametric function m1 (u1 ) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed lines. 49 effect of this genotype remains the same when subjects are exposed to different environments at different doses. While for the other two genotypes, there is an increase in the value of odds until dosage level four, indicating increased blood pressure as dosage level increases from 0mcg/min to 20mcg/min. Table 3.4 Estimated odds for different genotypes at each dosage levels. SNP ID Genotyoe 1 codon16 AA 0.92 GG 1.22 GA 1.35 codon27 GG 1.25 CC 1.06 CG 1.08 codon49 GG 4.19 AA 1.13 GA 1.12 codon389 GG 1.78 CC 1.41 CG 0.91 codon492 TT 1.11 CC 1.12 CT 1.03 3.6 Dosage level 2 3 4 0.92 0.92 0.92 1.89 2.63 3.66 1.66 1.94 2.28 2.07 2.75 2.96 1.75 2.34 2.58 1.63 2.07 2.21 3.56 3.05 2.28 1.50 1.79 1.98 2.17 3.41 4.66 1.78 1.78 1.77 1.87 2.21 2.35 1.57 2.14 2.33 1.99 2.69 2.77 1.97 2.61 2.63 1.67 2.12 2.12 5 0.92 3.17 2.14 2.32 2.06 1.81 1.76 1.79 3.88 1.77 2.09 1.86 2.20 2.05 1.70 6 0.93 1.73 1.63 1.96 1.73 1.56 1.38 1.52 2.80 1.77 1.91 1.65 2.33 2.14 1.78 Discussion In this paper, we proposed a generalized varying index coefficient modeling procedure to assess the interaction effect of multiple environmental factors as a whole with a genetic factor. The model was motivated by empirical evidence and was developed under an longitudinal design with a binary disease response. We developed a profile estimation procedure to estimate the index coefficients and nonparametric interaction functions iteratively. The estimation was conducted under the QIF framework. To estimate the nonparametric functions, we first approximated the function using truncated power spline basis, then estimated the spline coefficients based on QIF. Furthermore, we proposed a nonparametric hypothesis test to assess the linearity of the nonparametric interaction function. Simulation study has been conducted to illustrate the estimation and testing procedures 50 to evaluate the finite sample performance. The results indicate reasonable estimation performance of the method under different sample sizes and measurement times. Our method was proposed to evaluate the joint interaction effect between multiple environmental variables as a whole with genetic variables. Compared to the generalized additive varying coefficient model (gAVCM), which considers the G×E effect for each single environmental factor separately, our model presents two advantages: 1) it is biologically more attractive if there are synergistic effects between multiple exposures; and 2) it can potentially increase the testing power for detecting interactions since it can reduce multiple testing burden by treating multiple exposures as a single index variable. Although our method was motivated by a genetic association study, the developed model and inference procedures can be applied to other disciplines with the purpose to model the synergistic effect of multiple variables as a whole. We applied our method to a real data set from a pain sensitivity study. Testing results indicate that all of the five SNPs are nonlinear moderated, by the synergistic effect of the three variables with dosage as a “time"-varying variable, to affect the risk of high blood pressure. These five SNPs were genotyped from a candidate gene which has been shown to be related to blood pressure changes (Johnson and Terra, 2002). Although the purpose of the data was not generated to evaluate the genetic effect on “hypertension", we simply applied the method to this data set to demonstrate the utility of the method. The estimated odds of different genotypes for a particular SNP at different dosage levels does give insights into the effect of the SNPs nonlinearly modulated by dosages. Of particular interest is SNP condon49 in which individuals carrying genotype GG show a decreasing risk of developing high blood pressure as the dosage level increases, indicating a protective effect of this genotype. For the same SNP, individuals carrying genotype GA show a different pattern of developing high blood pressure as the dosage level increases. Such a dynamic change of genetic effect over different dosage levels cannot be revealed by a cross-sectional study, indicating the relative merit of a longitudinal design. 51 CHAPTER 4 DETECTING GENETIC ASSOCIATIONS WITH MULTIVARIATE PARTIALLY LINEAR VARYING-COEFFICIENTS MODELS FOR MULTIPLE LONGITUDINAL TRAITS 4.1 Introduction Cross-sectional disease traits have been the primary focus in genetic association studies. Given the improved power to identify disease genes with phenotypic data measured over time, longitudinal designs are becoming popular in genetic association studies (Sitlani et al. 2015; Macgregor et al. 2005; Furlotte et al. 2012; Xu et al. 2014). Most statistical methods developed so far focus on a single outcome of interest. When multiple outcomes are measured over time, for example, multiple measures of heart function in a longitudinal study of cardiac function, methods focusing on just a single outcome over time may not provide a complete picture of cardiac function. In genetics, the phenomenon that a single gene or locus influences more than one trait is known as pleiotropy (Wang et al., 2014; Gratten and Visscher, 2016). Genetic pleiotropy plays a crucial role in many complex diseases. One of the most well-known examples is the phenylketonuria (PKU) disease (Lobo, 2008). The conventional approach to identify genetic pleiotropic effects on multiple traits is to test the association between a gene and each trait individually and then determine whether the genetic effect is significantly associated with more than one trait. The disadvantages of this approach, such as the inflation in the family-wise Type I error and incomplete information in individual tests compared to a combined analysis for multiple traits, have been discussed in some studies (e.g. Wang et al., 2014). Therefore, a joint genetic association test on multiple traits is more desirable to control the family-wise Type I error and enhance the power of tests. In real life, timing is a very important factor in the development of a disease. Genetic effects on a disease trait vary during the life span of an individual. The function of a gene depends largely on when it turns on and off, which could show a temporal pattern. In order to capture the dynamic 52 effect of a gene on a disease trait over time, it is natural to model the dynamic effect as a potential nonlinear function over time. Considering multiple longitudinal traits, we proposed the following partially linear varying coefficients model, Yli j = Yli (ti j ) = β0l (ti j ) + β1l (ti j )Gi + α l Zi j + εli j , (4.1) where Yli j is the response variable which measures the l-th phenotype on the i-th subject at the j-th time point; Zi j is a p-dimensional vector of covariates, which can be either dependent or independent of time; Gi denotes the time-invariant genetic variable within subject; β0l (·) and β1l (·) are unknown functions; and εli j is an error term which is assumed to following the following joint distribution,    εi =     ε 1i .. .    ∼ N 0, Σ   ε Li with Σ be some covariance structure. If we use a time-varying environmental factor Xi j instead of ti j in the model, i.e. Yli j = β0l (Xi j ) + β1l (Xi j )Gi + α l Zi j + εli j , then the model can be used for jointly modeling dynamic G×E interactions for multiple longitudinal traits. Models for multivariate longitudinal traits are necessarily complex, because they must consider different types of correlations for each independent subject: correlation between measurements for the same trait at different time points, correlation between measurements at the same time point on different traits, and correlation between measurements at different time points and on different traits. Qu and Li (2006) applied the method of quadratic inference functions (QIF) to the varying coefficients models for longitudinal data. One important advantage is that the QIF method only requires correct specification of the mean structure and does not require any likelihood or approximation of the likelihood in hypothesis testing. In addition, when the working correlation structure 53 is misspecified, the QIF is more efficient than the generalized estimation equation (GEE) approach. Another advantage of QIF approach is that the inference function has an asymptotic form, which provides a model selection criteria similar to AIC and BIC. It also allows us to test whether coefficients are significantly time-varying based on the asymptotic results. The purpose of this paper is to develop a set of hypothesis testing procedure, including joint testing for multiple traits and marginal testing for each individual trait, for model (4.1). We first use penalized splines (Ruppert and Carroll, 2000) to approximate the nonparametric functions in the model. Then we develop a 2-step testing procedure to first jointly test the genetic effect on multiple traits and then separately test marginal genetic effect on each trait based on the QIF approach. Estimation of the parametric coefficients and nonparametric spline coefficients are obtained under the QIF framework. This chapter is organized as follows: we state our proposed model in Section 4.2.1, and define the objective function in a QIF method in Section 4.2.2. Estimation procedure and asymptotical properties of estimators are provided in Section 4.2.3. A model selection criteria using BIC is provided in Section 4.2.4. A theorem for goodness-of-fit test in QIF approach is established in Section 4.2.5 and we propose a 2-step testing procedure based on that in Section 4.2.6. We assess the finite sample performance of the proposed procedure with Monte Carlo simulation in Section 4.3 and illustrate the proposed methodology by the analysis of a pain sensitivity data set in Section 4.4. Conclusions and discussion are made in Section 4.5. Proofs are included in Appendix. 4.2 4.2.1 Joint models and statistical methods Joint multivariate models In multivariate longitudinal studies, suppose yli j is the l-th continuous outcome collected on the i-th observation at the time point ti j , where l = 1, . . . , L, i = 1, . . . , N, j = 1, . . . , ni . The joint 54 partially linear varying coefficient models are defined as yli j = yli (ti j ) = β0l (ti j ) + β1l (ti j )Gi + α l Z(ti j ) + εli j , where Gi is the SNP variable which is not depend on time and type of measurement, Z (ti j ) is the p-dimensional covariate vector, which can be either time-dependent or time-independent. εli j is an error term and    εi =     ε 1i .. .    ∼ N 0, Σ   ε Li with Σ be some covariance structure. β0l (·) and β1l (·) are unknown smooth nonparametric functions. To illustrate the idea, in the following we demonstrate the methods assuming L=2. For the situation where there are more than two traits (L > 2), the technique can be easily extended. For the case when L=2, the joint models can be written as y1i j = y1i (ti j ) = β01 (ti j ) + β11 (ti j )Gi + α 1 Z(ti j ) + ε1i j , y2i j = y2i (ti j ) = β02 (ti j ) + β12 (ti j )Gi + α 2 Z(ti j ) + ε2i j , where      σ12 Σ 11  ρ12 σ1 σ2 Σ 12   ε 1i   0   εi =   ∼ N   ,   ε 2i 0 ρ12 σ1 σ2 Σ 21 σ22 Σ 22 4.2.2 Objective function based on QIF To construct the objective function using the QIF approach, we first approximate the unknown functions β01 , β11 , β02 and β12 by a q-degree truncated power spline basis, i.e. βsl (t) ≈ Bsl (t)T γ sl , for s = 0, 1 and l = 1, 2, q (4.2) q where Bsl (t) = (1,t,t 2 , ...,t qsl , (t − κ1 )+sl , ..., (t − κKsl )+sl )T is a truncated power spline basis with degree qsl and Ksl knots κ1 , ..., κKsl . γ sl is a (qsl + Ksl + 1)-dimensional vector of spline coefficients. 55 In a GEE approach we solve N ∑ µ˙ Ti V−1 i (yi − µ i ) = 0, (4.3) i=1 where yi = (yT1i , yT2i )T , yli = (yli1 , ..., ylini )T ; µ i = E(yi ) is the mean function and µ˙ i is the first derivative of µ i with respect to the parameters; Vi is the covariance matrix of yi and can be de1/2 composed as Vi = Ai 1/2 ρ )Ai R(ρ ρ) with Ai being a diagonal matrix of marginal variances and R(ρ being a correlation matrix with nuisance parameters ρ . QIF approach considers the inverse of the correlation matrix R as a linear combination of several known basis matrices in a form R−1 ≈ a1 M1 + a2 M2 + ... + ah Mh , (4.4) where M1 is the identity matrix and M2 , ..., Mh are symmetric basis matrixes. For exchangeable working correlation, M2 has 0 on the diagonal and 1 elsewhere. If the working correlation is AR(1), we can set M2 to have 1 on its two subdiagonals and 0 elsewhere. Plugging the expression of R−1 (4.4) into the GEE stated in (4.3), we define the estimation function as  −1/2 −1/2 µ˙ Ti Ai M1 Ai (yi − µ i ) ∑N i=1  1 1 N .. g¯N (θθ ) = ∑ gi (θθ ) =  . N i=1 N  ˙ T −1/2 Mh A−1/2 (yi − µ i ) ∑N i i=1 µ i Ai       Using the spline approximation, the mean function µ i can be written as    T T  µ1i1 (θθ )   B01 (ti1 )γγ 01 + B11 (ti1 )γγ 11 Gi + α 1 Z(ti1 )    .. ..    . .            µ1ini (θθ )   BT01 (tini )γγ 01 + BT11 (tini )γγ 11 Gi + α 1 Z(tini )  µ 1i (θθ )   = µ i (θθ ) =  =    T T    µ 2i (θθ )  µ2i1 (θθ )   B02 (ti1 )γγ 02 + B12 (ti1 )γγ 12 Gi + α 2 Z(ti1 )    .. ..    . .       µ2ini (θθ ) BT02 (tini )γγ 02 + BT12 (tini )γγ 12 Gi + α 2 Z(tini ) 56 (4.5)         ,        and the first derivative of µ i is  T T 0 0 0  B01 (ti1 ) B11 (ti1 )Gi Z(ti1 )  .. .. .. .. .. ..  . . . . . .    T 0 0 0  B01 (tini ) BT11 (tini )Gi Z(tini ) µ˙ i =    0 0 0 BT02 (ti1 ) BT12 (ti1 )Gi Z(ti1 )   .. .. .. .. .. ..  . . . . . .   0 0 0 BT02 (tini ) BT12 (tini )Gi Z(tini )         ,        where θ = (γγ T01 , γ T11 , α T1 , γ T02 , γ T12 , α T2 )T . Setting each component in (4.5) to be zero will result in more equations than unknown parameters. Following the idea of generalized method of moments (Hansen, 1982), the QIF is defined as −1 g¯ , QN (θθ ) = N g¯TN C¯N N (4.6) T where C¯N = N1 ∑N i=1 gi gi is a consistent estimator for var(gi ). Minimizing the objective function (4.6) provides the estimations of parameters. 4.2.3 Estimation The estimation of the parameters can be obtained through minimizing the objective function, i.e. θ = arg min QN (θθ ). θ To avoid over-fitting, we can define a penalized QIF in a form N −1 QN (θθ ) + λ θ T Dθθ , (4.7) where D is a diagonal matrix with 1 if the corresponding parameter is the spline coefficient associated with knots and 0 otherwise. Minimizing the penalized QIF provides θ = arg min(N −1 QN (θθ ) + λ θ T Dθθ ). θ 57 (4.8) To estimate the tuning parameter λ , we can extend the generalized cross-validation (Ruppert, 2002; Qu and Li, 2006; Bai et al., 2009) to the penalized QIF and define the generalized cross-validation statistic as GCV(λ ) = N −1 QN (1 − N −1 d f )2 with the effective degree of freedom df = tr (Q¨ N + 2Nλ D)−1 Q¨ N , where Q¨ N is the second derivative of QN . The optimized tuning parameter λ is given as λ = arg min GCV(λ ). λ To establish the asymptotic properties for the penalized quadratic inference function estimators with fixed knots, we assume θ 0 is the parameter satisfying Eθ (gi ) = 0. Similar theoretical results 0 are provided in Qu and Li (2006). Following their idea and extend those results to the estimators √ in our model, we get the strong consistency of the resulting estimators in Theorem 6. The Nconsistency and asymptotic normality of the estimators are given in Theorem 7 . Theorem 6 Suppose conditions (B1)-(B6) in the Appendix hold and the smoothing parameter λN = o(1), then the estimator θ , which is obtained by minimizing the penalized quadratic function in (4.7), exists and converges to θ 0 almost surely. Theorem 7 Suppose conditions (B1)-(B6) in the Appendix hold and the smoothing parameter λN = o(N −1/2 ), then the estimator θ obtained by minimizing the penalized quadratic function in (4.7) is asymptotically normally distributed with the limiting distribution, √ d −1 N(θ − θ 0 ) − → N(0, (GT0 C−1 0 G0 ) ), where the calculation of G0 and C0 can be found in Appendix. 58 4.2.4 Model selection In contrast to the complicated model selection in GEE method due to the lack of an objective function in its estimation procedure, it is natural to extend the BIC method to the QIF approach since the QIF objective function and twice the negative log-likelihood function have similar asymptotic properties. Under the assumption E(g) = 0 and the number of estimating equations is larger than d 2 (Hansen, 1982), where r is the dimension of the number of parameters, we have Q(θ ) − → χr−k g¯N (θθ ), k is the dimension of θ , θ is the estimator by minimizing the QIF when certain order and number of knots are chosen. Based on the asymptotic property of QIF, the BIC criterion for a model with r estimating equations and k parameters is Q(θ ) + (r − k) ln N. The model with minimum BIC would be considered optimal. 4.2.5 Nonparametric goodness-of-fit test Compared to GEE, an advantage of QIF approach is that QIF provides a goodness-of-fit test without estimations for second moment parameters. In Model (4.1), it is of interest to test whether the spline approximations for the varying coefficient functions are appropriate. Qu et al. (2000) constructed a test statistic based on QIF. Suppose that the d-dimension paramψ , ζ ), where ψ is the parameter of interest with dimension d1 , eter vector γ is partitioned into (ψ and ζ is a nuisance parameter with dimension d2 = d − d1 . If we are interested in testing H0 : ψ = ψ 0 , the test statistic ψ 0 , ζ ) − Q(ψ , ζ ) Q(ψ follows an asymptotically chi-square distribution with d1 degrees of freedom. 59 Theorem 8 (Qu et al., 2000) Suppose that all required regularity conditions are satisfied and ψ ψ 0 , ζ ) − QN (ψ , ζ ) is asymptotically chi-square has dimension d1 . Under the null hypothesis, QN (ψ distribution with d1 degrees of freedom, where ψ 0 , ζ ), (ψ , ζ ) = arg min QN (ψ ψ , ζ ). ζ = arg min QN (ψ 4.2.6 (4.9) Two-step hypothesis testing procedure In Model (4.1), it is of interest to test whether the genetic effects on multiple traits are significant or not. Based on Theorem 8, we develop a 2-step testing procedure for testing the significance of the varying coefficient functions. In the first step, the joint test is performed to see whether a genetic factor has significant effect on at least one longitudinal trait. If the testing result in the first step is significant, we will further conduct the marginal tests in the second step to assess if the genetic effect is significant on both traits or just one trait. So the first step is a joint test of significance followed by a marginal test to assess individual significance. 4.2.6.1 Step 1: Joint test First, we are interested in testing whether the genetic factor G has effect on at least one longitudinal trait. The hypothesis is stated as H0 : β11 (·) = β12 (·) = 0 v.s. H1 : β11 (·) = 0 or β12 (·) = 0. This can be handled through the truncated power spline approximation of the nonparametric functions stated in (4.2). In particular, test this hypothesis is equivalent to test the following null hypothesis H0 : γ 11 = γ 12 = 0. According to Theorem 8, we can construct a test statistic TN = QN (θ ) − QN (θ ), 60 where θ = arg min QN γ 01 , γ 11 , α 1 , γ 02 , γ 12 , α 2 | y1 , y2 , G, Z , γ 11 =γγ 12 =0 and θ = arg min QN γ 01 , γ 11 , α 1 , γ 02 , γ 12 , α 2 | y1 , y2 , G, Z . The test statistic TN has an asymptotical χ 2 distribution with degree of freedom equals the number of constraints under H0 , according to Theorem 8. 4.2.6.2 Step 2: Marginal tests From the joint test, if there exist a significant genetic effect on at least one longitudinal trait, then we can further test the marginal effects: H0 : β1l (·) = 0 v.s. H1 : β1l (·) = 0, l = 1, 2. Based on (4.2), this is equivalent to test the following two hypotheses H0 : γ 11 = 0 and H0 : γ 12 = 0 separately. For testing H0 : γ 11 = 0, we use test statistic TN1 = QN (γ 01 , 0, α 1 ) − QN (γ 01 , γ 11 , α 1 ), where (γ 01 , 0, α 1 ) = arg min QN γ 01 , γ 11 , α 1 | y1 , G, Z , γ 11 =0 and (γ 01 , γ 11 , α 1 ) = arg min QN γ 01 , γ 11 , α 1 | y1 , G, Z . 61 We can also construct another test statistic TN2 = QN (γ 02 , 0, α 2 ) − QN (γ 02 , γ 12 , α 2 ) for testing H0 : γ 12 = 0, where (γ 02 , 0, α 2 ) = arg min QN γ 02 , γ 12 , α 2 | y2 , G, Z , γ 12 =0 and (γ 02 , γ 12 , α 2 ) = arg min QN γ 01 , γ 12 , α 2 | y2 , G, Z . The asymptotical distribution of test statistics TN1 and TN2 can be obtained from Theorem 8. 4.3 4.3.1 Simulation study Simulation setup In this section, the finite sample performance of the proposed method is evaluated through Monte Carlo simulation studies. The two continuous variables are generated from the models y1i (ti j ) = β01 (ti j ) + β11 (ti j )Gi + α1 z(ti j ) + ε1i j , y2i (ti j ) = β02 (ti j ) + β12 (ti j )Gi + α2 z(ti j ) + ε2i j , where β01 (ti j ) = 0.5 cos(2πti j ), β11 = sin(π(ti j − 0.2)), β02 (ti j ) = sin(πti j ) − 0.5, β12 (ti j ) = cos(πti j − 0.8), α1 = 0.2 and α2 = 0.3. We generate T time points ti = (ti1 , . . . ,tiT ) from a uniform distribution U(0, 1). The predictor variable z(ti j ) is also generated from U(0, 1). We set the minor allele frequency (MAF) as pA = 0.5 and assume Hardy-Weinberg equilibrium. Three different SNP genotypes AA, Aa and aa are simulated from a multinomial distribution with frequencies p2A , 2pA (1 − pA ) and (1 − pA )2 , respectively. Variable G takes value in the set {0,1,2}, corresponding to genotypes {aa, Aa, AA}. We assume ε1i j and ε2i j are jointly normally distributed with the correlation corr(ε1i j , ε2i j ) = 0.5. Then we generate the error terms from a multivariate 62 normal distribution       σ12 Σ11   0.5σ1 σ2 Σ12    ε 1i    0     ∼ N   ,    ε 2i 0 0.5σ1 σ2 Σ12 σ22 Σ22 We set the marginal variances σ12 = σ22 = 0.1. The true correlation structure of Σ11 , Σ22 , and Σ12 are all exchangeable with ρ1 = 0.5, ρ2 = 0.5 and ρ12 = 0.2, respectively. We draw 1000 data sets with sample size N = 200, 500 and time points ni = T = 10, in order to compare the performances of our proposed method under different sample sizes. We set M1 to be the identity matrix and M2 to has 1 on subdiagonals and 0 elsewhere. The order and number of knots of the splines are chosen by the BIC method. 4.3.2 Performance of estimation Table 4.1 summarizes the parameter estimation for unknown coefficients. In this table, the average bias (Bias), the standard deviation of the 1000 estimates (SD), the average of the estimated standard error (SE) based on the theoretical results, and the estimated coverage probability (CP) at 95% confidence level are reported. In general, the biases for all parameter estimations are very small, the coverage probabilities are very close to the confidence level 95%, which indicate good performance of our proposed estimation procedure. As the sample size increases, the performance of the estimation improves by showing smaller bias, SD and SE. Table 4.1 Estimation results for parameters α1 and α2 with sample size N = 200, 500. N Parameter True 200 α1 0.2 α2 0.3 500 α1 α2 0.2 0.3 Bias SD SE CP 0.0004 0.018 0.018 94.6 0.0006 0.018 0.018 94.2 -5.2E-05 0.012 0.012 95.2 -0.0003 0.012 0.012 94.8 The plots for the estimations of nonparametric functions β01 (·) and β11 (·) under different sample sizes are shown in Figure 4.1. The estimated and true functions are denoted by the solid and dashed lines, respectively. The 95% confidence bands are denoted by the dotted-dash line. The 63 estimated curves almost overlap with the corresponding true curves as shown in the plots, indicating good estimate of the function. Larger sample size leads to tighter confidence bands. Figure 4.2 displays the estimations for functions β02 (·) and β12 (·), which are included in the model corresponding to response variable y2 . The results are similar to those in Figure 4.1 and further demonstrate the good performance of our estimation methods. N=200 0.3 0.25 β01(t) β01(t) N=500 0.50 0.0 0.00 −0.25 −0.3 −0.50 −0.6 0.25 0.50 t 0.75 0.25 0.75 N=500 1.0 1.0 0.5 0.5 β11(t) β11(t) N=200 0.50 t 0.0 0.0 0.25 0.50 t 0.75 0.25 0.50 t 0.75 Figure 4.1 The estimation of nonparametric functions β01 (·) and β11 (·) when N=200, 500. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash line. 64 N=200 N=500 0.4 0.4 0.2 0.2 β02(t) β02(t) 0.6 0.0 0.0 −0.2 −0.2 0.25 0.50 t 0.75 0.25 N=200 0.50 t 0.75 N=500 0.5 β12(t) 0.5 β12(t) 1.0 1.0 0.0 0.0 −0.5 −0.5 0.25 0.50 t 0.75 0.25 0.50 t 0.75 Figure 4.2 The estimation of nonparametric functions β02 (·) and β12 (·) when N=200, 500. The estimated and true functions are denoted by the solid and dashed lines respectively. The 95% confidence bands are denoted by the dotted-dash line. 4.3.3 4.3.3.1 Performance of hypothesis tests Performance for joint test We evaluate the performance of the joint test under the null hypothesis H0 : β11 (·) = β12 (·) = 0. Power is evaluated under a sequence of alternative models with different values of τ, which is τ (·) = τβ (·) and β τ (·) = τβ (·). denoted by H1τ : β11 11 12 12 Figure 4.3 shows the empirical size (when τ = 0) and power function (when τ > 0) at significance level 0.05. We obtain 1000 Monte Carlo simulations to assess the null distribution of test statistic under sample sizes N = 200, 500. The empirical Type I error under both sample sizes are 65 close to the nominal level 0.05 and the power increases dramatically when τ increases from 0 to 0.05. To see the effect of sample size, we compare the performances under N=200 and N=500. As expected, the Type I error is closer to 0.05 and the power increase faster for larger sample size when N=500. For a relatively small sample size N=200, the Type I error is a little inflated and the power function increases slower compared to the case with N = 500. Overall, the results indicate good performance of the proposed joint testing procedure. Power plot of joint test 1.0 Power 0.7 0.5 0.3 sample_size 200 500 0.1 0.00 0.01 0.02 τ 0.03 0.04 0.05 Figure 4.3 The power plot for the joint test under different sample sizes N=200, 500 when T =10. 4.3.3.2 Performance for marginal tests The performance of the marginal tests for the nonparametric functions corresponding to different traits is evaluated through simulations. Two null hypotheses H0 : β11 (·) = 0 and H0 : β12 (·) = 0 were considered separately. For each test, power is evaluated under a sequence of alternative τ (·) = τβ (·), l = 1, 2, correspondingly. models, denoted by Haτ : β1l 1l Figure 4.4 displays the power for both marginal tests under different sample sizes N=200 and 500.The empirical Type I error under both sample sizes are very close to the nominal level 0.05 and the power increases dramatically when τ increases from 0 to 0.05. It is obvious that the power 66 increases more rapidly with larger sample size, while the overall performances under N=500 and N=200 are close, which indicates that our method performances well and does not require very large sample size. Power for testing significance of β11 1.0 Power 0.7 0.5 0.3 sample_size 0.1 200 500 0.000 0.025 0.050 τ 0.075 0.100 Power for testing significance of β12 1.0 Power 0.7 0.5 0.3 sample_size 200 500 0.1 0.000 0.025 0.050 τ 0.075 0.100 Figure 4.4 The power plots for the marginal test for N=200, 500 and T =10. In summary, the simulation results indicate that our proposed estimation method works well. The test results indicate that the asymptotic χ 2 distribution for the proposed joint and marginal test works reasonably well under a finite sample size. 67 4.4 Real data application We applied the proposed multivariate partially linear varying coefficients model and the two-step hypothesis testing procedure to a data set from a study examining the association of the A118G SNP of OPRM1 to experimental pain sensitivity. A sample of 163 healthy volunteers were evolved in this study. For each subject, Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) were measured at 6 dosage levels of Dobutamine to assess the genetic effect on drug response. The 6 dosage levels are: 0 (baseline), 5, 10, 20, 30 and 40mcg/min. We treat the dosage levels as time in this analysis. We consider the partially linear varying coefficient model with two longitudinal traits, with the form YiSBP = β0SBP (Xi j ) + β1SBP (Xi j )Gi + α SBP Zi + εiSBP j j , YiDBP = β0DBP (Xi j ) + β1DBP (Xi j )Gi + α DBP Zi + εiDBP j j . The two longitudinal traits are SBP and DBP. One time-invariant covariate Z = age are included in the model. Five SNPs codon16, codon27, codon49, codon389, codon492 are considered. Table 4.2 displays the results of the joint and marginal testing for responses SBP and DBP, respectively. We note that condon49 and condon389 show significant result with p-values smaller than the significance level 0.05. Further marginal tests tell us that SNP condon49 has significant association with SBP but not DBP, while SNP condon389 has significant association with DBP but not SBP. The results indicate no pleiotropic effect of the two SNPs. In addition, we note that for SNP condon16, the joint test does not show significant result but the p-value of the marginal test for SBP is smaller than the significance level. If we choose α = 0.1 significance level, then the SNP shows a potential pleiotropic effect on the two traits. 68 Table 4.2 List of SNPs with MAF, allele, p-values under the joint and marginal testing for SBP and DBP. SNP ID codon16 codon27 codon49 codon389 codon492 MAF Alleles 0.3990 A/G 0.4160 G/C 0.1387 G/A 0.3045 G/C 0.4250 T/C β1SBP = β1DBP = 0 0.0866 0.3048 0.0229 0.0343 0.3234 p-values under different null hypotheses β1SBP = 0 β1DBP = 0 β1SBP is linear 0.0428 0.0514 0.8311 0.3819 0.1018 0.5938 0.0410 0.8349 0.7846 0.3550 0.0242 0.2150 0.5779 0.6957 0.2611 β1DBP is linear 0.3094 0.1012 0.4439 0.2938 0.7875 Table 4.3 shows the estimation results for the age effect α1 and α2 . The results show that age does not have significant contribution to the two blood traits in this data set. Table 4.3 List of SNPs with MAF, allele, estimation of coefficients, p-values of significance for coefficients corresponding to SBP and DBP, respectively. SNP ID codon16 codon27 codon49 codon389 codon492 MAF Alleles 0.3990 A/G 0.4160 G/C 0.1387 G/A 0.3045 G/C 0.4250 T/C α SBP α DBP 0.0109 0.0235 0.0356 0.0103 0.0213 -0.0435 -0.0365 -0.0336 -0.0644 -0.0494 p-value H0 : α SBP = 0 0.8562 0.7078 0.5637 0.8743 0.7366 H0 : α DBP = 0 0.5005 0.5902 0.6079 0.3409 0.4679 Figure 4.5 illustrates the estimated shapes for nonparametric coefficient functions for different SNPs for trait SBP. The 95% confidence bands cover the zero line for SNPs condon27, condon389 and condon492, which agrees with the testing results that these SNPs do not show significance. For SNP condon49 which shows significance at α = 0.05 level, the estimated increasing effect function as dosage increases suggests that this SNP positively responds to dosage increase to affect SBP. Figure 4.6 illustrates the estimated shapes for nonparametric coefficient functions for different SNPs for trait DBP. Again, we observe that the 95% confidence bands cover the zero line for SNPs condon27, condon49 and condon429 which agrees with the testing results that these SNPs are not significant. For SNP condon389, the estimated effect function shows a marginal significance, indicating the role of this SNP to drug response on DBP. 69 codon16, SBP 0.12 β1(dose) 0.08 0.04 0.00 0.00 0.25 0.50 dose 0.75 codon27, SBP codon49, SBP 0.05 0.15 β1(dose) β1(dose) 1.00 0.00 0.10 0.05 −0.05 0.00 −0.10 0.00 0.25 0.50 dose 0.75 −0.05 0.00 1.00 codon389, SBP 0.50 dose 0.75 1.00 codon492, SBP 0.05 β1(dose) 0.05 β1(dose) 0.25 0.00 −0.05 0.00 −0.05 −0.10 0.00 0.25 0.50 dose 0.75 −0.10 0.00 1.00 0.25 0.50 dose Figure 4.5 Plot of the estimate (solid curve) of the nonparametric function codon16, codon27, codon49, codon389 and codon492. The 95% confidence by the dashed line. 70 0.75 1.00 β1SBP (·) for SNPs bands are denoted codon16, DBP 0.12 β1(dose) 0.08 0.04 0.00 0.00 0.25 0.50 dose 0.75 codon27, DBP 1.00 codon49, DBP 0.05 0.10 β1(dose) β1(dose) 0.05 0.00 0.00 −0.05 −0.05 −0.10 0.00 0.25 0.50 dose 0.75 1.00 0.00 codon389, DBP 0.25 0.50 dose 0.75 1.00 codon492, DBP 0.05 0.04 β1(dose) β1(dose) 0.00 −0.05 0.00 −0.04 −0.10 0.00 0.25 0.50 dose 0.75 1.00 0.00 0.25 0.50 dose 0.75 1.00 Figure 4.6 Plot of the estimate (solid curve) of the nonparametric function β1DBP (·) for SNPs codon16, codon27, codon49, codon389 and codon492. The 95% confidence bands are denoted by the dashed line. 4.5 Discussion Identification of genetic pleiotropy effects has been an important task in genetic association studies. If one gene is associated with multiple traits, special attention should be paid to such genes 71 when designing drug target on those genes. In this paper, we propose a joint multivariate varying coefficient modeling procedure to accommodate correlated longitudinal traits and propose a testing procedure to find the dynamic genetic association between SNPs and multiple traits. Both simulation and real data analysis demonstrate the utility of the proposed method. One difficulty in jointly modeling multiple longitudinal traits is to model the complex correlation structure. For each subject, we should consider correlation between measurements for the same trait at different time points, correlation between measurements at the same time point on different traits, and correlation between measurements at different time points and on different traits. We implement the QIF approach in estimation and testing procedures. There are several advantages for QIF approach. First, the QIF approach only requires correct specification of the mean structure and does not require any joint likelihood in hypothesis testing. Second, it avoids estimating the nuisance correlation structure parameters by assuming that the inverse of working correlation matrix can be approximated by a linear combination of several known basis matrices. Third, when the working correlation structure is misspecified, the QIF is more efficient than the GEE approach. Forth, the inference function of the QIF approach has an explicit asymptotic form, which provides a model selection criteria and allows us to test whether coefficients are significant or time varying based on the asymptotic results. Our method was demonstrated with the L = 2 case. The proposed method can be extended to multiple longitudinal traits with L > 2, although the computational cost might increase. In the real application, we investigate association of SNPs in a candidate gene with two longitudinal traits SBP and DBP. Although the data were not longitudinal in terms of time measurement, the increasing dosage levels can be treated in a time scale. So we can apply the proposed method. The results indicate a weakly pleiotropic effect for SNP condon16 to affect both SBP and DBP. As the application shows, our method is not restricted to a longitudinal study. It also applies to other studies where a certain trait can be measured in a linear scale. Therefore, our method is directly applicable to neuro-genetics studies in which the purpose is to identify SNPs associated with spatial distribution of neuroimages in brain. 72 CHAPTER 5 CONCLUSION AND FUTURE WORK 5.1 Conclusion The originalities and contributions of our work can be summarized in two respects. From the respect of statistical methodology, we propose a functional varying index coefficient model (FVICM) to capture the nonlinear G×E interactions under a longitudinal design. In the development of the estimation procedure, penalized spline method is implemented to approximate the nonparametric unctions in the model. A profile estimation procedure is proposed to estimate two sets of parameters: the index loading coefficients and spline coefficients. Then the quadratic inference function approach for analysing longitudinal data is extended to estimate the index loading coefficients and spline coefficients in the profile estimation method. To test the linearity of nonparametric G×E interaction function, we apply the pseudo-likelihood ratio test using a linear mixed model representation of our proposed model. Consistency and asymptotic normality of the estimators are established. To deal with binary longitudinal traits, it is a natural extension of the FVICM to a generalized functional varying index coefficient model (gFVICM) for investigating nonlinear G×E interactions. We modify the profile estimation procedure with QIF approach to the gFVICM and establish theoretical results of the estimators. Then we proposed a testing procedure based on the asymptotic χ 2 distribution of the objective function in QIF approach. The testing procedure can be used to assess the linearity of the interaction function. For some complex diseases, there are multiple phenotypes that can be used to quantify the risk of diseases and sometimes they have shared genetic determinations and this phenomenon is termed genetic pleiotropy. A joint modeling for multiple longitudinal traits using varying index coefficient model is proposed in our work to deal with correlated longitudinal traits in G×E interaction problems. One difficulty of the joint model is the specification of the complicated correlation structure. 73 The important advantage of QIF approach is that a misspecified working correlation does not affect the consistency of the regression parameter estimation, and the QIF provides a robust sandwich estimator for the variance of the regression parameter estimator. When the working correlation structure is misspecified, the QIF is more efficient than the GEE. From the application perspective, the varying index coefficient modeling is a powerful tool when we consider the joint effect of environmental mixtures and how they interact with genes to affect disease risk. Our methods are well motivated by epidemiological studies with the hope to identify any synergistic G×E interaction effects. Real data application shows that, compared to the additive varying coefficient model, which consider the G×E for each single environmental factor separately, our models outperform in detecting the significant interaction effect since it can reduce multiple testing burden by treating the serval environmental variables as a single index variable. Also, the assumption of a nonparametric interaction is flexible for possible nonlinear interactions in practice. We also provide a framework to assess the simultaneous genetic effect on multiple phenotypes with longitudinal data. 5.2 Future work In the future, we plan to extend the functional varying index coefficient model to joint modeling of binary and continuous longitudinal traits. This is practically important for some diseases. For example, over-weighted people will have a higher chance to develop hypertension. Both obese and hypertension might share some common genetic determinants. Jointly modeling the binary hypertension and continuous body weight or BMI could shed novel insights into the genetic etiology of the disease. The main difficulty of joint modeling for binary and continuous longitudinal traits is the lack of a joint distribution. To overcome this difficulty, many researchers introduce a continuous latent variable underlying the binary response and assuming a joint normal distribution for the latent variable and the continuous variable. Catalano and Ryan (1992) suggested to decompose the joint distribution into two components: a marginal distribution for the continuous response and a 74 conditional distribution for the binary distribution given the continuous response. The first component can be easily modeled. The conditional distribution for binary response can be modeled using the underlying latent continuous variable through a probit link function. Kürüm et al. (2016) proposed the time-varying coefficient models for joint modeling binary and continuous longitudinal responses based on the above idea. However, they only focus on the estimation part and did not provide a testing method for the nonparametric functions in the model. Motivated by their work, we can extend their method to varying index coefficient models for joint modeling binary and continuous longitudinal traits and develop a testing method for joint testing of the significance or linearity of the interaction functions. This will be investigated in our future work. 75 APPENDIX 76 Regularity conditions To establish the asymptotic properties for the estimator of θ , we need the following regularity conditions. (A1) {ni } is a bounded sequence of integers. (A2) The parameter space Ω is compact and θ ∗0 is an interior point of Ω. (A3) The parameter θ ∗ is identified, that is, there is a unique θ ∗0 ∈ Ω such that the mean zero model assumption E[g(θθ ∗0 )] = 0. (A4) E[g(θθ )] is continuous in θ . ∗ ∗ T ∗ (A5) C¯N (θ ) = N1 ∑N i=1 gi (θ )gi (θ ) converges almost surely to C0 , which is a constant and invertible matrix. ∂ g¯ ∗ (A6) The first derivative of g¯N exists and is continuous. ∂ θN∗ (θ ) converges in probability to G0 ∗ if θ converges in probability to θ ∗0 . (B1) {ni } is a bounded sequence of integers. (B2) The parameter space Ωθ is compact and θ 0 is an interior point of Ωθ . (B3) The parameter θ is identified, that is, there is a unique θ 0 ∈ Ωθ such that the first moment assumption E[gi (θθ 0 )] = 0 holds for i = 1, ..., N, and E[gi (θθ )] is continuous. (B4) E[g(θθ )] is continuous in θ . T (B5) C¯N (θ ) = N1 ∑N i=1 gi (θ )gi (θ ) converges almost surely to C0 , which is a constant and invert- ible matrix. ∂ g¯ (B6) The first derivative of g¯N exists and is continuous. ∂ θN (θ ) converges in probability to G0 if θ converges in probability to θ 0 . 77 Proof of Theorem 1: ∗ If we can prove that θ exist and converges to θ ∗0 almost surely, then we can prove the consistency ∗ of θ directly. θ = arg min(N −1 QN (θθ ∗ ) + λ θ ∗T Dθθ ∗ ) exists because (2.4) has zero as a lower ∗ bound and the global minimum exists. To prove the consistency, first, the estimator θ is obtained by minimizing N −1 QN (θθ ∗ ) + λ θ ∗T Dθθ ∗ , then we have ∗ ∗T ∗ 1 1 θ ∗0 . QN (θ ) + λN θ Dθ ≤ QN (θθ ∗0 ) + λN θ ∗T 0 Dθ N N (1) Since 1 −1 (θ θ ∗0 )g¯N (θθ ∗0 ) = o(1) QN (θθ ∗0 ) = g¯TN (θθ ∗0 )C¯N N by the strong law of large number and (A5), and λN = o(1), 1 a.s. QN (θθ ∗0 ) + λN θ ∗T Dθθ ∗0 −−→ 0. 0 N Thus, we can obtain from (1) that ∗ ∗ −1 ∗ ∗ a.s. 1 QN (θ ) = g¯TN (θ )C¯N (θ )g¯N (θ ) −−→ 0. N (2) Since the parameter space Ω is compact, by Glvenko-Cantelli theorem, a.s. sup g¯N (θθ ∗ ) − E[g(θθ ∗ )] −−→ 0. θ ∗ ∈Ω Hence, by (A5) and the continuous mapping theorem, ∗ −1 ∗ ∗ ∗ ∗ a.s. g¯TN (θ )C¯N (θ )g¯N (θ ) − E[g(θ )]T C−1 E[g( θ )] −−→ 0. 0 Combined with (2), we get ∗ ∗ a.s. E[g(θ )]T C−1 0 E[g(θ )] −−→ 0. (3) ∗ Then we will show that it is impossible that θ remains outside of U, where U is any neighbor∗ hood of the true parameter θ ∗0 . Suppose there exist a neighborhood U such that θ ∈ U c . Since ∗ θ ∗ )] is a continuous function and U c is compact, there exists a point θ ∈ U c E[g(θθ ∗ )]T C−1 0 E[g(θ such that ∗ ∗ E[g(θ )]T C−1 0 E[g(θ )] 78 achieve its minimum in U c . By the identification of θ ∗ in (A3), there is a unique θ ∗0 ∈ Ω satisfying E[g(θθ 0 )] = 0, we have θ ∗ )] > 0, E[g(θθ ∗ )]T C−1 0 E[g(θ ∗ which contradicts (3). Then we can prove that θ converges almost surely to θ ∗ . Thus, θ is a consistent estimator of θ . Proof of Theorem 2 The estimate of θ satisfies 0= ∗ 1 ∂ QN ∗ (θ ) + 2λN Dθ . ∗ N ∂θ By Taylor expansion, we obtain 0= ∗ 1 ∂ QN ∗ ∗ 1 ∂ 2 QN ∗ ( θ ) + 2λ D ( θ − θ ∗0 ), (θθ 0 ) + 2λN Dθθ ∗0 + N ∗2 θ N ∂ N ∂θ ∗ ∗ where θ is some value between θ and θ ∗0 . Thus, we can have −1 1 ∂ QN ∗ 1 ∂ 2 QN ∗ θ − θ ∗0 = − ( θ ) + 2λ D (θθ ∗ ) + 2λN Dθθ ∗0 . N N ∂ θ ∗2 N ∂θ∗ 0 ∗ ∗ ∗ (4) Since θ converges to θ ∗0 in probability and θ is between θ and θ ∗0 , by (A5) and (A6) we can get 1 ∂ 2 QN ∗ (θ ) N ∂ θ ∗2 = 2 ∂ g¯N T ∗ ¯ −1 ∗ ∂ g¯N ∗ (θ )CN (θ ) ∗ (θ ) + o p (1) ∂θ∗ ∂θ p − → 2GT0 C−1 0 G0 When λN = o(N −1/2 ), −1 1 ∂ 2 QN ∗ 1 −1 −1/2 ). ( θ ) + 2λ D = (GT0 C−1 N 0 G0 ) + o p (N ∗2 N ∂θ 2 Similarly, since ∂ g¯N T ∗ ¯ −1 ∗ 1 ∂ QN ∗ θ (θ ) = (θθ 0 )CN (θθ 0 )g¯N (θθ ∗0 ) N ∂θ∗ 0 ∂θ∗ 79 and λN = o(N −1/2 ), we have 1 ∂ QN ∗ (θθ ) + 2λN Dθθ ∗0 = GT0 C0−1 g¯N (θθ ∗0 ) + o(N −1/2 ). N ∂θ∗ 0 Therefore, (4) can be written as √ √ ∗ −1 T −1 θ ∗0 ) + o p (1). N(θ − θ ∗0 ) = − N(GT0 C−1 0 G0 ) G0 C0 g¯N (θ (5) By Central Limit Theorem, √ d N g¯N (θθ ∗0 ) − → N(0, C0 ). Using (5) and (6), we obtain √ ∗ d −1 N(θ − θ ∗0 ) − → N(0, (GT0 C−1 0 G0 ) ), and directly, √ d −1 T N(θ − θ 0 ) − → N(0, J(GT0 C−1 0 G0 ) J ). 80 (6) BIBLIOGRAPHY 81 BIBLIOGRAPHY [1] Bai,Y., Fung W. K. and Zhu, Z. (2009). Penalized quadratic inference functions for singleindex models with longitudinal data. Journal of Multivariate Analysis 100, 152-161. [2] Baker, C. (2004). "Chapter 3. Environment Illustrated". Behavioral Genetics AAAS. ISBN 978-0871686978. [3] Belle, G., Fisher, L. D., Heagerty, P. J. and Lumley, T. (2004). "Chapter 18: Longitudinal data analysis". Biostatistics: A Methodology For the Health Sciences, 2nd Edition. [4] Carpenter, D. O., Arcaro, K., and Spink, D. C. (2002). Understanding the human health effects of chemical mixtures. Environmental Health Perspectives 110(suppl 1), 25-42. [5] Catalano, P. J. and Ryan, L. M. (1992) Bivariate latent variable models for clustered discrete and continuous outcomes. Journal of the American Statistical Association 87(419), 651-658. [6] Choi, Y., Chowdhury, R. and Swaminathan, B. (2014). Prediction of hypertension based on the genetic analysis of longitudinal phenotypes: a comparison of different modeling approaches for the binary trait of hypertension. BMC Proceedings 8(Suppl 1):S78. doi:10.1186/1753-6561-8-S1-S78. [7] Crainiceanu, C. and Ruppert, D. (2004). Likelihood ratio tests in linear mixed models with one variance component. Journal of the Royal Statistical Society, Series B 65, 165-185. [8] Crainiceanu, C., Ruppert, D., Claeskens, G., and Wand, P. (2005). Exact likelihood ratio tests for penalized splines. Biometrica 92, 91-103. [9] Crowder M. (1986). On consistency and inconsistency of estimating equations. Econometric Theory 2, 305-330. [10] Crowder M. (1995). On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika 82(2), 407-410. [11] Cui, X., Härdle, W., and Zhu, L. (2011). The EFM approach for single-index models. The Annals of Statistics 39, 16581688. [12] Falconer, D. S. (1952). The problem of environment and selection. The American Naturalist 86, 293-298. [13] Furlotte, N. A., Eskin, E. and Eyheramendy, S. (2014). Genome-wide association mapping with longitudinal data. Genet Epidemiol 36, 463-471. [14] Gratten, J. and Visscher, P. M. (2016). Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Medicine 8:78, doi: 10.1186/s13073-016-0332x. 82 [15] Greven, S., Crainiceanu, C., Kühenhoff, H., and Peters, A. (2008). Restricted likelihood ratio testing for zero variance components in linear mixed models. Journal of Computational and Graphical Statistics 17, 870-891. [16] Johnson, J. A. and Terra, S. G. (2002). Beta-adrenergic receptor polymorphisms: cardiovascular disease associations and pharmacogenetics. Pharm Res 19, 1779-1787. [17] Kürüm, E., Li, R., Shiffman, S. and Yao, W. (2016). Time-varying coefficient models for joint modeling binary and continuous outcomes in longitudinal data. Statistica Sinica 26, 979-1000. [18] Leisch, F., Weingessel, A., and Hornik, K. (1998). On the generation of correlated artificial binary data. Working Paper Series, SFB Adaptive Information Systems and Modelling in Economics and Management Science, Vienna University of Economics. [19] Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalised linear models. Biometrika 73, 12-22. [20] Liu, X., Cui, Y. and Li, R. (2016). Partial linear varying multi-index coefficient model for integrative gene-environment interactions. Statistica Sinica 26, 1037-1060. [21] Liu, X., Gao, B. and Cui Y. (2017) Generalized partial linear varying multi-index coefficient model for gene-environment interactions. Statistical Applications in Genetics and Molecular Biology 16(1), 59-74. [22] Lobo, I. (2008). Pleiotropy: one gene can affect multiple traits. Nature Education 1, 10. [23] Ma, S. and Song, P. (2015). Varying Index Coefficient Models. Journal of the American Statistical Association 110, 341-356. [24] Macgregor, S., Knott, S. A., White, I. and Visscher, P. M. (2005) Quantitative trait locus analysis of longitudinal quantitative trait data in complex pedigrees. Genetics 171, 13651376. [25] Song, P., Jiang, Z., Park, E. and Qu A. (2009) Quadratic inferance functions in marginal models for longitudinal data. Statistics in Medicine 28, 3683-3696. [26] Qu, A. and Li, R. (2006). Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics 62, 379-391. [27] Qu, A., Lindsay, B. G. and Li, B. (2000). Improving generalised estimation equations using quadratic inference fucntions. Biometrika 87, 823-836. [28] Ross, C. A., and Smith, W. W. (2007). Gene-enviroment interactions in Parkinson’s disease. Parkinsonism and Related Disorders 13, S309-S315. [29] Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics 11, 735-757. 83 [30] Ruppert, D. and Carroll, R. J. (2000). Spatially-adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics 42, 205-223. [31] Self, S. G. and Liang, K. Y. (1987). Assessing cumulative health risks from exposure to environmental mixtures - three fundamental questions. Journal of the American Statistical Association 82, 605-610. [32] Sexton, K. and Hattis, D. (2007). Asymptotic properties of maximum likelihood estimators and likelihood ratio under non-standard conditions. Environmental Health Perspectives 115, 825-832. [33] Sitlani, C. M., Rice, K. M., Lumley, T. et al. (2015). Generalized estimating equations for genome-wide association studies using longitudinal phenotype data, Stat Med 34, 118-130. [34] Stram, D. O. and Lee, J. W. (1994). Variance components testing in the longitudinal mixed effects model. Biometrics 50, 1171-1177. [35] Wang, Y. and Chen, H. (2012). On testing an unspecified function through a linear mixed effects model with multiple variance compnents. Biometrics 68, 1113-1125. [36] Wang, W., Feng, Z., Bull, S. B. and Wang Z. (2014). A 2-step strategy for detecting pleiotropic effects on multiple longitudinal traits. Front. Genet. doi.org/10.3389/fgene.2014.00357. [37] Xu, Z., Shen, X., Pan, W. (2014) Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One 9(8), e102312. [38] Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partially linear single-index models. Journal of the American Statistical Association 97(460), 10421054. [39] Zhang, D. (2004). Genaralized linear mixed models with varying coefficients for longitudinal data. Biometrics 60, 8-15. [40] Zimmet, P., Alberti, K., and Shaw, J. (2001). Global and societal implications of the diabetes epidemic. Nature 414, 782-787. 84