VARIABLE SELECTION IN VARYING MULTI-INDEX COEFFICIENT MODELS WITH APPLICATIONS TO GENE-ENVIRONMENTAL INTERACTIONS By Shunjie Guan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Doctor of Philosophy 2017 ABSTRACT VARIABLE SELECTION IN VARYING MULTI-INDEX COEFFICIENT MODELS WITH APPLICATIONS TO GENE-ENVIRONMENTAL INTERACTIONS By Shunjie Guan Variable selection is an important topic in modern statistics literature. And varying multi-index coefficient model(VMICM) is a promising tool to study the synergistic interaction effects between genes and multiple environmental exposures. In this dissertation, we proposed a variable selection approach for VMICM, we also generalized such approach to generalized and quantile regression settings. Their theoretical properties, simulation performance and application in genetic research were studied. Complicated diseases have both environmental and genetic risk factors, and large amount of research have been devoted to identify gene-environment (G×E) interaction. Defined as different effect of a genotype on disease risk in persons with different environmental exposures (Ottman (1996)), we can view environmental exposures as the modulating factors in the effect of a gene. Based on this idea, we derived a three stage variable selection approach to estimate different effects of gene variables: varying, constant and zero which respectively correspond to nonlinear G×E effect, no G×E effect and no genetic effect. For multiple environmental exposure variables, we also select and estimate important environmental variables that contribute to the synergistic interaction effect. We theoretically evaluated the oracle property of the three step estimation method. We conducted simulation studies to further evaluate the finite sample performance of the method, considering both continuous and discrete predictors. Application to a real data set demonstrated the utility of the method. In Chapter 3, we generalized such variable selection approach to binary response setting. Instead of minimizing penalized squared error loss, we chose to maximize penalized log-likelihood function. We also theoretically evaluated the oracle property of the proposed selection approach in binary response setting. We demonstrated the performance of the model via simulation. At last, we applied our model to a Type II diabetes data set. Compared to conditional mean regression, conditional quantile regression could provide a more comprehensive understanding of the distribution of the response variable at different quantile. Even if the center of distribution is our only interest, median regression (special case of quantile regression) could offer a more robust estimator. Hence, we extended our three stage variable selection approach to a quantile regression setting in Chapter 4. We demonstrated the finite sample performance of the model via extensive simulation. And we applied our model to a birth weight data set. I delicate this dissertation to my parents, Bishan Lin, Ruixiong Guan and my girlfriend Lingjie Zhou. iv ACKNOWLEDGMENTS Here, I would like to offer my sincere gratitude to my advisor Dr. Yuehua Cui for his patient, support and guidance during my study and research. Dr. Cui is kind, knowledgable and humble, whenever I am stuck with a problem, he could always offer very useful idea and insights. In the genetic journal club hosted by Dr. Cui, I learned a lot from other’s presentation and eventually, I was able to give a talk on my own. I really own my completion of this dissertation to Dr. Cui’s guidance. I also have to thank the members of my PhD committee, Dr. Pingshou Zhong, Dr. Hyokyoung G. Hong, and Dr Qing Lu. Their comments and suggestions are very helpful in my graduate study. I appreciate the help I got from all the professors in the Department of Statistics and Probability. Especially, Dr. Taps Maiti, I took several of his classes, unlike other courses that focus on theory and proof, his course in linear modelling focused on interpretation and how to explain statistical concepts to non-statisticians. It is only when I began to look for a job in the past months that I realize its importance. I really own my success in finding a job after graduation to his insistence in interpreting statistics results. My thanks also goes to other senior students in Dr. Cui’s group, including Dr. Bin Gao, Dr. Honglang Wang, Dr. Tao He, Jingyi Zhang and post-doctoral researcher Xu Liu. In just a few years, many of them become successful researchers in a university or statisticians in big companies. Last but not least, I thank my parents and my girlfriend for their support and confidence in me. They are the beacon for my journey, offered me hope and a sense of direction. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . 1.1 Overview . . . . . . . . . . . . 1.2 Gene-Environment Interaction 1.3 Non-Parametric Models . . . 1.4 Variable Selection . . . . . . . 1.5 Quantile Regression . . . . . . 1.6 Objective and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix . . . . . . . 1 1 2 3 5 8 9 Variable Selection with Varying Multi-Index Coefficients Model for G×E Interaction . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variable Selection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Selection of the order h and the number of internal knots K . . . . . 2.2.6 Selection of Initial values for β . . . . . . . . . . . . . . . . . . . . . Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Simulation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 The Continuous Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 For discrete G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 13 13 14 16 18 20 20 21 22 23 23 25 29 32 Chapter 2 2.1 2.2 2.3 2.4 2.5 2.6 Chapter 3 Variable Selection for Generalized VMICM . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Variable Selection for gVMICM . . . . . . . . . . . . . . . 3.2.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Estimation Method . . . . . . . . . . . . . . . . . . 3.2.3 Computational algorithm . . . . . . . . . . . . . . . 3.2.4 Selection of Parameters . . . . . . . . . . . . . . . . 3.2.4.1 Selection of tuning parameters λ1 , λ2 , λ3 . 3.2.4.2 Selection of order h and number of interior 3.2.5 Choosing the initial values . . . . . . . . . . . . . . 3.3 Theoretical Properties . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . knots K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 38 38 38 40 42 42 44 44 45 3.4 . . . . . 46 47 49 54 56 . . . . . . . . . . . . . . . . K . . . . . . . . . . . . . . 59 59 61 62 62 64 66 66 67 67 68 69 69 71 75 79 Chapter 5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 81 82 3.5 3.6 Simulation . . . . . . . . 3.4.1 For continuous G 3.4.2 For discrete G . . Real Data Application . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 Variable Selection for Quantile VMICM . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Variable selection for quantile regression with VMICM . . . . . . . . . . 4.2.1 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Estimation method . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Selection of parameters . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4.1 Selection of the tuning parameters λ1 , λ2 , λ3 . . . . . . . 4.2.4.2 Selection of the order h and the number of interior knots 4.2.5 Selection of the initial values . . . . . . . . . . . . . . . . . . . . . 4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Simulation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 The continuous case . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 The discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . Appendix A Real Data Results . . . A.1 Real data results of gVMICM . Appendix B Algorithm . . . . . . . . B.1 Algorithm for VMICM . . . . . B.2 Algorithm for model (2.6) . . . B.3 Algorithm for gVMICM . . . . B.4 Algorithm for quantile VMICM B.5 Algorithm for model (4.6) . . . Appendix C Proof of Theorems . . . C.1 Proof of Theorem 2.3.1 . . . . . C.2 Proof of Theorem 2.3.2 . . . . . C.3 Proof of Theorem 3.3.1 . . . . . C.4 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 . 85 . 85 . 87 . 87 . 90 . 90 . 92 . 93 . 94 . 94 . 98 . 99 . 103 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 vii LIST OF TABLES Table 1: Selection and prediction accuracy of mk (·) for continuous G . . . . . . . . 24 Table 2: Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 25 Table 3: Selection and prediction accuracy of mk (·) for discrete G. . . . . . . . . . 27 Table 4: Prediction accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 2 29 Table 5: The estimated loading parameters . . . . . . . . . . . . . . . . . . . . . . 31 Table 6: Selection and prediction accuracy of mk (·) for continuous G . . . . . . . . 48 Table 7: Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 49 Table 8: Setup for mk (u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Table 9: Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . . 51 2 2 Table 10: Estimation accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 53 2 Table 11: The estimated effect of β for SNP rs6537663. . . . . . . . . . . . . . . . . 56 Table 12: Selection and estimation accuracy of mk (·) for continuous G . . . . . . . 70 Table 13: Selection and estimation accuracy of β . . . . . . . . . . . . . . . . . . . . 71 Table 14: Setup for mk (u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Table 15: Selection and estimation accuracy for mk (u) with discrete G . . . . . . . 73 Table 16: Selection and estimation accuracy for β with discrete G . . . . . . . . . . 75 Table 17: Effect of SNPs in gene ST3GAL1 . . . . . . . . . . . . . . . . . . . . . . . 77 Table 18: Estimated Loading Parameter for Gene ST 3GAL1 . . . . . . . . . . . . . 78 Table 19: List of SNPs with a varying effect. . . . . . . . . . . . . . . . . . . . . . . 85 Table 20: List of SNPs with a constant effect. . . . . . . . . . . . . . . . . . . . . . 86 viii LIST OF FIGURES Figure 1: Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . . 28 Figure 2: Plot of the varying coefficient effect for gene expression P GGT 1B. . . . 31 Figure 3: Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . . 52 Figure 4: Plot of effects on a log odds scale for SNP rs6537663 . . . . . . . . . . . 56 Figure 5: Selection and estimation accuracy of mk (·) for discrete G . . . . . . . . . 74 Figure 6: Plot of interaction effect effects . . . . . . . . . . . . . . . . . . . . . . . 79 ix Chapter 1 Introduction 1.1 Overview In this dissertation, we studied how genetic and environmental factors interact to affect a disease outcome by developing novel statistical methods. Ever since Gregor Mendel’s famous experiments with his pea plants in the nineteenth century, researchers have been fascinated with the role of genetics played in our lives. With decades of genetics research, we knew more than ever the effect of various genes. For example, we knew mutations in the CFTR gene cause cystic fibrosis, mutations in PAH gene cause phenylketonuria. In fact, scientists have identified more than 10, 000 human disorders that are caused by mutations in single genes. However, complex diseases such as type II diabetes have various risk factors: environmental risk factors such as exercise level, body mass index, genetic risk factors such as mutations in gene TCF7L2 and ABCC8. Hence, it is of great interest to study how gene and environment interact and affect various disease traits. In this chapter, we first provided some background information and a brief review of traditional statistical models used to study gene-environment(G×E) interaction in section 1.2. To address the constrain of traditional G×E interaction models, we discussed non-parametric models in section 1.3. In section 1.4, we discussed the recent advance in variable selection via penalized regression and how we can apply variable selection in our model to select significant risk factors. We 1 offered a brief review of quantile regression and its benefit in section 1.5. At last, the goal and organization of this dissertation is offered in section 1.6 1.2 Gene-Environment Interaction In recent years, more and more research suggested that gene-environment (G×E) interaction plays an important role in complex traits such as type II diabetes and birth weight. G×E interaction was defined by Ottman (1996) as “a different effect of a genotype on disease risk in persons with different environmental exposures”. Traditionally, G×E interaction was investigated via linear model: Y = β0 + β G ∗ G + β E ∗ E k + β G×E ∗ G ∗ E k + (1.1) where G represents genetic factors; E k represents an environmental factor; β G , β E represents the genetic effects and environmental effect respectively; and β G×E represents the G×E effect between genetic factors G and environmental factor E k . However, such model has several drawbacks. Firstly, it assumes the G×E interaction is linear, which is often violated in many cases. Secondly, model (1.1) is only computational feasible with a single environmental factor. With the introduction of multiple environmental factors, model dimension will increase dramatically, resulting in biased and unstable estimation. Nevertheless, many epidemiological studies revealed that disease risks can be modified by simultaneously exposure to several environmental factors (Carpenter et al. (2002); Sexton and Hattis (2007)). These limitations led to the implementation of several non-parametric models in the G×E interaction studies. 2 1.3 Non-Parametric Models In parametric modelling, such as linear model (1.1) or generalized linear model, we assumed we knew the model structure in advance and we estimated the model parameters based on the assumed structure and the data set. However, more often than we would prefer, such assumptions cannot be justified. Which was particularly problematic since all parameter estimation and model interpretation were made based on those assumptions. This predicament led us to consider non-parametric modelling. Non-parametric models make little to no assumptions, it lets the data decide the functional relationships between the response variable and the predictors. Due to its flexibility, non-parametric models could be applied to most data set. For example, to address the limitation of linearity in model (1.1), Ma et al. (2011) proposed to use varying coefficient(VC) model to detect non-linear gene-environmental interaction. Proposed by Hastie and Tibshirani (1993), a varying coefficient model could be of the form p Y = β0 (X) + βk (X)Gk + (1.2) k=1 where Y is the response; Gk , k = 1, . . . , p is the kth genetic factor; and X is a single environmental factor and is the random error. Representing the gene effect of Gk , βk (X) is a smooth non-linear non-parametric functions indexed by X. Under such structure, βk (X) is allowed to vary as a function of the environmental factor X. It could either be a linear or a non-linear function. This is the reason why model (1.2) could be used to detect non-linear gene-environmental interaction between several genetic variants and a single environmental variant. While VC model (1.2) alleviates the linearity constrain of model (1.1), it can only be 3 used to model G×E interaction with a single environmental factor. Similar to linear models, it is computational infeasible to model how a mixture of environmental factors interact with genetic variants due to dramatically increasing model dimension. To address this issue, Liu et al. (2016) proposed we could implement varying multi-index coefficient model (VMICM) p Y = m0 (Xβ) + mk (Xβ)Gk + (1.3) k=1 where Y is the continuous response variable; G = (G1 , . . . , Gp ) is a p dimensional matrix representing genetic factors; X is an environmental factor matrix of dimension q; β is the loading parameter for environment covariates X; mk (u) is a smooth non-linear non-parametric function indexed by Xβ, and it represents the gene effect of Gk . One of the main advantage of model (1.3) is it considers the interaction between genetic variants G and a mixture of environmental variants X without increasing model dimension dramatically. We could interpret mk (Xβ) as the gene effect of Gk modulated by its index Xβ. And βd could be interpreted as the strength of interaction between the d − th environmental factors Xd and G. We could also observe VC model (1.2) is a special case of VMICM (1.3) when q = 1. Due to its unique ability to accommodate non-linear G×E interaction with a mixture of environmental variants, we decide to implement VMICM in this dissertation. Estimation for non-parametric model such as (1.2) and (1.3) could be roughly grouped into three categories: kernel smoothing (Fan and Zhang (1999); Xia and Li (1999) ; Cai et al. (2000)), spline-based methods (Huant et al. (2004); Hoover et al. (1998); Chiang et al. (2001)) and wavelet estimation (Zhou and You (2004)). In this dissertation, we adopted the idea of B-spline approximation to estimate non-parametric function mk (u) for several reasons. Firstly, by some transformation of the B-spline basis function, we would be able to 4 separate the constant effect of Gk from its varying effect. This would be discussed in detail in the following chapters. Secondly, the computation algorithm of B-spline approximation is more efficient compared to kernel based methods. It is essential since we are working with high dimension genetics data. Further, it’s easier to implement variable selection via penalized regression in a B-spline approximation setting. 1.4 Variable Selection To select significant gene environmental combo from a large number of variants, we need to implement some dimension reduction technique, via either variable selection or hypothesis testing. In this dissertation, we focused on variable selection for its efficient algorithm and unique feature of simultaneous model selection and estimation. Traditional model selection techniques include backward/forward selection or information criterion based technique such as AIC or BIC. However, with the rise of big data, such methods are no longer feasible for several reasons: exponentially increasing computation time and unstable estimation due to increasing collinearity. Recently, variable selection via penalized regression has been gaining popularity. Its idea is to add a penalty term to the loss or lieklihood function. It could be of the form p ˆ = arg min Q(β) = arg min ||Y − Xβ||2 + n β pλ (|βj |) 2 β β j=1 (1.4) or p ˆ = arg max Q(β) = arg max l(β, X, Y ) − n β pλ (|βj |) β β j=1 5 (1.5) where Y is the dependent variable and X is the predictor; ||Y − Xβ||22 is the squared error loss function; l(β, X, Y ) is the log-likelihood function; and pλ (|βj |) is the penalty function for the j-th coordinate of β. By adding an appropriate penalty term to the optimization ˆ could be shrunk to 0, therefore, function, some covariates of the penalized estimator β achieving simultaneous model selection and estimation. With different choices of penalty function pλ (·), the penalized estimator of β could possess different properties. Fan and Li (2001) advocated three properties that a penalized estimator should possess: (1) Sparsity: The penalized estimator should automatically set small coefficients to zero, therefore achieving model selection. (2) Approximately Unbiasedness: The penalized estimator should be approximated unbiased, especially when the true coefficient is large. (3) Continuity: The penalized estimator should be continuous in the data, therefore, reducing instability in model prediction. To possess all three properties under squared error loss setting, Antoniadis and Fan (2001) deduced that the penalty function pλ (t) should satisfy: (1) mint≥0 {t + pλ (t)} > 0; (2) pλ (t) = 0 for large t; (3) arg mint≥0 {t + pλ (t)} = 0. Here, we present several popular choice of penalty functions: (1) Ridge Regression: pλ (βj ) = λ|βj |2 (Hoerl and Kennard (1970)); (2) LASSO: pλ (βj ) = λ|βj | (Tibshirani (1996)); (3) Adaptive Lasso: pλ (βj ) = wj λ|βj | (Zou (2006)); 6 (aλ−β )+ j (4) SCAD: pλ (βj ) = λ{I(βj ≤ λ) + (a−1)λ I(βj > λ)} for some a > 2 (Fan and Li (2001)); βj (5) MCP: pλ (βj ) = λ 0 (1 − τsλ )+ ds for some τ > 0 (Zhang (2010)). Among those penalty functions, adaptive LASSO, SCAD and MCP all possess the sparsity, approximately unbiasedness and continuity property. In this dissertation, we focused on MCP as our penalty function. To solve the optimization problem (1.4) or (1.5), there are several algorithms available. Least-angle regression (LARS) (Efron et al. (2004)) could be used to calculate the entire solution path of the LASSO problem very efficiently. Fan and Li (2001) proposed local quadratic approximation (LQA) algorithm to solve the non-concave penalized likelihood problem. It can be implemented with a number of penalty functions. Their idea was to locally approximate the optimization function with a quadratic function. Therefore, transforming problem (1.4) or (1.5) to a least square problem with a closed form solution. Even though its efficient was surpassed by recent algorithms, the idea of LQA is still of great importance. Building on the idea of LQA, Zou and Li (2008) proposed local linear approximation (LLA) algorithm. Their idea was to locally approximate the non-concave penalty function linearly. Transforming the non-concave penalty to a LASSO penalty, which could be solved using LARS. Coordinate descent algorithm (Friedman et al. (2007), Friedman et al. (2010)) is an even more efficient algorithm. With small modification, it could be implemented to a wide range of penalized optimization problems. In this dissertation, we adopted the idea of coordinate descent and LQA to solve our optimization problem. Besides the usual penalized regression that select individual parameters, there is a group analog, call grouped penalized regression (Yuan and Lin (2006)). Instead of penalizing 7 individual parameter, it penalizes the L2 norm of a group of parameters. Assume β is divided into q groups β = (β 1 , . . . , β q ), and the objective function could be of the form q ˆ = arg min Q(β) = arg min ||Y − Xβ||2 + n pλ (||β j ||2 ) . β 2 β β j=1 (1.6) The end result is that we select non-zero parameters as a group, either a group of parameters being all zero or none of them being zero. The grouped penalized regression technique is particularly useful in this dissertation. 1.5 Quantile Regression Another objective of this dissertation was to extend the proposed variable selection for VMICM to quantile regression setting. Quantile regression is a very important alternative to the conventional conditional mean regression. It differs from mean regression in the loss function. Instead of trying to minimize the squared error loss for linear model, quantile regression tries to minimize the quantile loss function n ρτ (Yi − X i β) and ρτ (u) = u{τ − I(u < 0)}. (1.7) i=1 Quantile regression also possess several advantages. First, modelling a dataset at several different quantiles offers a far more comprehensive view of the distribution of the response variable. In many scenarios, the effect of gene Gk varies at different quantile of the distribution. Further, even when we are only interested in the center of the distribution, median regression (quantile regression with τ = 0.5) can provides more robust estimator, therefore, insensitive to outliers. 8 1.6 Objective and Organization To select non-linear gene-environment interaction, we proposed a three stages iterative variable selection approach for Varying Multi-Index Coefficient Model and its generalization. We also extended such approach to a quantile regression setting. One of our goal is to classify genetic variants into three categories: varying, constant and zero. Varying effect gene is the gene that interact with environmental factors. Its effect on the response varies as environmental factors changes. Constant effect gene is the gene that only has a constant effect, not being modulated by environmental factors. Zero effect gene does not have an effect at all. By selecting non-zero loading parameters β in model (1.3), another goal of the proposed approach is to select environmental variants that interact with genetic variants. The rest of the dissertation is organized as follow. In chapter 2, we presented the variable selection approach for VMICM, its estimation method, and theoretical properties. We conducted extensive simulation studies to evaluate the finite sample performance of the proposed method. The utility of our model was demonstrated with a real data analysis. In chapter 3, we generalized our selection approach to a binary response generalized regression setting. We extended the model to a quantile regression setting in chapter 4, followed by conclusion and further works in chapter 5. At last, all the proofs and details of the algorithms were rendered in the Appendices. 9 Chapter 2 Variable Selection with Varying Multi-Index Coefficients Model for G×E Interaction 2.1 Introduction Gene-environment (G×E) interaction study has been gaining popularity. As discussed in chapter 1, varying multi-index coefficient model (VMICM) enjoys the unique ability to model non-linear interaction between genetic variants and a mixture of environmental variants. In this chapter, we propose a variable selection model for VMICM. Consider the varying multi-index coefficient model of the form: p Y = mk (Xβ)Gk + (2.1) k=0 where Y = (Y1 , Y2 , . . . , Yn )T is a continuous response variable that measures certain phenotypic trait of interest; X is a q-dimensional environmental exposure variables; G is a p + 1 dimensional genetic variables; mk (·), k = 0, 1, . . . , p is the unknown non-parametric function and β is a vector of unknown loading parameter of dimension q. One of the main 10 advantage of VMICM is that it models the effects of G on Y as functions of X without suffering the curse of dimensionality. One can interpret mk (Xβ) as the effect of Gk on Y , modified by multiple X variables though the index Xβ. In addition, VMICM is flexible enough to cover a wide range of models. For instance, if q = 1 and β = 1 then it becomes an additive varying coefficient model, and if p = 1 and G = 1 then it becomes a standard additive single index model. Variable selection has been a popular statistical strategy to solve large p small n problem in a regression setup. In the past, researchers often opted for forward/backward selection, and information based criteria such as AIC and BIC for variable selection. Recently, variable selection via penalized regression is gaining more popularity and wider acceptance as it features simultaneous selection and estimation of parameters. Its idea is to add a penalty function to the loss function or log-likelihood function. Bridge regression (Frank and Friedman (1993)), least absolute shrinkage and selection operator (LASSO) (Tibshirani (1996)) and its extensions (adaptive-LASSO Zou (2006)), smoothly clipped absolute deviation (SCAD) (Fan and Li (2001)) and minimax concave penalty (MCP) (Zhang (2010)) are a few examples. To evaluate different penalized functions, Fan and Li (2001) proposed three important criteria: sparsity, unbiasedness, and continuity. They also showed that SCAD penalty possess oracle property, meaning that penalized regression featuring SCAD works as well as if the correct sub-model was known in advance. Adaptive LASSO (Zou (2006)), SCAD (Fan and Li (2001)) and MCP (Zhang (2010)) all possess oracle property. However, for adaptive LASSO, determining weights for parameters might become problematic when p > n. In the current work, we adopt MCP penalty function for its oracle property and fast algorithm. Considering the nonlinear structure about the unknown non-parametric functions mk (·) 11 and its unknown parameter β. We propose a three stage iterative variable selection strategy. Specifically, our goal is: (1) to classify the non-parametric functions mk (·), k = 1, . . . , p into three categories: varying, constant and zero; (2) to select zero and non-zero components of loading parameters β; and (3) to estimate mk (·), k = 0, 1, . . . , p and β. Our approach is motivated by the practical need to separate three different mechanisms in G×E interaction. The zero function of mk (·) indicates no genetic effect at all; the constant function of mk (·) indicates no G×E effect; while the varying function of mk (·) indicates the existence of G×E effect. As shown in Liu et al. (2016), the VMICM model has the advantage to capture the joint interaction of genes with multiple exposures as a whole. Novel insights about the underlying genetic mechanism can be revealed by the proposed model. In addition to the selection of the coefficient functions, we can also select important loading parameters inside each index coefficient function, to further quantify the relative importance of individual exposure variables. Feng and Xue (2013) proposed a variable selection approach based on VMICM by applying a group SCAD penalty on B-spline coefficients γ and loading parameters β. They focused on either zero or nonzero coefficient functions mk (·). We are particularly interested in the constant coefficient since it corresponds to no G×E effect and has important practical implications. Tang et al. (2012) proposed a two step variable selection approach based on an additive varying-coefficient model. They classified the non-parametric function into three categories: varying, constant or zero. However, their model is a special case of our VMICM model with the dimension of the X variable being one. No variable selection approach on VMICM has been proposed to classify unknown non-parametric functions mk (·) into three categories (varying, constant or zero), while at the same time selecting non-zero loading parameter β. Following their previous work, we use B-spline basis functions to approximate 12 unknown non-parametric functions mk (·), then using penalized regression to classify mk (·) into varying, constant or zero. In addition, we select non-zero β via first order approximation and penalized regression. We show that under mild regulatory conditions, our estimators possess the oracle property, indicating that our penalized estimator works as well as if the correct sub-model is known in advance. The rest of the chapter is organized as follows. Section 2.2 introduces our proposed variable selection approach, including estimation method, iteration approach, and how to select various tuning parameters. Method on how to select initial values for β is also discussed. In Section 2.3, we evaluate the theoretical properties of our approach. In Section 2.4, we perform simulations to evaluate the performance of our approach in finite samples, followed by a real data application in Section 2.5 and a discussion. 2.2 Variable Selection Method Throughout the chapter, superscript T is used to denote matrix transpose, ||.||p is used to denote Lp norm, and log(a) is used to denote natural logarithm of a. For the sake of simplicity, we use constant and non-zero constant interchangeably. 2.2.1 Model Setup The varying multi-index coefficient model is set up as follows: p Y = mk (Xβ)Gk + k=0 13 where Y n×1 = (Y1 , Y2 , . . . , Yn )T is a continuous response variable; n is the sample size. mk (·), k = 0, 1, . . . , p are p + 1 unknown continuous functions; X n×q = (X 1 , X 2 , . . . , X q ) are continuous loading covariates; β q×1 = (β1 , . . . , βq )T are the loading parameters; Gn×(p+1) = (G0 , G1 , . . . , Gp ), G0 = (1, . . . , 1)T and Gk = (G1k , G2k , . . . , Gnk )T is a continuous or discrete vector of length n for k = 1, 2, . . . p. In the model, mk (Xβ) is the effect of Gk on Y for k = 0 and m0 (Xβ) is the intercept function which models the marginal effect of multiple X variables on Y . The error term is an unknown random error with mean 0 and variance σ 2 . We further assumed i and j are independent ∀ 1 ≤ i, j ≤ n and i = j. 2.2.2 Estimation Method Our goal was to select and estimate unknown functions {mk (·)}k=0,1,...,p and unknown loading parameter β = (β1 , . . . , βq )T . For identifiability purpose, we assume ||β||2 = 1 and β1 > 0, and mk (·) cannot has the form of mj (u) = αT uβ T u + γ T u + c. We approximated the unknown function {mk (u)}k=0,1,...,p using B-spline basis functions. Without loss of generality, we assumed u ∈ [0, 1]. Let K be the number of internal knots, h be the degree of the B-spline basis function. So h = 1 represents linear splines, h = 2 represents quadratic splines. Denote u1 , u2 , ..., uK be internal knots satisfying 0 = u0 =< u1 < u2 < .... < uK < uK+1 = 1. Denote Int to be left closed, right opened interval [ut−1 , ut ) for 1 ≤ t ≤ K; InK+1 to be closed interval [uK , uK+1 ]. Denote F to be a collection of functions f defined on [0, 1] satisfying: (i) the restriction of f to Int is a polynomial of degree h or less for 1 ≤ j ≤ K + 1; (ii) f is h − 1 times continuous differentiable on [0, 1]. Let L = K + h + 1, by (Schumaker (2007)), we have normalized B-spline basis function B(u) = (B 1 (u), B 2 (u), . . . , B L (u)) for F . And there exists a linear transformation matrix 14 ¯ Π, such that ΠB(u) = (1, B(u)) = (1, B 2 (u), B 3 (u), . . . , B L (u)) = B(u) where each ¯ component of B(u) is a function of u. Then for 0 ≤ k ≤ p, we can estimate mk (u) by ¯ mk (u) ≈ (1, B2 (u), · · · , BL (u)) ∗ (γk1 , γk2 , · · · , γkL )T = B(u)γ k = γk1 + B(u)γ k∗ (2.2) where γ k∗ = (γk2 , γk3 , . . . , γkL )T and γ k = (γk1 , γ Tk∗ )T . With B-spline approximation, (2.1) can be rewritten as p ¯ {γk1 + B(Xβ)γ k∗ }Gk + . Y = (2.3) k=0 Thus, the original estimation problem can be transformed to estimate {γk1 , γ k∗ }k=0,1,...,p and β. Note: the transformation Π enable us to separate the constant effect of Gk on Y from its jointly effect with X on Y . That is: (1) if ||γ k∗ ||2 = ( L 2 1/2 l=2 γkl ) = 0, then there exists interaction between Gk and multiple X; (2) if ||γ k∗ ||2 = 0 and |γk1 | = 0, then Gk has a constant effect on Y , i.e., no G×E interaction effect; and (3) if further ||γ k∗ ||2 = 0 and |γk1 | = 0 then Gk has no effect on Y at all. To select and estimate the parameters {γ k }k=0,1,...,p and β, we adopted the penalized regression idea and minimized the following objective function: p n g(Yi − Q(β, γ) = i=1 p ¯ [γk1 + B(Xβ)γ k∗ ]Gik ) + n k=0 k=1 p q pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) + n +n pλ1 (||γ k∗ ||2 ) k=1 (2.4) pλ3 (|βd |) d=2 where g(·) is a loss function; pλ1 (·), pλ2 (·), pλ3 (·) are penalty function of the corresponding parameters; and I(·) is an indicator function. Remarks: (1) From the construction of the penalty function, we penalized γk1 only if 15 ||γ k∗ ||2 = 0. If ||γ k∗ ||2 = 0, it implies that the function is varying and no need to penalize the constant part. (2) No penalty was applied to the intercept function m0 (·) as both γ 0∗ and γ01 was not involved in the penalty term. There is no practical motivation of penalizing the marginal intercept function. (3) No penalty was applied to the first loading parameter β1 in β due to the constraint. For the penalty function, we used MCP penalty proposed by C. Zhang (Zhang (2010)), p(x, λ) = λ 0x (1 − τsλ )+ ds with regularization parameters τ > 0 and λ > 0. For the loss function, we set g(·) to be squared error loss, (2.4) can be further rewritten as : p n Yi − Q(β, γ) = i=1 ¯ [γk1 + B(Xβ)γ k∗ ]Gik 2 p k=0 k=1 p q pλ2 (|γk1 |)I(||γ k∗ ||1 = 0) + n +n pλ1 (||γ k∗ ||1 ) +n k=1 (2.5) pλ3 (|βd |) d=2 where pλ1 (·), pλ2 (·), pλ3 (·) are the MCP penalty function defined above. 2.2.3 Iterative Approach Our modeling purpose was to separate the mk (·) function into three different categories: varying, constat or zero, denoted by V, C and Z respectively. Following the ideas by Feng and Xue (2013) and Tang et al. (2012), we adopted a three step iterative approach to obtain our penalized estimator. ˆ (0) , we obtained our 1st step estimation Step 1 : For given initial values of β, denoted by β 16 (1) (1)T T ˆ (1) = {ˆ ˆ k∗ }k=0,1,...,p by following a group penalized regression, of γ, denoted by γ γk1 , γ ˆ (0) ) ˆ (1) = arg min Q1 (γ|λ1 , β γ γ where ˆ (0) ) = Q1 (γ|λ1 , β p n {Yi − i=1 p ˆ (0) )γ k∗ ]Gik }2 + n ¯ [γk1 + B(X β k=0 pλ1 (||γ k∗ ||2 ). k=1 Note that instead of penalizing each coordinate of γ k∗ = (γk2 , . . . , γkL )T separately, we penalized the L2 norm of γ k∗ because we want to assess the presence of joint varying effect of X and Gk on Y . No penalty was applied to γ 0∗ (the B-spline coefficients for the intercept function m0 (·)). Step 1 separates mk (·), k = 1, . . . , p into two categories: varying(V) or (1) (1) non-varying(NV), and mk (·) ∈ V if ||ˆ γ k∗ ||2 > 0 and mk (·) ∈ N V if ||ˆ γ k∗ ||2 = 0. (1) ˆ k∗ = 0. From the non-varying Step 2 : The aim of this step was to select γk1 given γ functions obtained from step 1, we would like to further select the variables with constant effects, and classify the non-parametric functions into constant(C) and zero (0). We penalized (1) γk1 only when ||ˆ γ k∗ ||2 = 0, k = 1, 2, . . . , p, and no penalty is applied to γ01 . (2) (2) (2) ˆ (2) = {(ˆ ˆ k∗ )k∈V , (ˆ We obtained our step 2 estimator γ γk1 , γ γk1 )k∈N V } via penalized regression ˆ (2) = arg min Q2 (γ|λ2 , β (0) , γ ˆ (1) ) γ γ 17 where n Q2 (γ|λ2 ˆ (1) ) , β (0) , γ (2) (0) )γ (2) ]G − ¯ [γk1 + B(Xβ k k∗ {Yi − = i=1 k∈V (2) γk1 Gk }2 k∈N V p (2) (1) pλ2 (|γk1 |)I(||ˆ γ k∗ ||2 = 0). +n k=1 After Step 1 and 2, we had obtained the estimator of the B-spline coefficients γ and classify mk (·) k = 1, . . . , p into V, C or 0. The next step is to select loading parameter β ˆ (2) . given γ ˆ via penalized regression Step 3 : We obtained β ˆ = arg min Q3 (β|λ3 , γ ˆ (2) ) β ||β ||2 =1 where n Q3 (β|λ3 ˆ (2) ) ,γ = p (2) (Yi − [ˆ γk1 i=1 k=0 q (2) ¯ + B(Xβ)ˆ γ k∗ ]Gk )2 pλ3 (|βd |). +n d=2 ˆ (0) by β, ˆ and iterate step 1 to 3 until convergence. The algorithms used Then we replaced β in each step would be discussed in Appendices. 2.2.4 Selection of tuning parameters We proposed the following three step tuning parameters selection process based on Bayesian Information Criterion (BIC) (Schwarz (1978)). 18 Step 1: We took λ1 as the minimizer of p n log(n) (λ ) (λ ) ˆ (0) )ˆ ¯ ∗ dfλ1 [ˆ γk11 + B(X β γ k∗1 ]Gk )2 + n (Yi − BIC(λ1 ) = log i=1 k=0 (λ ) (λ1 ) ˆ (0) ) defined above; β ˆ (0) is ˆ k∗ }k=0,1,...,p are the minimizers of Q1 (γ|λ1 , β where {ˆ γk11 , γ chosen as the estimator from previous iteration; and dfλ1 is defined as the total number of non zero coefficients if λ1 is the penalized parameter. Step 2: We took λ2 as the minimizer of p n log(n) (λ ) (λ ) ˆ (0) )ˆ ¯ ∗ dfλ2 [ˆ γk12 + B(X β γ k∗2 ]Gk )2 + n (Yi − BIC(λ2 ) = log i=1 k=0 (λ ) (λ2 ) ˆ (0) ) defined above; β ˆ (0) is ˆ k∗ }k=0,1,...,p are the minimizers of Q2 (γ|λ2 , β where {ˆ γk12 , γ chosen as the estimator from previous iteration; and dfλ2 is defined as the total number of non zero coefficients if λ2 is the penalized parameter. Step 3: We took λ3 as the minimizer of p n (Yi − BIC(λ3 ) = log i=1 logn (λ ) (λ ) ˆ (λ3 ) )T γ ¯ ˆ k∗2 ]Gk )2 + [ˆ γk12 + B(X β ∗ dfλ3 n k=0 (λ ) (λ2 ) ˆ (λ3 ) is the minimizer of Q3 (β|λ3 , γ ˆ (2) ) defined above; {ˆ ˆ k∗ }k=0,1,...,p are where β γk12 , γ minimizer of the B-spline coefficient from Step 2; and dfλ3 is defined as the total number of non zero β if λ3 is the penalized parameter. We searched the optimal of λ1 , λ2 , λ3 over a grid of 100 exponentially decreasing values with the minimum to be 1E-3, and the maximum of λ1 , λ2 , λ3 were set to be the minimum value such that all of the penalized estimators are 0. 19 2.2.5 Selection of the order h and the number of internal knots K Since h is the order of the B-spline basis function, higher degree corresponds to more complicated interactions and is less interpretable in practice. For instance, h = 2 implies quadratic splines while h = 3 implies cubic splines. Hence, we searched optimal h over 1 the set h ∈ {2, 3, 4}. As for the selection of K, since only when K = Op (n 2r+1 ) (n is the number of samples and r is the smoothness of our nonparametric function mk (·) and r > 2), our selection approach possesses oracle properties. So we can search optimal K in 1 the neighborhood of n 2r+1 , which is denoted by K . In our simulation, K = {2, 3, 4, 5}. In theory, we can select the optimal order and the interior knots for all the nonparametric functions. However, this is practically infeasible due to the large search space and the computational cost. Thus, we assumed that all the nonparametric functions share common h and K, and fit the following intercept only model to select the optimal h and K, Y = m0 (Xβ) + . (2.6) We searched the optimal K and h over a grid K ∈ K and h ∈ {2, 3, 4}. The optimal K and h is defined as the K, h that minimize log n ˆ i=1 (Yi − Yi )2 + log(n) n (K + h + 1), where Yˆi is the prediction of the ith subject under model (2.6). 2.2.6 Selection of Initial values for β The algorithm described above needs a reasonable initial value for β (denoted by β initial ) to start the iteration. Based on the optimal K and h selected, we fit the intercept only model (2.6) via B-spline approximation and Newton-Raplson algorithm. β initial was set to be the estimator for β in (2.6). 20 2.3 Theoretical Properties Let β 0 and m0k (·), k = 0, 1, . . . , p be the true value of β and mk (·), respectively, and denote γ 0 be the true value of the B-spline coefficient γ. Without loss of generality, we assumed βl0 = 0 for l = 1, . . . s, βl0 = 0 for l = s + 1, . . . q; m0k (·) is varying for k = 0, 1, . . . , v, m0k (·) is non-zero constant for k = v + 1, . . . , c and m0k (·) is zero for k = c + 1, . . . , p. The following theorems established the consistency of the penalized least square estimators. Theorem 2.3.1. Assume the regulatory conditions (A1)-(A7) in Appendices hold and the number of knots K = Op (n1/(2r+1) ). Then ˆ − β 0 || = Op (n−r/(2r+1) + an ); (i) ||β (ii) ||m ˆ k (·) − m0k (·)|| = Op (n−r/(2r+1) + an ), k = 1, . . . , q where 0 |), p (|β 0 |), γ 0 = 0, γ 0 = 0, β 0 = 0} an = max{pλ (||γ 0k∗ ||2 ), pλ (|γj1 j1 λ3 l k∗ l 1 2 k,j,l and k, j = 1, . . . , p, l = 2, . . . , q, and r is defined in Appendices. Theorem 2.3.2. Assume the regularity conditions (A1)-(A7) the Appendices hold and the number of knots K = Op (n1/(2r+1) ). Let λmax = max{λ1 , λ2 , λ3 }, λmin = min{λ1 , λ2 , λ3 }. Suppose λmax → 0 and nr/(2r+1) λmin → ∞ as n → ∞.Then with probability approaching ˆ and m 1, β ˆ k (·) must satisfy: (i) βˆj = 0 for j = s + 1, . . . , q (ii) m ˆ k (·) = ck for k = v + 1, . . . , c where ck is some non-zero constant 21 (iii) m ˆ k (·) = 0 for k = c + 1, . . . , p Theorem 2.3.1 and 2.3.2 show that our proposed variable selection approach is consistent and possesses oracle property. 2.4 Simulation We conducted extensive simulation to evaluate the performance of our proposed approach. The performance is measured in several ways: (1) classification accuracy for function m(·), denoted as oracle percentage; (2) IMSE of the estimated m(·) function; (3) selection accuracy of β; and (4) estimation accuracy of β(MSE). Denote R as the total number of simulations. Oracle percentage of m(·) is defined as the percentage of correct classification out of a total of R simulations. For example, if mk (·) ∈ V , and out of R simulations, mk (·) is g × 100%. classified as varying for g times, then the oracle percentage of mk (·) is R IMSE of mk (·) is defined as 1 IM SE = R R [ r=1 1 ngrid ngrid (r) (r) ¯ j )ˆ (ˆ γk1 + B(u γ k∗ − mk (uj ))2 ] j=1 where ngrid is the number of points that we want to estimate the MSE of the predicted (r) (r) ˆ k∗ and γˆk1 are the estimators of the B-spline coefficients for the rth simulation function; γ ˆ (r) is the estimator of the loading parameter β using the proposed estimation approach; β for the rth simulation; and uj is taken at the j/ngrid × 100% quantile among the range of ˆ Xβ (r) . For our simulations, ngrid is set to be 100. Oracle percentage of β is defined as the percentage of correct selection of β out of R simulations. For example, if βd = 0 and out of R simulations, βd is selected to be non-zero 22 g for g times, then the oracle percentage of βd is R × 100%. 1 MSE of βd is calculated as R R 2 ˆ(r) r=1 (βd − βd ) (r) where βˆd is the estimator for βd in the rth simulation. 2.4.1 Simulation Setting The simulation data was generated according to the following model, p Y = m0 (Xβ) + mk (Xβ)Gk + k=1 where X was generated from a U nif (0, 1) distribution; β = (β1 , β2 , . . . , βq )T ; β1 = β2 = √1 ; 2 and the rest βj s are zeros. We evaluated the performance of the proposed approach with both continuous and discrete predictors G. For continuous G variables, they can be gene expressions. For discrete G variables, they can be SNP variants. In either case, the dimension of G can be large. 2.4.2 The Continuous Cases In the continuous case, the non-parametric functions mk (u) are defined as follows: m0 (u) = 2sin(2πu), m1 (u) = 2cos(πu) + 2 and m2 (u) = sin(2πu) + cos(πu) + 1 are varying functions; m3 (u) = 2 and m4 (u) = 2.5 are non-zero constants; mk (u) = 0 for k = 5, . . . , p are zeros. The number of loading parameters is set as q = 5 and β1 = β2 = √1 , β3 = β4 = β5 = 0. G 2 was generated randomly from a N (0, 1) distribution, ∼ N (0, 1). We ran 1000 simulations (R = 1000) to evaluate the performance of the proposed model under p = 50, 100. Table 1 demonstrates the selection and estimation accuracy for non-parametric functions with continuous G. The left and right penal corresponds to the case where p = 50 and 23 100 respectively. For all the cases, the selection accuracy (oracle %) is very closed to 100%. IMSE for varying functions (m0 (·), m1 (·) and m2 (·)) is in the order of -2, and the IMSE for constant functions (m3 (·) and m4 (·)) is in the order of -3. All of the model IMSE and oracle IMSE are in the same order. Overall, the differences in oracle percentage and IMSE are negligible between the case p = 50 and the case p = 100. These observations suggest that our proposed variable selection approach possesses reasonable selection and prediction accuracy for non-parametric function mk (·). Table 1: Selection and prediction accuracy of mk (·) for continuous G p = 50 p = 100 Oracle % IMSE Model Oracle Oracle % IMSE Model Oracle m0 (·) m1 (·) m2 (·) m3 (·) m4 (·) Zero 100.0% 99.6% 99.9% 100.0% 100.0% 99.7% 3.87E-02 4.27E-02 1.58E-02 2.42E-02 2.33E-02 2.58E-02 2.09E-03 2.11E-03 2.04E-03 2.06E-03 1.94E-05 0 100.0% 99.9% 99.9% 100.0% 100.0% 99.9% 3.77E-02 4.51E-02 1.57E-02 3.14E-02 2.26E-02 2.96E-02 1.90E-03 1.97E-03 2.07E-03 2.12E-03 1.12E-05 0 m0 (·) m1 (·) m2 (·) n = 1000 m3 (·) m4 (·) Zero 100.0% 100.0% 100.0% 100.0% 100.0% 99.8% 3.23E-02 7.17E-03 1.46E-02 1.02E-03 1.09E-03 8.50E-06 100.0% 100.0% 100.0% 100.0% 100.0% 99.9% 3.31E-02 7.07E-03 1.46E-02 9.60E-04 1.06E-03 3.46E-06 n = 500 3.40E-02 1.21E-02 1.59E-02 1.02E-03 1.09E-03 0 3.47E-02 1.17E-02 1.64E-02 9.55E-04 1.07E-03 0 Table 2 presents the selection and prediction accuracy for loading parameter β. The left and the right penal correspond to the case where p = 50 and 100 respectively. It shows that the selection accuracy for all β is reasonably good (> 98% in all cases). For most of the β, the MSE is in the order of -4 or lower, except for β2 , which was -3 for both p = 50 and p = 100 when n = 500. The order of the model estimation for β is at least the same as that of the oracle model if not lower. And we did not observe a difference in performance between 24 the case p = 50 and the case p = 100. These results indicate that our model possesses good selection and prediction accuracy for loading parameters β. Table 2: Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 2 p = 50 p = 100 Oracle % MSE Model Oracle Oracle % MSE Model Oracle β1 β2 β3 β4 β5 100.0% 100.0% 98.1% 98.8% 98.6% 1.15E-04 1.07E-04 8.04E-03 4.12E-03 9.98E-05 0 2.99E-05 0 1.00E-04 0 100.0% 100.0% 98.2% 99.1% 98.5% 1.17E-04 1.30E-04 2.26E-03 7.62E-03 3.64E-05 0 3.13E-05 0 7.73E-05 0 β1 β2 n = 1000 β3 β4 β5 100.0% 100.0% 98.9% 99.4% 99.1% 5.30E-05 5.52E-05 5.34E-05 1.86E-03 9.36E-06 0 6.30E-06 0 7.17E-06 0 100.0% 100.0% 98.8% 99.5% 99.0% 5.00E-05 5.49E-05 5.04E-05 1.79E-03 1.16E-05 0 5.49E-06 0 6.93E-06 0 n = 500 2.4.3 For discrete G We further evaluated how our proposed model performs with the discrete G. In this simulation, each G variable was simulated from a multinomial distributions with minor allele frequency (MAF) denoted as Pa . The G variable took values 0, 1, and 2 corresponding to the genotype aa, Aa, and AA where a is the minor allele. The frequencies corresponding to the three genotypes (aa, Aa, and AA) are Pa2 , 2Pa (1 − Pa ) and (1 − Pa )2 . We varied the MAF for different G variables to evaluate the impact of the MAF on the selection performance. Specifically, for k = 1, 2, 7, Pa = 0.5; for k = 3, 4, 8, Pa = 0.3; for k = 5, 6, 9, Pa = 0.1, and for k = 10, 11, . . . , p, Pa ∼ U nif (0.05, 0.5). For the non-parametric functions, we set m0 (u) = 2sin(2πu), m1 (u) = m3 (u) = m5 (u) = 2cos(πu) + 2, m2 (u) = m4 (u) = m6 (u) = sin(2πu) + cos(πu) + 1; m7 (u) = m8 (u) = m9 (u) = 2; and mk (u) = 0 for 25 k = 10, 11, . . . , p. Under the setup, we have both varying and constant effect with different minor allele frequencies. For the zero effects, the MAF for Gk ranged uniformly from 0.05 to 0.5. X was generated from U nif (0, 1) and was generated from N (0, 1). Finally, Y was generated according to model (2.1). We evaluated the performance of our proposed model via 1000 simulations under p = 50, 100 and n = 500, 1000. Table 3 and figure 1 present the selection and estimation accuracy of non-parametric function mk (·) for discrete G. We observed that the oracle percentage is very high (> 99%) for all cases, indicating our proposed model can correctly select the coefficient functions with high accuracy. The IMSE for varying functions was of the order −1 or −2, while the IMSE for constant functions was of the order −2 or−3. Moreover, the IMSE of the proposed model was in the same order of the IMSE of the oracle model. With the decrease of minor allele frequency Pa of Gk (from 0.5 to 0.1), we observed an increase in both model IMSE and oracle IMSE. This is consistent with our expectation since SNP with lower MAF provides less information. We did not observe a difference in oracle percentage and IMSE between the case p = 50 and p = 100. This suggests that our model performs reasonably well in both cases. The case n = 1000 performed slightly better than the case n = 500, and it is consistent with the asymptotic property of the model. Overall, the proposed model performs reasonably well in the selection and estimation of non-parametric functions. 26 Table 3: Selection and prediction accuracy of mk (·) for discrete G. Oracle % p = 50 IMSE Model Oracle Oracle % p = 100 IMSE Model Oracle m0 (·) m1 (·) m2 (·) m3 (·) m4 (·) m5 (·) m6 (·) m7 (·) m8 (·) m9 (·) Zero 100.0% 99.7% 99.7% 99.7% 99.6% 99.6% 99.6% 100.0% 100.0% 100.0% 99.6% 4.67E-02 3.71E-02 4.56E-02 4.39E-02 5.13E-02 1.15E-01 1.29E-01 4.53E-03 5.34E-03 1.27E-02 1.82E-04 4.34E-02 3.54E-02 4.26E-02 4.19E-02 4.93E-02 1.53E-01 1.24E-01 4.50E-03 5.30E-03 1.26E-02 0 100.0% 99.6% 99.4% 99.4% 99.4% 99.2% 99.3% 99.9% 99.8% 99.9% 99.6% 5.06E-02 5.22E-02 4.02E-02 4.53E-02 4.92E-02 4.77E-02 4.95E-02 5.44E-02 5.68E-02 5.45E-02 1.33E-01 2.02E-01 1.33E-01 1.30E-01 4.74E-03 4.56E-03 6.26E-03 6.03E-03 1.21E-02 1.24E-02 1.28E-04 0 m0 (·) m1 (·) m2 (·) m3 (·) m4 (·) n = 1000 m5 (·) m6 (·) m7 (·) m8 (·) m9 (·) Zero 100.0% 99.9% 99.9% 99.9% 99.9% 99.9% 99.9% 100.0% 100.0% 100.0% 99.8% 3.11E-02 1.44E-02 2.16E-02 1.69E-02 2.35E-02 4.10E-02 4.62E-02 2.29E-03 2.67E-03 5.61E-03 3.10E-05 3.31E-02 1.96E-02 2.37E-02 2.19E-02 2.50E-02 4.55E-02 5.01E-02 2.34E-03 2.66E-03 5.69E-03 0 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.9% 3.11E-02 1.44E-02 2.13E-02 1.74E-02 2.40E-02 4.25E-02 4.65E-02 1.89E-03 2.62E-03 6.03E-03 1.15E-05 n = 500 27 3.20E-02 1.93E-02 2.34E-02 2.20E-02 2.56E-02 4.74E-02 4.85E-02 1.89E-03 2.63E-03 6.08E-03 0 Model IMSE for varying effect Gk 1.0 Oracle percentage for varying effect Gk Orable IMSE for varying effect Gk 0.15 n=500 p=50 n=500 p=100 n=1000 p=50 n=1000 p=100 0.4 0.5 0.05 0.1 0.3 0.4 Oracle percentage for constant effect Gk Model IMSE for constant effect Gk 0.5 0.1 0.2 0.4 0.5 0.5 n=500 p=50 n=500 p=100 n=1000 p=50 n=1000 p=100 orable IMSE 0.004 0.002 0.000 0.002 0.000 0.3 Minor Allele Freq. 0.4 0.010 0.010 IMSE 0.006 0.004 0.1 0.3 Orable IMSE for constant effect Gk 0.008 0.8 0.6 0.4 0.0 n=500 p=50 n=500 p=100 n=1000 p=50 n=1000 p=100 0.2 Minor Allele Freq. n=500 p=50 n=500 p=100 n=1000 p=50 n=1000 p=100 0.012 1.0 Minor Allele Freq. 0.2 Oracle Percentage 0.2 0.012 0.3 Minor Allele Freq. 0.008 0.2 0.006 0.1 0.00 0.0 n=500 p=50 n=500 p=100 n=1000 p=50 n=1000 p=100 0.00 0.2 0.05 0.10 orable IMSE 0.10 IMSE 0.6 0.4 Oracle Percentage 0.8 0.15 n=500 p=50 n=500 p=100 n=1000 p=50 n=1000 p=100 0.1 0.2 0.3 Minor Allele Freq. 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Minor Allele Freq. Figure 1: Selection and estimation accuracy of mk (·) for discrete G Table 4 presents the selection and estimation result of the loading parameters β. The left and right panel represent the case where p = 50 and p = 100, respectively. We observed that the oracle percentage in all the cases is above 97% and the MSE for the estimation of β is of the order −3 or lower in the proposed and oracle model. With the increase of model dimension p (from 50 to 100), we observed a slight increase model MSE. This is expected since model performance usually decreases as complexity increases. Further, the case where n = 1000 performed slightly better than the case n = 500, and it is consistent with the asymptotic theory of the model. Overall, the proposed variable selection approach 28 can correctly select and estimate the loading parameters with high accuracy. Table 4: Prediction accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 2 p = 50 p = 100 Oracle % MSE Model Oracle Oracle % MSE Model Oracle β1 β2 β3 β4 β5 100.0% 99.8% 97.4% 99.1% 97.3% 2.86E-04 1.04E-04 2.84E-03 1.04E-04 9.38E-05 0 4.08E-05 0 7.04E-05 0 100.0% 99.7% 97.0% 98.6% 98.2% 6.36E-04 2.13E-04 6.89E-03 6.13E-03 2.15E-04 0 2.21E-04 0 1.68E-04 0 β1 β2 n = 1000 β3 β4 β5 100.0% 100.0% 99.3% 99.3% 99.2% 5.45E-05 5.03E-05 1.92E-03 2.09E-03 5.19E-06 0 4.21E-06 0 7.51E-06 0 100.0% 100.0% 99.0% 98.9% 99.5% 5.20E-05 6.03E-05 5.20E-05 1.74E-03 1.10E-05 0 9.69E-06 0 2.64E-06 0 n = 500 To summarize, the proposed method possesses reasonable selection and estimation accuracy for non-parametric functions mk (·) and their loading parameters β in all cases. With the increase of the sample size n, we observed a small increase in oracle percentage for both non-parametric functions mk (·) and their loading parameters β. We also observed a decrease in IMSE of mk (·) and MSE of β. These coincide with the asymptotic theory of our model. 2.5 Real Data Application We demonstrated the utility of the model with a human liver cohort (HLC) data set. The data set is consisted of genotype (SNPs), gene expressions, and phenotypes (activity of several liver enzymes). The data set can be downloaded from www.synapse.org using synapse ID: syn4499. For more details regarding the data set, please refer to Schadt et al. (2008) and 29 Yang et al. (2010). In the HLC data set, the phenotypes are enzyme activity measurements of Cytochrom P450. There are a total of nine P450 enzymes (CYP1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, and 3A4). However, only CYP2E1 passed Shapiro-Wilk normality test (p-value > 0.1) after log transformation. Hence, we focused the analysis on CYP2E1 activity (Y ). For the environmental variable (X), we set X 1 = Age, X 2 = Aldehyde Oxydase, and X 3 = Liver Triglyceride, then transform X i , i = 1, 2, 3 to range [0,1] with X −min (X ) ij j=1,2,...,n ij Xij = max . Where Xij denotes the jth observation of X i , (X j=1,2,...,n ij )−minj=1,2,...,n (Xij ) j = 1, . . . n. In this analysis, we focused on gene expressions and treated them as the G variable. After data cleaning, we had n = 394 (sample size) and N = 19, 172 (number of gene expressions). We implemented N-Glycan the biosynthesis. proposed Based selection on approach the to KEGG pathway pathway hsa00510 database (http://www.genome.jp/kegg/pathway.html), pathway hsa00510 is mapped to 44 gene expressions in our data. Due to model constrain of VMICM, the first loading parameter must be a non-zero positive number (β1 > 0). Hence, we fit the proposed model three times with age, aldehyde oxydase, and liver triglyceride being the first loading covariate. Gene expression P GGT 1B was selected as varying coefficient predictor in all three models. Gene expression B4GALT 3 and B3GN T 3 were selected as constant coefficient predictors in two models (age and aldehyde oxydase being the first loading covariate). 30 Effect of PGGT1B 0.4 0.2 −0.2 0.0 Log of CYP2E1 Activity 7.4 7.3 7.2 7.1 −0.4 6.9 7.0 Log of CYP2E1 Activity 7.5 0.6 7.6 0.8 Environmental Effect 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 Xβ 0.6 0.8 1.0 1.2 Xβ Figure 2: Plot of the varying coefficient effect for gene expression P GGT 1B. Table 5: The estimated loading parameters Aldehyde Oxydase Age Liver Triglyceride 0.879 0.477 0 Figure 2 presents the plot of the marginal environmental effect and the effect of gene P GGT 1B on log of CYP2E1 activity with aldehyde oxydase being the first loading covariate. With the increase of loading index Xβ, marginal environmental effect (m0 (·)) first increased from 6.9 to 7.6, then it fluctuated around 7.5 before decreasing sharply to 7.0. For gene expression P GGT 1B, its effect on the log of CY P 2E1 activity first decreased sharply from 0.8 to -0.2, then it fluctuated around -0.1 for the remainder of the index. This suggests that the G×E effect of gene expression P GGT 1B mainly exists in the lower quantile of aldehyde oxydase and age. With aldehyde oxydase being the first loading covariate, the constant effects of gene expression B4GALT 3 and B3GN T 3 were -0.0618 and -0.0624 respectively. These suggest that the effect of gene expression B4GALT 3 and B3GN T 3 on the log of 31 CY P 2E1 activity does not interact with environmental factors. Table 5 presents the selection result of environmental factors with aldehyde oxydase being the first loading covariate. The model selected aldehyde oxydase and age as significant environmental factors, their estimation were 0.879 and 0.477 respectively. Base on their value, aldehyde oxydase have a stronger effect than age. This result demonstrates the unique ability of the proposed method to capture the non-linear interaction between a mixture of environmental variants and genetic variables. However, further biological investigation is needed to confirm this finding. 2.6 Discussion VMICM is a promising candidate to model non-linear interaction between multiple genes and multiple environments as a whole. It combines multiple exposure variables X into a single index Xβ, hence can reduce model dimension and alleviate the curse of dimensionality. In this paper, we developed a three stage variable selection approach for VMICM model. Our goal was to identify varying, constant and zero effect that interacted with a gene. In the meantime, we also selected important exposure variables. Rather than modeling the G×E effect for each X variable separately, our approach can model the joint effect of the environmental factors (X) as a whole, then identify how different genes (G) interacted with the environmental mixture to affect the phenotype Y . Biologically speaking, our approach is attractive since it offers an alternative strategy to look for G×E interaction, and our model is flexible to detect any non-linear interactions. In addition, we theoretically evaluated the selection consistency of the variable selection. Both simulation and real data analysis demonstrated the utility of the proposed method. 32 In our model setup, the covariates X were assumed to be continuous. This is due to the fact that the index Xβ has to be continuous in order to model the nonlinear function. In real applications, environmental variables can be discrete such as smoking status, gender and ethnicity. To accommodate the presence of discrete factors, the model VMICM could be generalized to partial-linear VMICM: Y = Zα + p k=0 mk (Xβ)Gk + where Z is the discrete covariates of dimension r and α is its effect on the response. Our variable selection approach can be modified slightly to perform selection of non-parametric function and the parametric component simultaneously. More specifically, the objected function (2.5) is modified as p n i=1 p 2 ¯ [γk1 + B(Xβ)γ k∗ ]Gik } + n {Yi − Z i α − Q(β, γ) = k=0 k=1 q p r j=1 pλ3 (|βd |). pλ2 (|γk1 |)I(||γ k∗ ||1 = 0) + n pλz (|αj |) + n +n pλ1 (||γ k∗ ||1 ) d=2 k=1 And the objective function in step 1 is modified as ˆ (0) ) = Q1 (γ|λz , λ1 , β p n {Yi − Z i α − i=1 ˆ (0) )γ ]Gik }2 ¯ [γk1 + B(X β k∗ k=0 p r pλ1 (||γ k∗ ||2 ) + n +n pλz (|αj |). j=1 k=1 In this paper, we discussed the variable selection approach for VMICM with a continuous response variable. However, many response variables are measured on a binary scale such as the presence of a certain disease in humans. It is natural to extend the current selection approach to a generalized VIMCM framework, which will be investigated in chapter 3. 33 In our model formulation, we assumed different index coefficients share the common loading parameters, i.e., β 0 = β 1 = · · · = β p = β. From a practical point of view, allowing different loading parameters makes perfect sense. However, such a model imposes theoretical challenges when evaluating the theoretical properties such as the selection consistency. This is because that the loading coefficients for the kth index coefficients are not identifiable when mk (u) ∈ / V . When a coefficient function does not vary, β k does not exists. Thus, the selection consistency for β k does not exists. For this reason, we imposed the same loading parameters for all the index coefficient functions. In addition to the application to G×E study, our model has many applications in other fields where a potential nonlinear varying effect exists. Our method enriches the catalog of variable selection. It contributes to the methodology development of variable selection in theory and to the application of G×E study in practice. 34 Chapter 3 Variable Selection for Generalized VMICM 3.1 Introduction There has been a growing interest in identifying gene-environment (G×E) interaction in scientific literatures. Ottman(1996) defined gene-environment interaction as “a different effect of a genotype on disease risk in persons with different environmental exposures”. Traditionally, G×E interactions were investigated based on a single environmental factor, since introduction of several environmental factors will increase model dimension exponentially, which could potentially lead to biased estimation and large standard error (curse of dimensionality). However, more and more epidemiological studies revealed that disease risk can be modified by simultaneously exposure to several environmental factors (Carpenter et al. (2002), Sexton and Hattis (2007)). Further, little was known about how multiple environmental factors as a whole could interact with genetic factors to affect the response variable. Any investigation in this area could shed some light into the disease etiology and offer prospects for future disease prevention. In chapter 2, we proposed to use varying multi-index coefficient model of the form Y = p k=0 mk (Xβ)Gk + to model the non-linear gene-environmental interaction. However, 35 VMICM can only model continuous phenotype. In this chapter, we generalized VMICM to model data with binary response variables. Consider the generalized varying multi-index coefficient model (gVMICM) P (Y = 1|X, G) = log P (Y = 0|X, G) p mk (Xβ)Gk (3.1) k=0 where mk (·), k = 0, 1, . . . , p represents the gene effect of Gk , and is modeled as a smoothed non-linear non-parametric function modulated by the loading index Xβ. Its unique structure allows us to capture the interaction effect between genetic factors and a mixture of environmental factors. Further, model (3.1) is very flexible to cover a wide range of models. For instance, if q = 1 and β = 1 then it reduces to a generalized varying coefficient mode; if p = 1 and G = 1 then it becomes a standard generalized single index model. Variable selection has been a popular topic in modern statistics literature. In the past, people often implemented hypothesis testing, forward/backward selection combined with AIC or BIC. Nevertheless, with the rise of big data, we were able to collect hundreds of thousands of variables at the same time. Traditional methods are no longer feasible because of exponentially increasing computation time and unstable estimation due to increasing collinearity. Recently, variable selection via penalized regression has become fairly popular. The idea is to add a penalty term to the optimization function. With different penalty functions, the penalized estimator could possess different properties. Fan and Li (2001) proposed three important criteria for penalized estimator: sparsity, unbiasedness and continuity. They also characterized oracle property meaning the model performs as well as if the true sub-model is known in advance. It has become the standard for new penalized estimator. A few examples of the penalized function would be Bridge regression 36 (Frank and Friedman (1993)), least absolute shrinkage and selection operator (LASSO) (Tibshirani (1996)), its extension adaptive LASSO (Zou (2006)), smoothly clipped absolute deviation (SCAD) (Fan and Li (2001)), and minimax concave penalty (MCP) (Zhang (2010)). Although LASSO enjoys simple formulation and efficient algorithm (LARS), it does not possess oracle property. On the other side, Adaptive LASSO, SCAD, MCP all posses oracle property and we decided to implement MCP in our model. Due to the complexity of model (3.1), variable selection presents unique challenge, specifically, the nonlinear and non-parametric structure of function mk (·) and its unknown loading parameter β. In chapter 2, we proposed a three stage variable selection approach for VMICM. The model classified the non-parametric gene effect into varying, constant and zero. It also selected non-zero loading parameters. In this chapter, we extended such approach to gVMICM. In stead of penalizing squared error loss for VMICM, we decided to implement the penalized log-likelihood method in this model. The rest of this chapter is organized as follows: section 3.2 introduced the proposed variable selection method, formulation of the penalized log-likelihood, three step iterative optimization approach, and selection of tuning parameters. In section 3.3, we discussed the asymptotic properties of the proposed method. And we evaluated the performance of our method via several simulations in section 3.4. Utility of our model was demonstrated with a Type II diabetes data set in section 3.5, followed by discussion and future work. 37 3.2 3.2.1 Variable Selection for gVMICM Model Setup Consider the following gVMICM model P (Y = 1|X, G) )= log( P (Y = 0|X, G) p mk (Xβ)Gk k=0 where Y n×1 = (Y1 , Y2 , . . . , Yn )T is a binary response variable measured over n subjects; mk (·), k = 0, 1, . . . , p are p+1 unknown smooth non-linear functions; X = (X 1 , . . . , X q ) is a matrix of dimension n×q representing continuous environmental variables; β = (β1 , . . . , βq )T is the loading parameter of dimension q; Gn×(p+1) = (G0 , G1 , . . . , Gp ), G0 = (1, . . . , 1)T and Gk = (G1k , G2k , . . . , Gnk )T is the k − th genetic factor. For k = 0, mk (Xβ) is the effect of Gk on the response on a log odda scale; m0 (Xβ) is the intercept term measuring the marginal environmental effect. For ease of notation, we denoted µ = p k=0 mk (Xβ)Gk . Thus, model (3.1) can be rewritten as log( 3.2.2 P (Y = 1|X, G) ) = µ. P (Y = 0|X, G) (3.2) Estimation Method Our goal was to select and estimate unknown functions {mk (·)}k=0,1,...,p and their unknown loading parameter β = (β1 , . . . , βq )T . For the sake of identifiability, we assumed ||β||2 = 1 and β1 > 0, and mk (·) cannot has the form of mj (u) = αT uβ T u + γ T u + c. We first approximated the non-parametric function mk (u) with B-spline basis functions. Without loss of generality, we assumed u ∈ [0, 1], and denoted K to be the number of internal 38 knots and h to be the degree of B-spline basis function. So h = 1 represents linear splines; h = 2 represents quadratic splines; and h = 3 represents cubic splines. From standard B-spline theory, we denoted u1 , u2 , ..., uK to be the interior knots satisfying 0 = u0 =< u1 < u2 < .... < uK < uK+1 = 1. Let Int be left closed, right opened interval [ut−1 , ut ) for 1 ≤ t ≤ K, and InK+1 be closed interval [uK , uK+1 ]. Let F to be a collection of functions f defined on [0, 1] satisfying: (i) the restriction of f to Int is a polynomial of degree h or less for 1 ≤ j ≤ K + 1; (ii) f is h − 1 times continuously differentiable on [0, 1]. Let L = K + h + 1, then by Schumaker (1981)(Schumaker (2007)), we normalized B-spline basis function B(u) = (B 1 (u), B 2 (u), . . . , B L (u)) for F . And there exists a linear transformation ¯ matrix Π, such that ΠB(u) = (1, B(u)) = (1, B 2 (u), B 3 (u), . . . , B L (u)) = B(u) where ¯ each component of B(u) is a function of u. Hence, for 0 ≤ k ≤ p, we approximated mk (u) by ¯ mk (u) ≈ (1, B2 (u), · · · , BL (u)) ∗ (γk1 , γk2 , · · · , γkL )T = B(u)γ k = γk1 + B(u)γ k∗ (3.3) where γ k∗ = (γk2 , γk3 , . . . , γkL )T and γ k = (γk1 , γ Tk∗ )T . We approximated µ with µB : p µB = ¯ [γk1 + B(Xβ)γ k∗ ]Gk (3.4) k=0 where γk1 , k ≥ 1 represents the main genetic effect for the kth variant and γ k∗ , k ≥ 1 represents the G×E interaction effect between the kth variant and a mixture of the environmental variables. With this new representation, the selection of the non-parametric function mk (·) was transformed to the selection of its B-splines coefficients γ = {γk1 , γ k∗ }k=0,1,...,p . The transformation Π allows us to separate constant effect of Gk from its varying effect. More 39 specifically, (1) if ||γ k∗ ||2 = 0, then there exists G×E effect; (2) if ||γ k∗ ||2 = 0 and |γk1 | = 0, then there only exists main genetic effect; and (3) if ||γ k∗ ||2 = 0 and |γk1 | = 0, then Gk has no effect at all. This transformation is the key step to separate different genetic effects. Given the binary response, the log-likelihood function is defined as: n B µ (Yi µB i − log(1 + e i )) l(γ, β) = i=1 B where µB i is the ith subject of µ . We defined the following penalized log-likelihood objective function, n (Yi µB i M (β, γ) = B − log(1 + eµi )) − n i=1 p pλ1 (||γ k∗ ||2 ) k=1 p −n (3.5) q pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) − n k=1 pλ3 (|βd |) d=2 where pλ1 (·), pλ2 (·), pλ3 (·) are the penalty functions for γ k∗ , γk1 and β, respectively; I(·) is an indicator function. The construction of the penalty functions in (3.5) implies that there is a natural order when selecting the effect of Gk . First, the model selects Gk with varying effect. If Gk does not have a varying effect, the model penalizes the constant effects of Gk . We did not penalize β1 due to the model constraint. For the penalty function, we adopted the MCP penalty function proposed by Zhang (Zhang (2010)), i.e., p(x, λ) = λ 0x (1 − τsλ )+ ds with regularization parameters τ > 0 and λ > 0. 3.2.3 Computational algorithm To optimize function (3.5), we followed the idea proposed in chapter 2 and adopted a 3 step interactive approach. 40 ˆ (0) , we obtained the 1st step estimation Step 1 : Given an initial value for β, denoted as β (1) (1)T ˆ (1) = {ˆ ˆ k∗ }Tk=0,1,...,p by following a group penalized regression, i.e., of γ, denoted as γ γk1 , γ ˆ (0) ) ˆ (1) = arg max M1 (γ|λ1 , β γ γ where ˆ (0) ) = M1 (γ|λ1 , β n (Yi µB i p B − log(1 + eµi )) − n i=1 pλ1 (||γ k∗ ||2 ). k=1 Step 1 classified mk (·), k = 1, . . . , p into two categories: varying(V) or non-varying(NV). (1) (1) That is, mk (·) ∈ V if ||ˆ γ k∗ ||2 > 0 and mk (·) ∈ N V if ||ˆ γ k∗ ||2 = 0. (1) ˆ k∗ = 0. We further selected Step 2 : In this step, our interest was to select γk1 given γ constant coefficient from the non-varying coefficient obtained in step 1. We penalized γk1 (1) only when ||ˆ γ k∗ ||2 = 0, k = 1, 2, . . . , p, and no penalty was applied to γ01 . We excluded (1) (2) ˆ k∗ = 0. We obtained our step 2 estimator γ k∗ from the model if ||ˆ γ k∗ ||2 = 0 in step 1, i.e. γ (2) (2) (2) ˆ (2) = {(ˆ ˆ k∗ )k∈V , (ˆ γ γk1 , γ γk1 )k∈N V } via penalized regression ˆ (2) = arg max M2 (γ|λ2 , β (0) , γ ˆ (1) ) γ γ where n M2 (γ|λ2 ˆ (1) ) , β (0) , γ = (2) (Yi µB i B (2) − log(1 + eµi )) − n i=1 µB i (2) is the ith element of µB p (2) (1) pλ2 (|γk1 |)I(||ˆ γ k∗ ||2 = 0). k=1 (2) with µB (2) = (2) k∈V [γk1 (0) )γ (2) ]G + ¯ + B(Xβ k k∗ (2) k∈N V γk1 Gk . ˆ (0) After step 2, we obtained our estimators of the B-splines coefficients γ for given β 41 and classified mk (·) k = 1, . . . , p into varying, constant or zero effects. The next step is to ˆ (2) . update and select the loading parameter β given γ ˆ via penalized regression Step 3 : We obtained β ˆ = arg max M3 (β|λ3 , γ ˆ (2) ) β ||β ||2 =1 where n M3 (β|λ3 ˆ (2) ) ,γ = (3) (Yi µB i B (3) − log(1 + eµi )) − n i=1 µB i (3) is the ith element of µB (3) q pλ3 (|βd |). d=2 with µB (3) = (2) p γk1 k=0 [ˆ (2) ¯ + B(Xβ)ˆ γ k∗ ]Gk . ˆ we set β ˆ (0) = β, ˆ then iterate step 1 through 3 until Once we have the updated β, convergence. Remark: For step 1 and 2, we implemented the block coordinate descent algorithm for group penalty. For step 3, we developed our algorithm based on the idea of the local quadratic approximation (LQA) proposed by Fan and Li(2001). For more detail, please refer to the Appendices. Next we discuss details about selecting the tuning parameters λ1 , λ2 , λ3 , order h and the number of interior knots K for the B-spline approximation, as well as a reasonable initial value for β. 3.2.4 Selection of Parameters 3.2.4.1 Selection of tuning parameters λ1 , λ2 , λ3 We proposed to use Bayesian Information Criterion (BIC) (Schwarz (1978)) to select the tuning parameters. 42 Step 1: We selected λ1 as the minimizer of (1) 1 ˆ BIC(λ1 ) = −2l(ˆ γλ , β (0) ) + log(n) ∗ dfλ1 (1) ˆ (0) ) defined above; β ˆ (0) is chosen as the estimator ˆ λ is the minimizer of M1 (γ|λ1 , β where γ 1 from the previous iteration; and dfλ1 is defined as the total number of non-zero coefficients if λ1 is the penalized parameter. Step 2: We selected λ2 as the minimizer of (2) 2 ˆ BIC(λ2 ) = −2l(ˆ γλ , β (0) ) + log(n) ∗ dfλ2 (2) ˆ (0) is chosen as the ˆ λ is the minimizer of M2 (γ|λ2 , β (0) , γ ˆ (1) ) defined above; β where γ 2 estimator from the previous iteration; and dfλ2 is defined as the total number of non-zero coefficients if λ2 is the penalized parameter. Step 3: We selected λ3 as the minimizer of ˆ ) + log(n) ∗ dfλ BIC(λ3 ) = −2l(ˆ γ (2) , β λ3 3 ˆ λ is the minimizer of M3 (β|λ3 , γ ˆ (2) ) defined above; γ ˆ (2) is the minimizer of the where β 3 B-spline coefficient from step 2; and dfλ3 is defined as the total number of non-zero β if λ3 is the penalized parameter. The penalized parameter λ1 , λ2 , λ3 were chosen over a grid of exponentially decreasing values with the minimum to be 1E-3. The maximum of λ1 , λ2 , λ3 were set to be the minimum values such that all of the penalized estimators are zeros. The number of grid to be searched was set as 100. 43 3.2.4.2 Selection of order h and number of interior knots K In theory, it might be appealing to allow different K and h for different non-parametric functions mk (·). In practice, it would be computationally infeasible. For computational purpose, we let all mk (·) share the same K and h. Since h is the order of the B-spline basis function, higher degree implies more complicated interactions between environmental factors and genetic predictors, thus, more difficult to interpret. For this reason, we searched 1 optimal h over the set h ∈ {2, 3, 4}. For the interior knots K, only when K = Op (n 2r+1 ) (n is the sample size and r is the smoothness of the nonparametric function mk (·) and r > 2), our selection approach possesses oracle properties. Thus, we searched optimal K in the 1 neighborhood of n 2r+1 , denoted by K . In our simulation, we chosed K = {2, 3, 4, 5}. The knots K and order h were then selected by fitting the following intercept only model with the B-spline approximation and Newton-Raphson algorithm. log( P (Y = 1|X, G) ) = m0 (Xβ). P (Y = 0|X, G) (3.6) ˆ and ˆ 0∗ ) and the loading parameters as β, Denote the estimated spline coefficients as (ˆ γ01 , γ ˆ = γˆ01 + B(X ˆ γ 0∗ . We searched the optimal K and h over a grid K ∈ K ¯ β)ˆ let m ˆ 0 (X β) and h ∈ {2, 3, 4}. Optimal K and h are defined as the K and h that minimized log Y ∗ ˆ) log(n) ˆ 0 (X β ˆ − log(1 + em m ˆ 0 (X β) ) + n (K + h + 1). 3.2.5 Choosing the initial values To start the iterative optimization process described above, a reasonably good initial value of β, β initial is essential. In many literature, β initial is set to be (1, 0, . . . , 0)T or ( √1q , . . . , √1q )T . However, neither works well in our simulations. Hence, we set the initial value of β as the 44 byproduct of selecting the optimal K and h by fitting the intercept only model in (3.6). 3.3 Theoretical Properties Let β 0 and m0k (·), k = 0, 1, . . . , p be the true value of β and mk (·), respectively. Denote γ 0 to be the true value of the B-spline coefficients γ. Without loss of generality, we assumed βl0 = 0 for l = 1, . . . s, βl0 = 0 for l = s + 1, . . . q; m0k (·) has varying coefficients for k = 0, 1, . . . , v, m0k (·) has non-zero constant coefficients for k = v + 1, . . . , c and m0k (·) is zero for k = c + 1, . . . , p. The following theorems give the consistency of the penalized log-likelihood estimators. Theorem 3.3.1. Assume the regulatory conditions (A1)-(A7) in Appendices hold and the number of knots K = Op (n1/(2r+1) ). Then ˆ − β 0 || = Op (n−r/(2r+1) + an ); (i) ||β (ii) ||m ˆ k (·) − m0k (·)|| = Op (n−r/(2r+1) + an ), k = 1, . . . , q where an = max{pλ (||γ 0k∗ ||2 ), pλ (|γ 0j1 |), pλ (|βl0 |), γ 0k∗ = 0, γj1 = 0, βl0 = 0}. 1 2 3 k,j,l k, j = 1, . . . , p, l = 2, . . . , q. r is defined in condition (A2) in Appendices. pλ (·) denotes the first order derivative of the penalty function pλ (·). Theorem 3.3.2. Assume the regularity conditions(A1)-(A7) in Appendices hold and the number of knots K = Op (n1/(2r+1) ). Let λmax = max{λ1 , λ2 , λ3 }, λmin = min{λ1 , λ2 , λ3 }. 45 Suppose λmax → 0 and nr/(2r+1) λmin → ∞ when n → ∞.Then with probability approaching ˆ and m 1, β ˆ k (·) must satisfy: (i) βˆj = 0 for j = s + 1, . . . , q (ii) m ˆ k (·) = ck for k = v + 1, . . . , c where ck is some non-zero constant (iii) m ˆ k (·) = 0 for k = c + 1, . . . , p Theorem 3.3.1 and 3.3.2 show that our penalized likelihood estimators are consistent and possess oracle properties. 3.4 Simulation We evaluated the finite sample performance of our model via simulations. Its performance was evaluated in several ways: (1) classification accuracy of the m(·) denoted as oracle percentage; (2) IMSE of the estimated m-functions; (3) selection accuracy of β; and (4) estimation accuracy of β (MSE). For all cases, we ran a total of R simulations. The oracle percentage of m(·) is defined as the percentage of correct classification for varying, constant and zero effects. For instance, if mk (·) is a varying function and mk (·) is g classified as varying for g times, then the oracle percentage of mk (·) is calculated as R ×100%. IMSE of mk (·) is defined as 1 R R [ r=1 1 ngrid ngrid (r) (r) ¯ j )ˆ (ˆ γk1 + B(u γ k∗ − mk (uj ))2 ] j=1 (r) (r) ˆ k∗ and γˆk1 are the estimators of the B-spline where ngrid is the number of grid points; γ ˆ (r) is the coefficients for the rth simulation using the proposed estimation approach; β estimator of the loading parameter β for the rth simulation; and uj is taken at the 46 ˆ j/ngrid × 100% quantile among the range of X β (r) . In the simulations, ngrid was set to be 100. The oracle percentage of β is defined as the percentage of correct selection of β out of R simulations. For example, if βd = 0, and βd is selected as non-zero for g times, then the oracle g 1 × 100%. MSE of βd is calculated as R percentage of βd is calculated as R R 2 ˆ(r) r=1 (βd − βd ) , (r) where βˆd is the estimator for βd in the rth simulation. The simulation data were generated according to model (3.1). The index matrix X was generated from a U nif (0, 1) distribution. For the loading parameter β, β1 = β2 = √1 and 2 the rest βj s are zeros. We evaluated the performance of the proposed approach with both continuous and discrete predictors G. The continuous G can be gene expressions and the discrete G can be SNP variants. 3.4.1 For continuous G In the continuous case, the non-parametric functions mk (u) are defined as follows: m0 (u) = 2sin(2πu), m1 (u) = 2cos(πu) + 2 and m2 (u) = sin(2πu) + cos(πu) + 1 are varying coefficient functions. m3 (u) = 2 and m4 (u) = 2.5 are non-zero constant coefficients. mk (u) = 0 for k = 5, . . . , p are zeros. G ∼ N (0, 1) . We conducted 1000 simulations (R = 1000) to evaluate the performance of the proposed model under p = 50, 100, q = 5 and n = 1000, 2000. Table 6 demonstrates the selection and estimation accuracy for mk (·) with continuous predictors. The left and the right penal corresponds to the case where p = 50 and 100, respectively. The upper and lower panel corresponds to the case where n = 1000 and 2000, respectively. In all cases, the selection accuracy was closed to 100% for varying, constant and zero effect coefficients. The IMSE of the proposed model was in the order of −1 or −2 47 for varying and constant effect predictors. With the increase of model dimension p (from 50 to 100), we observed a small increase in model IMSE. With the increase of the sample size n (from 1000 to 2000), there were decreases in both model IMSE and oracle IMSE, which is consistent with the asymptotic property of the proposed model. These suggest that the proposed variable selection approach performs reasonably well in selection and estimation accuracy for the non-parametric functions mk (·). Table 6: Selection and prediction accuracy of mk (·) for continuous G p = 50 Oracle % p = 100 IMSE Model Oracle Oracle % IMSE Model Oracle n = 1000 m0 (.) m1 (.) m2 (.) m3 (.) m4 (.) Zero 100.0% 99.2% 99.4% 100.0% 99.9% 99.1% 1.50E-01 1.44E-01 1.34E-01 3.95E-02 5.58E-02 1.03E-03 1.47E-01 2.14E-01 1.61E-01 3.85E-02 5.53E-02 0 100.0% 99.7% 99.7% 100.0% 100.0% 99.0% 1.48E-01 1.48E-01 1.37E-01 4.19E-02 5.93E-02 1.16E-03 1.31E-01 1.66E-01 1.43E-01 3.33E-02 4.86E-02 0 n = 2000 m0 (.) m1 (.) m2 (.) m3 (.) m4 (.) Zero 100.0% 100.0% 100.0% 100.0% 100.0% 99.4% 6.85E-02 5.17E-02 5.46E-02 1.40E-02 1.79E-02 3.33E-04 6.88E-02 6.00E-02 5.99E-02 1.46E-02 1.93E-02 0 100.0% 100.0% 100.0% 100.0% 100.0% 99.4% 7.00E-02 5.43E-02 5.40E-02 1.46E-02 1.89E-02 3.14E-04 7.05E-02 6.34E-02 5.88E-02 1.48E-02 1.87E-02 0 Table 7 presents the selection and prediction accuracy for loading parameter β. The left and right penal corresponds to the case where p = 50 and 100, respectively. The upper and lower panel corresponds to the case where n = 1000 and 2000, respectively. For all cases, the selection accuracy for non-zero loading parameters (β1 , β2 ) was close to 100%. Their MSE were in order of −2 to −4. For zero loading parameters (β3 , β4 , β5 ), the selection accuracy for the case n = 1000 was around 97%. When the sample size increases to n = 2000, their oracle percentages increased to 99%. Their MSE were in the order of −4 to −5. These 48 suggest that the algorithm could not shrink estimators to 0 in around 4% of the cases when n = 1000, which is a common drawback of the LQA algorithm. Table 7: Prediction accuracy of β for continuous G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 2 p = 50 3.4.2 p = 100 MSE Oracle Oracle % Model n = 1000 β1 β2 β3 β4 β5 100.0% 100.0% 97.3% 97.5% 96.8% 7.26E-04 1.36E-02 2.34E-04 1.90E-04 9.01E-04 n = 2000 β1 β2 β3 β4 β5 100.0% 100.0% 99.1% 99.6% 99.3% 2.76E-04 2.71E-04 5.49E-05 2.68E-05 4.58E-05 MSE Oracle Oracle % Model 5.91E-04 1.11E-02 0 0 0 100.0% 100.0% 96.7% 96.7% 96.6% 6.75E-04 3.28E-03 6.22E-04 2.72E-04 2.95E-04 5.75E-04 5.78E-04 0 0 0 2.68E-04 2.66E-04 0 0 0 100.0% 100.0% 99.6% 99.4% 99.4% 2.77E-04 2.76E-04 1.57E-05 3.16E-05 2.22E-05 2.75E-04 2.74E-04 0 0 0 For discrete G For the discrete case, each G was simulated from a multinomial distribution assuming a minor allele frequency (MAF) Pa . G took values as 0, 1, 2 corresponds to aa, Aa, and AA genotype, where a is the minor allele. G was simulated via the following probability distribution function: P (Gij = 0) = Pa2 , P (Gij = 1) = 2 ∗ Pa (1 − Pa ), P (Gij = 2) = (1 − Pa )2 where Gij is the j-th variable of the i-th subject, for i = 1, . . . , n, j = 1, . . . , p. The following lists the choice of the coefficient functions mk (u). 49 Table 8: Setup for mk (u) Function m0 (u) = 2sin(2πu) m1 (u) = 2cos(πu) + 2 m2 (u) = sin(2πu) + cos(πu) + 1 m3 (u) = 2cos(πu) + 2 m4 (u) = sin(2πu) + cos(πu) + 1 m5 (u) = 2cos(πu) + 2 m6 (u) = sin(2πu) + cos(πu) + 1 m7 (u) = 2 m8 (u) = 2 m9 (u) = 2 mk (u) = 0, k > 9 Pa of Gk NA 0.5 0.5 0.3 0.3 0.1 0.1 0.5 0.3 0.1 U nif (0.05, 0.5) In this setup, there were varying and constant coefficient functions corresponding to different MAF. The purpose of setting varying MAFs is to check the selection and estimation performance under different MAFs in G. For zero coefficients, Pa ranged uniformly from 0.05 to 0.5. X was simulated from a U nif (0, 1) distribution. Y was generated according to model (3.1). We evaluated the performance of our proposed model with 1000 simulations under p = 50, 100, n = 1000, 2000 and q = 5. Table 9 and presents the selection and estimation accuracy of non-parametric function mk (·) for discrete G. We observed sample size n and minor allele frequency (Pa ) of Gk were the determining factors in the performance of the proposed model and we present figure 3 to better visualize their impact. With the increase of sample size (from 1000 to 2000), the performance of the model increased. For example, the oracle percentage for m1 (·), . . . , m4 (·) increased from around 80% to 100%. The corresponding IMSE decreased significantly. These are consistent with the asymptotic theory of the proposed model. With the decrease of minor allele frequency for Gk (from 0.5 to 0.1), we observed a decrease in performance, both in terms of oracle percentage and model IMSE. For instance, in the case where n = 50 1000, the oracle percentages of {m1 (·), m2 (·)}(Pa = 0.5), {m3 (·), m4 (·)}(Pa = 0.3), and {m5 (·), m6 (·)}(Pa = 0.1) were around 85%, 80%, and 23% respectively. Their IMSE also increased from 0.4 (Pa = 0.5) to 0.5 (Pa = 0.3), then to 1.3 (Pa = 0.1). We believed this is due to the fact that SNP with lower minor allele frequency provides less information. Overall, the proposed variable selection approach works better with larger sample and with common variant SNPs. Table 9: Selection and estimation accuracy of mk (·) for discrete G Oracle % p = 50 IMSE Model Oracle Oracle % p = 100 IMSE Model Oracle n = 1000 m0 (.) m1 (.) m2 (.) m3 (.) m4 (.) m5 (.) m6 (.) m7 (.) m8 (.) m9 (.) Zero 100.0% 83.4% 87.7% 76.0% 83.3% 23.2% 25.7% 100.0% 99.9% 100.0% 99.0% 1.63E-01 4.29E-01 3.82E-01 5.43E-01 4.37E-01 1.35E+00 1.27E+00 5.09E-02 6.08E-02 1.05E-01 2.74E-03 2.28E-01 5.61E-01 4.38E-01 5.87E-01 4.19E-01 1.05E+00 8.93E-01 6.22E-02 7.50E-02 1.24E-01 0 100.0% 83.1% 87.9% 75.5% 79.6% 22.1% 23.7% 99.9% 99.9% 99.9% 99.2% 1.75E-01 4.44E-01 3.69E-01 5.50E-01 5.11E-01 1.45E+00 1.31E+00 5.05E-02 5.83E-02 1.13E-01 2.15E-03 2.29E-01 6.08E-01 4.50E-01 6.17E-01 4.50E-01 1.18E+00 9.22E-01 5.86E-02 6.75E-02 1.21E-01 0 n = 2000 m0 (.) m1 (.) m2 (.) m3 (.) m4 (.) m5 (.) m6 (.) m7 (.) m8 (.) m9 (.) Zero 100.0% 100.0% 100.0% 100.0% 99.9% 77.5% 77.5% 100.0% 100.0% 100.0% 99.3% 7.47E-02 9.46E-02 9.52E-02 1.07E-01 1.05E-01 4.84E-01 4.93E-01 1.89E-02 2.17E-02 4.47E-02 9.15E-04 8.03E-02 1.38E-01 1.21E-01 1.55E-01 1.37E-01 2.99E-01 2.63E-01 2.11E-02 2.39E-02 4.95E-02 0 100.0% 99.9% 99.9% 99.8% 99.8% 75.1% 73.5% 100.0% 100.0% 100.0% 99.5% 7.42E-02 1.02E-01 9.47E-02 1.08E-01 1.05E-01 5.15E-01 5.32E-01 1.75E-02 2.08E-02 4.24E-02 6.81E-04 8.35E-02 1.49E-01 1.21E-01 1.50E-01 1.30E-01 3.08E-01 2.53E-01 1.97E-02 2.33E-02 4.56E-02 0 51 n=1000 p=50 n=1000 p=100 n=2000 p=50 n=2000 p=100 0.3 0.4 0.5 1.0 0.8 0.6 orable IMSE 0.4 0.2 0.3 0.4 Oracle percentage for constant effect Gk Model IMSE for constant effect Gk 0.12 1.0 Minor Allele Freq. 0.5 0.1 0.2 0.4 0.5 0.5 0.10 n=1000 p=50 n=1000 p=100 n=2000 p=50 n=2000 p=100 orable IMSE 0.04 0.02 0.00 0.00 0.3 Minor Allele Freq. 0.4 0.08 0.10 IMSE 0.06 0.04 0.02 0.1 0.3 Orable IMSE for constant effect Gk n=1000 p=50 n=1000 p=100 n=2000 p=50 n=2000 p=100 0.08 0.8 0.6 0.0 n=1000 p=50 n=1000 p=100 n=2000 p=50 n=2000 p=100 0.2 Minor Allele Freq. 0.4 0.2 Oracle Percentage 0.2 0.1 Minor Allele Freq. 0.12 0.2 0.06 0.1 0.0 0.0 n=1000 p=50 n=1000 p=100 n=2000 p=50 n=2000 p=100 0.0 0.2 0.2 0.4 0.6 IMSE 0.8 0.6 0.4 Oracle Percentage 1.0 0.8 1.2 n=1000 p=50 n=1000 p=100 n=2000 p=50 n=2000 p=100 1.2 1.0 Orable IMSE for varying effect Gk 1.4 Model IMSE for varying effect Gk 1.4 Oracle percentage for varying effect Gk 0.1 0.2 0.3 Minor Allele Freq. 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Minor Allele Freq. Figure 3: Selection and estimation accuracy of mk (·) for discrete G Table 10 demonstrates the selection and estimation result of the loading parameter β. The left and right panel correspond to the case where p = 50 and p = 100 case respectively and the upper and lower panel correspond to the case where n = 1000 and n = 2000 respectively. We observed sample size n was the determining factor in model performance. When sample size is large (n = 2000), the oracle percentage for non-zero loading covariates (β1 , β2 ) was 100% and the oracle percentage for zero loading covariates (β3 , β4 , β5 ) was around 99%. The MSE for β was in the order of −3 to −5. When sample size is relatively small (n = 1000), although the oracle percentage was 100% for non-zero loading covariates, 52 the oracle percentage for zero loading parameters decreased to around 95%. Comparing between the case p = 50 and p = 100, we detected a small deterioration in selection accuracy for zero loading parameters with n = 1000. This is expected since model performance usually decreases with the increase of complexity. However, we did not detect such difference in performance with larger sample size (n = 2000). Table 10: Estimation accuracy of β for discrete G (β1 = β2 = √1 , β3 = β4 = β5 = 0) 2 Oracle % p = 50 MSE Model Oracle Oracle % p = 100 MSE Model Oracle n = 1000 β1 β2 β3 β4 β5 100.0% 100.0% 95.2% 97.0% 96.1% 6.64E-04 7.37E-03 3.64E-04 1.21E-04 2.04E-04 7.28E-04 7.32E-03 0 0 0 100.0% 100.0% 95.2% 96.1% 94.7% 5.84E-04 2.76E-03 3.73E-04 4.18E-04 5.46E-04 5.13E-04 3.27E-03 0 0 0 n = 2000 β1 β2 β3 β4 β5 100.0% 100.0% 98.9% 98.7% 98.9% 2.34E-04 2.31E-04 5.00E-05 5.44E-05 4.37E-05 2.22E-04 2.20E-04 0 0 0 100.0% 100.0% 98.9% 98.9% 99.0% 2.20E-04 2.30E-03 3.24E-05 4.49E-05 5.07E-05 2.12E-04 2.37E-03 0 0 0 Based on the simulation results with both continuous and discrete G variables, we inferred the following characteristics of the proposed model. (1) The proposed model performs reasonably well with large sample (n = 1000 or 2000). (2) The false positive rate for loading parameter β was around 5% when n = 1000. This is due to the implementation of LQA algorithm since it cannot shrink zero parameters to zero in some cases. (3) Compared to rare variant SNP (Pa = 0.1), the model performs better with common variant SNP (Pa = 0.3 or 0.5). We believe this is due to the fact that SNP with lower minor allele frequency provides less information. 53 3.5 Real Data Application We demonstrated the utility of our model with a type 2 diabetes data set. The data set contains genotypes (SNPs), environments and phenotypic trait of interest for type 2 diabetes. The data set is consisted of two nested case-control cohort studies: the Nurses Health Study (NHS) and the Health Professional Fellow-up Study (HPFS) from the Gene Environmental Association Studies Consortium (GENVEA). Details of these two cohorts can be found from Coldditz and Hankinson (2005) and Rimm et al. (1991). Originally, the data set contained 3,391 females (NHS) and 2,599 males (HPFS). After data cleaning by removing subjects with unmatch genotypes and phenotypes, SNPs with more than 10% missing rate, MAF < 0.05, and deviation from Hardy-Weinberg equilibrium (p-value < 0.001), the data set contains 655,002 SNPs and a total of 5,865 subjects (2,494 males and 3,371 females), of which there were 2,733 cases and 3,132 controls. There are 12 continuous covariates including: height, weight, age, alcohol consumption etc. We fit a marginal logistic regression model for all 12 factors. Based on their marginal p-values and the correlation between those factors, we decided to select 5 covariates as the environmental variables in the model. They were total physical activity (X1 denoted as act), BMI (X2 ), alcohol intake (X3 denoted as alcohol), heme iron intake (X4 denoted as heme), and glycemic load (X5 denoted as gl). Based on the location of the SNPs, we mapped all SNPs to all known genes. Then we selected the genes with more than 30 SNPs. As a result, we obtained 2,178 genes. We applied the proposed variable selection approach by fitting one gene at a time to select significant SNPs and identify their effects. Since the first element of the loading parameter β, β1 has to be a non-zero positive number for identifiability purpose, we fit the proposed model five times, each time with a different variable as the first 54 component in X. If a SNP is selected as varying or constant effect in all five models, this would suggest a convincing signal. We only considered results with a convincing signal, that is, SNPs showing non-zero effect in all the five fitted models by varying the order of the five environmental variables in X. In total, our model identified 13 varying effect SNPs and 26 constant effect SNPs. Here we presented one of the selected varying SNP as an example. Please refer to the supplemental material for the complete list of selected varying and constant SNPs. Fig 4 presents plots of marginal environmental effect and the interaction effect of a SNP rs6537663 with heme iron intake being the first loading covariate. With the increase of the index, we observed that the marginal effect first decreases then increases, followed by a rapid decrease as the total effect of the five environmental factors increases. For the interaction effect, it fluctuated around 0 as the total effect of the five environmental variables increases, indicating that the SNP is not responding (or insensitive) to the changes of the five environmental variables. As the index X β increases, the SNP reacts to the environmental changes, with a dramatic increased risk on type 2 diabetes after a certain threshold. This estimated effect implies that the genetic sensitivity of the SNP to the total effect of the five environmental variables follows a threshold model. This has practical applications as people are mostly OK with the daily environmental changes including dietary changes. However, such changes cannot pass a certain limit. Otherwise, disease may occur. Table 11 presents the selection and estimation results of the loading parameters β with heme iron intake being the first loading covariate. The model selects all loading parameters except the alcohol consumption (alcohol). We also observed that body mass index (bmi) has the largest effect which makes practical sense since BMI is positively associated with type 2 diabetes and is a risk factor for diabetes. 55 Effect of rs6537663 1.0 0.5 −0.5 0.0 Log Odds of Type II Diabetes 1 0 −1 Log Odds of Type II Diabetes 2 1.5 Environmental Effect 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 Xβ 0.4 0.6 0.8 1.0 Xβ Figure 4: Plot of effects on a log odds scale for SNP rs6537663 Table 11: The estimated effect of β for SNP rs6537663. act bmi alcohol heme gl -0.1832 0.954445 0 0.215668 0.094647 Previous study (Sale et al. (2007) and Grant et al. (2006)) suggested that gene TCF7L2 is associated with type 2 diabetes across multiple populations. Sale et al. (2007) reported strong association between type 2 diabetes and SNPs rs7903146 and rs7901695. Our model also selected SNP rs7901695 showing a constant effect (rs7903146 was not in our data set). 3.6 Discussion Gene-environment interaction has been one of the major components in genetic association studies. In this paper, we developed a 3 stage iterative variable selection approach for generalized varying multi-index coefficient model with binary responses. Our goal was to 56 identify varying, constant and zero effects as well as to select non-zero loading parameters. Biologically speaking, our approach is attractive since it offered a novel way to look at G×E interaction from a systems genetics perspective. Our model is flexible to detect non-linear interactions. It should be preferred when the gene effect is non-linearly modified by simultaneous exposure to multiple environmental factors. Statistically speaking, gVMICM treats the effect of multiple variables X as a single index, thus reducing model dimension and alleviating the curse of dimensionality. In a typical G×E study, there are usually hundreds of thousands of genotype (SNPs) and a couple dozens environmental variables. It is important to reduce the dimension of gene predictors first before apply our method. For example, one could fit a marginal model between the response and every genotype, then select a reasonable number of gene predictors to fit the variable selection model. In human genome, a pathway usually contains a wide range of genes and each gene could contain ten to hundred of SNPs. In this chapter, we implemented the proposed method focusing a gene as a unit to select varying and constant effect SNPs. We could implement such method in every pathway to select significant genes or SNPs of importance. Alternatively, we could apply principle component analysis (PCA) (Jolliffe (2002)) or sparse principle component analysis (sPCA) (Zou et al. (2006)) to summarize the SNP information in a gene or pathway to several principle components (PCs), apply the proposed method to select significant PCs. In chapter 2, we proposed a 3 stage variable selection approach for VMICM with continuous responses. Then we generalized such approach to binary responses in chapter 3. The regression model applied in chapter 2 is essentially a mean regression model in which one is interested in the conditional mean. When there are outliers or the nature of interest is not on the mean rather than on different quantiles, a quantile regression model might be 57 a natural choice. For example, when studying the effect of genes on birth weight, people are typically interested in the effect of SNPs on the lower or upper quantile of the birth weight because extremely low or high birth weight may pose potential risk later in life. In such cases, it is essential to extend the proposed method to a quantile regression setup to select important genes modulated by multiple environmental exposures to affect a trait of interest such as birth weight. This will be addressed in chapter 4. 58 Chapter 4 Variable Selection for Quantile VMICM 4.1 Introduction Over the past decades, there has been a growing interest in identifying gene-environment (G×E) interaction in scientific research. Gene-environment interaction was defined as a different effect of a genotype on disease risk under different environmental exposures (Ottman (1996)). Traditionally, G×E interactions has been investigated based on a single environmental exposure model. However, more and more epidemiological studies reveal that disease risk can be modified by simultaneously exposure to multiple environmental factors (Carpenter et al. (2002) and Sexton and Hattis (2007)). When multiple environmental factors are analyzed, the model dimension can increase dramatically with the inclusion of the interaction terms, which lead to estimation instability and large standard errors (curse of dimensionality). To ease such burden, a varying multi-index coefficient model(VMICM) (Liu et al. (2016)) can be applied to model the interaction between genetic factors Gk and a mixture of environmental factors X by p Y = mk (Xβ)Gk + k=0 59 where mk (u), k = 0, 1, . . . , p are continuous smooth functions; β is q-dimensional loading parameter; and is the random error. We have proposed an iterative 3 steps variable selection method for VMICM in chapter 2, which classifies the non-parametric smooth function mk (u) into three categories: varying, constant and zero. The goal of this paper is to generalize such approach to a quantile regression setting. The quantile VMICM is a important alternative to the conditional mean models for analyzing G×E interaction. First, comparing conditional mean regression, modelling conditional quantiles offers a far more comprehensive understanding of the distribution of the response variable. In many applications, the impact of the genetic factor G on the response varies at different quantiles of the distribution. People are more interested in the quantiles rather than the mean. For example, in a study to find genes associated with birth weight, extremely low or high birth weight could be problematic since it could cause complications later in life. In this case, one would be interested in identifying genes or SNPs affecting low or high birth weight with their effect modified by environmental exposures. Second, even when we are interested in the center of the conditional distribution of the response variable, median regression (quantile regression with τ = 0.5) can provides more robust estimators, especially when there are outliers in the trait distribution. With the recent advancement in biotechnoloty, we are now able to collect hundreds of thousands of single nucleotide polymorphisms (SNPs) data, at the same time with relatively low cost. Such advancement renders traditional model selection methods such as forward/backword selection or methods based on AIC/BIC information obsolete. Recently, variable selection via penalized regression has been gaining popularity. The idea is to add a penalty term to the loss(likelihood) function. With different choices of penalty functions, the estimator could possess different properties. Fan and Li (2001) proposed three 60 important criteria for penalized estimator: sparsity, unbiasedness and continuity. They also characterized oracle property, meaning that the model performs as well as if the true model is known in advance. For instance, adaptive LASSO (Zou (2006)), smoothly clipped absolute deviation (SCAD) (Fan and Li (2001)) and minimax concave penalty (MCP) (Zhang (2010)) possess oracle property. In this work, we proposed a quantile regression based variable selection method built upon the VMICM model to identify how genetic effects are modified by simultaneous exposure to multiple environmental factors to affect a disease trait. We adopted the MCP penalty in our modeling framework. We evaluated the method through both simulation and real data applications. The proposed variable selection method for quantile regression enriches the literature about variable selection for quantile regression. The rest of the paper is organized as follows. Section 2 introduces the proposed variable selection method, how to formulate the penalized quantile loss function, how to iteratively optimize the penalized quantile loss function as well as how to select various tuning parameters and initial value for β. In section 3, we evaluated the finite sample performance of our model via monte carlo simulation. We applied our approach to a birth weight data set in section 4, followed by a discussion. 4.2 Variable selection for quantile regression with VMICM Throughout the paper, superscript T is used to denote matrix transpose; ||.||p is used to denote Lp norm; log(a) is used to denote natural logarithm of a. For the sake of simplicity, we use constant and non-zero constant interchangeably. 61 4.2.1 Model setup For a random sample of the data {Y n×1 , X n×q , Gn×(p+1) } with size n, we assume the following model: p Y = mk (Xβ(τ ), τ )Gk + (τ ) (4.1) k=0 where Y = (Y1 , Y2 , . . . , Yn )T is a continuous response variable; X = (X 1 , X 2 , . . . , X q ) is a continuous q-dim environmental variable; Gn×(p+1) = (G0 , G1 , . . . , Gp ) where G0 = (1, . . . , 1)T and Gk is a continuous or discrete genetic vector of length n for k = 1, 2, . . . p. Our parameters of interest {mk (u, τ )}k=0,1,...,p are p + 1 unknown non-parametric functions conditional on quantile τ ; β(τ ) = (β1 (τ ), . . . , βq (τ ))T is the loading parameters conditional on quantile τ . (τ ) is an unknown random error satisfies P ( (τ ) < 0|X, G) = τ for some specific quantile 0 < τ < 1. The case with τ = 0.5 corresponds to median regression. For ease of presentation, all parameters of interest are τ -specific. For instance, we use β and mk (u) to represents β(τ ) and mk (u, τ ), respectively. We denote the quantile VMICM model as τ VMICM. 4.2.2 Estimation method Our goal is to select and estimate unknown functions {mk (·)}k=0,1,...,p and unknown loading parameter β = (β1 , . . . , βq )T . For the sake of identifiability, we assumed ||β||2 = 1 and β1 > 0, and mk (·) cannot has the form of mk (u) = αT uβ T u+γ T u+c. Please refer to Schumaker (2007) for details of the construction of B-splines basis function. Given the number of interior knots K and the degree of the B-spline basis function h, we can approximate mk (u) by ¯ mk (u) ≈ γk1 + B(u)γ k∗ 62 (4.2) where γ k∗ = (γk2 , γk3 , . . . , γkL )T , γ k = (γk1 , γ Tk∗ )T and L = K + h + 1. Hence, (4.1) can be rewritten as : p ¯ [γk1 + B(Xβ)γ k∗ ]Gk + . Y = (4.3) k=0 Here, the estimation of non-parametric function mk (u)k=0,1,...,p and its loading parameter β is transformed to the estimation of {γk1 , γ k∗ }k=0,1,...,p and β. This B-spline approximation (4.2) also enable us to separate the constant effect of Gk on Y from its joint effect with X on Y . (1) If ||γ k∗ ||2 = 0, then Gk and X jointly affect Y ; (2) If ||γ k∗ ||2 = 0 and |γk1 | = 0, then Gk has a constant effect on Y ; (3) If ||γ k∗ ||2 = 0 and |γk1 | = 0 then Gk has no effect on Y at all. Following the idea proposed in previous chapters, we adopted the following penalized regression approach and defined the objective function as Qτ (β, γ): p n ρτ Yi − Qτ (β, γ) = i=1 p ¯ [γk1 + B(Xβ)γ k∗ ]Gik + n k=0 k=1 p q pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) + n +n pλ1 (||γ k∗ ||2 ) k=1 (4.4) pλ3 (|βd |) d=2 where ρτ (u) = u{τ − I(u < 0)} is the quantile loss function; pλ1 (·), pλ2 (·), pλ3 (·) are penalty functions of the corresponding parameters; and I(·) is an indicator function which equals 1 if the condition in the parentheses is satisfied, and 0 otherwise. From the construction of the penalty function, we only penalize γk1 if ||γ k∗ ||2 = 0, meaning that we are interested in whether the non-parametric function mk (·) is zero or a non-zero constant only when it is not varying. No penalty is applied to the intercept function m0 (·) as both γ 0∗ and γ01 are not involved in the penalty term. No penalty is applied to the coefficient β1 due to the constrain: 63 β1 > 0. For the penalty function, we use MCP penalty proposed by Zhang (Zhang (2010)) which is defined as p(x, λ) = λ 0x (1 − τsλ )+ ds with the regularization parameters τ > 0 and λ > 0. 4.2.3 Estimation algorithm In the previous chapters, we proposed a 3 step iterative approach based on the VMICM model for both continuous and binary responses. The methods classify the non-parametric functions mk (·) into 3 categories: varying, constat or zero, denoted by V, C and Z respectively. Here, we generalized the estimation algorithm to a quantile regression setting. Step 1 : (1) ˆ (0) , the step 1 estimator of γ, γ ˆ (1) For given β, denoted by β = (1)T ˆ k∗ }Tk=0,1,...,p can be obtained via optimizing the following grouped penalized {ˆ γk1 , γ regression ˆ (0) ) ˆ (1) = min Q1 (γ|λ1 , β γ γ where ˆ (0) ) = Q1 (γ|λ1 , β p n ρ τ Yi − i=1 p (0) )γ ]G ¯ [γk1 + B(Xβ k∗ ik + n k=0 pλ1 (||γ k∗ ||2 ). k=1 Instead of penalizing each coordinate of γ k∗ = (γk2 , . . . , γkL )T separately, we penalized the L2 norm of γ k∗ since we would like to assess the presence of the varying effect of Gk on the response variable. Step 1 classifies mk (·), k = 1, . . . , p into two categories: varying(V) or (1) (1) non-varying(NV) where mk (·) ∈ V if ||ˆ γ k∗ ||2 > 0 and mk (·) ∈ N V if ||ˆ γ k∗ ||2 = 0. Step 2 : Based on the step 1 estimators of the B-spline coefficients γ, the step 2 estimators (2) (2) (2) ˆ (2) = {(ˆ ˆ k∗ )k∈V , (ˆ γ γk1 , γ γk1 )k∈C } can be obtained via the penalized regression. Note that 64 (2) (1) ˆ k∗ = 0 automatically if γ ˆ k∗ = 0. We obtained the estimator by, γ ˆ (2) = min Q2 (γ|λ2 , β (0) , γ ˆ (1) ) γ γ where n Q2 (γ|λ2 ˆ (1) ) , β (0) , γ i=1 (2) (0) )γ ]G − ¯ [γk1 + B(Xβ k∗ ik ρτ Yi − = k∈V γk1 Gik k∈C p (2) (1) pλ2 (|γk1 |)I(||ˆ γ k∗ ||2 = 0). +n (4.5) k=1 ˆ (0) , we can obtain the estimators of the B-splines Based on the initial estimator of β, β ˆ (2) and classify mk (·) k = 1, . . . , p into V, C or 0. coefficients γ, γ ˆ via the penalized regression by Step 3 : We obtained β ˆ= β ˆ (2) ) min Q3 (β|λ3 , γ ||β ||2 =1 where n Q3 (β|λ3 ˆ (2) ) ,γ = p q (2) ρ τ Yi − [ˆ γk1 i=1 k=0 (2) ¯ + B(Xβ)ˆ γ k∗ ]Gik pλ3 (|βd |). +n d=2 ˆ (0) = β, ˆ then iterate step 1 to 3 until convergence. Denote γ ˆ as the ˆ and β Step 4 : Set β converged estimators. With this iteration approach, we still need to select the tuning parameters λ1 , λ2 , λ3 , order h and number of interior knots K for the B-spline approximation, as well as a reasonable initial value for β. 65 4.2.4 Selection of parameters For a traditional linear mean regression model, Bayesian Information Criterion (BIC) (Schwarz (1978)) has been a popular choice in the selection of shrinkage parameters. Lee et al. (2014) provided theoretical justification in the use of BIC in quantile regression models. Hence, we use BIC as our selection criterion for shrinkage parameters, and the order h and the number of interior knots K of the B-spline basis functions. 4.2.4.1 Selection of the tuning parameters λ1 , λ2 , λ3 Step 1: We took λ1 as the minimizer of p n ρ τ Yi − BIC(λ1 ) = log i=1 k=0 log(n) (λ ) (λ ) ˆ (0) )ˆ ¯ ∗ dfλ1 β γ k∗1 ]Gik + [ˆ γk11 + B(X 2n (λ ) (λ1 ) ˆ (0) ) defined above; β ˆ (0) is ˆ k∗ }k=0,1,...,p are the minimizers of Q1 (γ|λ1 , β where {ˆ γk11 , γ chosen as the estimator from previous iteration; and dfλ1 is defined as the total number of non-zero coefficients if λ1 is the penalized parameter. Step 2: We took λ2 as the minimizer of p n ρ τ Yi − BIC(λ2 ) = log i=1 log(n) (λ ) (λ ) ˆ (0) )ˆ ¯ [ˆ γk12 + B(X β γ k∗2 ]Gik + ∗ dfλ2 2n k=0 (λ ) (λ2 ) ˆ (0) ) defined above and dfλ ˆ k∗ }k=0,1,...,p are the minimizers of Q2 (γ|λ2 , β where {ˆ γk12 , γ 2 is defined as the total number of non-zero coefficients if λ2 is the penalized parameter. Step 3: We took λ3 as the minimizer of p n ρ τ Yi − BIC(λ3 ) = log i=1 log(n) (2) (2) ˆ (λ3 ) )ˆ ¯ [ˆ γk1 + B(X β γ k∗ ]Gik + ∗ dfλ3 2n k=0 66 ˆ (λ3 ) are the minimizers of Q3 (β|λ3 , γ ˆ (2) ) defined above and dfλ3 is defined as the where β total number of non-zero β if λ3 is the penalized parameter. To find the optimal tuning parameters, λ1 , λ2 , λ3 are searched over a grid of exponentially decreasing values with the minimum to be 1E-3, and the maximum of λ1 , λ2 , λ3 are set to be the minimum value such that all of the penalized estimators are 0. For ease of computation, the number of grid points is set to be 100. 4.2.4.2 Selection of the order h and the number of interior knots K As we discussed in previous chapters, higher order of the B-spline basis function implies more complex functions and leads to less interpretable effect. From a practical point of view, the interaction effect is less likely to be highly nonlinear. Thus, we searched optimal h over the set h ∈ {2, 3, 4} to avoid any complications in interpretation. As for the number of interior knots K, we searched the optimal K in K = {2, 3, 4, 5}. For every combination of K and h, we fit the following intercept only model, Y = m0 (Xβ) + (4.6) Again, as we discussed in previous chapters, this is to avoid computational burden. The optimal K and h are those that minimize log n ˆ i=1 ρτ (Yi log(n) − Yi ) + 2n (K + h + 1), where Yˆi is the estimation of the i-th subject under model (4.6). 4.2.5 Selection of the initial values The initial value of β for a single index model is usually set to be (1, 0, . . . , 0)T or ( √1q , . . . , √1q )T . However, neither works well in our simulations. Hence, we get the initial 67 value by fitting the intercept only model (4.6). 4.3 Simulation We conducted extensive simulation to investigate the finite sample performance of the proposed selection approach under the τ VMICM model. The performance is evaluated in several ways: (1) the selection accuracy (oracle percentage) of the non-parametric function mk (u); (2) IMSE of m(u); ˆ (3) the selection accuracy (oracle percentage) of β and (4) MSE of β. The performance is evaluated over 1000 simulation runs. The oracle percentage of mk (u) is defined as the percentage of correct classification of mk (u). For instance, if mk (u) ∈ V , and mk (·) is classified as varying for g times, then the g × 100%. IMSE of mk (·) is defined as oracle percentage of mk (·) is calcualted as 1000 1 1000 1000 r=1 1 [ 100 100 (m ˆ k (uj )r − mk (uj ))2 ] j=1 where m ˆ k (uj )r is the fitted value of mk (uj ) for the r-th simulation and uj is taken at the ˆ j − % quantile among the range of X β (r) . The oracle percentage of β is defined as the percentage of correct selection of β. For example, if βd = 0 and βd is selected to be non-zero g for g times, then the oracle percentage of βd is calculated as 1000 × 100%. MSE of βd 1 is calculated as 1000 1000 ˆ(r) r=1 (βd (r) − βd )2 where βˆd is the estimator for βd in the r − th simulation. 68 4.3.1 Simulation Setting Our data was generated according to the model, p Y = m0 (Xβ) + mk (Xβ)Gk + (τ ). k=1 Five (q = 5) independent environmental factors were generated with each one being generated from a U nif (0, 1) distribution. For the loading parameter β = (β1 , β2 , . . . , βq )T , we set β1 = β2 = √1 and the rest βj s were set as zeros. 2 was generated from a N (0, 1) distribution and (τ ) = − F −1 (τ ) where F denotes the CDF of . F −1 (τ ) was subtracted from to make sure the τ th-quantile of (τ ) is zero. For the genetic factor G, we evaluated the performance of our model with both continuous and discrete variables. 4.3.2 The continuous case We first evaluated the performance of our approach with continuous genetic predictors G which marginally followed a N (0, 1) distribution. The non-parametric functions mk (u) were set to be: m0 (u) = 2sin(2πu), m1 (u) = 2cos(πu) + 2 and m2 (u) = sin(2πu) + cos(πu) + 1; m3 (u) = 2 and m4 (u) = 2.5 which were non-zero constants; mk (u) = 0 for k = 5, . . . , p which were zero effects. We simulated the data under p = 50, 100, τ = 0.25, 0.5, 0.75, and n = 2000. Table 12 presents the selection and estimation result for non-parametric functions mk (u) with continuous predictor G. For varying and constant coefficient function (m1 (·) to m4 (·)), the oracle percentage was at 100% and model IMSE was in the order of -2 for all cases. These suggest that our model could correctly select and estimate non-zero effect predictors at all quantiles. Between the median regression case (τ = 0.5) and the cases where τ = 0.25, 0.75, 69 the false positive rate was much lower for median regression case (around 1%) than that of the case τ = 0.25, 0.75 (around 10%). Also the model IMSE of the case τ = 0.5 was smaller than the model IMSE of the case τ = 0.25, 0.75. These are expected since median regression usually provides the most accurate estimator among different quantiles. Between the case p = 50 and the case p = 100, we did not observe an significant difference in model performance. Overall, the proposed variable selection approach can correctly select non-zero effect predictors at all quantiles. Compared with median regression, although the false positive rate and model IMSE were higher in the case τ = 0.25, 0.75, they were still decent. Table 12: Selection and estimation accuracy of mk (·) for continuous G p = 50 p = 100 m0 (.) m1 (.) m2 (.) 0.25 m3 (.) m4 (.) Zero 100.0% 100.0% 100.0% 100.0% 100.0% 88.8% 2.78E+00 8.16E-02 9.59E-02 2.58E-02 2.56E-02 1.03E-02 2.96E-02 6.16E-03 1.30E-02 8.77E-04 1.02E-03 0 100.0% 100.0% 100.0% 100.0% 99.8% 90.6% 2.67E+00 7.07E-02 7.80E-02 2.19E-02 2.12E-02 7.27E-03 2.97E-02 6.16E-03 1.31E-02 9.71E-04 9.80E-04 0 m0 (.) m1 (.) m2 (.) m3 (.) m4 (.) Zero 100.0% 100.0% 100.0% 99.7% 99.9% 98.7% 7.49E-02 4.98E-02 5.72E-02 1.25E-03 1.50E-03 1.00E-04 2.88E-02 100.0% 5.07E-03 100.0% 1.22E-02 100.0% 7.90E-04 99.9% 7.91E-04 99.8% 0 99.1% 7.70E-02 4.89E-02 5.77E-02 1.30E-03 1.48E-03 6.26E-05 2.91E-02 5.31E-03 1.23E-02 7.90E-04 8.29E-04 0 m0 (.) m1 (.) 0.75 m2 (.) m3 (.) m4 (.) Zero 100.0% 100.0% 100.0% 99.9% 99.9% 88.5% 2.60E+00 6.25E-02 7.12E-02 2.42E-02 2.62E-02 1.05E-02 3.01E-02 6.08E-03 1.35E-02 8.58E-04 9.93E-04 0 2.60E+00 6.17E-02 7.14E-02 2.18E-02 2.20E-02 7.19E-03 3.13E-02 6.18E-03 1.37E-02 9.03E-04 9.63E-04 0 0.5 100.0% 100.0% 100.0% 100.0% 100.0% 90.7% Table 13 presents the selection and estimation results for the loading parameters β. For non-zero loading β’s (β1 and β2 ), the oracle percentage was close to 100% in all cases. This 70 suggests our model could correctly select non-zero loading parameters. For zero loading β’s (β3 , β4 , and β5 ), the oracle percentage for the median regression case was around 98.5%. Compared between the case τ = 0.25 and τ = 0.75, oracle percentage for the case τ = 0.25 (99%) was slightly higher than that of the case τ = 0.75 (96%), while the MSE for the case τ = 0.25 is higher than that of the case τ = 0.75. Between the case p = 50 and p = 100, we did not observed a difference in model performance. Overall, the proposed model can correctively select and estimate loading covariates with reasonably high accuracy at all quantiles. Table 13: Selection and estimation accuracy of β p = 50 τ Model Oracle Oracle % Model Oracle β1 β2 0.25 β3 β4 β5 100.0% 99.8% 99.9% 99.9% 99.9% 1.20E-02 4.76E-05 1.47E-02 4.75E-05 7.15E-05 0 2.09E-04 0 9.60E-05 0 100.0% 100.0% 99.9% 99.9% 99.9% 6.71E-03 4.36E-05 7.69E-03 4.37E-05 5.81E-05 0 2.93E-04 0 9.53E-05 0 β1 β2 β3 β4 β5 100.0% 100.0% 98.6% 98.3% 98.7% 5.29E-05 3.78E-05 5.29E-05 3.78E-05 1.52E-06 0 3.33E-06 0 1.85E-06 0 100.0% 100.0% 98.3% 98.8% 98.9% 5.33E-05 3.54E-05 5.34E-05 3.54E-05 1.23E-06 0 3.24E-06 0 1.45E-06 0 β1 β2 0.75 β3 β4 β5 100.0% 100.0% 95.3% 95.9% 95.8% 4.40E-03 4.81E-05 4.51E-03 4.80E-05 4.14E-05 0 5.31E-05 0 3.85E-05 0 100.0% 100.0% 96.3% 96.2% 95.8% 3.60E-03 4.89E-05 3.49E-03 4.88E-05 1.87E-05 0 4.23E-05 0 2.32E-05 0 0.5 4.3.3 Oracle % p = 100 The discrete case We continued to evaluate the performance of our model with discrete genetic predictors G. One of many applications of our model is to select significant single nucleotide polymorphism 71 (SNP) in a gene or pathway. All SNPs take values of 0, 1 and 2 to represent aa, Aa, and AA genotype under an additive genetic model. We simulated G according to the following probability distribution function, P (Gij = 0) = M AF 2 , P (Gij = 1) = 2 ∗ M AF (1 − M AF ), P (Gij = 2) = (1 − M AF )2 where Gij is the j-th predictor of the i-th subject, i = 1, . . . , n and j = 1, . . . , p. The data were simulated under p = 50, 100, τ = 0.25, 0.5, 0.75, and n = 2000. We set the non-parametric function mk (u) and the corresponding MAF for Gk as follows: Table 14: Setup for mk (u) Function m0 (u) = 2sin(2πu) m1 (u) = 2cos(πu) + 2 m2 (u) = sin(2πu) + cos(πu) + 1 m3 (u) = 2cos(πu) + 2 m4 (u) = sin(2πu) + cos(πu) + 1 m5 (u) = 2cos(πu) + 2 m6 (u) = sin(2πu) + cos(πu) + 1 m7 (u) = 2 m8 (u) = 2 m9 (u) = 2 mk (u) = 0, k > 9 MAF of Gk NA 0.5 0.5 0.3 0.3 0.1 0.1 0.5 0.3 0.1 Unif(0.05,0.5) From the simulation setup, we would be able to evaluate how our proposed variable selection method perform with SNPs of a wide range of MAF. Table 15 presents the selection and estimation result for the non-parametric functions with discrete genetic predictors. Compared to the case where τ = 0.25, 0.75, the median regression case (τ = 0.5) performed better. For instance, the oracle percentage for varying effect Gk was around 99% for median regression, while that of the case τ = 0.25, 0.75 ranged from 90% to 97% (p = 50) and 84% 72 to 91% (p = 100). Further, the model IMSE for the median regression case was smaller than that of the case τ = 0.25, 0.75. Between the case where p = 50 and p = 100, the case p = 50 performed slightly better both in terms of oracle percentage and model IMSE with varying effect predictors. Overall, the model performed reasonably well across different quantiles. In addition, we observed the model performance was associated with minor allele frequency of Gk . We prepared figure 5 to better visualize such trend. With an increase of minor allele frequency of Gk (from 0.1 to 0.5), we observed an increase in oracle percentage as well as a decrease in model IMSE. These suggest that the model performs better with common variant SNPs. It is expected since rare variant SNP has less information. Table 15: Selection and estimation accuracy for mk (u) with discrete G Oracle % τ p = 50 IMSE Model Oracle Oracle % p = 100 IMSE Model Oracle Intercept Varying 0.25 Constant Zero 100.0% 97.5% 94.2% 93.9% 7.54E-01 1.44E-01 1.66E-02 4.71E-03 2.47E-02 2.30E-02 3.19E-03 0 100.0% 91.4% 95.5% 96.0% 7.98E-01 2.23E-01 1.66E-02 3.05E-03 2.44E-02 2.32E-02 3.22E-03 0 Intercept Varying Constant Zero 100.0% 99.9% 95.4% 93.6% 4.84E-02 9.14E-02 7.52E-03 2.87E-03 2.35E-02 2.00E-02 2.68E-03 0 100.0% 99.3% 95.5% 93.8% 4.97E-02 9.84E-02 9.30E-03 2.87E-03 2.30E-02 1.98E-02 2.76E-03 0 Intercept 0.75 Varying Constant Zero 100.0% 90.8% 96.7% 96.8% 1.02E+00 2.29E-01 1.35E-02 2.09E-03 2.53E-02 2.33E-02 3.19E-03 0 100.0% 84.1% 98.6% 98.5% 1.09E+00 2.48E-02 3.35E-01 2.30E-02 1.26E-02 3.14E-03 9.24E-04 0 0.5 73 Model IMSE for varying effect Gk (p=50) 1.0 Oracle percentage for varying effect Gk (p=50) Oracle IMSE for varying effect Gk (p=50) tau = 0.25 tau = 0.5 tau = 0.75 0.1 0.2 0.3 0.4 0.0 0.0 tau = 0.25 tau = 0.5 tau = 0.75 0.02 0.00 0.1 0.2 0.01 0.2 IMSE 0.3 IMSE 0.6 0.4 Oracle Percentage 0.4 0.8 0.03 0.5 tau = 0.25 tau = 0.5 tau = 0.75 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Minor Allele Freq. Minor Allele Freq. Oracle percentage for varying effect Gk (p=100) Model IMSE for varying effect Gk (p=100) Oracle IMSE for varying effect Gk (p=100) 1.0 Minor Allele Freq. tau = 0.25 tau = 0.5 tau = 0.75 0.02 IMSE 0.4 IMSE 0.6 0.1 0.2 0.3 Minor Allele Freq. 0.4 0.5 0.0 0.0 tau = 0.25 tau = 0.5 tau = 0.75 0.00 0.2 0.01 0.4 0.2 Oracle Percentage 0.6 0.8 0.03 0.8 tau = 0.25 tau = 0.5 tau = 0.75 0.1 0.2 0.3 Minor Allele Freq. 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Minor Allele Freq. Figure 5: Selection and estimation accuracy of mk (·) for discrete G Table 16 shows the selection and estimation of loading parameters β with discrete G. The proposed model select and estimate non-zero loading β with high precision (100% in all cases). For zero loading parameters (β3 , β4 , and β5 ), the false positive rate was around 0%, 1.5%, and 6.5% for the case τ = 0.25, τ = 0.5, and τ = 0.75 respectively. The MSE for zero loading β’s was in the order of −5 to −7, suggesting the proposed model shrunk it very close to 0. 74 Table 16: Selection and estimation accuracy for β with discrete G τ Oracle % p = 50 MSE Model Oracle Oracle % p = 100 MSE Model Oracle β1 β2 0.25 β3 β4 β5 100.0% 100.0% 100.0% 100.0% 100.0% 3.98E-04 4.04E-04 0 0 0 4.42E-05 4.43E-05 0 0 0 100.0% 100.0% 100.0% 100.0% 100.0% 4.63E-04 4.69E-04 0 0 0 β1 β2 β3 β4 β5 100.0% 100.0% 99.1% 98.9% 98.9% 5.20E-05 3.96E-05 5.21E-05 3.97E-05 2.27E-07 0 3.43E-07 0 5.45E-07 0 100.0% 100.0% 99.1% 98.4% 98.7% 5.30E-05 3.97E-05 5.30E-05 3.97E-05 1.00E-06 0 8.12E-07 0 5.87E-07 0 β1 β2 0.75 β3 β4 β5 100.0% 100.0% 93.6% 94.3% 93.5% 2.75E-04 5.13E-05 2.84E-04 5.15E-05 2.52E-05 0 2.23E-05 0 3.46E-05 0 100.0% 100.0% 93.0% 93.9% 93.3% 3.70E-04 4.69E-05 3.59E-04 4.69E-05 2.26E-05 0 1.80E-05 0 3.08E-05 0 0.5 4.75E-05 4.74E-05 0 0 0 Based on the simulation results described above, we were able to conclude the followings. (1) The proposed model selects and estimates Gk with reasonably high precision. (2) Compared to rare variant SNPs (Pa = 0.1), the model performed better with common variant SNPs (Pa = 0.3, 0.5). (3) The model could correctly select and estimate non-zero loading β. At last, (4) the false positive rate for zero loading β was low for τ = 0.25, 0.5, and it was slightly higher for τ = 0.75. 4.4 Real Data Application We applied the proposed variable selection approach to a birth weight data set, which is obtained from Gene Environment Association Studies initiative (GENEVA) funded by the 75 Genes, Environment and Health Initiative (GEI). Epidemiological studies often suggested that birth weight is strongly associated with morbidity and mortality risk during the first year, and risk of many diseases in adulthood. Birth weight is affected by fetal genes and maternal environment. We first performed data cleaning, removing SNPs with more that 5% missing, SNPs with minor allele frequency < 0.05 and SNPs deviates from Hardy-Weinberg equilibrium (p-value < 0.001). After this cleaning step, the data set contains 1,126 subjects and 590,913 SNPs. For the environmental factors X, based on the marginal p-value (< 0.05) when regressing birth weight with each X variable, we select 3 environmental factors: mother’s mean OGTT(oral glucose tolerance test) diastolic blood pressure (X1 ), mother’s one hour OGTT glucose level (X2 ) and mother’s mean OGTT systolic blood pressure (X3 ). We first map all SNPs to all known genes based on its location. Then we select the genes which contain ≥ 30 SNPs. As a result, we get 2,076 genes. Then we fit the proposed model to all genes at τ =0.25, 0.5, and 0.75. Since the first element of the loading parameter β has to be a non-zero positive number (β > 0), we fit the proposed model three times by varying the order of the X variable inside the index function. If the SNP is selected as varying or constant in all the three cases, this would suggest a convincing signal. Therefore, we only consider the cases where all three models return the same effect classification result. Our model identified 122 genes with constat effect and no gene with varying effect, indicating that these genes are not sensitive to the changes of those three environmental variables, i.e., no significant G×E effect. Consider gene ST3GAL1 located on chromosome 8 as an example. It contains 39 SNPs in our data set. Table 17 presents the estimated SNP effect in gene ST3GAL1. The left, middle, and right column correspond to the case where τ = 0.25, τ = 0.50, and τ = 0.75, respectively. Among those, SNPs rs13267049, rs6986303, and rs6990329 have effect on response in lower quantile (τ = 0.25). SNP rs2142306 only has 76 effect for the τ = 0.5 quantile, and SNPs rs2736860, rs9643299, and rs7460764 have effect when τ = 0.75. Interestingly, for SNP rs7831227, we observed an increase in negative effect with the increase of the quantile of the response variable. This suggests that the SNP has different effect at different quantile of the birth weight. A genome-wide associations study in 2016 (Horikoshi et al. (2016)) suggested that only 15% of the variance in birth weight was captured by genetic variations. Thus, it is not surprise to see the limited genetic variants with relatively small effect being selected. Table 17: Effect of SNPs in gene ST3GAL1 τ = 0.25 τ = 0.50 τ = 0.75 rs13267049 rs2736860 rs2142306 rs6986303 rs6990329 rs9643299 rs7460764 rs7831227 0.0260 0 0 0.1129 0.1411 0 0 -0.0294 0 0 0.0720 0 0 0 0 -0.0801 0 -0.0290 0 0 0 -0.0489 -0.1077 -0.1007 Table 18 presents the parameter estimates for the marginal effect of environmental factors on birth weight. The first, second, and third row correspond to the case where τ = 0.25, τ = 0.50, and τ = 0.75 respectively. We observed different results at different quantiles, suggesting that the environmental effects are different at different quantiles for the birth weight. For example, at lower quantile (0.25), X2 (mother’s one hour OGTT glucose level) and X3 (mother’s systolic blood pressure) showed strongest effect. At higher quantile (0.75), X1 (mother’s diastolic blood pressure) and X3 showed strongest effect. For median regression case (τ = 0.5), we observed a strong effect for X1 , while X2 was not selected. Figure 6 represents the effect of environmental factors on different quantiles of birth 77 Table 18: Estimated Loading Parameter for Gene ST 3GAL1 β1 β2 τ = 0.25 0.272 τ = 0.5 0.895 τ = 0.75 0.637 β3 0.707 0.653 0.000 -0.445 -0.288 -0.715 weight. The red, blue, and black curve correspond to the effect at quantile 0.25, 0.5, and 0.75, respectively. Since the estimated β is different at different quantile, the span differs. We observed a higher fitted birth weight at a higher quantile. For τ = 0.75, with the increased index Xβ, we first observed a quick decrease in fitted birth weight, then it fluctuates around 3.3. Since the loading coefficient estimates for β2 and β3 are negative, this implies that higher values for mother’s one hour OGTT glucose level and mother’s systolic blood pressure potentially contribute to higher birth weight at the upper quantile. For the median regression, the fitted birth weight first decreases from 3.3 to 3.0, then slowly increases to 3.2. Both larger values in Mother’s diastolic blood pressure and Mother’s systolic blood pressure contribute to relatively higher birth weight. For τ = 0.25, we saw a positive trend in the total effect as the index increases. 78 Environmental Effect on Birthweight 3.2 2.8 3.0 Birth Weight 3.4 3.6 tau=0.25 tau=0.5 tau=0.75 −1.0 −0.5 0.0 0.5 1.0 1.5 X*beta Figure 6: Plot of interaction effect effects 4.5 Discussion Varying multi-index coefficient model is a novel way to model non-linear interaction between genetic variants G and a mixture of environmental factors X. In previous chapters, we proposed a 3 step iterative variable selection approach with the goal of selecting varying and constat effect genetic variants as well as non-zero loading parameters. In this paper, we generalized such approach to a conditional quantile regression setting. Compared to condition mean regression, condition quantile regression possesses several advantages. First, modelling the data at several quantiles offers a more comprehensive way to understand the distribution of the response variable. In many applications, the effect of Gk on the response differs at different quantiles. Second, quantile regression is robust to extreme observations. By looking at different quantiles, novel insights about the underlying genetic mechanism 79 could be revealed. From the setup of τ VMICM, the environmental factors X have to be continuous due to the model constrain. Nevertheless, discrete factors such as gender and smoking status might possess significant effect. In this case, we could easily generalize τ VMICM to a partial linear τ VMICM model as Y = Zα + (ZG)δ + p k=0 mk (Xβ)Gk + (τ ) where Z represent the discrete environmental factors and δ represent the interaction of gene with discrete environmental variables. The proposed variable selection approach could be modified to accommodate these changes. Although varying multi-index coefficient model enjoys many advantages over traditional G×E model, it is not without its limitations. One of which is the constrains on its loading parameters, ||β||2 = 1 and β1 > 0. Potentially, it would lead to different selection results if we vary the order of loading covariates. In the application to real data, we can fit several different models by varying the first loading covariate. If all the models return the same selection results, then we are convinced such finding is valid. The constrains also limit the interpretation of fitted environmental parameters, it is difficult to characterize the effect of a single environmental factor. If environmental effect is our primary interest, we should consider other modelling technique. In the case study, the estimated interaction coefficients are all constants, indicating no G×E interaction for birth weight. As the study is focused on the Thai population, this cannot rule out the possibility that the mother’s conditions may act on fetus’ genome to affect fetal growth in other populations. Further investigation is needed to confirm this. 80 Chapter 5 Conclusion and future work 5.1 Conclusion The main goal of this dissertation is to develop variable selection methods to identify non-linear G×E interactions. We first proposed to use varying multi-index coefficient models since it allows non-linear interaction between genetic factors and multiple environmental factors. In Chapter 2, we proposed a 3 step iterative variable selection approach for VMICM via a penalized regression. It separates gene effects into three categories: varying, constant and zero. It could also select non-zero loading parameters for environmental factors. In Chapter 3, we generalized such approach to a generalized regression setting with binary responses. Following the work of Chapter 2, we extended the proposed variable selection approach to a quantile regression setting in Chapter 4, since it provided a more comprehensive understanding of the data and offered more robust estimators. In conclusion, this dissertation contributes to the literature in two ways. From the methodological perspective, it contributes to the methods development in variable selection under a nonparametric varying index coefficients model framework. For the selection of the nonparametric coefficients, we separated three types of effects rather than just selecting zero vs non-zero functions. This complicates our selection procedure and distinguishes our methods to existing ones in the literature (e.g., Feng and Xue (2013)). Theoretical properties 81 of the selection methods were evaluated. Our methods have practical meanings and enrich the literature of variable selection. From the application perspective, our method development is well motivated by empirical studies to evaluate the joint effect of multiple environmental exposures and how they interact with genes to affect a disease trait. By taking gene or pathway information into account, we were able to select important players in a gene set. The method developed under the quantile regression framework makes much biological sense in certain trait such as birth weight. Novel insights are expected under the proposed models. 5.2 Future work In the simulation studies, we assumed any two genetic variants Gk and Gl are independent. However, it would be more desirable if we consider linkage disequilibrium structure among SNPs. More specifically, we could set the correlation between Gk and Gl to be ρ|k−l| where ρ = 0.3 for low correlation case, ρ = 0.5 for median correlation case, and ρ = 0.9 for high correlation case. To demonstrate the robustness of median regression compared to mean regression, we could consider the case where follows a t distribution with 3 degree of freedom. In Chapter 2 and 3, we theoretically proved that our penalized estimators are consistent in both estimation and selection under a fixed number of parameters. It could be more desirable if we could prove the selection consistency when the number of parameters increases as the sample size increases. Also it could be beneficial if we could demonstrate the consistency of the penalized estimators in the quantile regression setting in Chapter 4. Further, generalizations of the proposed selection approach to other generalized regression setting 82 such as poisson or categorical variable could be done with modification to the likelihood function. At last, extension of the proposed model to longitudinal data could be considered. We will consider the aforementioned mentioned future work and continue to investigate along those line. 83 APPENDICES 84 Appendix A Real Data Results A.1 Real data results of gVMICM Table 19 presented all the varying effect SNPs selected. Similarly, Table 20 presented all the constant effect SNPs selected. Table 19: List of SNPs with a varying effect. GeneID SNP GeneID:440600 GeneID:2590 GeneID:729993 GeneID:54768 GeneID:117532 GeneID:758 GeneID:23395 GeneID:647107 GeneID:8633 GeneID:2185 GeneID:4915 GeneID:19 GeneID:286205 rs6537663 rs6666516 rs1015431 rs4788621 rs7509377 rs5766384 rs4311249 rs2404825 rs3775049 rs6557991 rs6559870 rs4742969 rs2416996 85 Table 20: List of SNPs with a constant effect. GeneID SNP GeneID:114827 GeneID:2899 GeneID:260425 GeneID:9857 GeneID:2590 GeneID:6934 GeneID:55742 GeneID:867 GeneID:10867 GeneID:57494 GeneID:196385 GeneID:64328 GeneID:23348 GeneID:23348 GeneID:57099 GeneID:11060 GeneID:25780 GeneID:100505498 GeneID:117532 GeneID:29780 GeneID:25814 GeneID:9620 GeneID:23429 GeneID:80254 GeneID:8633 GeneID:157680 rs3815792 rs12118788 rs11102660 rs2293990 rs9308482 rs7901695 rs7101596 rs4489755 rs740771 rs11047510 rs11058132 rs1961415 rs7326971 rs7991210 rs16962542 rs16970994 rs6708570 rs6730602 rs11696526 rs5765571 rs713999 rs11090812 rs17009630 rs11710699 rs10516957 rs1788161 86 Appendix B Algorithm B.1 Algorithm for VMICM Here we presents the algorithm used in each steps: ˆ (0) ), we implemented the group Step 1 : To minimize the objective function Q1 (γ|λ1 , β coordinate descent algorithm. The design matrix has the form ˆ (0) )G0 , G1 , B(X ˆ (0) )G1 , . . . , Gp , B(X ˆ (0) )Gp ) ¯ ¯ ¯ D = (G0 , B(X β β β n×(L∗(p+1)) with the corresponding parameters (γ01 , γ 0∗ , γ11 , γ 1∗ , . . . , γ0p , γ p∗ ), where γ k∗ has a length of L−1. We assigned a grouping index for each of the parameters (from 0 to M , where M +1 is the number of groups). Parameters with the same grouping index were in the same group and they were penalized as a group, parameters with grouping index 0 were not penalized. Denote D m as the design matrix for group m, m = 0, 1, . . . M . Given the penalty parameter λ1 and the MCP tuning parameter γ M CP , γ (1) is obtained through the following iteration: (0) Perform Q-R decomposition on all D m , i.e., D m = Qm Rm , m = 0, 1, 2 . . . M , where QTm Qm = I and Rm is an upper triangular matrix. Hence, Qm is the normalized design matrix for group m. 87 ˆ initial = {ˆ (1) For given initial values for γ, denoted as γ γ initial }, m = 0, 1, . . . , M . m ˆ OLS ˆ OLS ˜ −m ) = QTm Y − We obtained OLS estimate γ via γ = QTm (Y − Q−m γ m m ˜ −m where subscript Q−m represents the normalized design matrix without QTm Q−m γ ˜ −m represents the most updated values for γ without group m. group m and γ ˆ0 = γ ˆ OLS (2) For group 0, set γ . 0 ˆ m via (3) For all other groups m = 1, . . . , M , obtain the MCP estimate γ ˆm = γ ˆ OLS γ if ||ˆ γ OLS m m ||2 > λ ∗ τ λ τ S(ˆ ˆ OLS ˆ OLS ˆ m = τ −1 γ OLS ≤ λ ∗ τ where S(ˆ γ OLS γ m , λ) if γ m m , λ) = (1 − ˆ OLS )+ ∗ γ m ||γ m ||2 ˆ initial ˆ m . Iterate (1) through (4) until convergence (4) Updated γ in step (1) by γ m ˆ m by γ ˆ fminal = R−1 ˆ m , for m = 0, 1, . . . , M . (5) We adjusted the converged estimator γ m γ ˆ (2) = minγ Q2 (γ|λ2 , β (0) , γ ˆ (1) ) where Step 2 : To obtained γ n Q2 (γ|λ2 ˆ (1) ) , β (0) , γ (2) (0) )γ (2) ]G − ¯ [γk1 + B(Xβ k k∗ {Yi − = i=1 k∈V (2) γk1 Gk }2 k∈C p (2) (1) pλ2 (|γk1 |)I(||ˆ γ k∗ ||2 = 0) +n k=1 We implement standard coordinate descent algorithm and the design matrix is: D (2) = ˆ (0) )Gj ˆ (0) ¯ ¯ Gj , B(X β j∈V , Gk , B(X β )Gk k∈C ˆ we minimized Step 3 : To obtain β, p Q3 (β|λ3 ˆ (2) ) ,γ = ||Y − q ¯ [ˆ γk1 + B(Xβ)ˆ γ k∗ ]Gk ||2 + n k=0 pλ3 (|βd |) d=2 88 ˜ and doing a local linear approximation of B(Xβ)ˆ ˜ ¯ For an given initial value β γ k∗ Gk at β, we have ˜ γ Gk + B ˜ γ XGk (β − β) ˜ ¯ ¯ β)ˆ ¯ (X β)ˆ B(Xβ)ˆ γ k∗ Gk ≈ B(X k∗ k∗ For the d − th coordinate of β, βd , we have ˜ γ ∗ Gk + B ˜ γ ∗ X d Gk (βd − β˜d ) ¯ ¯ β)ˆ ¯ (X β)ˆ B(Xβ)ˆ γ ∗k Gk ≈ B(X k k Thus we could obtain βˆd by minimizing the following: Qd = ||Y ∗d − X ∗d βd ||22 + npλ3 (|βd |) where Y ∗d = Y − X ∗d = p γk1 Gk k=0 [ˆ p ¯ k=0 B ˜ γ ∗ Gk − B ˜ γ ∗ Gk X d β˜d ] ¯ T β)ˆ ¯ (X β)ˆ + B(X k k ˜ γ ∗ Gk X d (X β)ˆ k ˆ ∗ = (βˆ∗ , . . . , βˆ∗ )T via the following Hence, we obtained the MCP penalized estimator β q 1 ˜ calculate Y ∗ andX ∗ according to the (1) Given an initial estimate of β, denoted as β, d d above formula. (2) Normalize X ∗d by X ∗d = X ∗d /||X ∗d ||2 . (3) Calculate βˆdOLS = X ∗d T Y ∗d . ˆOLS (β −λ)+ (4) Set βˆ1∗ = βˆ1OLS and for d = 1, βˆd∗ = d M CP if |βˆdOLS | ≤ λγ M CP and βˆd∗ = βˆdOLS 1−1/γ if |βˆOLS | > λγ M CP . d adjusted (5) Adjust βˆd∗ by βˆd = βˆd∗ ∗ ||X ∗d ||2 . 89 (6) Repeat steps (1) through (5) for all βd , d = 1, . . . , q. ˆ adjusted , i.e. βˆd = (7) Normalize β adjusted βˆ d ˆ adjusted || ||β 2 adjusted ∗ sign(β1 ). ˜ in step (1) with β ˆ = (βˆ1 , . . . , βˆq )T and iterate step (1) through (6) until (8) Update β convergence. B.2 Algorithm for model (2.6) For the intercept only model, Y = m0 (Xβ)+ , the following contain the steps in estimation. ¯ (0) We approximated m0 (Xβ) with B-spline basis function: m0 (Xβ) ≈ γ01 + B(Xβ)γ 0∗ . ˆ let the design matrix D = (1, B(X ˆ denote ¯ (1) Given initial value for β, denoted as β, β)), ˆ = (D T D)−1 D T Y . γ = (γ01 , γ 0∗ ), we estimated γ with γ ¯ (2) Obtain β updated via minimizing ||Y − γˆ01 − B(Xβ)ˆ γ 0∗ ||22 with Newton-Raphson algorithm. ˆ by β updated in step (1), and iterate until convergence. (3) Replace β B.3 Algorithm for gVMICM Here we showed the algorithm used in gVMICM: Step 1 & Step 2 followed directly from the group coordinate descent algorithm described before, hence the details are omitted. Step 3 : q ˆ = max M3 (β|λ3 , γ ˆ (2) ) = max pλ3 (|βd |) β l(ˆ γ (2) , β) − n ||β ||2 =1 ||β ||2 =1 d=2 90 We implemented the local quadratic approximation technique proposed by Fan and ˜ to be the most updated value of β, by Taylor Li(2001)(Fan and Li (2001)). Denote β ˜ we have expansion at β, ˜ T ˜ T (β − β) ˜ + 1 (β − β) l(ˆ γ (2) , β) 2 ˜ + l(ˆ γ (2) , β) ≈ l(ˆ γ (2) , β) 2 ˜ ˜ l(ˆ γ (2) , β)(β − β) where (2) (2) γ , β) ∂l(ˆ γ , β) T ˜ = ( ∂l(ˆ l(ˆ γ (2) , β) ,..., ) is the gradient ˜ ∂β1 ∂βq β =β 2 l(ˆ ˜ γ (2) , β) = ∂ 2 l(ˆ γ (2) , β) is the hessian matrix ˜ ∂βj ∂βl 1≤j,l≤q β =β and ∂l(ˆ γ (2) , β) = (Y − ∂βj ∂ 2 l(ˆ γ (2) , β) = (Y − ∂βj ∂βl B (3) )T 1 + e−µ (3) ∂µB )T − ∂βj ∂βl B (3) − µ 1+e 1 (3) 1 ∂µB ∂βj B (3) e−µ B (3) 2 ) (1 + e−µ T (3) ∂µB ∂βj with (3) µB p (2) (2) ¯ [ˆ γk1 + B(Xβ)ˆ γ k∗ ] · Gk = k=0 (3) ∂µB ∂βj (3) ∂µB = ∂βj ∂βl p (2) ¯ (Xβ)ˆ B γ k∗ · X j · Gk = k=0 p (2) ¯ (Xβ)ˆ B γ k∗ · X j · X l · Gk k=0 ˆ ˆ |β2 | |βq | p (β ) p ( βq ) ˜ = diag(0, λ3 2 , . . . , λ3 Let Σλ3 (β) ). Then, we updated β with ˆ ˆ ˜− β∗ = β 2 ˜ + nΣλ (β) ˜ l(ˆ γ (2) , β) 3 91 −1 ˜ + nΣλ (β) ˜ β ˜ l(ˆ γ (2) , β) 3 ∂µB · ∂β (3) l β∗ Then we standardized β with β updated = sign(β1∗ ) ∗ . We iterate the above steps until ||β ||2 convergence. B.4 Algorithm for quantile VMICM Here, we presented the algorithm used in the quantile VMICM model. Step 1&2 : The algorithm used in these steps followed directly from Peng and Wang(2015) and we implemented the R-package “rqPen” (Sherwood and Maidman (2016)). ˆ = min ˆ (2) ) where Step 3: Obtain β ||β ||2 =1 Q3 (β|λ3 , γ p n ρτ Yi − ˆ) = Q3 (β|λ3 , γ i=1 q ¯ [ˆ γk1 + B(Xβ)ˆ γ k∗ ]Gik + n k=0 pλ3 (|βd |) d=2 ¯ Since B(Xβ) is not a linear function of β, we adopted the idea of first order approximation ˜ as the most updated value of β. We have ¯ for B(Xβ). Denote β ˜ γ Gk + B ˜ γ XGk (β − β) ˜ ¯ ¯ β)ˆ ¯ (X β)ˆ B(Xβ)ˆ γ k∗ Gk ≈ B(X k∗ k∗ For individual βd , d = 1, . . . q, we have ˜ γ ∗ X d Gk (βd − β˜d ) ˜ γ ∗ Gk + B ¯ ¯ β)ˆ ¯ (X β)ˆ B(Xβ)ˆ γ ∗k Gk ≈ B(X k k Then we could obtain βˆd by minimizing the following: Qd = ρτ (Y ∗d − X ∗d βd ) + npλ3 (|βd |) 92 where Y ∗d = Y − p ¯ k=0 B ˜ γ ∗ Gk X d . (X β)ˆ k p γk1 Gk k=0 [ˆ ˜ γ ∗ Gk − B ˜ γ ∗ Gk X d β˜d ] and X ∗ = ¯ T β)ˆ ¯ (X β)ˆ + B(X k k d ˆ∗ = Hence, we obtained the MCP penalized estimator β (β1∗ , . . . , βq∗ )T via the following steps: ˜ calculate Y ∗ and X ∗ according to the (1) For given initial value for β, denoted as β, d d above formula. ˆ ∗ = (β ∗ , . . . , β ∗ )T via implementing the iterative coordinate descent algorithm (2) Obtain β q 1 for quantile regression. ∗ ˆ ˆ ∗ via β ˆ = sign(βˆ∗ ) β∗ . (3) Standardize β 1 ˆ ||β ||2 (4) Iterate steps (1) to (3) until convergence. B.5 Algorithm for model (4.6) Consider the intercept only model Y = m0 (Xβ) + , we proposed an iterative algorithm to estimate its parameters in the following. (0) Approximate m0 (Xβ) with the B-spline basis function, i.e., m0 (Xβ) ≈ γ01 + ¯ B(Xβ)γ 0∗ . ˆ obtain γˆ01 and γ ˆ 0∗ via the “rq” function (1) For given initial value for β, denoted as β, ˆ being the design matrix. ¯ β) in R (in package “quantreg”), with B(X (2) Obtain β updated via minimizing n i=1 ρτ (Y i ¯ − γˆ01 − B(Xβ)ˆ γ 0∗ ) with the linear approximation method described in Appendix B.4. ˆ by β updated , and iterate until convergence. (3) Update β 93 Appendix C Proof of Theorems Some notations: Denote the space of Lipschitz continuous functions for any fixed constant c as Lip([a, b], c) = {f : |f (x1 ) − f (x2 )| ≤ c|x1 − x2 |, ∀x1 , x2 ∈ [a, b]}. Let C (p) [a, b] = {f : f (p) ∈ C[a, b]} be the space of the pth order smooth functions. C.1 Proof of Theorem 2.3.1 We assume the following regularity conditions: (A1) The density function fU (β ) (·) of random variable U (β) = Xβ is bounded away from 0 on Ω = {Xβ, X ∈ X }, where X is the compact support of X. There exists a constant c1 , such that fU (β ) (·) ∈ Lip([a, b], c1 ). (A2) For k = 0, 1, . . . , p, the non-parametric function mk (·) ∈ C (r) and r ≥ 2. (A3) E(||G||6 ) < ∞ and E(| |6 ) < ∞. (A4) The matrix M (u) = E(GGT |Xβ = u) is positive definite, each element of M (u) ∈ Lip([a, b], c4 ). (A5) Let bn = maxk,l {pλ (||γ 0k∗ ||2 ), pλ (|γ 0k1 |), pλ (|βd0 |), γ 0k∗ = 0, γ 0k1 = 0, βl0 = 0} for 1 2 3 k = 1, . . . , p, d = 2, . . . , q. Then bn → 0 as n → 0 (A6) 1 |p (||γ k∗ ||2 )| > 0 for k = v + 1, . . . , p n→∞ ||γ || →0+ λ1 λ1 k∗ 2 lim inf lim inf 94 1 lim inf lim inf |p (|γ k1 |)| > 0 for k = c + 1, . . . , p n→∞ |γ |→0+ λ2 λ2 k1 lim inf lim inf n→∞ |β |→0+ d 1 |p (|β |)| > 0 for d = s + 1, . . . , q λ3 λ3 d (A7) Let c1 , . . . , cK be the interior knots of [a, b], where a = inf{u : u ∈ Ω}, b = sup{u : u ∈ Ω} and c0 = 1, cK+1 = b, hi = ci − ci−1 , h = max{hi }. Then exists a constant C7 such h that min{h < C7 and max{hi+1 hi } = o(K −1 ). i} Lemma C.1.1. If mk (u), k = 0, 1, . . . , p satisfy condition (A2), then there exists a constant C > 0 such that −r ¯ sup |mk (u) − γk1 − B(u)γ k∗ | ≤ CK . u∈Ω Proof : This result follows directly from a standard B-spline theory. Denote φ = (β2 , . . . βq )T , hence β = ( 1 − ||φ||22 , φT )T and φ0 is the true value of φ. ˆ it is equivalent to show the consistency of φ ˆ = (βˆ2 , · · · , βˆq ). To show the consistency of β, Let αn = n−r/(2r+1) + an , γ = γ 0 + αn τ 1 , φ = φ0 + αn τ 2 and τ = (τ 1 , τ 2 ), where τ 1 = (τ01 , τ 0∗ , . . . , τp1 , τ p∗ ) and {τk1 , τ k∗ } corresponds to the B-spline coefficients γk1 , γ k∗ ; φ φ φ τ 2 = (τ1 , . . . , τq−1 ) and τl corresponds to φl . ˆ we need to show that ∀ > 0, ∃ a large enough C, ˆ and φ, To show the consistency of γ so that: inf {Q1 (γ, φ)} > Q1 (γ 0 , φ0 ) ≥ 1 − P (C.1) ||τ ||=C where p pλ1 (||γ k∗ ||2 ) + n Q1 (γ, φ) = g(γ, φ) + n k=1 and g(γ, φ) = ||Y − p q−1 pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) + n k=1 p 2 ¯ k=0 (γk1 + B(φ)γ k∗ )Gk ||2 95 pλ3 (|φl |) l=1 is the squared error loss. If (C.1) holds, we can say with probability at least 1− , there exists a local minimum in the ball {(γ 0 , φ0 )+αn ∗ ˆ − (γ 0 , φ0 )|| = Op (αn ). τ ; ||τ || ≤ C}. Hence, there exists a local minimizer such that ||(ˆ γ , φ) Let 1 1 {Q1 (γ, φ) − Q1 (γ 0 , φ0 )} = {Q1 (γ 0 + αn τ 1 , φ0 + αn τ 2 ) − Q1 (γ 0 , φ0 )} K K 1 = g(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − g(γ 0 , φ0 ) K Dn (τ ) = p pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 ) +n k=1 p 0 + α τ |)I(||γ 0 + α τ || = 0) − p (|γ 0 |)I(||γ 0 || = 0) pλ2 (|γk1 n k1 n k∗ 2 λ2 k1 k∗ k∗ 2 +n k=1 q−1 φ pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |) +n j=1 Since pλ1 (||γ 0k∗ ||2 )] = 0 for k = v + 1, . . . , p and pλ3 (|φ0j |) = 0 for j = s + 1, . . . , q − 1 and I(||γ 0k∗ ||2 = 0) = 0 for k = 1, . . . , v , we have Dn (τ ) ≥ 1 g(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − g(γ 0 , φ0 ) K v [pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 )] +n k=1 p 0 + α τ |) − p (|γ 0 |) pλ2 (|γk1 n k1 λ2 k1 +n k=v+1 s−1 +n φ [pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |)] j=1 96 Then by Taylor series expansion at (γ 0 , φ0 ), we have Dn (τ ) ≥ αn ∂g ∂g 1 ( 0, nα2 τ I(γ 0 , φ0 )τ T (1 + op (1)) )τ T − 0 K ∂γ ∂φ 2K n n + K + + n K n K v k=1 p γ0 αn pλ (||γ 0k∗ ||2 ) 0k∗ τ Tk∗ + αn2 pλ (||γ 0k∗ ||2 )τ k∗ τ Tk∗ (1 + op (1)) 1 1 ||γ || k∗ 2 0 |)sign(γ 0 )τ + α2 p (|γ 0 |)(τ )2 (1 + o (1)) αn pλ (|γk1 p k1 n λ k1 k1 k1 2 2 k=v+1 s−1 φ φ αn pλ (|φ0j |)sign(φ0j )τj + αn2 pλ (|φ0j |)(τj )2 (1 + op (1)) j=1 3 3 := S1 + S2 + S3 + S4 + S5 where I(γ 0 , φ0 ) is the Fisher’s information matrix. By standard arguments, S1 is of the order Op (1 + nr/(2r+1) αn )||τ ||, S2 is of the order Op (1 + 2nr/(2r+1) αn )||τ ||2 and for large enought C, S2 dominates S1 uniformly in ||τ || = C. Further, by Taylor expansion at γ 0 , we have n S3 ≤ αn an K v γ 0k∗ ||γ 0k∗ ||2 k=1 τ Tk∗ n + αn2 max{pλ (||γ 0k∗ ||2 )} 1 K k v τ k∗ τ Tk∗ k=1 n √ n ≤ αn2 v + αn2 C max{pλ (||γ 0k∗ ||2 )} 1 K K k Since maxk {pλ } → 0, S3 is dominated by S2 . 1 For S4 and S5 , we have n S4 ≤ αn an K p k=v+1 n 0 |)} τk1 + αn2 max{pλ (|γk1 2 K k n 2 n 0 |)} ≤ αn C + αn2 C 2 max{pλ (|γk1 2 K K k 97 p (τk1 )2 k=v+1 and n S5 ≤ αn an K s φ τj j=1 n + αn2 max{pλ (|φ0j |)} 3 K j s φ (τj )2 j=1 n 2 n ≤ αn C + αn2 C 2 max{pλ (|φ0j |)} 3 K K j Similarly, we have S4 and S5 being dominated by S2 . Hence, by choosing a large enough ˆ − (γ 0 , φ0 )|| = Op (αn ). This proves the consistency of the penalized least C, we have ||(ˆ γ , φ) ˆ square estimator (ˆ γ , φ). C.2 Proof of Theorem 2.3.2 Without loss of generality, we denote φ = (φnz , φz ), where φnz = (φ1 , . . . , φs−1 ) and φz = (φs , . . . , φq−1 ). From λmax → 0, it is very easy to see that an = 0 for large n. Then by Theorem 2.3.1, it is sufficient to show for φnz , ||φj − φ0j ||2 = Op (n−r/(2r+1) ), j = 1, . . . , s − 1 and for φz , for some given small = Cn−r/(2r+1) , when n → ∞, with probability approaching 1, for j = s, . . . , q − 1, we have ∂Q1 (γ, φ) ∂Q1 (γ, φ) > 0 when 0 < φj < and < 0 when − < φj < 0 ∂φj ∂φj Since we have ∂Q1 (γ, φ) ∂g(γ, φ) = + npλ3 (|φj |)sign(φj ). ∂φj ∂φj 98 Do Taylor expansion at φ0 for ∂g(γ ,φ) ∂φj only, we have q−1 2 ∂Q1 (γ, φ) ∂g(γ, φ0 ) ∂ g(γ, φ0 ) = + (φl − φ0l ) ∂φj ∂φj ∂φj ∂φl l=1 q−1 q−1 + l=1 k=1 ∂ 3 g(γ, φ∗ ) (φ − φ0l )(φk − φ0k ) + npλ3 (|φj |)sign(φj ) ∂φj ∂φl ∂φk l where φ∗ lies between φ0 and φ. After some calculation, we have 1 1 ∂Q1 (γ, φ) = nλ3 { pλ (|φj |)sign(φj ) + Op ( n−r/(2r+1) )} 3 ∂φj λ3 λ3 Since limn→∞ lim inf φj →0 λ1 pλ (|φj |) > 0 and λ1 n−r/(2r+1) → 0, we can conclude that the 3 3 3 ∂Q1 (γ ,φ) sign of is completely determined by sign of φj . Hence, we have proven βˆj = 0 for ∂φ j j = s + 1, . . . , q For (ii) & (iii), applying similar arguments as in (i), we immediately have, with probability ˆ k∗ = 0 for k = v + 1, . . . , p and γˆk1 = 0 for k = c + 1, . . . , p. Then approaching 1, γ ˆ γ , we have proven m ¯ by supu B(u) = O(1) and m ˆ k (·) = γˆk0 + B(X β)ˆ ˆ k (·) = ck for k∗ k = v + 1, . . . , c where ck is some constant and m ˆ k (·) = 0 for k = c + 1, . . . , p. C.3 Proof of Theorem 3.3.1 We assume the following regularity conditions: (A1) The density function fU (β ) (·) of a random variable U (β) = Xβ is bounded away from 0 on Ω = {Xβ, X ∈ X }, where X is the compact support of X. And there exists a constant c1 , such that fU (β ) (·) ∈ Lip([a, b], c1 ). (A2) For k = 0, 1, . . . , p, the non-parametric function mk (·) ∈ C (r) and r ≥ 2. 99 (A3) E(||G||6 ) < ∞. (A4) The matrix M (u) = E(GGT |Xβ = u) is positive definite, and each element of M (u) ∈ Lip([a, b], c4 ) (A5) Let bn = maxk,l {pλ (||γ 0k∗ ||2 ), pλ (|γ 0k1 |), pλ (|βd0 |), γ 0k∗ = 0, γ 0k1 = 0, βl0 = 0} for 1 2 3 k = 1, . . . , p, d = 2, . . . , q, then bn → 0 as n → 0. (A6) lim inf lim inf n→∞ ||γ || →0+ k∗ 2 1 |p (||γ k∗ ||2 )| > 0 for k = v + 1, . . . , p λ1 λ1 1 lim inf lim inf |p (|γ k1 |)| > 0 for k = c + 1, . . . , p n→∞ |γ |→0+ λ2 λ2 k1 1 |p (|βd |)| > 0 for d = s + 1, . . . , q n→∞ |β |→0+ λ3 λ3 d lim inf lim inf (A7) Let c1 , . . . , cK be the interior knots of [a, b], a = inf{u : u ∈ Ω}, b = sup{u : u ∈ Ω} and c0 = 1, cK+1 = b, hi = ci − ci−1 , h = max{hi }. Then there exists a constant C7 such h < C7 and max{hi+1 hi } = o(K −1 ). that min{h } i Let φ = (β2 , . . . , βq )T , and we have β = ( 1 − ||φ||22 , φT )T , hence the restriction ||β|| = ˆ it is enough to show 1 and β1 > 0 is equivalent to ||φ||2 < 1. To show the consistency of β, ˆ Let αn = n−r/(2r+1) + an , γ = γ 0 + αn τ 1 , φ = φ0 + αn τ 2 and the consistency of φ. τ = (τ 1 , τ 2 ), where τ1 = (τ01 , τ 0∗ , . . . , τp1 , τ p∗ ) and {τk1 , τ k∗ } corresponds to the B-spline φ φ φ coefficients γk1 , γ k∗ ; τ 2 = (τ1 , . . . , τq−1 ) and τl corresponds to φl . ˆ we need to show ∀ > 0, ∃ a large enough C, so that: ˆ and φ, To show consistency of γ P sup {M (γ, φ)} < M (γ 0 , φ0 ) ≥ 1 − ||τ ||=C 100 (C.2) where p M (γ, φ) = l(γ, φ) − n p pλ1 (||γ k∗ ||2 ) − n k=1 q−1 pλ2 (|γk1 |)I(||γ k∗ ||2 = 0) − n k=1 pλ3 (|φl |) l=1 and l(γ, φ) is the log-likelihood function defined above. If (C.2) holds, we can see, with probability at least 1 − , there exists a local maximum in the ball {(γ 0 , φ0 ) + αn ∗ τ ; ||τ || ≤ ˆ − (γ 0 , φ0 )|| = Op (αn ). C}. Hence, there exists a local maximizer such that ||(ˆ γ , φ) Let 1 1 {M (γ, φ) − M (γ 0 , φ0 )} = {M (γ 0 + αn τ 1 , φ0 + αn τ 2 ) − M (γ 0 , φ0 )} K K 1 = l(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − l(γ 0 , φ0 ) K Dn (τ ) = p pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 ) −n k=1 p 0 + α τ |)I(||γ 0 + α τ || = 0) − p (|γ 0 |)I(||γ 0 || = 0) pλ2 (|γk1 n k1 n k∗ 2 λ2 k1 k∗ k∗ 2 −n k=1 q−1 φ pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |) −n j=1 Since pλ1 (||γ 0k∗ ||2 )] = 0 for k = v + 1, . . . , p and pλ3 (|φ0j |) = 0 for j = s + 1, . . . , q − 1 and I(||γ 0k∗ ||2 = 0) = 0 for k = 1, . . . , v , we have 101 Dn (τ ) ≤ 1 l(γ 0 + αn τ 1 , φ0 + αn τ 2 ) − l(γ 0 , φ0 ) K v [pλ1 (||γ 0k∗ + αn τ k∗ ||2 ) − pλ1 (||γ 0k∗ ||2 )] −n k=1 p 0 + α τ |) − p (|γ 0 |) pλ2 (|γk1 n k1 λ2 k1 −n k=v+1 s−1 φ [pλ3 (|φ0j + αn τj |) − pλ3 (|φ0j |)] −n j=1 Then by Taylor expansion at (γ 0 , φ0 ), we have Dn (τ ) ≤ αn ∂l ∂l T − 1 nα2 τ I(γ 0 , φ0 )τ T (1 + o (1)) )τ ( 0, p K ∂γ ∂φ0 2K n n − K − − n K n K v k=1 p αn pλ (||γ 0k∗ ||2 ) 1 γ 0k∗ τ Tk∗ 0 ||γ k∗ ||2 + αn2 pλ (||γ 0k∗ ||2 )τ k∗ τ Tk∗ (1 + op (1)) 1 0 |)sign(γ 0 )τ + α2 p (|γ 0 |)(τ )2 (1 + o (1)) αn pλ (|γk1 p k1 n λ k1 k1 k1 k=v+1 s−1 j=1 2 2 φ φ αn pλ (|φ0j |)sign(φ0j )τj + αn2 pλ (|φ0j |)(τj )2 (1 + op (1)) 3 3 := S1 − S2 − S3 − S4 − S5 where I(γ 0 , φ0 ) is the Fisher’s information matrix. By standard arguments of likelihood theory, S1 is of the order Op (1 + nr/(2r+1) αn )||τ ||, S2 is of the order Op (1 + 2nr/(2r+1) αn )||τ ||2 and for large enough C, S2 dominates S1 uniformly in ||τ || = C . 102 Further, we have n S3 ≤ αn an K ≤ v k=1 γ 0k∗ ||γ 0k∗ ||2 τ Tk∗ n + αn2 max{pλ (||γ 0k∗ ||2 )} 1 K k v τ k∗ τ Tk∗ k=1 n 2√ n αn v||τ || + αn2 ||τ ||2 max{pλ (||γ 0k∗ ||2 )} 1 K K k Since maxk {pλ } → 0, we have S3 dominated by S2 . 1 For S4 and S5 , we have n S4 ≤ αn an K p k=v+1 n 0 |)} τk1 + αn2 max{pλ (|γk1 2 K k p (τk1 )2 k=v+1 n n 2 0 |)} αn ||τ || + αn2 ||τ ||2 max{pλ (|γk1 ≤ 2 K K k and s−1 s−1 n n φ φ S5 ≤ αn an τl + αn2 max{pλ (|φ0l |)} (τl )2 3 K K l l=1 l=1 n 2 n 2 2 0 α ||τ || + αn ||τ || max{pλ (|φl |)} ≤ 3 K n K l Similarly, we have S4 and S5 dominated by S2 . Hence, by choosing a large enough C, ˆ − (γ 0 , φ0 )|| = Op (αn ). Hence the consistency of penalized least squares we have ||(ˆ γ , φ) ˆ is proven. estimator (ˆ γ , φ) C.4 Proof of Theorem 3.3.2 For ease of notation, we denote φ = (φnz , φz ), where φnz = (φ1 , . . . , φs−1 ) and φz = (φs , . . . , φq−1 ). From λmax → 0, it is very easy to see an = 0 for large n. Then by Theorem 103 3.3.1, it is sufficient to show for φnz , it satisfies, ||φl − φ0l ||2 = Op (n−r/(2r+1) ), l = 1, . . . , s − 1 and for φz and for some given small = Cn−r/(2r+1) , when n → ∞, with probability approaching 1, for l = s, . . . , q − 1, it satisfies, ∂M (γ, φ) ∂M (γ, φ) < 0 when 0 < φl < and > 0 when − < φl < 0 ∂φl ∂φl Since we have ∂l(γ, φ) ∂M (γ, φ) = − npλ (|φl |)sign(φl ) 3 ∂φl ∂φl Do Taylor expansion at φ0 for ∂l(γ ,φ) ∂φl only, we have q−1 ∂M (γ, φ) ∂l(γ, φ0 ) ∂ 2 l(γ, φ0 ) = + (φk − φ0k ) ∂φl ∂φl ∂φl ∂φk k=1 q−1 q−1 + k=1 j=1 ∂ 3 l(γ, φ∗ ) (φ − φ0k )(φj − φ0j ) − npλ (|φj |)sign(φj ) 3 ∂φl ∂φk ∂φj k where φ∗ lies between φ0 and φ. After some calculation, we have ∂M (γ, φ) 1 1 = nλ3 − pλ (|φl |)sign(φl ) + Op ( n−r/(2r+1) ) ∂φl λ3 3 λ3 Since limn→∞ lim inf φj →0 λ1 pλ (|φj |) > 0 and λ1 n−r/(2r+1) → 0, we can conclude that the 3 3 3 ∂M (γ ,φ) is completely determined by the sign of φj . Hence, we prove βˆj = 0 for sign of ∂φ j j = s + 1, . . . , q For (ii) & (iii), applying similar arguments as in (i), we immediately have, with probability 104 ˆ k∗ = 0 for k = v + 1, . . . , p and γˆk1 = 0 for k = c + 1, . . . , p. Then by approaching 1, γ ˆ γ , we prove m ¯ supu B(u) = O(1) and m ˆ k (·) = γˆk0 + B(X β)ˆ ˆ k (·) = ck for k = v + 1, . . . , c, k∗ where ck is some constant and m ˆ k (·) = 0 for k = c + 1, . . . , p. 105 BIBLIOGRAPHY 106 BIBLIOGRAPHY [1] Antoniadis, A., & Fan, J. (2001). Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455), 939-967. [2] Ben Sherwood and Adam Maidman (2016). rqPen: Penalized Quantile Regression. R package version 1.5.1. https://CRAN.R-project.org/package=rqPen [3] Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The annals of applied statistics, 5(1), 232. [4] Cai, Z., Fan, J., & Li, R. (2000). Efficient estimation and inferences for varying-coefficient models. Journal of the American Statistical Association, 95(451), 888-902. [5] Carpenter, D. O., Arcaro, K., & Spink, D. C. (2002). Understanding the human health effects of chemical mixtures. Environmental Health Perspectives, 110(Suppl 1), 25. [6] Cheverud, J. M. (2001). A simple correction for multiple comparisons in interval mapping genome scans. Heredity, 87(1), 52-58. [7] Chiang, C. T., Rice, J. A., & Wu, C. O. (2001). Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. Journal of the American Statistical Association, 96(454), 605-619. [8] Colditz, G. A., & Hankinson, S. E. (2005). The Nurses’ Health Study: lifestyle and health among women. Nature Reviews Cancer, 5(5), 388-396. [9] Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of statistics, 32(2), 407-499. [10] Fan, J., & Zhang, W. (1999). Statistical estimation in varying coefficient models. Annals of Statistics, 1491-1518. [11] Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348-1360. [12] Feng, S., & Xue, L. (2013). Variable selection for single-index varying-coefficient model. Frontiers of Mathematics in China, 8(3), 541-565. [13] Frank, L. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109-135. 107 [14] Friedman, J., Hastie, T., Hfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302-332. [15] Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1), 1. [16] Grant, S. F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., ... & Styrkarsdottir, U. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature genetics, 38(3), 320-323. [17] Hastie, T., & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological), 757-796. [18] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. [19] Horikoshi, M., Beaumont, R. N., Day, F. R., Warrington, N. M., Kooijman, M. N., Fernandez-Tajes, J., ... & Bradfield, J. P. (2016). Genome-wide associations for birth weight and correlations with adult disease. Nature, 538(7624), 248-252. [20] Hoover, D. R., Rice, J. A., Wu, C. O., & Yang, L. P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 809-822. [21] Huang, J. Z., Wu, C. O., & Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 763-788. [22] Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd. [23] Lee, E. R., Noh, H., & Park, B. U. (2014). Model selection via Bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109(505), 216-229. [24] Liu, X., Cui, Y., & Li, R. (2016). Partial linear varying multi-index coefficient model for integrative gene-environment interactions. Statistica Sinica, 26, 1037-1060. [25] Ma, S., Yang, L., Romero, R., & Cui, Y. (2011). Varying coefficient model for geneenvironment interaction: a non-linear look. Bioinformatics, 27(15), 2119-2126. [26] Ma, S., & Song, P. X. K. (2015). Varying index coefficient models. Journal of the American Statistical Association, 110(509), 341-356. [27] Ma, S., Xu, S. (2015). Semiparametric nonlinear regression for detecting gene and environment interactions. Journal of Statistical Planning and Inference, 156, 31-47. 108 [28] Nyholt, D. R. (2004). A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. The American Journal of Human Genetics, 74(4), 765-769. [29] Ottman, R. (1996). Geneenvironment interaction: Preventive medicine, 25(6), 764. definitions and study designs. [30] Peng, B., & Wang, L. (2015). An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3), 676-694. [31] Rimm, E. B., Giovannucci, E. L., Willett, W. C., Colditz, G. A., Ascherio, A., Rosner, B., & Stampfer, M. J. (1991). Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet, 338(8765), 464-468. [32] Sale, M. M., Smith, S. G., Mychaleckyj, J. C., Keene, K. L., Langefeld, C. D., Leak, T. S., ... & Freedman, B. I. (2007). Variants of the transcription factor 7-like 2 (TCF7L2) gene are associated with type 2 diabetes in an African-American population enriched for nephropathy. Diabetes, 56(10), 2638-2642. [33] Schadt, E. E., Molony, C., Chudin, E., Hao, K., Yang, X., Lum, P. Y., ... & Zhu, J. (2008). Mapping the genetic architecture of gene expression in human liver. PLoS Biol, 6(5), e107. Chicago [34] Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press. [35] Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 6(2), 461-464. [36] Sexton, K., & Hattis, D. (2007). Assessing cumulative health risks from exposure to environmental mixturesthree fundamental questions. Environmental Health Perspectives, 825-832. [37] Tang, Y., Wang, H. J., Zhu, Z., & Song, X. (2012). A unified variable selection approach for varying coefficient models. Statistica Sinica, 601-628. [38] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288. [39] Yang, X., Zhang, B., Molony, C., Chudin, E., Hao, K., Zhu, J., ... & Guengerich, F. P. (2010). Systematic genetic and genomic analysis of cytochrome P450 enzyme activities in human liver. Genome research, 20(8), 1020-1036. [40] Yu, Y., & Ruppert, D. (2002). Penalized spline estimation for partially linear single-index models. Journal of the American Statistical Association, 97(460), 1042-1054. 109 [41] Xia, Y., & Li, W. K. (1999). On the estimation and testing of functional-coefficient linear models. Statistica Sinica, 735-757. [42] Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67. [43] Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. [44] Zhou, X., & You, J. (2004). Wavelet estimation in varying-coefficient partially linear regression models. Statistics & probability letters, 68(1), 91-104. [45] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429. [46] Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509. [47] Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265-286. 110