HIGH DIMENSIONAL STATISTICAL METHODS FOR GENE-ENVIRONMENT INTERACTIONS By Cen Wu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2013 ABSTRACT HIGH DIMENSIONAL STATISTICAL METHODS FOR GENE-ENVIRONMENT INTERACTIONS By Cen Wu The genetic influences on complex disease traits generally depends on the joint effects of multiple genetic variants, environment factors, as well as their interplays. Gene×environment (G×E) interactions play vital roles in determining an individual’s disease risk, but the underlying genetic machinery is poorly understood. Traditional analysis assuming linear relationship between genetic and environmental factors, along with their interactions, is commonly pursued under the regression-based framework to examine G×E interactions. This assumption, however, could be violated due to nonlinear responses of genetic variants to environmental stimuli. As an extension to our previous work on continuous traits, we proposed a flexible varying-coefficient model for the detection of nonlinear G×E interaction with binary disease traits. Varying coefficients were approximated by a non-parametric regression function through which one can assess the nonlinear response of genetic factors to environmental changes. A group of statistical tests were proposed to elucidate various mechanisms of G×E interaction. The utility of the proposed method was illustrated via simulation and real data analysis with application to Type 2 Diabetes. It has been increasingly recognized the power of genetic variant set based association analysis over the single variant based approach. We develop a variant set based approach to examine how variants in a genetic system mediated by a common environment factor to affect the phenotype response. The problem can be approached from a high dimensional variable selection perspective. In particular, we can select genetic variants with varying, non-zero constant and zero coefficients, which are corresponding to cases of G×E interactions, no G×E interactions and no genetic effects, correspondingly. The procedure was implemented in a two stage iterative framework via Smoothly Clipped Absolute Deviation (SCAD) penalty. With proper regularity conditions, we can establish the consistency in variable selection and effect separation of our two stage iterative estimator, as well as the optimal convergence rates of the estimates for varying effect. In addition, it can be shown that the estimate of non-zero constant coefficient enjoys the oracle property. The utility of our procedure will be demonstrated through extensive simulation study and real data analysis. Due to the drawback of local quadratic approximations in the aforementioned two-stage framework, the approach is not efficient in handling cases when the dimension p is very large. A group coordinate descent (GCD) based approach was proposed within the framework, which is computationally efficient particularly for high dimensional problems where p > n, because the computational complexity increases only linearly with the number of predictor groups after basis expansion. The advantage of our method is demonstrated through extensive simulation study and real data analysis. Copyright by CEN WU 2013 I dedicate this thesis to my parents, Guizhi Ye and Liyang Wu. v ACKNOWLEDGMENTS First and foremost, I am deeply grateful to my advisor, Dr. Yuehua Cui, for his tremendous assistance, constant support and extreme patience. He is always approachable and ready to help in every facet of life. Dr. Cui led me into the area of statistical genetics and inspired me through numerous discussions. His training in the past five years helped me not only grasp the skills needed for the interdisciplinary research in statistics and biology, but also gradually develop the capability of thinking independently. Without his excellent guidance and advisement, this thesis would not have been possible. I would like to thank Dr. Robin Buell, my Quantitative Biology Ph.D program coadvisor, for her valuable suggestions and guidance that significantly improve my background in biology and broaden my training in this interdisciplinary area. My sincere thanks also goes to my thesis committee members: Dr. Lifeng Wang, Dr. Ping-Shou Zhong and Dr. Qing Lu. My transition to the research area of penalized regressions cannot be so smooth without frequent discussions with Dr. Wang. I am also indebted to Dr. Zhong’s help in picking up skills necessary for the technical proof in this thesis. I appreciate all the insightful comments they made on this work. I would also like to extend my sincere gratitude to all the faculty and staff in our department. In particular, my thanks goes to Dr. Dennis Gilliland for his constant support and encouragement, especially for my job hunting. Statistical consulting is a crucial part of my Ph.D training, and I am grateful to Dr. Robert J. Tempelman and Dr. Sasha Kravchenko for accepting me as a consultant at CANR Statistical Consulting Center. I benefited enormously from our weekly group meetings. I also thank their assistance for my job applications. My special thanks goes to Qi Yan, now at Minnesota, for her tremendous help, especially vi during my preparation for the qualify exams. I am also grateful to the assistance and support from Drs. Gengxin Li, Shaoyu li, Xiaoqin Tang, Wei Wang and Ming Gu. So lucky to have Gengxin and Shaoyu as my academic sisters. They offered me generous help and precious suggestions during each stage of my Ph.D life. I appreciate Tao He, Honglang Wang, Bin Gao and Ran Cao, who are the current members in Dr. Cui’s group, for creating the stimulating research atmosphere. The memories of heated discussions we had in our weekly journal clubs will never fade away. I also want to thank Shaoyu and Tao for providing me a cozy office environment. Last, but definitely not least, I would like to express my deepest gratitude to my parents, Guizhi Ye and Liyang Wu, for their endless love and undying support. I was not able to overcome insurmountable obstacles without them, since they never gave me up under any circumstances. Besides, I thank Xin Tan and Ling Gong who made the winter in Michigan no longer tough for me. I spent so many weekends and holidays with them in the last five years, which turned driving on I-96 one of the most pleasant things for me. vii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1 Introduction . . . . . . . . . . . . . . . . 1.1 A review of basic genetics . . . . . . . . . . . . . 1.2 Genetic mapping of complex traits . . . . . . . . 1.3 Gene-environment (G×E) interactions . . . . . . 1.3.1 Basics of G×E interactions . . . . . . . . 1.3.2 Challenges and issues in G×E interactions 1.4 Objectives and organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 4 4 5 7 A Novel Method For Identifying Nonlinear Gene-Environment Interactions In Case-Control Association Studies . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimating β(X) function . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assessing G×E interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 False positive control . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Power evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 12 15 16 17 18 19 24 39 Chapter 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Chapter 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 High Dimensional Variable Selection In teractions . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . The proposed variable selection method . . . . . . 3.2.1 The penalized estimation via SCAD . . . . 3.2.2 The computational algorithms . . . . . . . 3.2.3 Selection of tuning parameters . . . . . . . Asymptotic results . . . . . . . . . . . . . . . . . Simulation . . . . . . . . . . . . . . . . . . . . . . Real data analysis . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . Technical proofs . . . . . . . . . . . . . . . . . . . 3.7.1 Useful notations and lemmas . . . . . . . . 3.7.2 Proofs of Theorem 1. . . . . . . . . . . . . 3.7.3 Proofs of Theorem 2. . . . . . . . . . . . . viii Gene-Environment In. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 44 46 46 49 51 52 54 63 64 66 66 67 73 Chapter 4 4.1 4.2 4.3 4.4 4.5 A Group Coordinate Descent Approach For High Dimensional Variable Selection In Gene-Environment Interactions . . . . . . 76 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.1 Basis expansion and penalized regression . . . . . . . . . . . . . . . . 79 4.2.2 Computational algorithms . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.2.1 Group LASSO and group adaptive LASSO . . . . . . . . . . 82 4.2.2.2 Group TLP and group SCAD . . . . . . . . . . . . . . . . . 84 4.2.2.3 Convergence of the GCD algorithm . . . . . . . . . . . . . . 86 4.2.3 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . . 87 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 105 BIBLIOGRAPHY . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . 108 LIST OF TABLES Table2.1 List of SNPs with p-value < 5E-06 in the HPFS (Male) data set . . 26 Table2.2 List of SNPs with p-value < 5E-06 in the NHS (Female) data set . . 37 Table3.1 Simulation results of Example 3.1 . . . . . . . . . . . . . . . . . . . 57 Table3.2 Simulation results of Example 3.2, d = 10 . . . . . . . . . . . . . . . 61 Table3.3 Simulation results of Example 3.2, d = 50 . . . . . . . . . . . . . . . 62 Table3.4 List of SNPs with p-value < 0.001 from the Jak-STAT signaling pathway 64 Table4.1 Simulation results of Example 4.1, N (0,1) error . . . . . . . . . . . . 91 Table4.2 Simulation results of Example 4.1, t(3) error . . . . . . . . . . . . . 92 Table4.3 Simulation results of Example 4.2, p = 10, n = 200, N (0,1) error . . 94 Table4.4 Simulation results of Example 4.2, p = 10, n = 200, t(3) error . . . 95 Table4.5 Simulation results of Example 4.2, p = 100, n = 200, N (0,1) error . 96 Table4.6 Simulation results of Example 4.2, p = 100, n = 200, t(3) error . . . 97 Table4.7 Simulation results of Example 4.2, p = 200, n = 200, N (0,1) error . 98 Table4.8 Simulation results of Example 4.2, p = 200, n = 200, t(3) error . . . 99 Table4.9 Simulation results of Example 4.2, p = 400, n = 200, N (0,1) error . 100 Table4.10 Simulation results of Example 4.2, p = 400, n = 200, t(3) error . . . 101 Table4.11 List of SNPs with p-value < 5E-06 on Chromosome 9 . . . . . . . . 103 x LIST OF FIGURES Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Different models of gene-environment interaction: (A) the interaction of gene and environment in discrete environmental conditions; cases with (B) no G×E interaction; and (C) linear and (D) non-linear G×E interactions. AA, Aa and aa represent three different genotypes in a gene, and environment mediator represents a continuous environment variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 The false positive rate of different models at the 0.05 level.(For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation.) . . . 19 The power of different models under different MAFs and sample sizes when data were generated with the VC model. . . . . . . . . . . . . 21 The power of different models under different MAFs and sample sizes when data were generated with the LM model. . . . . . . . . . . . . 22 The power of different models under different MAFs and sample sizes when data were generated with the LM-I model. . . . . . . . . . . . 23 The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0 when fitting the VC model to the male data set. . . . . . . . . . . . 27 The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when fitting the LM model to the male data set. . . . . . . . . . . . . . . 28 The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0 when fitting the LM-I model to the male data set. . . . . . . . . . . 29 The QQ plot of genome-wide p-values for the male data, when data are fitted with the VC model . . . . . . . . . . . . . . . . . . . . . . 30 The QQ plot of genome-wide p-values for the male data, when data are fitted with the LM model . . . . . . . . . . . . . . . . . . . . . . 31 The QQ plot of genome-wide p-values for the male data, when data are fitted with the LM-I model . . . . . . . . . . . . . . . . . . . . . 31 xi Figure 2.12 The QQ plot of genome-wide p-values for the female data, when data are fitted with the VC model . . . . . . . . . . . . . . . . . . . . . . 32 The QQ plot of genome-wide p-values for the female data, when data are fitted with the LM model . . . . . . . . . . . . . . . . . . . . . . 32 The QQ plot of genome-wide p-values for the female data, when data are fitted with the LM-I model . . . . . . . . . . . . . . . . . . . . . 33 The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0 when fitting the VC model to the female data set. . . . . . . . . . . 34 The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when fitting the LM model to the female data set. . . . . . . . . . . . . . 35 The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0 when fitting the LM-I model to the female data set. . . . . . . . . . 36 The estimated varying-coefficient function and fitted probability of SNP rs13050325 (upper panel) on chromosome 21 of female population and SNP rs4635456 (lower panel) on chromosome 19 of male population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 3.1 The selection ratio of Example 3.1 . . . . . . . . . . . . . . . . . . . 55 Figure 3.2 The selection ratio of Example 3.2, d = 10 . . . . . . . . . . . . . . 59 Figure 3.3 The selection ratio of Example 3.2, d = 50 . . . . . . . . . . . . . . 60 Figure 2.13 Figure 2.14 Figure 2.15 Figure 2.16 Figure 2.17 Figure 2.18 xii Chapter 1 Introduction 1.1 A review of basic genetics Genes are the basic functional units where the biological characteristics can be passed from parents to offspring. The genes of each cell are arranged in linear order on chromosomes. Majority of the multicellular organisms have duplicate copies of each gene, hence they are diploid. The number of paired chromosomes varies across different species. For instance, Brassica Oleracea has 9 pairs of chromosomes, Zea Mays has 10 pairs, Mus musculus has 20 pairs and human beings have 23 pairs. The entire set of chromosomes form the genome of a particular organism and organelles. Each gene resides on certain site, or locus of the chromosomes. For diploid organisms, the genes corresponding to the locus take two forms (say A and a), called alleles. The three genetic compositions of the two alleles, AA, Aa, and aa, are genotypes. The pair of identical alleles (AA or aa) and different alleles (Aa) are called homozygous and heterozygous respectively. One or several loci may determine the observable characteristics or phenotypic traits, such as eye color, body weight, blood pressure, and so on. The traits that can be continuously measured are defined as quantitative traits, like the aforementioned body weight and blood pressure. The investigation of the genetic basis of a quantitative trait is the major task of quantitative genetics. In classical quantitative genetics, 1 the phenotypic value (P) is due to genetic factors (G) and environment factors (E): P =G+E (1.1) By assuming the independence of the terms in (1.1), the phenotypic variance of a quantitative trait (VP ) can be decomposed into its genetic (VG ) and environmental (VE ) variance components: VP = VG + VE (1.2) Based on the three modes of gene actions(additive, dominance and epistasis), we can further partition VG as VG = VGa + VGd + VGe (1.3) where VGa , VGd , VGe are additive, dominance and epistatic(or interaction) genetic variance respectively. VGd and VGe are referred to as nonadditive genetic variance. In quantitative genetics, heritability characterizes the relative importance of the role genetic variance plays in determining the phenotypic variance. The two types of heritability, broad-sense heritability (H 2 ) and narrow-sense heritability (h2 ) , are defined separately as: H2 = VG VG + VE (1.4) h2 = VGa VG + VE (1.5) and The degree of overall genetic control over the quantitative trait can be measured by the two heritability parameters H 2 and h2 [1, 2]. 2 1.2 Genetic mapping of complex traits The past decades witnessed waves of breakthroughs in genetic mapping of complex diseases (traits). The family-based linkage study prevails as conventional means of locating disease genes. Transmission disequilibrium test (TDT)[3], the most popular association test in family based study, was proposed to test the linkage between a genetic marker and disease susceptibility. However, large pedigrees are needed for fine-mapping in linkage study, so the utility of the study is confined when such information is not available. While the linkage study predominated in discovering disease variants with major effects, the population based association study gained advantage in detecting genes with modest effects [4], which is of particular significance for complex human diseases[5]. The rapid progress in the high-throughput genotyping technologies made possible the large scale, highly dense genome-wide association studies (GWAS) for millions of genetic variants, such as single-nucleotide polymorphism (SNP) across the entire human genome[6]. The GWAS has thus identified a large number of susceptibility variants associated with the complex human diseases, including asthma[7], breast cancer [8, 9], coronary heart disease [10, 11] and Type 2 Diabetes[12, 13]. A detailed list is given in [14]. Despite the huge success achieved in GWAS, the disease etiology is still not clearly elucidated since a substantial proportion of the heritability remains obscure. Therefore, there are pressing needs to investigate the part of unexplained heritability. There are a number of potential sources of missing heritability, such as rare variants, structural variation, gene-gene (G×G) interactions and gene-environment (G×E) interactions [15]. For example, the ‘Common Disease, Common Variants’ hypothesis is commonly adopted in GWAS, assuming that majority of the genetic risk of common complex dis- 3 eases can be attributed, at least partially, to common disease variants which is of more than 5% minor allele frequency (MAF) [16, 17]. The rare (MAF<0.5%) and low frequency (0.5%0.05), which suggests that β(X) was a constant and there was no G×E interaction for these SNPs. Hence the LM model assuming no interaction gave the strongest signals. The bottom SNP in the table had the strongest signal when data were fitted with the LM-I model since we rejected constant coefficient 27 Figure 2.7: The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when fitting the LM model to the male data set. (P CON=0.0108) but failed to reject linear coefficient (P LIN=0.6792). Among the SNPs listed in the table, some have been reported in other studies. For example, transcription factor 7-like 2 (TCF7L2) is an intensively examined gene associated with a broad categories of diseases, including Type 2 Diabetes. The causal genetic association between SNPs of the gene and the Type 2 Diabetes was first reported in Grant et al. [59] and 28 Figure 2.8: The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0 when fitting the LM-I model to the male data set. was subsequently replicated in many ethnic groups (Jin and Liu [60]). As the SNPs in this gene are not sensitive to obesity, it is not surprise that they can be identified in other studies by using methods assuming a linear relationship. But our method identified three more that show nonlinear G×E relationship, even though they did not pass the genome-wide Bonferroni threshold. We also did QQ plot of the p-values for data fitted with the three models. The 29 p-values are quite uniformly distributed and only a few showing departure from the expected values (see the QQ plot). This indicates that the models have no serious inflation of false positives and the strong signals are likely to be true. Figure 2.9: The QQ plot of genome-wide p-values for the male data, when data are fitted with the VC model 30 Figure 2.10: The QQ plot of genome-wide p-values for the male data, when data are fitted with the LM model Figure 2.11: The QQ plot of genome-wide p-values for the male data, when data are fitted with the LM-I model 31 Figure 2.12: The QQ plot of genome-wide p-values for the female data, when data are fitted with the VC model Figure 2.13: The QQ plot of genome-wide p-values for the female data, when data are fitted with the LM model 32 Figure 2.14: The QQ plot of genome-wide p-values for the female data, when data are fitted with the LM-I model 33 Figure 2.15: The Manhattan plot of -log10(p-values) for testing H0 : β(X) = 0 when fitting the VC model to the female data set. Figures 2.15–2.17 showed the Manhattan plot of the -log10(p-values) for the female data. Even though no SNPs passed the genome-wide Bonferroni threshold, we did see stronger signals fitted by the VC model. Those SNPs that passed the suggestive threshold are listed in Table 2.2. Again, gene TCF7L2 does not show sign of sensitivity to obesity to affect 34 Figure 2.16: The Manhattan plot of -log10(p-values) for testing H0 : β = 0 when fitting the LM model to the female data set. T2D risk. Gene GLI2 shows sign of interaction with obese to affect T2D risk. Two SNPs in gene NRIP1 located on chromosome 21 show sign of nonlinear interaction with obese to affect T2D risk. In comparison to the male data, it is clear that SNP effects are stronger in the male population than in the female population. Moreover, the genetic effects in females are relatively more sensitive to obesity to affect T2D risk. In summary, strong sex-specific 35 Figure 2.17: The Manhattan plot of -log10(p-values) for testing H0 : β1 = β2 = 0 when fitting the LM-I model to the female data set. genetic effects were observed, for example, those SNPs on chromosome 2, 3, 4, and 21. To further demonstrate the utility of the method, we plotted the dynamic effect of SNP rs13050325 on chromosome 21 from the female data (upper panel) and SNP rs4635456 on chromosome 19 from the male data (lower panel). The two curves on the left side of Figure 2.18 showed the estimated dynamic genetic effect as a function of BMI fitted with the B36 Table 2.2: List of SNPs with p-value < 5E-06 in the NHS (Female) data set SNP ID GeneName Chr P VC P CON P LIN fitted with VC model rs13050325 NRIP1 21 3.79E-07 3.77E-06 0.0016 rs2331061 LANCL2 7 8.60E-07 1.30E-06 5.26E-07 rs1466042 GLI2 2 1.10E-06 3.48E-06 0.0389 rs11145373 VPS13A 9 2.63E-06 8.41E-04 2.56E-04 rs3775043 UNC5C 4 2.95E-06 0.0018 6.96E-04 rs12627409 NRIP1 21 4.00E-06 2.38E-05 0.0441 fitted with LM model rs10519107 RORA 15 4.84E-05 0.8381 rs809736 RORA 15 4.96E-05 0.8145 rs4506565 TCF7L2 10 4.35E-05 0.4953 rs7901695 TCF7L2 10 4.42E-05 0.4895 rs12255372 TCF7L2 10 1.2E-04 0.5576 rs4368343 Unknown 2 1.88E-04 0.7537 fitted with LMI model rs2677528 GLI2 2 2.63E-06 2.62E-05 0.0732 rs7978946 Unknown 12 3.09E-05 0.0117 0.6078 rs887370 TSHZ2 20 3.63E-05 1.45E-05 0.5868 SNP ID GeneName Chr P LM P LMI PI fitted with VC model rs13050325 NRIP1 21 0.0062 1.23E-05 1.0E-04 rs2331061 LANCL2 7 0.0703 0.1456 0.4471 rs1466042 GLI2 2 0.0241 1.61E-06 3.37E-06 rs11145373 VPS13A 9 1.27E-04 6.2E-04 0.7679 rs3775043 UNC5C 4 6.11E-05 2.56E-04 0.4929 rs12627409 NRIP1 21 0.0119 5.60E-06 2.38E-05 fitted with LM model rs10519107 RORA 15 8.52E-07 3.72E-06 0.3802 rs809736 RORA 15 9.22E-07 5.84E-06 0.8961 rs4506565 TCF7L2 10 1.69E-06 4.66E-06 0.2018 rs7901695 TCF7L2 10 1.75E-06 4.30E-06 0.1729 rs12255372 TCF7L2 10 4.47E-06 9.73E-06 0.1543 rs4368343 Unknown 2 4.75E-06 2.53E-05 0.6320 fitted with LMI model rs2677528 GLI2 2 0.0064 2.16E-06 1.55E-05 rs7978946 Unknown 12 1.04E-04 3.62E-06 0.0016 rs887370 TSHZ2 20 0.4492 4.46E-06 9.30E-07 37 spline function. We can see clear nonlinear genetic effects over BMI, which indicates nonlinear interaction between BMI and the variants. The figures in the right panel show the plot of fitted probabilities against individual BMI values corresponding to different genotypes. We coded the heterozygote as 0 in our model. This implies that the green curves in the two plots correspond to the mean fitted probability when G = 0. In general, the risk of T2D increases as BMI increases. This is consistent with our prior knowledge that the disease prevalence is strongly associated with body weight (McCarthy[61]). For SNP rs13050325 on chromosome 21, the allele frequency for the minor allele G is 0.2587. For SNP rs4635456 on chromosome 19, the allele frequency for the minor allele G is 0.3771. In both cases, the overall trend for T2D risk for the baseline (corresponding to genotype AG) increased as BMI level increases (green curve). However, individuals carrying AA genotype had much higher chance to develop T2D than those carrying AG or GG genotype. Man with genotype AA had the lowest risk of conferring T2D susceptibility when BMI level was below 28 in male and below 33 in female. After the transition points, the AA genotype triggers larger effect, resulting in higher risk of T2D. The association signals for both LM and LM-I model are weaker than the one fitted with the VC model, leading to potential mis-identification of these variants. The results offered personalized preventive suggestions based upon our findings fitted with the VC model. For example, man carrying genotype AA at this SNP locus should pay more attention to control their body weight if their BMI level is above 28 to avoid the risk of T2D. 38 1 Fitted Probability 1 ˆ β(X) 0.5 0 −0.5 −1 20 30 40 0.2 0 Fitted Probability −1 30 20 40 0.8 30 40 50 BMI 1 −0.5 20 AA AG GG 0.4 50 0 ˆ β(X) 0.6 X(BMI) 0.5 −1.5 0.8 AA AG GG 0.6 0.4 0.2 X(BMI) 20 30 40 BMI Figure 2.18: The estimated varying-coefficient function and fitted probability of SNP rs13050325 (upper panel) on chromosome 21 of female population and SNP rs4635456 (lower panel) on chromosome 19 of male population 2.7 Discussion It is broadly recognized that naturally occurring variations in most complex disease traits have a genetic basis. However, the degree of variability is believed to have a strong environmental component in addition to genetic causes for many disease traits such as obesity and Type 2 Diabetes (Qi and Cho [62]). Recent efforts on epigenetics study reveals the importance of epigenetic modification on complex diseases (Liu et al. [63]). These epige- 39 netic changes involve major chromatin remodeling processes such as DNA methylation and histone modification. Being the environmentally driven plasticity at the DNA level, these structural changes at the DNA level reveal the interplay of gene-environment interaction in the regulation of phenotype, which is increasingly recognized as the epigenetic basis of many complex diseases (Liu et al. [63]). Large efforts have been devoted to the exploration of epigenetic mechanisms for a better understanding of the molecular machinery underlying complex diseases (Feinberg and Irizarry [64]). However, how environment mediates epigenetic changes to affect phenotypic plasticity is still poorly understood, largely due to the lack of powerful statistical methods to dissect this complicated process. In this chapter, we proposed a novel statistical method by modeling the genotypic effect on disease risk as a dynamic function of environment mediators. Our model is built upon well-studied statistical varying-coefficient model implemented with the nonparametric spline technique to estimate the varying coefficients. The model extends out previously developed method on continuous traits to a case-control population-based design. Simulation studies show dramatically improved power when the underlying genetic penetrance behaves nonlinearly under certain environmental stimulus. Our model can capture the dynamic changes of the gene functions over environmental changes, hence has particular power to tackle longstanding genetic questions regarding gene action and phenotypic plasticity (Feinberg [44]). Our simulation studies indicate that model mis-specification is an issue in G×E study. The power to detect genetic signals is heavily dependent upon the models to fit the data. Simple models are always the first choice due to their simplicity to interpret. However, if they cannot capture the underlying functional mechanism, they suffer tremendously from power loss. For example, if the true genetic effect does vary nonlinear across environmental changes, fitting a simple linear model would loss power (Figure 2.3). On the other hand, 40 complex models always suffer from large degrees of freedom for testing. We proposed a sequential testing procedure to assess if a simpler model fits the data better. The real data analysis confirms that this strategy works. For example, when testing constant shows that there is no G×E interaction, the model with linear predictor and without interaction term gives the smallest testing p-values (see Table 2.1 and Table 2.2). In real data analysis, one should always start by assessing constant coefficient first, then move to test linear or varying coefficients. We applied our model to two Type 2 Diabetes data sets. Cornelis et al. [65] evaluated seven statistical models to dissect G×E interactions using the same data sets. Both Cornelis et al. [65] and our work treated BMI as the environmental factor. Cornelis et al. [65] claimed that specifying BMI as a continuous covariate will lead to inflated type 1 error, which has consequence in detecting increased number of false positives as the true signal. They converted the continuous environment factor BMI into a binary variable prior to further comparisons of all the 7 models. However, this conversion will result in information loss, which might be the reason that no G×E interaction signals passed the genome-wide significance levels for all the seven models in both data sets in their analysis (Cornelis et al. [65]). In our approach, we allowed the nonlinear effect of BMI on Type 2 Diabetes (modeled by function α(X)) rather than treating it as a linear function (i.e., α0 + α1 X). This greatly alleviated the type 1 error inflation compared to a model fitted with a linear function in BMI (data not shown). In our analysis, several signals reached the genome-wide significance level, which is a piece of convincing evidence for keeping the continuous BMI measure as an environmental variable. In the real data analysis, we observed strong sex-specific variants associated with T2D. There were not much overlap between genes identified in both data sets except for SNPs 41 in gene TCF7L2. Identification of SNPs in gene TCF7L2 on the pathogenesis of Type 2 Diabetes has been successfully replicated from different populations (Grant et al. [59]). This information indicates the robustness of our model. In addition, we observed stronger signals in the male data evidenced by seven SNPs from three genes reaching the genomewide significance threshold (cutoff=7.9E-8), as shown in Table (2.1). However, we observed stronger BMI×G interaction to affect T2D in females than in males evidenced by more nonlinear G×E interaction in the female data set (Table 2.2). We could miss these signals if we only focused on linear predictor models. In a recent investigation of a Italian population, Vaccaro et al. [66] found a significantly higher average BMI levels in diabetes women. So possibly certain genes may be sensitive to high BMI level to increase T2D risk. Our model provides a testable framework to identify the underlying genetic blueprint sensitive to obese changes to affect T2D risk. The results obtained by our model can be applied to pathway or gene-set enrichment analysis to identify potential sex-specific pathways for T2D. In this chapter, we generalized the VC model for continuous quantitative response to the case-control binary response. There is several ongoing work worthy of further investigation. First, the model can be easily extended to other types of phenotype data, such as count data or survival data by applying different link functions. Second, more replication studies are needed by applying our approach to Type 2 Diabetes of different ethnic groups to further confirm the robustness of the method. Third, it is worth noting that the interesting result reported by Perry et al. [52] that stratification on the Type 2 Diabetes patients based on BMI might help enrich the significance of potential susceptibility loci. We could also try to carry out analysis to test if this hypothesis leads to any new discoveries based on the VC model. Finally, our model can easily incorporate population stratification (PS) effect by first doing a principle component analysis using software such as EIGENSTRAT (Price et 42 al. [67]), then incorporate those PCs as covariates into the model to account for the effect of PS. 43 Chapter 3 High Dimensional Variable Selection In Gene-Environment Interactions 3.1 Introduction Gene-environment (G×E) interaction has been traditionally examined by assessing genetic responses corresponding to various environmental stimuli, which provides novel insight in elucidating the genetic basis of complex diseases, because the disease risk is not only contingent on genetic risk factors, but also on the environmental pressures, as well as the interplay between them. The environmental pressure could be either discrete or continuous. When it comes to a G×E interaction study related to asthma, the environmental factor could be discrete, such as smoking status (smoking v.s. non-smoking). A much more clear picture on the interaction will be tangible if the environmental factor is evaluated on a continuous scale, since we can trace the varying patterns of genetic effect responsive to changes in environment. Conventional statistical modelling of G×E interaction often requires a linear relationship assumption between genetic and environmental factors, which could be violated in practice, as pointed out in Ma et al [37] and Wu and Cui [68]. Consequently, a varying coefficient (VC) model framework, together with a sequence of goodness-of-fit tests, were proposed in 44 [37] and [68] for continuous and binary responses respectively, to track down the dynamic features of genetic responses to environmental pressures. Because of the particular power and flexibility of VC models to capture the variations in regression coefficients, the framework demonstrated significant advantage over the conventional methods especially in the presence of non-linear G×E interaction Unlike the predominant single genetic variant based approaches dissecting G×E interactions, such as the parametric methods in Guo [34], non-parametric methods in Ma et al [37] and Wu and Cui [68], and semi-parametric methods in Chatterjee et al [69] and Maity et al [70], we propose a variant set based framework to investigate how variants in a set are mediated by a common environment factor to affect the phenotypic response, since it has been increasing acknowledged the merit of set based association analysis, such as in the gene-centric analysis in Cui et al [39] and Wu and Cui [40], gene-set analysis in Schaid et al [71] and Efron and Tibshirani [72], as well as the pathway based analysis in Wang et al [38]. When the number of variants within the genetic system is large, the problem can be approached from the entry point of high dimensional variable selection. In particular, we can select genetic variants with varying, non-zero constant and zero coefficients, which are corresponding to scenarios of G×E interactions, no G×E interactions and no genetic effects, respectively. To the best of our knowledge, this is the first time that the problem is tackled from the angle of high dimension variable, on the contrary to the popular single genetic variant based approaches coined in a hypothesis testing framework. Through B spline basis expansion, the varying coefficient function can be separated into constant and varying portions, respectively. Then the distinction of the 3 effects could be achieved by penalizing the 2 portions in a two-stage iterative framework, as shown in Tang et al.[73]. Though the asymptotic properties of the two stage estimator with adaptive 45 LASSO penalty were established in [73], the finite sample performance still has a large margin to improve. Therefore we proposed a Smoothly Clipped Absolute Deviation (SCAD) based approach to examine the separation of varying, non-zero constant and zero coefficient functions. Our approach has significantly improved percentages of choosing the exact true model and reduced error in parameter estimation. Assuming suitable regularity conditions, we can establish the consistency in variable selection and effect separation of our estimator, as well as the optimal convergence rates of the estimates for varying effect. Furthermore, it can be shown that the estimate of non-zero constant coefficient enjoys the oracle property, that is, the asymptotic distribution of the non-zero constant coefficient function is the same as that when the true model is known in priori. In this chapter, we describe the penalized least square estimation procedure via basis expansion and SCAD penalty, as well as the computational algorithms. Next we present the theoretical results including consistency in variable selection and oracle property. The merit of the proposed approaches were demonstrated through extensive simulation study and real data analysis. Discussions will be given at the end of the chapter. We relegate technical proofs to the Appendix. 3.2 3.2.1 The proposed variable selection method The penalized estimation via SCAD Let (Xi , Yi , Zi ), i = 1, . . . , n be independent and identically distributed (i.i.d.) random vectors, then the varying coefficient (VC) model, proposed by Hastie and Tibshirani [74], 46 has the form d Yi = βj (Zi )Xij + εi (3.1) j=0 where Xij is the jth component of (d+1)-dimensional vector Xi with the first component Xi0 being 1, βj (·)’s are unknown varying-coefficient functions, Zi ’s are the scalar index variable, and εi is the random error such that E(ε|X, Z) = 0 and V ar(ε|X, Z) = σ 2 < ∞ . The smooth functions {βj (·)}d in (3.1) can be approximated by polynomial splines. j=0 Without loss of generality, suppose that Z ∈ [0, 1]. Let wk be a partition of the interval [0,1], with kn uniform interior knots wk = {0 = wk,0 < wk,1 < . . . < wk,kn < wk,kn +1 = 1} Let Fn be the collection of functions on [0,1] satisfying (3.1) the function is a polynomial of degree p or less on subintervals Is = [wk,s , wk,s+1 ), s = 0, . . . , Nn − 1 and INn = [wj,Nn , wj,Nn +1 ). (2) the functions are p − 1 times continuous differentiable on [0,1]. L j ¯ ¯ Let B(·) = {Bjl (·)}l=1 be a set of normalized B spline basis of Fn . Then for j = 0, . . . , d, the VC functions can be approximated by basis functions βj (Z) ≈ Lj ¯ ¯ l=1 γjl Bjl (Z), where Lj is the number of basis functions in approximating the functions βj (Z). By changing of equivalent basis, the basis expansion can be reexpressed as Lj βj (·) ≈ . ˜T γjl Bjl (·) = γj1 + Bj (·)γj,∗ l=1 . T ˜ the spline coefficient vector γj = (γj,1 , γj∗ )T , and Bj (·) = (Bj2 (·), . . . , BjLj (·))T . γj,1 and γj∗ correspond to the constant and varying part of the coefficient function respectively. We treat γj∗ as a group. If γj∗ 2 =0, then the jth predictor only has a non-zero constant effect 47 and moreover, if γj,1 =0, then the predictor is redundant. To carry out variable selection separating the varying, non-zero constant, and zero effects, we minimize the penalized least square function Q(γ) = n 1 n  d Yi − i=1 2 L d γjl Xij Bjl (Zi ) + j=0 l=1 j=1 pλ1 ( γj∗ 2 ) d + j=1 (3.2) pλ2 (|γj1 |)I( γj∗ 2 = 0) where λ1 and λ2 are the penalization parameters, pλ (·) is the SCAD penalty function, defined as    λx if 0 ≤ x ≤ λ     (x2 −2aλx+λ2 ) pλ (x) = if λ < x ≤ aλ − 2(a−1)     (a+1)λ2   if x > aλ 2 (3.3) To express (3.2) by vectors and matrices, we redefine d Q(γ) =(Y − U γ)T (Y − U γ) + n pλ1 ( γj∗ 2 ) j=1 d (3.4) pλ2 (|γj1 |)I( γj∗ 2 = 0) +n j=1 T T where Y = (Y1 , . . . , Yn )T , γ = (γ0 , . . . , γd )T , Ui = (Xi0 B(Zi )T , . . . , Xid B(Zi )T )T , and T T U = (U1 , . . . , Un )T . The function of the 2nd term of Q(γ) is to separate the varying and constant effects by penalizing the L2 norm of the varying part of the coefficient functional. The indicator functions in the 3rd term helps to penalize the variables of the constant effects. Both γj,1 and γj,∗ will be shrunk to zero if the predictor Xj has no effect. 48 3.2.2 The computational algorithms The SCAD penalty function is singular at the origin, and do not have continuous 2nd order derivatives, therefore the regular gradient-based optimization cannot be applied. In this section, we develop an iterative two-stage algorithm to minimize the penalized loss function using local quadratic approximation to the SCAD penalty. Following Fan and Li [75], in a neighbourhood of a given positive x0 ∈ R+ , p (x0 ) 2 pλ (x) ≈ pλ (x0 ) + λ (x − x2 ) 0 2x0 where pλ (x) = λ{I(x (aλ−x) + λ) + (a−1)λ I(x > λ)} for a=3.7 and x >0. Here we use a similar quadratic approximation by substituting x with γj∗ 2 and |γk1 | in LQA, for k = 0, ..., d. Therefore we have pλ ( γj∗ 2 ) ≈ pλ ( pλ ( 0 γj∗ 2 ) + 0 γj∗ 2 ) 0 2 γj∗ 2 0 ( γj∗ 2 − γj∗ 2 ) 2 2 (3.5) and 0 pλ (|γj,1 |) ≈ pλ (|γj,1 |) + 0 pλ (|γj,1 |) 0 2|γj,1 | 0 (|γj,1 |2 − |γj,1 |2 ) (3.6) The sets of predictors with varying, non-zero constant, and zero effects are termed as V, C and Z respectively. We implement the iterative algorithm in the following two-stage procedure. At stage 1, using the LQA (3.5) and dropping the irrelevant constant terms, we minimize Q1 (γ) = (Y − U γ)T (Y − U γ) + n T γ Ωλ1 (γ0 )γ 2 (3.7) where the initial spline vector γ0 is the unpenalized estimator, Ωλ1 (γ0 )=diag{Ω0 , Ω1 , . . . , Ωd }, 49 where Ω0 = 0L , Ωj = 0, 0 0 pT ( γj∗ 2 ) pT ( γj∗ 2 ) λ1 λ1 ,..., 0 0 γj∗ 2 γj∗ 2 for j = 1, . . . , d. Hence the estiL mator can be iteratively obtained as (m) ˆ γVC = U T U + n (m−1) −1 T Ωλ1 (ˆVC γ ) U Y 2 (3.8) Suppose that all the predictors are in V at the beginning. The jth predictor will be moved to C if γj∗ 2 =0, otherwise it will stay in V. ˆ VC At stage 2, using the LQA (3.6) and dropping the irrelevant constant terms, we minimize the penalized loss only for the predictors in C: Q2 (γ) = (Y − U γ)T (Y − U γ) + n T γ Ωλ2 (ˆVC )γ γ 2 (3.9) where Ωλ2 (ˆVC )=diag{Ω0 , Ω1 , . . . , Ωd } with Ω0 = 0L , γ Ωj = VC pT (|ˆj,1 |) λ2 γ I( |ˆj,1 | γ VC ˆ VC γj∗ L2 = 0), 0, . . . , 0 . The estimator can be iteratively obtained L as (m) ˆ γCZ = U T U + n (m−1) −1 T Ωλ2 (ˆCZ ) γ U Y 2 (3.10) If the jth predictor is in C, then it will be moved to Z if |ˆk,1 |=0, otherwise it stays in C. γ CZ ˆ We can obtain the estimator γ at convergence from the iterative procedure between ˆ the above two stages, and the estimated coefficient function in (3.1) as βj (z) = B T (z)ˆj . γ ˆ ˆ βj (z) will be a varying function in z, non-zero constant and zero if γj is in V, C and Z correspondingly. 50 3.2.3 Selection of tuning parameters In this section, we choose the tuning parameters N ,p, λ1 and λ2 from a data driven procedure. N is the number of interior knots uniformly spaced on [0,1], p is the degree of the spline basis. here p and N control the smoothness of the coefficient functions, while λ1 and λ2 determine the threshold for variable selection. At the beginning, we use BIC in Schwarz [76] to choose N and p. The range for N is [max( 1 (2p+3) 0.5n , 1), 1 (2p+3) 1.5n ], where x denotes the integer part of x. The optimal pair of N and p can be achieved via a two-dimensional grid search, according to the following criterion: BICN,p = log(RSSN,p ) + (N + p + 1) log(n) n ˆ ˆ ˆ where RSSN,p = (Y − U γ )T (Y − U γ )/n, γ = (ˆ0 , 0T , . . . , 0T )T . Conditional on the γT selected N and p, λ1 is the minimizer of BICλ1 = log(RSSλ ) + 1 dfλ1 n log(n) ˆ ˆ ˆ where RSSλ1 = (Y − U γλ1 )T (Y − U γλ1 )/n, γλ1 is the minimizer of (3.7), and dfλ1 is the effective degree of freedom, defined as the total number of predictors in V and C. ˆ Conditional on γλ1 , λ2 is the minimizer of BICλ2 = log(RSSλ ) + 2 dfλ2 n log(n) ˆ ˆ ˆ where RSSλ2 = (Y − U γλ2 )T (Y − U γλ2 )/n, , γλ2 is the minimizer of (3.8), and dfλ2 is the effective degree of freedom, defined similarly as dfλ1 . 51 3.3 Asymptotic results Here we establish the asymptotic properties of the penalized least square estimators. Without loss of generality, we assume there are v varying coefficients as βj (·) ≡ βj (z),j = 1, . . . , v, (c − v) non-zero constant coefficients as βj (·) ≡ βj > 0, j = v + 1, . . . , c, and (d − c) zero coefficients as βj (·) ≡ 0, j = (c+1), . . . , d. Our asymptotic results are based on the following assumptions. (A1) Let Hr be the collection of all functions on the compact support [0,1] such that the r1 th order derivatives of the functions are H¨lder of order b with r = r1 + r2 , i.e., o |hr1 (z1 ) − hr1 (z2 )| ≤ C0 |z1 − z2 |r2 where 0 ≤ z1 , z2 ≤ 1 and C0 is a finite positive constant. 3 Then βj (z) ∈ Hr , j = 0, 1, . . . , v, for some r ≥ 2 . (A2) The density function of the index variable Z, f (z), is continuous and bounded away from 0 and infinity on [0, 1], i.e., there exist finite positive constants C1 and C2 such that C1 ≤ f (z) ≤ C2 for all z ∈ [0, 1]. (A3) Let λ0 ≤ . . . ≤ λd be the eigenvalues of E[XXT |Z = z]. Then λj (k = 0, . . . , d) are uniformly bounded away from 0 and infinity in probability. In addition, the random design vector are bounded in probability. (A4) For wj , the partition of the compact interval [0,1] defined as {0 = wj,0 < wj,1 < . . . < wj,kn < wj,kn +1 = 1}, j = 0, . . . , d, there exists finite positive constant C3 such that max(wj,k+1 − wj,k , k = 0, . . . , kn ) ≤ C3 min(wj,k+1 − wj,k , k = 0, . . . , kn ) (A5) The tuning parameters satisfy 1 2 max{λ , λ } kn 1 2 1 −1 → 0 and n 2 kn min{λ1 , λ2 } → ∞. (A6) maxj {|pλ (|γj∗ |)| : γj∗ = 0} → 0 as n → ∞ and maxj {|pλ (|γj1 |)| : γj1 = 0} → 0 1 2 52 as n → ∞ (A7) lim infn→∞ lim infθ→0+ λ−1 pλ (θ) > 0 and lim infn→∞ lim infθ→0+ λ−1 pλ (θ) > 0 1 2 1 2 The above assumptions are commonly used in literature of polynomial splines and variable selections. The assumption similar to (A1) could be found in Kim [77] and Tang et al [73]. (A1) guarantees certain degrees of smoothness of the true coefficient function in order to improve goodness of approximation. (A2) and (A3) are similar to those in Huang et al [50, 51] and Wang et al [78]. (A4) suggests that the knot sequence is quasi-uniform on [0,1], by Schumaker [79]. (A5-A7) are conditions on tuning parameters, of which (A5) could be found in Tang et al [73]; (A6) and (A7) are similar to those in Fan and Li [75] and Wang et al [78]. 1 Theorem 1. Under the assumptions (A1-A7) and suppose kn = Op n 2r+1 , then we have ˆ ˆ (1) βj (z) are nonzero constant, j = v + 1, . . . , c and βj (z) = 0, j = c + 1, . . . , d, with probability approaching 1; −r ˆ (2) βj − βj 2 = Op (n 2r+1 ), j = 0, . . . , v. Denote β ∗ = (βv+1 , . . . , βc )T as the vector of true nonzero constant coefficients. 1 Theorem 2. Under the assumptions (A1-A7) and suppose kn = Op (n 2r+1 ), then with n → ∞, √ d ˆ n(β ∗ − β ∗ ) − − → N (0, σ 2 Σ−1 ) −− where Σ is defined in the Appendix. 53 3.4 Simulation The performance of our proposed approach is demonstrated through extensive simulation study in this section. We use the percentage of choosing the true model out of total R replicates, or oracle percentage, to evaluate the accuracy of variable selection by identifying varying, non-zero constant and zero effects. The precision of estimation is assessed by integrated mean squared error (IMSE). ˆ(r) Let βj be the estimator of a nonparametric function βj in the rth (1 r R) replica- n grid ˆ(r) tion, and {zm }m=1 be the grid points where βj is evaluated. We use the integrated mean R 1 r=1 ngrid 1 ˆ ˆ squared error (IMSE) of βk (x), defined as IMSE(βj (z))= R ngrid (r) ˆ m=1 {βk (zm ) − βj (zm )}2 , to evaluate the estimation accuracy of coefficient βj , and the total integrated mean squared error (IMSE) of all the d coefficients (TIMSE), defined as TIMSE= d ˆ j=1 βj (z), is ˆ used to evaluate the overall estimation accuracy. Note that IMSE(βj ) will be reduced to ˆ ˆ MSE(βj ) when βj is a constant. Example 3.1. We simulate data from the following VC model d Yi = β0 (Zi ) + βj (Zi )Xij + εi j=1 where the index variable Zi ∼Uniform(0,1), and the predictors Xi are generated from a multivariate normal distribution with mean 0 and Cov(Zij , Z ij ) = 0.5|j−j | for 0 ≤ j, j ≤ d. j The performance is evaluated under both d=10 and 50. Xi , j = 0, 1, 2 are of varying j effects, Xi , j = 3, 4 are of non-zero constant effects, and the rest variables are redundant. The random error εi were generated from standard normal distributions and t distribution with 3 degrees of freedom respectively. The coefficients were set as: β0 (z) = sin(2πz), β1 (z) = 2 − 3 cos{(6z − 5)π/3}, β2 (z) = 3(2z − 1)3 , β3 (z) = 2, β4 (z) = 2.5, and βj (z) = 0 54 for j = 5, . . . , 10. The results are listed in Figure 3.1 and Table 3.1. N(0,1) error t(3) error 1 0.8 0.6 0.6 0.4 0.4 0.2 Selection Ratio 0.8 0.2 0 1 1 2 3 4 5 6 7 8 9 10 11 1 0.8 Selection Ratio ALASSO SCAD 0.6 0.4 0.4 0.2 ALASSO SCAD 0.8 0.6 1 2 3 4 5 6 7 8 9 10 11 0.2 0 10 20 30 0 40 Predictors 10 20 30 Predictors Figure 3.1: The selection ratio of Example 3.1 55 40 Figure 3.1 shows the selection ratio for predictors under different groups of d and error distributions. The height of bars on the top panel of Figure 3.1 denote the selection ratio for true positives for the first 5 predictors, and false positives for the rest predictors. Under both standard normal and and t(3) error, compared with the method based on adaptive LASSO, our method is capable of correctly identifying significant effect with high percentages, and has kept a very small percentages of choosing false positive predictors. In addition, our method has relative stable performance in terms of correct selection ratio when dimension grows. The oracle percentage and parameter estimation results are summarized in Table 3.1. In Tang et al [73], MSE was computed for constant coefficients only when the corresponding predictor is chosen with non-zero constant effect. To reflect the overall estimation precision, we compute IMSEs for all predictors, including β4 and β5 . When βj (j = 4, 5) is selected as non-zero constant, IMSE reduces to MSE. The IMSEs will be calculated if βj (j = 4, 5) incorrectly identified as varying. In all the pairs of d and error distribution type, the SCAD approach demonstrates superior performance over the adaptive LASSO approach. Example 3.2. Now we consider the simulation in genetics settings from the following VC model d Yi = β0 (Zi ) + βj (Zi )Xij + εi j=1 where the SNP Xi was coded with 3 categories (1,0,-1) for genotypes (AA,Aa,aa) respectively. We simulate the SNP genotype data based on the pairwise linkage disequilibrium(LD) structure. Suppose the two risk alleles A and B of two adjacent SNPs have the minor allele frequencies (MAFs) qA and qB , respectively, with LD denoted as δ. Then the frequencies of four haplotypes can be expressed as qab = (1 − qA )(1 − qB ) + δ, qAb = qA (1 − qB ) − δ, 56 Table 3.1: Simulation results of Example 3.1 d=10 N (0,1) error t(3) error SCAD ALASSO Oracle SCAD ALASSO Oracle Perc. 0.972 0.82 1 0.92 0.315 IMSE β0 (u) 0.0214 0.0243 0.0216 0.0398 0.0448 β1 (u) 0.0902 0.0930 0.0951 0.1166 0.1254 β2 (u) 0.0365 0.1018 0.0431 0.0764 0.2211 0.6248 β3 (u) 0.0122 0.2405 0.0032 0.0753 β4 (u) 0.0045 0.0405 0.0031 0.0183 0.1713 TIMSE 0.1648 0.5075 0.1661 0.3282 1.3000 d=50 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle 1 0.1929 0.3392 0.5859 0.1775 0.1100 0.4017 0.945 0.635 1 0.8 0.012 1 0.0221 0.0878 0.0404 0.0478 0.0101 0.2086 0.0230 0.0896 0.0551 0.0776 0.0165 0.2966 0.0219 0.0927 0.0428 0.0027 0.0029 0.1631 0.0431 0.1230 0.1042 0.1727 0.0239 0.5146 0.0612 0.1477 0.0969 0.0771 0.0608 2.4926 0.0426 0.1253 0.0751 0.0105 0.0083 0.2619 qaB = (1 − qA )qB − δ, and qAB = pA pB + δ. With the Hardy-Weinberg equilibrium assumption, the SNP genotype at locus A can be simulated assuming a multi-nomial distribution with frequencies p2 ,2pA (1 − pA ) and (1 − pA )2 for genotypes (AA,Aa,aa) correspondingly. A We can subsequently generate the SNP2 genotypes conditional on SNP1 can be simulated based on the conditional probability matrix in Cui et al. [39]. The non-zero coefficients of the model are the same as those in Example 1. The simulation with sample size 500 were performed 500 replicates. Figure 3.2 show the selection ratio when d=10, under different combinations of MAF and error distributions. The height of bars is defined similarly as that in Example 1. When the random error is standard normal, our approach has higher proportions to choose true positive SNPs, especially those with no effect on G×E interactions, and lower proportions to choose 57 false positive SNPs. When MAF increases, both approaches lead to higher selection ratios for true positive SNPs and lower selection ratios for false positive SNPs. A similar pattern can be observed when the error distribution follows a t(3) distribution. The performance of both approaches is better when the error is normal. An analogous conclusion can be reached in Figure 3.3 when d=50. Table 3.2 presents the oracle proportions and estimation results for d=10. Under the standard normal error, we observe the superior performance of our approach over the adaptive LASSO based approach in terms of both oracle percentage and estimation precision. The estimation accuracy of our approach is pretty close to that of the true model. The accuracy of all the 3 methods improves as MAF increases from 0.1 to 0.5. The performance of our method under t(3) error are still comparable to that under the standard normal error, while the ALASSO method did much worse for the t(3) error. A similar pattern can be observed for the high dimensional case (d=50) in Table 3.3. Our approach is still powerful in high dimensional scenario, especially under the N (0,1) random error, while the ALASSO approach barely select the correct model. 58 p=0.1 p=0.3 p=0.5 Selection Ratio,t(3) error Selection Ratio,N(0,1) error 1 0.8 0.8 ALASSO SCAD 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 1 2 4 6 8 10 2 4 6 8 10 2 0.8 0.8 0.6 0.4 0.2 2 4 6 8 10 0.4 0.2 8 10 0.6 0.4 6 0.8 0.6 4 0.2 0 2 4 6 8 10 2 4 6 8 10 SNP Figure 3.2: The selection ratio of Example 3.2, d = 10 59 p=0.1 Selection Ratio,N(0,1) error 1 p=0.5 1 ALASSO SCAD 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 1 Selection Ratio,t(3) error p=0.3 1 10 20 30 40 0 1 10 20 30 40 0.8 0 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 10 20 30 40 0.2 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 SNP Figure 3.3: The selection ratio of Example 3.2, d = 50 60 Table 3.2: Simulation results of Example 3.2, d = 10 pA=0.1 pA=0.3 pA=0.5 N (0,1) error t(3) error SCAD ALASSO Oracle SCAD ALASSO Oracle Perc. 0.976 0.784 1 0.72 0.268 IMSE β0 (u) 0.0863 0.1250 0.0891 0.3078 1.6608 β1 (u) 0.1611 0.1601 0.1667 0.3285 0.3947 β2 (u) 0.1264 0.1358 0.1238 0.4890 1.2776 2.8155 β3 (u) 0.0270 0.1183 0.0192 1.3307 β4 (u) 0.0191 0.0433 0.0174 0.2943 2.1633 TIMSE 0.4205 0.6106 0.4162 2.9342 9.2044 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle 1 0.2247 0.3557 0.2932 0.0643 0.0475 0.9855 0.992 0.84 1 0.91 0.33 1 0.0268 0.1071 0.0561 0.0086 0.0066 0.2007 0.0297 0.1074 0.0551 0.0271 0.0118 0.2404 0.0273 0.1174 0.0637 0.0084 0.0065 0.2233 0.0607 0.1600 0.1360 0.1111 0.0443 0.5311 0.0975 0.2065 0.1373 0.1216 0.1125 1.3069 0.0601 0.1746 0.1320 0.0237 0.0222 0.4126 0.98 0.846 1 0.894 0.34 1 0.0213 0.1044 0.0497 0.0077 0.0063 0.1895 0.0214 0.1043 0.0507 0.0210 0.0103 0.2177 0.0214 0.1106 0.0604 0.0077 0.0063 0.2065 0.0431 0.1581 0.1101 0.0439 0.0240 0.4072 0.0485 0.1721 0.3270 0.7984 0.3082 1.8768 0.0451 0.1725 0.1170 0.0192 0.0135 0.3673 61 Table 3.3: Simulation results of Example 3.2, d = 50 pA=0.1 pA=0.3 pA=0.5 N (0,1) error t(3) error SCAD ALASSO Oracle SCAD ALASSO Oracle Perc. 0.908 0.542 1 0.435 0.025 IMSE β0 (u) 0.1929 0.9911 0.0884 0.5687 1.2335 β1 (u) 0.2064 0.1988 0.1684 0.3851 0.3484 β2 (u) 0.5235 0.8382 0.1218 0.6934 0.4432 0.7892 β3 (u) 2.0918 2.0345 0.0196 2.4522 β4 (u) 0.3475 0.4798 0.0158 0.5996 0.4671 TIMSE 3.3644 4.7239 0.4140 5.7021 8.9145 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle 1 0.2209 0.3340 0.2614 0.0484 0.0445 0.9092 0.986 0.642 1 0.745 0.06 1 0.0289 0.1107 0.0817 0.1083 0.0229 0.3526 0.0732 0.1124 0.1834 0.4072 0.0748 0.9334 0.0278 0.1137 0.0646 0.0075 0.0068 0.2204 0.0860 0.1858 0.2205 0.3865 0.0840 1.2288 0.1970 0.1974 0.1768 0.2018 0.1099 3.3013 0.0599 0.1742 0.1301 0.0254 0.0220 0.4117 0.988 0.706 1 0.8 0.07 1 0.0215 0.1048 0.0608 0.0470 0.0120 0.2461 0.0232 0.1073 0.1269 0.2846 0.0444 0.6426 0.0216 0.1123 0.0579 0.0078 0.0053 0.2050 0.0450 0.1551 0.1754 0.1681 0.0480 0.6492 0.0560 0.1716 0.1525 0.1501 0.0889 2.8755 0.0434 0.1608 0.1085 0.0167 0.0190 0.3484 62 3.5 Real data analysis We applied the method to a real dataset from a study conducted at Department of Obstetrics and Gynecology at Sotero del Rio Hospital in Puente Alto, Chile. The initial objective of the study was to pinpoint genetic variants associated with a binary response indicating large for gestational age (LGA) or small for gestational age (SGA) depending on new born babies’ weight and mother’s gestational age. After data cleaning by removing SNPs with MAF less than 0.05 or deviation from Hardy-Weinberg equilibrium, the dataset contains 1536 new born babies with 189 genes covered by 660 single nucleotide polymorphisms (SNPs). Mother’s body mass index (MBMI), defined as mother’s body mass (kg) divided by the square of their height (m2 ), is a measure for mothers’ body shape and obesity condition. The environment factor for a baby inside mother’s body is defined through the mother, such as mother’s obesity condition (MBMI) or age. Due to the complicated interaction between fetus’s genes and mother’s obesity level, the birth weight might be different for a fetus with the same gene but under different environment conditions. The phenomenon of regular variation in birth weight could be explained by corresponding genetic variants and how they respond to different MBMI. We applied both methods to the Janus kinase/signal transducers and activators of transcription (JAK/STAT) signaling pathway, which has 68 SNPs covering 24 genes in our real data. JAK/STAT signaling pathway is the main signaling mechanism for a broad range of cytokines and growth factors in mammals [80]. Our method select the model Y = β0 (z) + βk Xk + ε where Xk corresponds to SNP 2069762, and βk is a constant, while the ALASOO method 63 only identifies the varying intercept. SNP 2069762 is of non-zero constant effect, therefore this one is a genetic risk factor associated with birthright but not sensitive to MBMI to influence birth weight. To further validate our result, we conducted the single SNP based analysis in Ma et al [37] and tabulate SNPs with p-value less than 0.001 when fitting the candidate models (LM, LMI and VC). The p-values for the overall genetic association test with the LM, LMI and VC model are denoted as P CON, P LIN and P VC. It follows from the test on constant coefficient that SNP 2069885 does not vary across MBMI (PP CON¡ 0.05). Consequently, the p-value obtained from test with LM model is less than those obtained from VC and LM model. Table 3.4: List of SNPs with p-value < 0.001 from the Jak-STAT signaling pathway SNP ID GeneName Location P VC P CON P LIN P LM P LMI PI LM model 2069885 3.6 IL9 exon 0.0014 0.0913 - 7.32E-05 8.93E-05 0.0875 Discussion The significance of G×E interactions in complex traits has stimulated waves of discussion. A diversity of statistical models have been proposed to assess the gene effect under different environmental exposures, as reviewed in Cornelis et al [65]. The success of genetic variant set based association analysis, as shown in Wang et al [38], Cui et al [39], Wu and Cui [40] and Schaid et al [71], motivates us to propose a high dimensional variable selection approach to understand the mechanism of G×E interactions associated with complex traits. We adopt a penalized regression method within the VC model framework to investigate how 64 multiple variants within a genetic system, like the pathway, were mediated by a common environmental factor to influence the phenotypic response. Variable selection and parameter estimation can be achieved simultaneously within the framework. The varying coefficient function are divided into 2 parts after B spline basis expansion, for the non-zero constant and varying effect. We can determine if a particular genetic variant is sensitive to environmental stimuli by examining the status of the coefficient function. Specifically, the presence of G×E interactions, no G×E interactions and no association with the phenotype are corresponding to varying, non zero constant and zero effects of the coefficients. A two-stage iterative procedure was developed in Tang et al [73] to distinguish different effects. Here, we adopted the framework and carried out the separation with SCAD penalty. Asymptotic properties of the two-stage estimator were established under suitable regularity conditions. A comprehensive comparison between our method and that in Tang et al [73] was conducted in the simulation session, in terms of two criteria, the percentage of choosing the exact true model and the precision in parameter estimation. The estimation accuracy was calculated as IMSE for varying coefficients and MSE for constant coefficients. In Tang et al [73], for the predictors with non-zero constant coefficients, MSE was calculated when the predictor is corrected identified. However, this won’t reveal the error caused by failure to classify the coefficient as non-zero constant. Instead, we suggest calculating IMSE for all the predictors, since IMSE reduces to MSE when the coefficient is a constant. A much more accurate assessment on assessing the performance of the model can thus be achieved. The advantage of our approach has been endorsed by the extensive simulation study and real data analysis. Both algorithms are based on local quadratic approximations (LQA) to the penalty func65 tions, which suffers from the efficiency loss caused by repeated factorizations of large matrices. LQA limits the power of the framework to dissect G×E interactions when the dimension is large, especially in cases where p > n. We will integrate group coordinate descent (GCD) approach into the current framework and demonstrate the merit of the new scheme in next chapter. 3.7 3.7.1 Technical proofs Useful notations and lemmas For convenience, the following notations are adopted : ¯ ¯ ¯ ¯ ¯ Y = E(Y |X, T ), γ = (U T U )−1 U T Y , β = B γ T T T T T T γ(v) = (γ0 , . . . , γv )T , γ(c) = (γv+1 , . . . , γc )T , γ(d) = (γv+1,1 , . . . , γd,1 )T , γ(v) = (˜0 , . . . , γv )T , γ(c) = (˜v+1 , . . . , γc )T , γ(d) = (γv+1,1 , . . . , γd,1 )T , ˜ γT ˜T ˜ γT ˜T ˜ Gn = (B(z1 ), . . . , B(zn ))(B(z1 ), . . . , B(zn )T , ε = (ε1 , . . . , εn )T Φn = n−1 n T i=1 U(v)i U(v)i , Ψn = n−1 n T i=1 U(v)i U(c)i , Λi = U(c)i − ΨT Φ−1 U(c)i n n First we provide several lemmas to facilitate the proofs of Theorems 1 and 2. Lemma A.1. Under assumptions (A1-A3), there exists finite positive constants C1 and C2 such that all the eigenvalues of (kn /n)Gn fall between C1 and C2 , and therefore, Gn is invertible. Lemma A.2. Under assumptions (A1-A3), for some finite constant C0 , there exists γ = (˜0 , . . . , γd )T satisfying ˜ γT ˜T (1) γj∗ L2 > C0 , j = 0, . . . , v; γj1 = βj , γj∗ L2 = 0, j = v + 1, . . . , c; γj = 0, ˜ ˜ ˜ ˜ j = c + 1, . . . , d −r (2) supt∈[0,1] |βj (z) − B(z)T γj | = Op (kn ), j = 0, . . . , d, where γj = (˜j,1 , γj∗ )T ˜ ˜ γ ˜T 66 −r (3) sup(t,x)∈[0,1]×Rd+1 |X T β(z) − U (X) γ | = Op (kn ) ˜ 3.7.2 Proofs of Theorem 1. (A) Proof of Theorem 1(1) (Part 1) ˆ Here we first show βj (z) is constant for j = v + 1, . . . , d in probability, which amounts to demonstrating γj∗ j = 0, j = v + 1, . . . , d with probability tending to 1, as n → ∞. For ˆ vc n Yi − UiT γ Q1 (γ) = 2 d +n i=1 j=1 pλ1 ( γj∗ ) (3.11) 1 ˆ ˜ Let αn = n− 2 kn + an and γ vc = γ + αn δ. We want to show that for any given ε > 0, there exists a large constant C such that P inf δ =C Q1 (ˆ vc ) ≥ Q1 (˜ ) ≥ 1 − ε γ γ (3.12) This suggests that with probability at least 1 − ε there exists a local minimum in the ball ˆ ˜ {˜ + αn δ : δ ≤ C}. Hence, there exists a local minimizer such that γ vc − γ = Op (αn ). γ A direct computation yields Dn (δ) = Q1 (ˆ vc ) − Q1 (˜ ) γ γ n n T εi + X1 r(zi ) = −2αn δ UiT 2 + αn δ 2 i=1 i=1 d +n j=1 UiT Ui pλ1 ( γj∗ ) − pλ1 ( γj∗ ) ˆ vc ˜ ≡ ∆1 + ∆2 + ∆3 67 (3.13) where rj (z) = B(z)T γj − βj (z), j = 1, . . . , d and r(z) = (r1 (z), . . . , rd (z))T . By the fact ˜ E(εi |U , zi ) = 0, we obtain that 1 √ n n εi UiT δ = Op ( δ ) (3.14) i=1 Recall Lemma A.1, then 1 n n T −r Xi r(zi )U δ = Op (kn δ ) (3.15) i=1 Therefore √ −r −r ∆1 = Op ( nαn δ ) + Op (nkn αn δ ) = Op (nkn αn ) δ 2 We can also show that ∆2 = Op (nαn ) δ 2 . Then, by choosing a sufficiently large C, ∆1 is dominated by ∆2 uniformly in δ = C. It follows from Taylor expansion that d ∆3 ≤ n αn pλ1 ( γj∗ ) ˜ j=1 γj∗ ˜ γj∗ ˜ 2 δj∗ + αn pλ2 ( γj∗ ) δj∗ 2 (1 + op (1)) ˜ √ 2 ≤ n dαn fn δ + bn αn δ 2 where fn = maxj {|˜j∗ | : γj∗ = 0}. With assumption (A6), we can prove that ∆2 dominates γ ˜ ∆3 uniformly in δ = C. Therefore, (3.12) holds for sufficiently large C, and we have ˆ ˜ γ vc − γ = Op (αn ). ˆ In order to prove βj (z) = 0 for j = v+1, . . . , d in probability, it is sufficient to demonstrate ˆ vc that γj∗ = 0,j = v + 1, . . . , d. It follows from the definition that when max(λ1 , λ2 ) → 0, an = 0 for large n. Then we need to show that with probability approaching 1 as n → ∞, 68 1 1 ˆ ˆ ˜ for any γ vc satisfying γ vc − γ = Op (n− 2 kn ) and some small εn = Cn− 2 kn , we have ∂Q1 (γ) < 0, ∂γj,∗ for − εn < γj,∗ < 0, j = v + 1, . . . , d (3.16) > 0, for 0 < γj,∗ < εn , j = v + 1, . . . , d where γj,∗ denotes the individual component of γj∗ . It’s not hard to show that ∂Q1 (ˆ vc ) γ = −2 vc ∂ˆj,∗ γ n ˆ Uij Yi − UiT γ vc + npλ (|ˆj,∗ |)sgn(ˆj,∗ ) γ γ 1 i=1 n n Uij UiT [˜ − γ vc ] γ ˆ T Uij [εi + Xi r(zi )] − 2 = −2 i=1 i=1 (3.17) + npλ (|ˆj,∗ |)sgn(ˆj,∗ ) γ γ vc 1 −r+1/2 = nλ1 Op (λ−1 n 2r+1 ) + λ−1 pλ (|ˆj,∗ |)sgn(ˆj,∗ ) γ γ vc 1 1 −r+1/2 −1 By assumption (A5), λ1 n 2r+1 → 0. Then it follows from assumption (A7) that the sign ˆ of the derivative is completely determined by that of γj,∗ . Therefore, γ vc , the minimizer of ˆ vc ˆ vc Q1 , is achieved at γj∗ = 0, j = v + 1, . . . , d. This completes the proof of Theorem 1(1), part 1. (B) Proof of Theorem 1 (2) 1 Next we establish the consistency of the varying coefficient estimator. Let αn = n− 2 kn +an , T T ˆ ˜ ˆ ˜ γ(v) = γ(v) + αn δv , γ(d) = γ(d) + αn δd , and δ = (δv , δd )T n T T Yi − U(v)i γ(v) − U(d)i γ(d) Q2 (γ(v) , γ(d) ) = i=1 2 d +n j=v+1 69 pλ2 (|γj,1 |) (3.18) We need to show that for any given ε > 0, there exists a large constant Cε such that ˆ ˜ P inf δ =C Q2 (ˆ(v) , γ(d) ) ≥ Q2 (˜(v) , γ(d) ) ≥ 1 − ε γ γ (3.19) which implies that with probability at least 1 − ε there exists a local minimum in the ball {˜(v) + αn δv : δv ≤ C} and {˜(d) + αn δd : δd ≤ C} respectively. Therefore, there exists γ γ ˆ ˜ ˆ ˜ local minimizers such that γ(v) − γ(v) = Op (αn ) and γ(d) − γ(d) = Op (αn ). We have ˆ ˜ Dn (δv , δd ) = Q2 (ˆ(v) , γ(d) ) − Q2 (˜(v) , γ(d) ) γ γ n T εi + X1 R(Zi ) = −2αn i=1 n 2 + αn T U(v)i δ(v) T T U(v)i δ(v) + U(d)i δ(d) 2 T + U(d)i δ(d) d +n j=v+1 i=1 γ γ pλ2 (|ˆj,1 |) − pλ2 (|˜j,1 )| ≡ ∆1 + ∆2 + ∆3 (3.20) where r(z) = (r1 (z), . . . , rd (z))T and rj (z) = B(z)T γj − βj (z), j = 1, . . . , d. ˜ Since E(εi |U(v) , U(d) , zi ) = 0, we have 1 √ n n T T εi [U(v)i δ(v) + U(d)i δ(d) ] = Op ( δ ) (3.21) i=1 With Lemma A.1 we can show 1 n n T T T −r Xi r(zi ) U(v)i δ(v) + U(d)i δ(d) = Op kn δ i=1 70 (3.22) Combine the above two equations, we can obtain that 1 −r −r ∆1 = Op (n 2 αn δ ) + Op (nkn αn δ ) = Op (nkn αn ) δ 2 Since ∆2 = Op (nαn ) δ 2 , it’s easy to show that by choosing a sufficiently large C, ∆1 is dominated by ∆2 uniformly in δ = C. By Taylor expansion, d 2 2 αn pλ2 (|˜j,1 |)sgn(˜j,1 )|δdj | + αn pλ2 (|˜j,1 |)δdj (1 + o(1)) γ γ γ ∆3 ≤ n j=v+1 1 2 ≤ (p − v) 2 nαn fn δ + bn αn δ 2 where fn = maxj {|˜j,1 | : γj,1 = 0}. Recall (A6), then it follows that, by choosing an enough γ ˜ large C, ∆2 dominates ∆1 uniformly in δ = C. Consequently (3.19) holds for sufficiently ˆ ˜ ˆ ˜ large C, and we have γv − γv = Op (αn ) and γd − γd = Op (αn ). By the definition of ˆ cz ˜ γ cz , we have γ(d) − γ(d) = Op (αn ). Then for j = 0, . . . , d ˆ βj (zi ) − βj (z) 2 = 1 0 1 ≤ 0 = ˆ βj (z) − βj (z) 2 dt ˆ cz B(z)T γj (z) − B(z)T γj + rj (z) ˜ 2 dt 1 2 cz ˜ rj (z)2 dt (ˆj − γj )T Gn (ˆj − γj ) + 2 γ γ cz ˜ n 0 = ∆1 + ∆2 1 −1 2 Recall Lemma A.1, A.2 and kn = Op n 2r+1 , we can demonstrate that ∆1 = Op kn αn , −2r ∆2 = Op kn . ∆1 is dominated by ∆2 , thus we finish the proof of Theorem 1(b). (C) Proof of Theorem 1(1) (Part 2) 71 ˆ ˆ cz To show βj (z) = 0 for j = c + 1, . . . , d, it is sufficient to demonstrate that γj,1 = 0, since the constancy of βj (z), j = v + 1, . . . , d was already established in (A). It follows from the definition that when max(λ1 , λ2 ) → 0, an = 0 for large n. Then we need to ˆ ˆ prove that with probability approaching 1 as n → ∞, for any γ(v) and γ(d) satisfying 1 1 ˆ ˜ ˆ ˜ γ(v) − γ(v) = Op (n− 2 kn ), γ(d) − γ(d) = Op (n− 2 kn ) respectively, as well as some small 1 εn = Cn− 2 kn , we have ∂Q2 (γ(v) , γ(d) ) ∂γj,1 < 0, for − εn < γj,1 < 0, j = c + 1, . . . , d (3.23) > 0, for 0 < γj,1 < εn , j = c + 1, . . . , d We can prove that ˆ ∂Q2 (ˆ(v) , γ(d) ) γ ∂ˆj,1 γ n T ˆ T ˆ U(d)ij Yi − U(v)i γ(v) − U(d)i γ(d) + npλ (|ˆj,1 |)sgn(ˆj,1 ) γ γ = −2 i=1 n n T U(d)ij εi + Xi r(zi ) = −2 i=1 n T γ ˆ U(d)ij U(v)i [˜v − γv ] −2 i=1 (3.24) T γ ˆ U(d)ij U(d)i [˜d − γd ] + npλ (|ˆj,1 |)sgn(ˆj,1 ) γ γ −2 i=1 −r+1/2 = nλ2 Op λ−1 n 2r+1 2 + λ−1 pλ (|ˆj,1 |)sgn(ˆj,1 ) γ γ 2 −r+1/2 By assumption (A5), λ−1 n 2r+1 → 0. Then it follows from assumption (A7) that the sign 2 ˆ of the derivative is completely determined by that of γj,1 . Therefore, γ cz , the minimizer of ˆ ˆ cz Q2 , is achieved at γj,1 = 0, j = c + 1, . . . , d. This completes the proof of Theorem 1(1). 72 3.7.3 Proofs of Theorem 2. ˆ In Theorem 1, we showed that both γj∗ = 0, j = v + 1, . . . , c and γj = 0, j = c + 1, . . . , d, ˆ hold in probability. Then Q2 reduces to n T Yi − U(v)i γ(v) Q2 (γ(v) , γ(d) ) = 2 T − U(c)i γ(c) i=1 c +n j=v+1 pλ2 (|γj,1 |) (3.25) ≡ Q2 (γ(v) , γ(c) ) ˆ Since (ˆ(v) , γ(c) ) is the minimal value of Q2 (γ(v) , γ(c) ), we obtain γ ˆ ∂Q2 (ˆ(v) , γ(c) ) γ ˆ ∂ γ(v) ˆ ∂Q2 (ˆ(v) , γ(c) ) γ ˆ ∂ γ(c) n T ˆ T ˆ U(v)i Yi − U(v)i γ(v) − U(d)i γ(d) = 0 = −2 (3.26) i=1 n = −2 T ˆ T ˆ U(c)i Yi − U(v)i γ(v) − U(c)i γ(c) i=1 c (3.27) pλ2 (|ˆj,1 |)sgn(ˆj,1 ) = 0 γ γ +n j=v+1 By applying Taylor expansion on pλ2 (|ˆj,1 |) in (3.27), we have γ pλ2 (|ˆj,1 |) = pλ2 (|γj,1 |) + pλ2 (|γj,1 |)(ˆj,1 − γj,1 )[1 + op (1)] γ γ By the fact that pλ2 (|ˆj,1 |) = 0 as λ2 → 0, and pλ2 (|γj,1 |) = op (1) from the assumption, it γ follows that c γ γ j=v+1 pλ2 (|ˆj,1 |)sgn(ˆj,1 ) = op (ˆj,1 − γj,1 ) = op (ˆ(c) − γ(c) ). Consequently, γ γ we have 1 n n T ˆ T ˆ U(c)i Yi − U(v)i γ(v) − U(c)i γ(c) + op (ˆ(c) − γ(c) ) = 0 γ i=1 73 Following similar lines of arguments in Theorem 1, we can show 1 n n T T T ˆ ˆ γ U(c)i εi + Xi r(zi ) + U(v)i (γ(v) − γ(v) ) + U(c)i (γ(c) − γ(c) ) + op (ˆ(c) − γ(c) ) = 0 i=1 (3.28) Meanwhile, a straightforward calculation yields 1 n n T T T ˆ ˆ U(v)i εi + Xi r(ui ) + U(v)i (γ(v) − γ(v) ) + U(c)i (γ(c) − γ(c) ) = 0 (3.29) i=1 Recall the definition of Φn and Ψn , (3.29) is equivalent to ˆ γ(v) − γ(v) = Φ−1 n 1 n n T ˆ U(v)i εi + Xi r(zi ) + Ψn [γ(c) − γ(c) ] (3.30) i=1 Plugging (3.30) into (3.28) results in 1 n n U(c)i i=1 1 = n 1 T T εi + Xi r(zi ) − U(v)i Φ−1 n n n U(c)i U(c)i − ΨT Φ−1 U(v)i n n T n T U(v)i εi + Xi r(zi ) i=1 (3.31) (ˆ(c) − γ(c) ) + op (ˆ(c) − γ(c) ) γ γ i=1 Together with the facts that 1 n  n T T ΨT Φ−1 U(v)i εi + Xi r(zi ) − U(v)i Φ−1 n n n i=1 1 n  n T U(v)k [εk + Xk r(tk )] = 0 j=1 and 1 n n T ΨT Φ−1 U(v)i U(c)i − ΨT Φ−1 U(v)i n n n n i=1 74 T =0 and recall the definition of Λi , a direct computation from (3.31) leads to 1 n n Λi ΛT + op (1) i i=1 √ 1 ˆ n(γ(c) − γ(c) ) = √ n n 1 Λ i εi + √ n i=1 n 1 +√ n i=1 n T Λi Xi r(zi ) i=1 1 T Λi U(v)i Φ−1 n n n T U(v)k εk + Xk r(tk ) j=1 = ∆1 + ∆2 + ∆3 It follows from law of large numbers that 1 n n p Λi ΛT − − → N (0, Σ) i −− i=1 T where Σ = E U(c) U(c) − E E(ΨT |T )E(Φn |T )−1 E(Ψn |T ) . Consequently, n d ∆2 − − → N (0, σ 2 Σ) −− follows from central limit theorem. Because Xi is bounded and r(z) = op (1), we have ∆2 = op (1).Besides, n T i=1 Λi U(v)i =0 implies that ∆3 = 0. Therefore, by Slutsky theorem, we complete the proof of Theorem 2. 75 Chapter 4 A Group Coordinate Descent Approach For High Dimensional Variable Selection In Gene-Environment Interactions 4.1 Introduction High dimensional data arises in a diversity of scientific areas, especially in the study of human genetics as tons of data covering the entire human genome are brought by the advancement of high-throughput genotyping technologies. Gene-environment (G×E) interaction draws our interest due to the crucial roles it plays in elucidating the etiology of complex disease and tracking down the disease variants. The risk of complex diseases is triggered not merely by genetic factors, but also by the environmental exposures, as well as their interactions. The varying coefficient (VC) model framework, initially proposed in Hastie and Tibshirani [74], lends us significant flexibility in investigating genetic responses to environmental stimuli and how gene expressions are mediated by environmental influences to increase disease predispositions, for both continuous phenotype response in Ma et al [37] and binary 76 disease response in Wu and Cui [68]. The merit of the framework is especially prominent when the penetrance effect of genetic variants are non-linear, as the VC model is powerful to capture the dynamic fluctuations of regression coefficients. Current methodologies on G×E interactions are mainly developed within single variantbased framework, using either parametric methods as reviewed in Mukherjee et al [81], semi-parametric methods as in N. Chatterjee et al [35] and Maity et al [36], non-parametric methods as in Ma et al [37] and Wu and Cui [68], or data mining approaches such as multifactor dimension reduction (MDR) in Hahn et al [33]. Accumulation of evidence showing the advantage of set based association analysis, such as in Neal and Sham [82], Cui et al [39], Wu and Cui [40] and Wang et al [38], motivated us to consider the joint modeling of a number of variants (p) within the genetic system given a common environment mediator. When the dimension p is large, the problem can be approached from a high dimensional variable selection perspective, within the VC model framework. In recent years, much progress has been made on penalized regression methods for VC models. Wang et al [78] developed SCAD penalty based method for longitudinal response. Wang and Xia [83] considered variable selection for VC model via local constant kernel estimation with LASSO penalty, while the penalization method in Leng [84] was proposed in the framework of smoothing spline ANOVA models with component selection and smoothing operator (COSSO). The number of candidate predictors for selection in those models is finite and less than the sample size. A diversity of penalized group coordinate approaches have been developed for high dimensional case where the dimension of model p significantly exceeds the sample size n. See Yuan and Lin [85] for group LASSO approach, Wei et al [86] for group adaptive LASSO approach and Breheny and Huang [87] for group SCAD approach. In this chapter, we develop a general framework, based on the group coordinate descent 77 (GCD) algorithm, to carry out variable selection for set-based G×E interactions in VC models. GCD was generalized to group case from coordinate-wise descent algorithms which are demonstrated to be effective in fitting penalized models such as in Friedman et al [88], Wu and Lange [89] and Friedman et al [90]. Our framework is implemented through a two-stage iterative procedure to separate the varying, non-zero constant and zero effect of the predictors. It is computationally efficient especially for high dimensional problems where p > n, since the computational complexity increases only linearly with the number of predictor groups. The rest of the chapter is organized as follows. First, we describe the B spline basis expansion to the VC model. The computation algorithm via GCD with both convex and non-convex penalties will be proposed and the convergence of the algorithm will be discussed. Selection of the tuning parameters are examined subsequently. We demonstrate the utility of our approach through extensive simulation studies and real data analysis. Discussions and concluding remarks are given at the end of this chapter. 4.2 Statistical methods Let (Xi , Yi , Zi ), i = 1, . . . , n be random vectors which are independent and identically distributed (i.i.d.). Consider a varying coefficient (VC) model with p predictors p Yi = βj (Zi )Xij + εi (4.1) j=0 where βj (·) is the smooth varying-coefficient function, Xij is the jth component of (p+1)dimensional vector Xi with the first component Xi0 being 1, Zi ’s are the scalar index variable, 78 and εi is the model error. The varying coefficient model helps us gain particular power to evaluate the nonlinear responses of genetic variants to environmental incentives in gene-environment (G×E) interactions [37, 68].We are interested in dissecting the penetrance of multiple genetic variants within a system, such as gene set or pathway, under various environmental stimuli, especially when those variants are mediated by a common mediator Z to affect phenotype. The effect of the variants can be determined by investigating the status of the coefficient function βj (·) in (4.1). If βj (·) is a varying function of the environmental factor Z, then the G×E interaction exists. The nonzero constancy of βj (·) indicates that the G×E interaction is not present. The genetic variant is not associated with the response (phenotype) if βj (·) = 0. In such a set-based G×E interaction study design, the total number of variants in the system (p) can far exceed the sample size (n). 4.2.1 Basis expansion and penalized regression Supposed that the coefficient function βj (z) (j = 0, . . . , p) in (4.1) can be approximated by basis expansion such that L βj (Z) ≈ γjl Bjl (Z) l=1 where L is the number of basis functions to approximate the coefficient function, B(·) = {Bjl (·)}L is a set of normalized B spline basis, and γj is the corresponding spline coefficient l=1 vector. It follows from change of basis [79] that the above basis expansion is equivalent to L βj (·) ≈ . ˜T γjl Bjl (·) = γj1 + Bj (·)γj∗ l=1 79 . T ˜ where the spline coefficient vector γj = (γj,1 , γj∗ )T , and Bj (·) = (Bk2 (·), . . . , BjL (·))T . γj,1 and γj∗ correspond to the constant and varying part of the coefficient functional respectively. 1 T Denote γj 2 = γj Rj γj , where Rj = n j n i=1 T T Bj (Zi )Xij Xij Bj (Zi ) . Note that Rj is a L × L positive definite matrix.If γj∗ j =0, then the jth predictor only has a constant effect. Furthermore, if γj,1 =0, then the predictor is not associated with the response. Therefore γj∗ can be treated as a group. To separate the varying, constant and zero effect in the procedure of simultaneous variable selection and parameter estimation, we minimized the penalized loss function 1 Q(γ) = n n  p Yi − i=1 p + j=1 2 L p γjl Xij Bjl (Zi ) + j=1 j=0 l=1 pλ1 ( γj∗ j ) (4.2) pλ2 (|γj1 |)I( γj∗ j = 0) where λ1 and λ2 are the penalization parameters. The penalty function pλ (k=1,2) in k (4.2) can be concave, such as in LASSO [91] or adaptive LASSO [92], or non-cave, such as the smoothly clipped absolute deviation (SCAD) penalty function [75]. Tang et al [73] first proposed the framework and established the asymptotic results for adaptive LASSO penalty function, while Wu et al [93] demonstrated improved finite sample performance of SCAD penalty over adaptive LASSO as well as the corresponding oracle property. Both approaches are based on local quadratic approximations (LQA) to the penalty function pλ (k=1,2). k However, LQA leads to a ridge-type solution dependent on repeated large matrix inversions, which renders the algorithm not efficient for large scale regression problems. Furthermore, as pointed out in Breheny and Huang [87], the quadratic approximation cannot benefit from the sparsity since this approach will not yield naturally sparse solutions. 80 Zou and Li [94] developed local linear approximation (LLA) to the non-convex penalty function and demonstrated its advantage over LQA. Breheny and Huang [95] further enhanced the LLA by showing penalized models with non-convex penalty can be fitted effectively by coordinate descent method. The main idea of GCD for (4.2) is to minimize the penalized loss function Q with respect to an individual predictor group after basis expansion at each step, and then cycle through all the predictor groups till convergence. 4.2.2 Computational algorithms In this section, we extend the two-step iterative framework in [73, 93] with GCD approach. (4.2) can be rewritten using matrix notations as p Q(γ) =(Y − W γ)T (Y − W γ) + n pλ1 ( γj∗ j ) j=1 (4.3) p pλ2 (|γj1 |)I( γj∗ j = 0) +n j=1 T T where Y = (Y1 , . . . , Yn )T , Wi = (Xi0 B(Zi )T , . . . , Xip B(Zi )T )T , W = (W1 , . . . , Wn )T and T T γ = (γ0 , . . . , γp )T . We denote the subsets of predictors of varying, non-zero constant and zero effects by V, C, and Z respectively. The iterative algorithm is carried out in the following two steps. At step 1, to separate the varying and nonzero constant effects, we minimize the following penalized loss: p T Q1 (γ) = (Y − W γ) (Y − W γ) + n j=1 pλ1 ( γj∗ j ) (4.4) to obtain γ VC . All the predictors are assumed to be in the subset of varying effects V initially. 81 The penalization is conducted in a group manner. The jth predictor will be moved from subset V to C if γj∗ j =0, or it will stay in V. ˆ VC At step 2, the penalized criterion p Q2 (γ) = (Y − W γ)T (Y − W γ) + n j=1 pλ2 (|γj1 |)I( γj∗ j = 0) (4.5) is minimized only with regard to the predictors in set C. If Xj is in C, then it will be moved to Z if |ˆj,1 |=0, otherwise it will stay in C. γ CZ The above two steps will be iterated till convergence and we can obtain the estimator ˆ γ at convergence. The coefficient function βj (z) (j = 0, . . . , p) in (4.1) can be estimated as ˆ ˆ βj (z) = B T (z)ˆj . βj (z) will be a varying function in z, non-zero constant and zero if γj is γ ˆ in V, C and Z respectively. 4.2.2.1 Group LASSO and group adaptive LASSO In this section, we investigate the two-step iterative algorithm based on GCD approach using the convex penalty functions, such as LASSO and adaptive LASSO. Yuan and Lin [85] extended LASSO to the selection of variables group-wisely. By setting pλi (x) = λi Dij x for i = 1, 2 in (4.3), where Dij is the corresponding dimension of predictor group j. Let 1 1 − 2 γj = Rj γj and Wij = Rj 2 Wij . Then (4.4) and (4.5) can be reexpressed as Q1 (γ) = Y − W γ T p Y − Wγ + n λ1 D1j γj∗ 2 j=1 and Q2 (γ) = Y − W γ T p Y − Wγ + n λ1 j=1 82 D2j |γj1 |I( γj∗ 2 = 0) respectively. From now on, we drop from the formula for simplification of notation. The dimension of predictor group, Dij , was included in the optimization, as in [85], to guarantee that the amount of penalization is consistent with the group size, so the small groups won’t be dominated by large groups. However, note that the number of basis functions is the same for all the predictors in approximating the coefficient function, and the size of all the groups in Q2 is 1, Dij can be dropped from both Q1 and Q2 . By setting the partial derivative of Q1 with respective to γj∗ to zero, we have ˆ VC γj∗ = 1− λ1 S S1j 2 + 1j (4.6) T T T T T where S1j = Wj∗ Y − W γ−j∗ with γ−j∗ = γ0 , . . . , γj−1 , (γj1 , 0L−1 )T , γj+1 , . . . , γp T , Wj∗ is the part of Wj corresponding to γj∗ and (x)+ = xI{x > 0}. Similarly, setting the partial derivative of Q2 with respective to γj1 for those j such that γj∗ 2 = 0 will lead to ˆ CZ γj1 = 1− λ2 S S2j 2 + 2j T where S2j = Wj1 Y − W γ−j1 with γ−j1 = (4.7) T T T T γ0 , . . . , γj−1 , (0, γj∗ )T , γj+1 , . . . , γp T , and Wj1 is the part of Wj corresponding to γj1 . The group adaptive LASSO estimator could be obtained by simply replacing λ1 and λ2 with λ1 S1j −1 and λ2 S2j −1 , in (4.6) and (4.7) respectively. (4.6) and (4.7) are essentially 2 2 the multivariate version of the soft-thresholding operator in [96]. They will be updated solely ˆ VC for predictors with varying and constant effect, respectively. All the rest components of γj ˆ CZ and γj will remain as the updates in previous iteration. Both solutions have closed forms and are exempt from any sort of approximations. Furthermore, the GCD algorithm is 83 piecewise linear and therefore exceptionally well suited for high dimensional scenarios. 4.2.2.2 Group TLP and group SCAD In group settings, the closed form solutions (4.6) and (4.7) exist not only for convex penalties such as LASSO and adaptive LASSO, but also for non-convex penalties like truncated L1 function penalty, TLP, in Shen et al [97], and smoothly clipped absolute deviation penalty, SCAD, as in Fan and Li [75]. The TLP penalty function is given as pτ,λ (x) = min( |x| , 1)λ τ (4.8) where positive tuning parameters λ and τ control adaptive model selection and the degree of sparsity, correspondingly. As pointed out in Shen et al [97], with properly tuned τ , the approximation error of TLP to L0 function reduces to 0, and low resolution coefficients can be taken good care of. By resorting to difference convex (DC) approach, Shen et al [97] turned non-convex optimization problems into its convex counterpart, thus significantly improved the computational efficiency and stability. Following the similar lines of derivations in Shen et al [97] and Xue and Qu [98], the closed form solutions for (4.4) and (4.5) with group TLP penalty can be obtained by replacing λ1 λ and λ2 in (4.6) and (4.7) with τ 1 I 1 λ ˆ γj∗ 2 ≤ τ1 and τ 2 I 2 ˆ γj1 2 ≤ τ2 , respectively, where ˆ ˆ γj∗ and γj1 are taken as their most recent updates. Before finding the solutions to (4.4) and (4.5) under group SCAD penalty, we briefly 84 revisit the SCAD penalty function, defined in Fan and Li [75] as    λx if 0 ≤ x ≤ λ     (x2 −2aλx+λ2 ) pλ,a (x) = if λ < x ≤ aλ − 2(a−1)     (a+1)λ2   if x > aλ 2 (4.9) where λ > 0 and a > 2, and its derivative    λ if 0 ≤ x ≤ λ     (x−aλ) pλ,a (x) = if λ < x ≤ aλ  − a−1      0 if x > aλ (4.10) The spirit of SCAD penalty could be explicitly conveyed by (4.10). SCAD starts its penalization with the same rate as LASSO till the point x = λ. Then the rate continuously reduces to 0 and will stay as 0 from the point x = aλ. Therefore SCAD is capable of correcting the bias introduced by LASSO while still performing variable selection. The group SCAD estimator for λ > 0 and ai > 2 (i=1,2) can be derived as ˆ VC γj∗ = and ˆ CZ γj1 =          λ 1− S 1 S1j 1j 2 + a1 −1  a1 −2       S  1j          if S1j 2 ≤ 2λ a1 λ1 1 − (a −1) S S1j if 2λ < S1j 2 ≤ a1 λ 1 1j 2 + if S1j 2 > a1 λ λ 1− S 1 S2j 2j 2 + a2 −1  a2 −2       S  2j (4.11) if S2j 2 ≤ 2λ a2 λ1 1 − (a −1) S S2j if 2λ < S2j 2 ≤ a2 λ 2 2j 2 + if S2j 2 > a2 λ 85 (4.12) with notations defined the same as in (4.6) and (4.7), correspondingly. (4.11) and (4.12) are of the soft-thresholding property as a1 −→ ∞ and a2 −→ ∞ respectively. 4.2.2.3 Convergence of the GCD algorithm In the previous two sections, we obtained the closed form updates for all the 4 penalty functions (LASSO, adaptive LASSO, TLP and SCAD) within the group settings. At each cycle of the GCD algorithm, the minimization of Q with respect to γj (j = 1, . . . , p) is achieved by iterating the minimization of Q1 with respect to γj∗ , and Q2 with respect to γj1 correspondingly, where Q, Q1 and Q2 are defined in (4.3), (4.4) and (4.5) respectively. Hence, the two-stage iterative algorithm is of the descent property. Meanwhile, the value of Q at each cycle is nonnegative. So we have the following proposition: ˆ Proposition 1. Let γ (k) be the estimated spline coefficient vector at the convergence of kth iteration. Then for the group penalties of LASSO, Adaptive LASSO, TLP and SCAD, GCD has the property such that Q(ˆ (k) ) ≤ Q(ˆ (k−1) ) γ γ ˆ γ Moreover, given an initial value γ (0) , the sequence {Q(ˆ (k) ), k ≥ 1} converges to a local minimal of Q. It follows from the above proposition that the two-stage iterative algorithm always converges. The penalized loss function Q is not convex in general, otherwise the convergence to the global optimum will be guaranteed by Proposition 1. Given that dimension p is fixed and p < n, the asymptotic properties of the two-stage estimator were established in Tang et al [73] with adaptive LASSO penalty, and Wu and Cui [93] with SCAD penalty. The global 86 optimality of the two stage estimator for large p small n is beyond the scope of this chapter and will not be discussed here. 4.2.3 Selection of tuning parameters Here we propose a data-driven procedure to choose the proper tuning parameters. For group LASSO and group adaptive LASSO, there are 4 tuning parameters, O, L, λ1 and λ2 , where the degree of the B spline basis O and the number of interior knots L uniformly spaced on [0,1] govern the smoothness of the varying coefficient functions, while λ1 and λ2 control the sparsity of the estimator. At the beginning, we apply BIC in Schwarz [79] to choose O and L. The range for L can 1 1 be determined by [max( 0.5n (2O+3) , 1), 1.5n (2O+3) ], with x denoting the integer part of x. The optimal combination of O and L can be reached by searching the corresponding two-dimensional grid, according to the following criterion ˆ ˆ BICO,L = log (Y − W γ )T (Y − W γ ) + (O + L + 1) log(n) n ˆ for γ = (ˆ0 , 0T , . . . , 0T )T . Conditional on the chosen N and p, we can determine optimal γT λ1 by minimizing ˆ BICλ1 = log Y − W γλ1 T ˆ Y − W γλ 1 + dfλ1 n log(n) (4.13) ˆ where γλ1 is the minimizer of (4.4) given λ1 , and dfλ1 is the effective degree of freedom, defined as the total number of varying and non-zero constant predictors. Once we selected 87 N , p and λ1 , the optimal λ2 can be determined by minimizing ˆ BICλ2 = log Y − W γλ2 T ˆ Y − W γλ 2 + dfλ2 n log(n) (4.14) ˆ where γλ2 is the minimizer of (4.5) given λ2 , and dfλ2 is defined similarly as dfλ2 . We need to choose additional tuning parameters τ1 and τ2 for group TLP algorithm. Slight modifications can be made to (4.13) so we can carry out a two-dimensional search for the best pair of λ1 and τ1 . Optimal λ2 and τ2 can be taken similarly. Note that in the group SCAD algorithm, we also have two more tuning parameters a1 and a2 than those in group LASSO and group adaptive LASSO. A search for the best combination of λi and ai (i=1,2) over a two dimensional grid will be computationally intensive. It was pointed out in Fan and Li [75] that as a function of a in (4.9), the Bayesian risks are not sensitive to the choice of a. Here a is fixed as 2.2 in the subsequent analysis. We can tune λ1 and λ2 according to (4.13) and (4.14) correspondingly. 4.3 Simulation study In this section, we carry out Monte Carlo simulations to assess the finite sample performance of group LASSO, group adaptive LASSO, group TLP and group SCAD in our framework through the GCD algorithm. The accuracy of variable selection and successful separation of the varying, non-zero constant and zero effects can be assessed by the percentage of choosing the correct model out of the total R replicates. Integrated mean squared error (IMSE), is adopted for evaluation of the estimation precision. ˆ(r) Let βj be the estimator of the coefficient function βj in the rth (1 ≤ r ≤ R) replication, 88 n grid ˆ(r) and {zk }k=1 be the grid points where βj is evaluated. The integrated mean squared error ˆ (IMSE) of βj (z) is defined as 1 ˆ IMSE(βj (z)) = R R ngrid 1 ngrid r=1 (r) ˆ {βj (zk ) − βj (zk )}2 (4.15) k=1 (4.15) can be used to measure the estimation precision for the jth predictor. The overall estimation accuracy is reflected by the total integrated mean squared error (TIMSE) of all the coefficients, defined as TIMSE= p ˆ j=0 βj (z). ˆ ˆ Note that IMSE(βj ) reduces to MSE(βj ) ˆ when βj is a constant. We rewrite model (4.1) here as p Yi = βj (Zi )Xij + εi j=0 (i = 1, . . . , n) for describing our simulation schemes. The predictors were generated from both continuous and discrete distributions. Two types of distribution for εi , a standard normal distribution and a t distribution with 3 degrees of freedom, were taken. Without loss of generality, assume the first 3 coefficients are varying, next two are constant and all the rest coefficients are redundant. The coefficients were set as: β0 (z) = 5 + 3 sin(2πz), β1 (z) = 2−3 cos{(6z −5)π/3}, β2 (z) = 7−7z, β3 (z) = 4.5, β4 (z) = 3, and βj (z) = 0 for j = 5, . . . , p. All the simulations were carried out with sample size n = 200, p =10,100,200,400 and a total of R =200 replicates. All the approaches use LASSO as the initial estimate. Example 4.1. We simulate the responses from model (4.1), where the index variable Xi ∼ Uniform(0,1), and the predictors were generated from a multivariate normal distribution with mean vector 0 and Cov(Xij , X ij ) = 0.5|j−j | for 0 ≤ j, j ≤ p. The IMSE of the 89 estimator obtained from all the 4 approaches and the true model were calculated. Under the standard normal error, as p increases, group LASSO has a systematic less satisfactory performance than the others which have relative stable and comparable performances, with group TLP gradually establishing its advantage over the rest in terms of both oracle percentage and estimation precision, as shown in Table 4.1. A pretty much similar trend can be observed in Table 4.2 when the random error was generated from t(3) distribution, though the performance of all the procedures are slightly worse than that under the standard normal error. 90 Table 4.1: Simulation results of Example 4.1, N (0,1) error p=10 p=100 p=200 p=400 gLASSO Oracle Perc. 0.615 IMSE β0 (u) 0.1393 β1 (u) 0.3199 β2 (u) 0.4777 β3 (u) 0.1771 β4 (u) 0.0784 TIMSE 1.2022 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE N (0,1) error gALASSO gSCAD 0.955 0.97 gTLP 1 Oracle 1 0.1380 0.1678 0.1539 0.0175 0.0231 0.5019 0.1388 0.1427 0.1213 0.0122 0.0103 0.4339 0.1492 0.1641 0.0666 0.0101 0.0081 0.3980 0.1492 0.1669 0.0674 0.0101 0.0081 0.4017 0.515 0.995 0.995 1 1 0.1650 0.2302 0.7810 0.0792 0.0711 1.3308 0.1657 0.1541 0.6678 0.0140 0.0216 1.0232 0.1648 0.1477 0.6515 0.0176 0.0128 0.9945 0.1587 0.1607 0.0713 0.0096 0.0082 0.4085 0.1506 0.1374 0.0855 0.0093 0.0081 0.3909 0.355 0.985 1 1 1 0.1771 0.2259 0.8094 0.0812 0.0804 1.3825 0.1761 0.1599 0.6731 0.0193 0.0271 1.0554 0.1781 0.1595 0.6484 0.0202 0.0144 1.0206 0.1612 0.1720 0.0764 0.0087 0.0081 0.4265 0.1523 0.1480 0.0845 0.0089 0.0085 0.4022 0.27 0.99 1 1 1 0.2104 0.2514 0.8967 0.0705 0.0749 1.5188 0.2102 0.1744 0.7930 0.0256 0.0257 1.2290 0.2099 0.1697 0.7840 0.0273 0.0124 1.2033 0.1654 0.1729 0.0746 0.0108 0.0072 0.4310 0.1654 0.1748 0.0756 0.0108 0.0072 0.4339 91 Table 4.2: Simulation results of Example 4.1, t(3) error p=10 p=100 p=200 p=400 gLASSO Oracle Perc. 0.29 IMSE β0 (u) 0.1929 β1 (u) 0.3392 β2 (u) 0.5859 β3 (u) 0.1775 β4 (u) 0.1100 TIMSE 1.4520 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE t(3) error gALASSO gSCAD 0.83 0.945 gTLP 0.97 Oracle 1 0.1921 0.2198 0.3676 0.0457 0.0463 0.8856 0.1913 0.2050 0.3358 0.0400 0.0358 0.8170 0.2063 0.2343 0.1445 0.0333 0.0233 0.6526 0.2066 0.2386 0.1464 0.0294 0.0211 0.6421 0.01 0.74 0.965 0.98 1 0.2557 0.2834 0.9739 0.1044 0.1036 1.8882 0.2537 0.2173 0.8729 0.0538 0.0549 1.5170 0.2551 0.2220 0.8721 0.0510 0.0402 1.4614 0.2109 0.2370 0.1386 0.0273 0.0260 0.7184 0.2002 0.2134 0.1544 0.0293 0.0230 0.6204 0.005 0.67 0.915 0.955 1 0.3046 0.2857 1.1327 0.0833 0.1156 2.2144 0.2963 0.2077 1.0282 0.0484 0.0501 1.8028 0.3026 0.2103 1.0327 0.0523 0.0340 1.6943 0.2004 0.2405 0.1441 0.0256 0.0237 0.7474 0.1999 0.2515 0.1577 0.0261 0.0212 0.6564 0 0.57 0.905 0.96 1 0.3481 0.2902 1.2438 0.0886 0.1170 2.3943 0.3428 0.2219 1.1509 0.0513 0.0482 1.9293 0.3398 0.2279 1.1543 0.0491 0.0346 1.8288 0.2097 0.2522 0.1555 0.0282 0.0227 0.7677 0.1955 0.2433 0.1369 0.0239 0.0223 0.6220 92 Example 4.2. Now the simulation in a genetic setup is considered. The responses were generated from model (4.1) with SNP Xj (j = 0, . . . , p) coded as 3 categories (1,0,-1) for genotypes (AA,Aa,aa) respectively. The LD based simulation scheme was adopted to generate the genotype data. Let pA and pB be the minor allele frequencies (MAFs) of two risk alleles A and B for two adjacent SNPs, with linkage disequilibrium denoted as δ. The frequencies of four haplotypes are pab = (1−pA )(1−pB )+δ, pAb = pA (1−pB )−δ, paB = (1−pA )pB −δ, and pAB = pA pB +δ. Assuming the Hardy-Weinberg equilibrium, we can simulate the SNP genotype at locus A from a multi-nomial distribution with frequencies p2 ,2pA (1−pA ) and (1−pA )2 for genotypes A (AA,Aa,aa) respectively. SNP genotype at locus B conditional on that at locus A can be simulated based on the conditional probability matrix in Cui et al. (2008)[39]. Table ??–?? summarize the estimation results in example 2. For all the four approaches, as the minor allele frequency (MAF) pA goes up from 0.1 to 0.5, the proportion of choosing the correct model generally increases, and the estimation error reduces. Group LASSO consistently performed worse than its counterparts in terms of the two criteria. Under the standard normal error, when p =10, the difference in performance among the 3 different MAFs, especially between MAF 0.3 and 0.5, is not significant for all the four methods. However, as p increases, dramatic differences are observed. Starting from p=100, all the approaches can barely choose the exact true model as pA=0.1, and the corresponding TIMSEs are quite large. Given MAF 0.3, the performance of all the approaches except group TLP in very high dimensions,such as p=200 and 400, fall way behind that given MAF 0.5. The performance of group TLP is relative stable compared to the other 3 methods, especially when the dimension is extremely high, as p=400. We observe similar patterns when the procedures are assessed under t error with 3 de93 grees of freedom. In general, the performances of all the procedures under standard normal error are superior. Group TLP outperforms the others in all the scenarios. Table 4.3: Simulation results of Example 4.2, p = 10, n = 200, N (0,1) error pA=0.1 pA=0.3 pA=0.5 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE gLASSO 0.44 N (0,1) error gALASSO gSCAD 0.795 0.935 gTLP 0.96 Oracle 1 0.7266 1.1823 1.8650 0.4954 0.2002 4.5192 0.7252 1.5831 1.3619 0.2529 0.2767 4.2036 0.7574 0.6841 0.7610 0.1010 0.0650 2.3704 0.3981 0.4124 0.3528 0.0615 0.0415 1.2934 0.3895 0.4861 0.3777 0.0573 0.0416 1.3521 0.615 0.98 0.99 0.985 1 0.2152 0.4975 1.0013 0.4007 0.1559 2.2862 0.2131 0.2814 0.3706 0.0467 0.0665 0.9808 0.2122 0.2064 0.2660 0.0268 0.0254 0.7424 0.1785 0.2204 0.1237 0.0245 0.0195 0.5699 0.1783 0.2277 0.1259 0.0237 0.0195 0.5751 0.73 0.97 0.99 0.99 1 0.2152 0.4975 1.0013 0.4007 0.1559 2.2862 0.2131 0.2814 0.3706 0.0467 0.0665 0.9808 0.2122 0.2064 0.2660 0.0268 0.0254 0.7424 0.1785 0.2204 0.1237 0.0245 0.0195 0.5699 0.1783 0.2277 0.1259 0.0237 0.0195 0.5751 94 Table 4.4: Simulation results of Example 4.2, p = 10, n = 200, t(3) error t(3) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0.33 IMSE β0 (u) 1.6609 β1 (u) 1.4901 β2 (u) 1.9043 β3 (u) 0.5425 β4 (u) 0.4238 TIMSE 6.1169 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE gALASSO 0.54 gSCAD 0.705 gTLP 0.79 Oracle 1 1.6954 2.3887 2.2271 0.6404 0.6674 7.6370 1.7427 1.7687 1.6040 0.3444 0.2963 5.8026 1.0722 0.9836 0.9273 0.2340 0.1649 3.5783 0.9165 1.0244 0.9033 0.1560 0.1261 3.1263 0.25 0.895 0.935 0.955 1 0.3432 0.5782 1.1033 0.3668 0.2144 2.6777 0.3406 0.4111 0.6824 0.0949 0.1224 1.6856 0.3432 0.4861 0.7795 0.0828 0.0648 1.7235 0.2534 0.3823 0.3340 0.0825 0.0592 1.1496 0.2491 0.3835 0.3098 0.0706 0.0522 1.0652 0.23 0.845 0.96 0.965 1 0.3432 0.5782 1.1033 0.3668 0.2144 2.6777 0.3406 0.4111 0.6824 0.0949 0.1224 1.6856 0.3432 0.4861 0.7795 0.0828 0.0648 1.7235 0.2534 0.3823 0.3340 0.0825 0.0592 1.1496 0.2491 0.3835 0.3098 0.0706 0.0522 1.0652 95 Table 4.5: Simulation results of Example 4.2, p = 100, n = 200, N (0,1) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0 IMSE β0 (u) 5.9486 β1 (u) 2.1664 β2 (u) 2.5871 β3 (u) 7.4511 β4 (u) 4.1242 TIMSE 11.978 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE N (0,1) error gALASSO gSCAD 0 0.005 gTLP 0.015 Oracle 1 5.9577 4.2821 2.6322 0.5889 0.6425 14.4393 5.9433 5.0012 2.6478 3.4381 2.0292 14.2111 4.8575 1.0952 0.9247 0.0965 0.0693 8.7643 0.4180 0.5239 0.5210 0.0624 0.0508 1.5761 0.01 0.565 0.82 0.99 1 1.0107 0.5794 1.1485 0.2246 0.1854 3.2486 1.0204 0.4570 0.9477 0.0596 0.1118 2.6329 1.0199 0.5103 1.5749 0.0674 0.0457 3.2207 0.1690 0.2204 0.1567 0.0233 0.0153 0.5847 0.1666 0.2202 0.1237 0.0232 0.0153 0.5489 0.55 1 1 1 1 0.1593 0.3242 0.9660 0.1357 0.1328 1.7231 0.1588 0.2071 0.8267 0.0330 0.0588 1.2843 0.1606 0.2637 1.1570 0.0360 0.0198 1.6370 0.1561 0.1953 0.1088 0.0197 0.0142 0.4942 0.1561 0.2017 0.1145 0.0197 0.0143 0.5063 96 Table 4.6: Simulation results of Example 4.2, p = 100, n = 200, t(3) error t(3) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0 IMSE β0 (u) 6.4423 β1 (u) 2.2632 β2 (u) 2.6220 β3 (u) 0.6904 β4 (u) 0.6988 TIMSE 13.0771 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE gALASSO 0 gSCAD 0 gTLP 0 Oracle 1 6.4941 3.5782 3.3151 1.0206 1.5284 16.6350 6.4759 4.2976 3.2927 0.3533 0.6974 15.6337 5.3221 1.9710 1.9261 0.2403 0.1799 12.0716 0.8603 0.9942 0.9143 0.1557 0.1369 3.0614 0 0.065 0.27 0.86 1 1.9184 0.6734 1.3878 0.2408 0.2472 5.1465 1.9154 0.6233 1.4093 0.1414 0.1723 5.2540 1.9258 0.8530 2.1560 0.1447 0.0857 5.9097 0.3391 0.4326 0.5030 0.0698 0.0474 2.0213 0.2511 0.3638 0.3218 0.0650 0.0439 1.0456 0.035 0.725 0.935 0.985 1 0.2407 0.3715 1.3128 0.1391 0.1893 2.5354 0.2443 0.2827 1.2550 0.1130 0.1120 2.1107 0.2407 0.3359 1.4448 0.1130 0.0783 2.2240 0.2057 0.3258 0.2307 0.0507 0.0361 0.9181 0.2088 0.3332 0.2403 0.0523 0.0346 0.8692 97 Table 4.7: Simulation results of Example 4.2, p = 200, n = 200, N (0,1) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0 IMSE β0 (u) 6.4035 β1 (u) 2.1184 β2 (u) 2.3654 β3 (u) 0.7732 β4 (u) 0.4501 TIMSE 12.2412 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE N (0,1) error gALASSO gSCAD 0 0 gTLP 0 Oracle 1 6.4500 4.3209 2.6648 0.4838 0.5315 14.8546 6.4518 4.9639 2.8046 0.3425 0.1985 14.7897 5.0816 1.1121 0.8625 0.0938 0.5025 8.9683 0.3961 0.4109 0.3500 0.0586 0.0426 1.2582 0 0.075 0.345 0.98 1 1.8631 0.6342 1.3076 0.2365 0.2280 4.4824 1.8954 5.5156 1.2343 0.0549 0.1338 4.0975 1.8833 0.8991 2.3473 0.0628 0.0575 5.2837 0.1987 0.2307 0.1570 0.0238 0.0192 0.6483 0.1703 0.2113 0.1374 0.0230 0.0193 0.5613 0.44 0.985 1 1 1 0.1588 0.2979 1.0129 0.1180 0.1310 1.7286 0.1594 0.2000 0.9003 0.0343 0.0676 1.3615 0.1583 0.2639 1.2411 0.0433 0.0216 1.7282 0.1525 0.2003 0.1141 0.0178 0.0158 0.5006 0.1525 0.2064 0.1157 0.0178 0.0159 0.5083 98 Table 4.8: Simulation results of Example 4.2, p = 200, n = 200, t(3) error t(3) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0 IMSE β0 (u) 7.8770 β1 (u) 2.237 β2 (u) 2.542 β3 (u) 0.8399 β4 (u) 0.7117 TIMSE 14.7035 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE gALASSO 0 gSCAD 0 gTLP 0 Oracle 1 7.8787 3.5923 3.4581 0.8507 1.4850 18.344 7.8986 4.4315 3.4735 0.5609 0.5707 17.356 5.628 2.0438 2.2584 0.2691 0.1665 12.8456 0.8906 1.0194 0.8541 0.1614 0.1282 3.0536 0 0 0.015 0.705 1 3.0900 0.7765 1.4769 0.2164 0.3116 6.5667 3.1150 0.7070 1.6380 0.1412 0.2076 6.8139 3.1170 1.0440 2.6080 0.1220 0.1040 7.2260 0.7914 0.5144 0.5512 0.0745 0.0536 2.5457 0.2705 0.4026 0.3126 0.0715 0.0480 1.1053 0 0.695 0.915 0.96 1 0.2872 0.4168 1.4403 0.1521 0.2092 2.9052 0.2928 0.3450 1.4106 0.0960 0.1303 2.5833 0.3075 0.3626 1.5519 0.1116 0.0670 2.4056 0.2178 0.3954 0.2830 0.0578 0.0557 1.0997 0.2089 0.3450 0.3179 0.0553 0.0461 0.9732 99 Table 4.9: Simulation results of Example 4.2, p = 400, n = 200, N (0,1) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0 IMSE β0 (u) 6.9606 β1 (u) 2.1276 β2 (u) 2.4348 β3 (u) 0.6920 β4 (u) 0.5384 TIMSE 12.8890 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE N (0,1) error gALASSO gSCAD 0 0 gTLP 0 Oracle 1 6.8309 4.0791 2.8612 0.6841 0.9365 15.8640 6.8046 4.8938 2.8865 0.4083 0.2659 15.3025 5.1098 1.2151 1.1964 0.1016 0.0713 9.3504 0.3685 0.4558 0.3100 0.0510 0.0392 1.2246 0 0 0.095 0.88 1 2.783 0.675 1.422 0.236 0.278 5.713 2.792 0.572 1.467 0.081 1.763 5.596 2.793 0.810 2.753 0.090 0.090 6.643 0.421 0.313 0.364 0.030 0.021 1.258 0.181 0.243 0.135 0.023 0.017 0.600 0.29 0.985 1 0.995 1 0.1817 0.2926 1.1169 0.1110 0.1392 1.8554 0.1825 0.2092 1.0209 0.0401 0.0676 1.5204 0.1831 0.2928 1.3567 0.0426 0.0250 1.9003 0.1558 0.2169 0.1325 0.0205 0.0152 0.5409 0.1551 0.2232 0.1162 0.0204 0.0152 0.5300 100 Table 4.10: Simulation results of Example 4.2, p = 400, n = 200, t(3) error pA=0.1 pA=0.3 pA=0.5 gLASSO Oracle Perc. 0 IMSE β0 (u) 8.3537 β1 (u) 2.1540 β2 (u) 2.5676 β3 (u) 0.8431 β4 (u) 0.7683 TIMSE 15.1398 Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE Oracle Perc. IMSE β0 (u) β1 (u) β2 (u) β3 (u) β4 (u) TIMSE t(3) error gALASSO gSCAD 0 0 gTLP 0 Oracle 1 8.1152 3.5756 3.3866 0.6975 1.3577 18.0344 8.2099 4.5341 3.4152 0.4919 0.6356 17.5993 6.0230 2.1497 2.5080 0.2443 0.1642 13.2892 0.9324 0.9725 0.9093 0.1773 0.1549 3.1463 0 0 0 0.425 1 4.127 0.812 1.562 0.264 0.359 7.973 4.114 0.681 1.815 0.187 0.244 8.427 4.123 0.940 2.985 0.162 0.143 8.711 1.968 0.701 0.823 0.114 0.076 5.132 0.279 0.427 0.305 0.071 0.057 1.146 0 0.61 0.935 0.97 1 0.3306 0.4178 1.4987 0.1294 0.2179 3.084 0.3289 0.3265 1.4649 0.0916 0.1178 2.5707 0.3317 0.3698 1.6647 0.0888 0.0518 2.5263 0.1991 0.3515 0.3334 0.0485 0.0457 1.1916 0.2016 0.3564 0.2950 0.0468 0.2438 0.9435 101 4.4 Real data analysis We used the genome-wide association data from Genetic Analysis Workshop (GAW) 18 for 142 unrelated subjects to illustrate the utility of our approach. It is widely recognized that individual’s blood pressure is related to age. The aim of our study is to track down the genetic variants that can interpret the variation in Diastolic Blood Pressure (DBP) caused by non-linear penetrance effect over time (age). The environment condition is defined as age, and the blood pressure, measured as DBP, might not be the same for an individual with the same gene but of different environmental exposures. This is triggered by the complex interplay between the age and gene effects. The dataset was cleaned by removing SNPs with MAF less than 0.05 or departure from Hardy-Weinberg equilibrium. Subjects with missing DBP or age, as well as with more than 1/3 missing genotypes, were excluded from the dataset before final analysis. We take a subset of 250 SNPs from chromosome 9 and imputed the missing value before the final analysis. We applied our approach to the data and select the model Y = β0 (z) + βj1 Xj1 + βj2 Xj2 + ε where Xj1 and Xj2 correspond to SNP rs723877 and rs10972462, respectively. Both coefficients βj1 and βj1 are varying. Thus we identified two genetic risk variants that are sensitive to age to affect blood pressure. A cross validation examination, the single SNP based analysis in Ma et al [37], was performed for SNPs on chromosome 9. From the over all association test corresponding to LM, LMI and VC models, we calculated p-values denoted as P CON, P LIN and P VC. SNPs with at least one of the three p-values less than 5E-06 were tabulated in Table 4.11. For SNP 102 rs10972462, testing on the constant coefficients implies that the coefficient function of this SNP does change across age, as P CON¡ 0.05. Further test on linear structure indicates that rs10972462 has a linear interaction with age (P LIN¿ 0.05). Hence it makes sense that the p-value obtained from the test corresponding to LMI model is smaller than its counterpart with LM and VC model. A similar observation can be concluded for SNP rs723877. Table 4.11: List of SNPs with p-value < 5E-06 on Chromosome 9 SNP ID GeneName Location P VC P CON P LIN P LM P LMI PI 2.83E-05 3.01E-05 2.24E-06 1.24E-06 0.00357 0.00175 LMI model rs723877 rs10972462 4.5 UNC13B UNC13B Chr9 Chr9 6.52E-06 2.13E-06 0.00649 0.00208 0.12311 0.07052 Discussion A growing number of pieces of evidence have demonstrated the importance of G×E interactions in complex traits, as the responses of genetic factors to environmental exposures play a crucial role in affecting disease risks and variations of quantitative traits. Many statistical methods have been developed to explore G×E interactions. The linear assumption of gene effect under environmental stimuli on which these methods rely is often violated in practice. Ma et al [37] and Wu and Cui [68] relaxed the linear assumption to allow for a non-linear genetic penetrance effect for continuous and binary disease phenotype, respectively. The true effect of G×E interactions is captured by the model itself, through a sequence of likelihood ratio tests. A common feature of these methods, including Ma et al [37] and Wu and Cui [68], and those reviewed in Cornelis et al [65], is that the identification of the presence of G×E interactions is casted in a single genetic variant based hypothesis testing framework. To 103 our best knowledge, Wu and Cui [93] for the first approached the problem from a high dimensional variable selection perspective. Specifically, within the VC model framework, the presence of G×E interactions, no G×E interactions and no association with the phenotype can be determined by separating the VC coefficient functions after B spline basis expansion approximations as varying, non-zero constant and zero, correspondingly. The effect separation and variable selection in Wu and Cui [93] is attained by penalized estimation on the model parameters so some varying coefficients are continuously shrunk to a non-zero constant or zero. The procedure is dependent on local quadratic approximations (LQA) to SCAD penalty function. However, LQA leads to loss in efficiency and accuracy, especially when dimension p is large, due to frequent factorizations of large matrices. The local linear approximation (LLA) to penalty functions proposed in Zou and Li [94] improved LQA but then was shown outperformed by coordinate descent method (CD), as in Breheny and Huang [87]. In this work, we integrate the group version of CD, or GCD, in the two-stage iterative procedure to exploit G×E interactions in a high dimensional setting. The most prominent character of GCD is the penalized objective function is optimized over one individual predictor group at a time, so the computational complexity only increase linearly with the dimension p. This attribute ensures the superior performance of our framework. We examined both convex (LASSO, adaptive LASSO) and non-convex (TLP, SCAD) penalty functions. Extensive simulation results manifest that all the penalty except LASSO perform satisfactorily, and TLP excels in all the scenarios. As p grows from low to very high dimensions, the performance of approaches with all penalty functions remain relative stable, though slight drops was observed. The phenomenon indicates the advantage of our framework from another perspective. 104 Chapter 5 Concluding Remarks It has been widely recognized that the naturally occurring variations in most complex disease traits are not merely explained by genetic factors, but also can be understood from the mechanism of genetic responses to environmental mediators. G×E interactions, the genetic control of the pattern to environmental stimuli, shed novel light on examining the trait variations. This dissertation focuses mainly on developing powerful statistical methods to tackle the challenges originated from G×E interactions. In chapter 2, we developed a new statistical approach to extend the varying coefficient model for continuous quantitative response in our previous work to the binary disease response. The varying coefficients were estimated by the nonparametric B spline method due to its computational expediency and nice asymptotic properties. Our scheme has particular advantage in hunting down the fluctuation in gene functions across environmental changes. A significant boost in power were indicated in the simulation study when underlying penetrance effect of genetic variants is non-linear. The simulation results also show that when the model for underlying mechanism of G×E interactions is misspecified, VC model may not have the higher power than LM and LMI models, since it is much more complex and needs large degrees of freedom for hypothesis testing. To determine which model fits the data more appropriately, we developed a sequential testing scheme. Our scheme is applied to two Type 2 Diabetes studies by first evaluating 105 constant coefficients, then linear and varying coefficients. The novel disease signals captured by VC model in our framework could be missed if our focus is restricted to linear model only. A broad spectrum of available methodologies in exploring G×E interactions are coined within single genetic variant based hypothesis testing framework. Set based association study, such as the gene-centric, gene-set and pathway based analysis, has continued to soar in popularity as its advantage has increasingly been acknowledged. In chapter 3, we proposed a variant set based framework to examine how variants in a genetic system are mediated by a common environment factor to influence quantitative phenotypic response. We tackle the issue from a high dimensional variable selection standpoint. Specifically, we can identify the sensitivity of genetic variants to environment stimuli, which is tantamount to determine the coefficient function as varying, non-zero constant and zero, corresponding to cases of existence of G×E interactions, no G×E interactions and no genetic effects. The procedure was implemented in a two stage iterative framework. We established the selection consistency and oracle properties of the penalized estimator with SCAD penalty, and demonstrated dramatic improvement in finite sample performance over the adaptive LASSO in simulation study, in terms of oracle percentage and estimation accuracy. The application of our approach to the JAK/STAT signaling pathway in LGA/SGA study correctly select the risk SNP without G×E effect. This framework is critically dependent on local quadratic approximations (LQA). Because of repeated factorizations of matrices, significance loss in computational efficiency and estimation precision will be caused when dimension is large, especially in scenarios of p > n. To overcome this issue, we developed the group coordinate descent (GCD) approach within the two-stage iterative framework in chapter 4. After basis expansion, the penalized 106 loss function is minimized with regard to single predictor group at a time, and all the predictor groups are cycled until convergence of the algorithm. Therefore, the computational complexity only grows linearly with dimension p. Through extensive simulation study with different penalty functions (LASSO, adaptive LASSO, TLP and SCAD), we demonstrated the merit of this framework. Our approach yields high oracle percentages and estimation precision even when dimension p is much larger than sample size n. The main objective of this dissertation is to develop novel statistical methodology for the elucidation of complicated machinery in G×E interactions. By implementing different link functions, our framework on investigating the non-linear G×E interactions can be readily extended to different types of phenotypic responses such as count or survival outcomes. We can also try to test if the novel hypothesis in Perry et al [52] that the significance of potential risk loci could be enhanced by the stratification of patients with Type 2 Diabetes based on BMI will result in any new findings within our framework. To the best of our knowledge, this dissertation first proposed the scheme of approaching G×E interactions from a high dimensional variable selection perspective. Generalizations of our framework to binary disease response in case control association study and survival response are currently undergoing. The selection consistency and oracle property when dimension of predictors p exceeds the sample size n for the procedure will also be examined. It is worth noting that to explore G×E interactions in ultra-high dimensional feature space, we can integrate the sure independence screen (SIS) procedure in Fan and Lv [99] into our framework. Relevant investigations will be carried out in the near future. 107 BIBLIOGRAPHY 108 BIBLIOGRAPHY [1] M. Lynch and B. Walsh. Genetics and Analysis of Quantitative Traits. Sinauer Associates,Sunderland, MA, 1998. [2] R. Wu, C.X. Ma and G. Casella. Statistical Genetics of Quantitative traits– Linkage, Maps and QTL. Springer, 2007. [3] R.S. Spielman, R.E. McGinnis, W.J. Ewens. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 52 (3): 506-516, 1993. [4] N. Risch and K. Merikangas. The future of genetics studies of complex human diseases. Science 273: 1516–1517, 1996. [5] J. Altmuller, L.J. Palmer, G. Fischer, H. Scherb and M. Wjst. Genome-wide scans of complex human diseases: true linkage is hard to find. Am J Hum Genet. 69(3): 936–950, 2001. [6] International HapMap Consortium. The haplotype map of the human genome. Nature 437: 1299-1320, 2005. [7] M.F. Moffatt, M. Kabesch, L. Liang, A.L. Dixon, D. Strachan, et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448(7152):470–473, 2007. [8] D.F. Easton, K.A. Pooley, A.M. Dunning, P.D.P. Pharoah, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447: 1087–1093, 2007. [9] D.J. Hunter, P. Kraft, K.B. Jacobs, D.G. Cox, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 39(7): 870–874, 2007. [10] A. Helgadottir, G. Thorleifsson, A. Manolescu, S. Gretarsdottir, et al. A Common Variant on Chromosome 9p21 Affects the Risk of Myocardial Infarction. Science 316(5830): 1491–1493, 2007. 109 [11] R. McPherson, A. Pertsemlidis, N. Kavaslar, A. Stewart, et al. A common allele on chromosome 9 associated with coronary heart disease. Science 316(5830): 1488–1491, 2007. [12] R. Sladek, G. Rocheleau, J.Rung, C. Dina, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445: 881–885, 2007. [13] L.J. Scott, K.L. Mohlke, L.L. Bonnycastle, C.J. Willer, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316(5829): 1341–1345, 2007. [14] M.I. McCarthy, G.R. Abecasis, L.R. Cardon, D.B. Goldstein et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9(5): 356–369, 2008. [15] T.A. Manolio, F.S. Collins, N.J. Cox, D.B. Goldstein, et al. Finding the missing heritability of complex diseases. Nature 461(7265): 747–753,2009. [16] D.E. Reich and E.S. Lander. On the allelic spectrum of human disease. Trends Genet. 17(9): 502–510, 2001. [17] J.K. Pritchard and N.J. Cox. The allelic architecture of human disease genes: common disease-common variant...or not? Hum. Mol. Genet. 11(20): 2417–2423, 2002. [18] E. Zeggini, W. Rayner, A.P. Morris, A.T. Hattersley, et al. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nat. Genet. 37: 1320–1322, 2005. [19] E.R. Mardis. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9:387–402, 2008. [20] E.R. Mardis. The impact of next-generation sequencing technology on genetics. Trends Genet. 24(3): 133–141, 2008. [21] J. Shendure and H. L. Lee. Next-generation DNA sequencing. Nature Biotech. 26: 1135–1145, 2008. [22] M.L. Metzker. Sequencing technologies - the next generation. Nat Rev Genet. 11(1):31–46 ,2009 110 [23] M.I. McCarthy and J.N. Hirschhorn. Genome-wide association studies: potential next steps on a genetic journey. Hum. Mol. Genet. 17(R2): R156–65, 2008. [24] J.B.S. Haldane. Heredity and Politics. New York, NY: W W Norton, 1938. [25] J.B.S. Haldane. The interaction of nature and nurture. Ann. Eugen. 13: 197–205, 1947. [26] D.S. Falconer. The problem of environment and selection. Amer. Natural. 86: 293– 298, 1952. [27] D.J. Hunter. Gene-environment interactions in human diseases. Nat. Rev. Genet. 6(4): 287–298, 2005. [28] C.M. Ulrich, E. Kampman, J. Bigler, C. Chen, et al. Colorectal adenomas and the C677T MTHFR polymorphism: evidence for gene-environment interaction? Cancer Epidemiol. Biomarkers Prev. 8(8): 659–668, 1999. [29] L.A. Mai. Genetic and exposure risks for chronic beryllium disease. Clin. Chest Med. 23: 827–839, 2002. [30] J.L. Ree. The genetics of sun sensivity in humances. Am. J. Hum. Genet. 75: 739– 751, 2004. [31] L.G. Costa and D.L. Eaton. Gene-Environment Interactions: Fundamentals of Ecogenetics. Hoboken, NJ: John Wiley & Sons, 2006. [32] A. Caspi and T.E. Moffitt. Gene-environment interactions in psychiatry: joining forces with neuroscience. Nat. Rev. Neurosci. 7(7): 583–590, 2006. [33] L.W. Hahn, M.D. Ritchie and J.H. Moore. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 19(3): 376–382, 2003. [34] S.W. Guo. Gene-environment interaction and the mapping of complex traits: some statistical models and their implications. Hum. Hered. 50(5): 286–303, 2000. [35] N. Chatterjee and R.J. Carroll. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92: 399– 418, 2005. 111 [36] A. Maity, R.J. Carroll, E. Mammen and N. Chatterjee. Testing in semiparametric models with interaction, with applications to gene-environment interactions. J. R. Statist. Soc. B 71: 75–96, 2009. [37] S.J. Ma, L.J. Yang, R. Romero and Y.H. Cui. Varying coefficient models for geneenvironment interaction: a non-linear look. Bioinformatics 27(15): 2119–2126, 2011. [38] K. Wang, M.Y. Li and H. Hakonarson. Analysing biological pathways in genome-wide association studies. Nat. Rev. Genet. 11, 843–854, 2010. [39] Y.H. Cui, G.L. Kang, K.L. Sun, R. Romero, M. Qian, and W.J. Fu. Gene-centric genomewide association study via entropy. Genetics 179: 637–650, 2008. [40] C. Wu and Y.H. Cui. Boosting signals in gene-based association studies via efficient SNP selection. Brief. Bioinform. in press, 2013. [41] J.Z. Liu, A.F. McRae, D.R. Nyholt, S.E. Medland, et al. A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 87(1):139–145, 2010. [42] P. Zimmet, K. G. M. M. Alberti and J. Shaw. Global and societal implications of the diabetes epidemic. Nature 414: 782–787, 2001. [43] C.J. Patel, R. Chen, K. Kodama, J.P. Ioannidis and A.J. Butte. Systematic identification of interaction effects between genome- and environment-wide associations in type 2 diabetes mellitus. Hum. Genet. 132: 495–508, 2013. [44] A.P. Feinberg. Phenotypic plasticity and the epigenetics of human disease. Nature 447: 433–440, 2004. [45] M. Peacock, C.H. Turner, M.J. Econs and T. Foroud. Genetics of osteoporosis. Endocr. Rev. 23:303-326. 2002 [46] D.B. Sparrow, G. Chapman, A.J.Smith, M.Z. Matter, J.A. Major, et al. A mechanism for gene-environment interaction in the etiology of congenital scoliosis. Cell 149: 295306, 2012. [47] V.S. Laitala, J. Kaprio and K. Silventoinen. Genetics of coffee consumption and its stability. Addiction 103: 2054–2061, 2008. 112 [48] J.A. Martinez, M.S. Corbalan, A. Sanchez-Villegas, L. Forga, et al. Obesity risk is associated with carbohydrate intake in women carrying the Gln27Glu beta2adrenoceptor polymorphism. J. Nutr. 133: 2549–2554, 2003. [48] B. Mukherjee, J. Ahn, S.B. Gruber and N. Chatterjee. Testing gene-environment interaction in large-scale case-control association studies: possible choices and comparisons. Am. J. Epidemiol. 175: 177–190, 2012. [49] J.Q.Fan and W. Zhang. Statistical methods with varying coefficient models. Stat. Interface 1: 179–195, 2008. [50] J.H. Huang, C. Wu and L. Zhou. Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica 14: 763–788, 2004. [51] J.Z. Huang, C.O. Wu and L. Zhou. Varying-coefficient models and basis function approximation for the analysis of repeated measurements. Biometrika 89: 111–128, 2002. [52] J.R.B. Perry, B.F. Voight, L. Yengo, N. Amin, J. Dupuis, et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 8(5):e1002741. doi:10.1371/journal.pgen.1002741. 2012. [53] M.C. Cornelis, A. Agrawal, J.W. Cole, N.N. Hansel, K.C. Barnes, et al. The Gene, Environment Association Studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions. Genet. Epidemiol. 34: 364–372, 2010. [54] G.A. Colditz and S.E. Hankinson. The Nurse’s Health Study: lifestyle and health among women. Nat. Rev. Cancer 5: 388–396, 2005. [55] E.B. Rimm, E.L. Giovannucci, W.C. Willett, G.A. Colditz, A. Ascherio, B. Rosner and M.J. Stampfer. Prospective study of alcohol consumption and risk of coronary disease in men. Lancet 338: 464–468, 1991. [56] T.L. Holbrook, E. Barrett-Connor and D.L. Wingard. The association of lifetime weight and weight control patterns with diabetes among men and women in an adult community. Int. J. Obes. 13: 723–729, 1989. [57] V.J. Carey, E.E. Walters, G.A. Colditz, C.G. Solomon et al. Body fat distribution and risk of non-insulin-dependent diabetes mellitus in women. The Nurses’ Health Study. Am. J. Epidemiol. 145: 614–619, 1997. 113 [58] J.M. Chan, E.B. Rimm, G.A. Colditz, M.J. Stampfer, W.C. Willett. Obesity, fat distribution, and weight gain as risk factors for clinical diabetes in men. Diabetes Care 17: 961–969, 1994. [59] S.F. Grant, G. Thorleifsson, I. Reynisdottir, R. Benediktsson, A. Manolescu, et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38: 320–323, 2006. [60] T. Jin and L. Liu. The Wnt signaling pathway effector TCF7L2 and type 2 diabetes mellitus. Mol. Endocrinol. 22: 2383–2392, 2008. [61] M. I. McCarthy. Genomics, Type 2 Diabetes, and Obesity.N. Engl. J. Med. 363: 2339–2350, 2010. [62] L. Qi and Y.A. Cho. Gene-environment interaction and obesity. Nutr. Rev. 66: 684– 694, 2008. [63] L. Liu, Y. Li, T.O. Tollefsbol. Gene-environment interactions and epigenetic basis of human diseases. Curr. Issues Mol. Biol. 10: 25–36, 2008. [64] A.P. Feinberg and R.A. Irizarry. Stochastic epigenetic variation as a driving force of development, evolutionary adaptation, and disease. Proc. Natl. Acad. Sci. USA 107: 1757–1764, 2010. [65] M.C. Cornelis, E.J. Tchetgen,L.M. Liang, L. Qi,N. Chatterjee, F.B. Hu, and P. Kraft. Gene-environment interactions in genome-wide association studies: A comparative study of tests applied to empirical studies of type 2 diabetes. Am. J. Epidemiol. 175(3):191–202, 2011. [66] O. Vaccaro, M. Boemi, F. Cavalot, P. De Feo, R. Miccoli, et al. The clinical reality of guidelines for primary prevention of cardiovascular disease in type 2 diabetes in Italy. Atherosclerosis 198: 396–402, 2008. [67] A.L. Price, N.J. Patterson, R.M. Plenge, M.E. Weinblatt, et al. Principal components analysis corrects for stratification in genome-wide association. Nat. Genet. 38: 904– 909, 2006. [68] C. Wu and Y.H. Cui. A novel method for identifying nonlinear gene-environment interactions in case-control association studies. (Under review) 2013. 114 [69] N. Chatterjee and R.J. Carroll. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92:399– 418, 2005. [70] A. Maity, R.J. Carrol, E. Mammen and N. Chatterjee. Testing in semiparametric models with interaction, with applications to gene-environment interactions. J. R. Statist. Soc. B 71: 75–96, 2009. [71] D.J. Schaid, J.P. Sinnwell, G.D. Jenkins, et al. Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies. Genet. Epidemiol. 36: 3-16, 2012. [72] B. Efron, R. Tibshirani. On testing the significance of sets of genes. Ann. Appl. Stat. 1: 107-129, 2007. [73] Y.L. Tang, H.X. Wang, Z.Y. Zhu and X.Y. Song. A unified variable selection approach for varying coefficient models. Statist. Sinica 22: 601–628, 2012. [74] T. Hastie and R. Tibshirani. Varying-coefficient models. J. R. Statist. Soc. B 55: 757–796, 1993. [75] J.Q. Fan and R.Z. Li. Variable selection via nonconcave penzlied likelihood and its oracle properties. J. Amer. Stat. Assoc. 96: 1348–1360, 2001. [76] G. Schwarz. Estimating the dimension of a model. Ann. Stat. 6: 461–464, 1978. [77] M.O. Kim. Quantile regression with varying coefficients. Ann. Stat. 35: 92-108, 2007. [78] L.F. Wang, H.Z. Li and J.Z. Huang. Variable selection in nonparametric varyingcoefficient models for analysis of repeated measurements. J. Amer. Stat. Assoc. 103: 1556–1569, 2008. [79] L.L. Schumaker. Spline Functions: basic theory. Wiley, New York. 1981. [80] J. S. Rawlings, K. M. Rosler and D.A. Harrison. The JAK/STAT signaling pathway. J. Cell Sci. 117: 1281–1283, 2004. [81] B. Mukherjee, J. Ahn, S.B. Gruber, and N. Chatterjee. Testing gene-environment interaction in large-scale case-control association studies: possible choices and comparisons. Am. J. Epidemiol. 175: 177–190, 2012. 115 [82] B.M. Neale and P.C.Sham. The future of association studies: Gene-based analysis and replication. Am. J. Hum. Genet. 75: 353-362, 2004. [83] H.S. Wang and Y.C. Xia. Shrinkage Estimation of the Varying Coefficient Model. J. Amer. Stat. Assoc. 104: 747–757, 2009. [84] C.L. Leng. A simple approach for varying-coefficient model selection. J. Stat. Plan. Inf., 139: 2138–2146, 2138 [85] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B 68: 49–67, 2006. [86] F.R. Wei, J. Huang and H.Z. Li. Variable selection and estimation in high-dimensional varying coefficient models. Stat. Sinica 21: 1515–1540, 2011. [87] P. Breheny and J. Huang. Group descent algorithms for nonconvex penalized linear and losistic regression models with grouped predictors. (Under review). 2013 [88] J. Friedman, T. Hastie, H. H¨fling and R. Tibshirani. Pathwise Coordinate Optimizao tion. Ann. Appl. Stat., 1: 302–332, 2007. [89] T.T. Wu and K. Lange. Coordinate Descent Algorithms for Lasso Penalized Regression. Ann. Appl. Stat. 1: 224–244. 2008. [90] J. Friedman, T. Hastie and R. Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. 2010 [91] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B. 58(1): 267–288, 1996. [92] H. Zou. The adaptive lasso and its oracle properties. J. Amer. Stat. Assoc. 101(476): 1418–1429, 2006. [93] C. Wu and Y.H. Cui. High dimensional variable selection in gene-environment interactions. (Manuscript) 2013. [94] H. Zou and R.Z. Li. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models.Ann. Stat. 36(4): 1509–1533, 2008. 116 [95] P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5(1): 232–253, 2011. [96] D. L. Donoho and I. M. Johnstone. Ideal denoising in an orthonormal basis chosen from a library of bases. Biometrika 81: 425–455, 1994. [97] X. T. Shen, W. Pan and Y.Z. Zhu. Likelihood-based selection and sharp parameter estimation. J. Amer. Stat. Assoc. 107: 223–232, 2012. [98] L. Xue and A. Qu. Variable selection in high-dimensional varying-coefficient models with global optimality. J. Mach. Learn. Res. 13: 1973–1998, 2012. [99] J.Q. Fan and J.C. Lv. Sure independence screening for ultra-high dimensional feature space. (with discussion) J. R. Statist. Soc. B, 70: 849–911, 2008. 117