High dimensional statistical methods for gene-environment interactions

The genetic influences on complex disease traits generally depends on the joint effects of multiple genetic variants, environment factors, as well as their interplays. Gene$\times$environment (G$\times$E) interactions play vital roles in determining an individual's disease risk, but the underlying genetic machinery is poorly understood. Traditional analysis assuming linear relationship between genetic and environmental factors, along with their interactions, is commonly pursued under the regression-based framework to examine G$\times$E interactions. This assumption, however, could be violated due to nonlinear responses of genetic variants to environmental stimuli. As an extension to our previous work on continuous traits, we proposed a flexible varying-coefficient model for the detection of nonlinear G$\times$E interaction with binary disease traits. Varying coefficients were approximated by a non-parametric regression function through which one can assess the nonlinear response of genetic factors to environmental changes. A group of statistical tests were proposed to elucidate various mechanisms of G$\times$E interaction. The utility of the proposed method was illustrated via simulation and real data analysis with application to type 2 diabetes.It has been increasingly recognized the power of genetic variant set based association analysis over the single variant based approach. We develop a variant set based approach to examine how variants in a genetic system mediated by a common environment factor to affect the phenotype response. The problem can be approached from a high dimensional variable selection perspective. In particular, we can select genetic variants with varying, non-zero constant and zero coefficients, which are corresponding to cases of G$\times$E interactions, no G$\times$E interactions and no genetic effects, correspondingly. The procedure was implemented in a two stage iterative framework via Smoothly Clipped Absolute Deviation (SCAD) penalty. With proper regularity conditions, we can establish the consistency in variable selection and effect separation of our two stage iterative estimator, as well as the optimal convergence rates of the estimates for varying effect. In addition, it can be shown that the estimate of non-zero constant coefficient enjoys the oracle property. The utility of our procedure will be demonstrated through extensive simulation study and real data analysis.Due to the drawback of local quadratic approximations in the aforementioned two-stage framework, the approach is not efficient in handling cases when the dimension $p$ is very large. A group coordinate descent (GCD) based approach was proposed within the framework, which is computationally efficient particularly for high dimensional problems where $p>n$, because the computational complexity increases only linearly with the number of predictor groups after basis expansion. The advantage of our method is demonstrated through extensive simulation study and real data analysis.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Wu, Cen

Thesis Advisors: Cui, Yuehua

Committee Members: Wang, Lifeng
Zhong, Pingshou
Lu, Qing

Date Published: 2013

Subjects: Medical genetics--Statistical methods
Genotype-environment interaction
Phenotype
Research--Statistical methods

Program of Study: Statistics - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: xii, 117 pages

ISBN: 9781303310836
130331083X

High dimensional statistical methods for gene-environment interactions

Full text