Kernel-based nonparametric testing in high-dimensional data with applications to gene set analysis
The ultimate goal of genome-wide association studies (GWAS) is understanding the underlying relationship between genetic variants and phenotype. While the heretability is largely missing in univariate analysis of traditional GWAS, it is believed that the joint analysis of variants, that are interactively functioning in a biological pathway (gene set), is more beneficial in detecting association signals. With the fast developing pace of sequencing techniques, more detailed human genome variation will be observed and hence the dimension of variants in the pathway could be extremely high. To model the systematic mechanism and the potential nonlinear interactions among the variants, in this dissertation we propose to model the set effect though a flexible non-parametric function under the high-dimensional setup, which allows the dimension goes to infinity as the size goes to infinity.Chapter 2 considers testing a nonparametric function of high-dimensional variates in a reproducing kernel Hilbert space (RKHS), which is a function space generated by a positive definite or semidefinite kernel function. We propose a test statistic to test the nonparametric function under the high-dimensional setting. The asymptotic distributions of the test statistic are derived under the null hypothesis and a series of local alternative hypotheses, the explicit power formula under which are also provided. We also develop a novel kernel selection procedure to maximize the power of the proposed test, as well as a kernel regularization procedure to further improve power. Extensive simulation studies and a real data analysis were conducted to evaluate the performance of the proposed method.Chapter 3 is theoretical investigation on the statistical optimality of kernel-based test statistic under the high-dimensional setup, from the minimax point of view. In particularly, we consider a high-dimensional linear model as the initial study. Unlike the sparsity or independence assumptions existing in related literature, we discussed the minimax properties under a structure free setting. We characterize the boundary that separates the testable region from the non-testable region, and show the rate-optimality of the kernel-based test statistic, under certain conditions on the covariance matrix and the growing speed of dimension.Our work in Chapter 4 fills the blank of kernel-based test using multiple candidate kernels under the high dimensional setting. Firstly, we extend the test statistic proposed in Chapter 2 to an inclusive form that allows the adjustment of covariants. The asymptotic distribution of the new test statistic under the null hypothesis is then provided. Two practical and efficient strategies are developed to incorporate multiple kernel candidates into the testing procedures. Through comprehensive simulation studies we show that both strategies can calibrates the type I error rate and improve the power over the the poor choice of kernel candidate in the set. Particularly, the maximum method, one of the two strategies, is shown having potential to boost the power close to one using the best candidate kernel. An application to Thai baby birth weight data further demonstrates the merits of our proposed methods.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Authors
-
He, Tao, Ph. D.
- Thesis Advisors
-
Cui, Yuehua
Zhong, Ping-Shou
- Committee Members
-
Jiang, Ning
Mandrekar, Vidyadhar
Steibel, Juan P.
- Date Published
-
2015
- Program of Study
-
Statistics - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- xi, 104 pages
- ISBN
-
9781321823554
132182355X