Comprehensive Approaches to High-Dimensional Heterogeneous Data : Semiparametric Methods and Feature Selection
The advent of new technologies has introduced a broad spectrum of data types, necessitating advanced methodologies beyond traditional linear models. In scenarios where covariates include both linear and unknown or non-linear relationships with the response variable, relying solely on either parametric or nonparametric methods can be inadequate. This dissertation addresses these complexities using a semiparametric approach called the partial linear trend filtering model. Instead of classical nonparametric methods, a newcomer, trend filtering, is integrated into the high-dimensional partial linear model. Simulation studies indicate that these models handle heterogeneous data more effectively than traditional approaches, demonstrating rigorous theoretical results, including convergence rates for the estimates.Additionally, this dissertation explores high-dimensional partial linear quantile regression to assess the heterogeneous effects of covariates on different quantiles of the response variable. By applying trend filtering to partial linear quantile regression, the strengths of both quantile regression and trend filtering are combined, supported by rigorous theoretical and simulation results. For practical validation, the partial linear quantile trend filtering model is applied to the Environment and Genetics in Lung Cancer Etiology (EAGLE) study data, showcasing their applicability and effectiveness in real-world data analysis.Furthermore, only a small number of works have addressed feature selection and False Discovery Rate (FDR) control within the high-dimensional quantile regression framework. Inspired by the model-X knockoff procedure [9], a new method is introduced for simultaneously controlling FDR and detecting important covariates via the regional quantile regression approach. This three-step procedure identifies signals within the quantile region of interest rather than at a specific quantile level, effectively controlling the FDR. Simulation studies demonstrate the utility of this method, which is also applied to National Health and Nutrition Examination Survey (NHANES) data to identify risk factors associated with high body mass index.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Authors
-
Lee, Sang Kyu
- Thesis Advisors
-
Hong, Hyokyoung G.
Weng, Haolei
- Committee Members
-
Galvao, Antonio
Cui, Yuehua
- Date Published
-
2024
- Subjects
-
Statistics
- Program of Study
-
Statistics - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- 118 pages
- Permalink
- https://doi.org/doi:10.25335/gxvk-tb82