Simultaneous model selection and estimation of generalized linear models with high dimensional predictors

In the past couple of decades the progressive use of technology made the enormous amount of data in different formats available and easily accessible. The size and volume of available data sets have grown rapidly and the technological capacity of the world to store information has almost doubled every 40 months since the 1980s. As of 2020, every day 2.5 quintillion of data are generated. Based on an International Data Group (IDG) report, the global data volume was predicted to grow exponentially and by 2025, IDG predicts there will be 163 zettabytes of data. This enormous amount of data is often characterized by its high dimensionality. Quite often, well-known statistical methods fail to manage such data due to their limitations (e.g., in high-dimensional settings they often encounter various issues such as no unique solution for the model parameters, inflated standard errors, overfitted models, multicollinearity). This resulted in resurging interest in the algorithms that are capable of handling massive quantities of data, extracting and analyzing information from it, and uncovering key insights that subsequently will lead to decision making. Techniques used by these algorithms are tend to speed up and improve the quality of predictive analysis, thus, they found their application in various fields such as banking, financial market, insurance, media, education, medicine, information technology and so on. For instance, medicine becomes more and more individualized nowadays and drugs or treatments can be designed to target small groups, rather than big populations, based on characteristics such as medical history, genetic makeup etc. This kind of treatment is referred to as precision medicine. In the era of precision medicine, constructing interpretable and accurate predictive models, based on patients' demographic characteristics, clinical conditions, and molecular biomarkers, has been crucial for disease prevention, early diagnosis and targeted therapy. The models, for example, can be used to predict patients' susceptibility to disease, identify high risk groups, schedule earlier or more frequent screening, and guide behavioral changes. Therefore, predictive models play a central role in decision making. Several well-known approaches can be used to solve the problem mentioned above. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. In this dissertation, we would like to propose a procedure for fitting generalized linear models with ultrahigh-dimensional predictors. Our procedure can provide a final model, control both false negatives and false positives, and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. In addition, under a sparsity assumption of the true model, the proposed approach can discover all of the relevant predictors within a finite number of steps. The thesis work is organized as follows. Chapter 1 highlights an importance of predictive models and names several examples where these models can be implemented. The main focus of Chapter 2 is to describe all well-known and already existing in the theory methods that attempted to solve the aforementioned problems, along with their shortcomings and disadvantages. Chapter 3 proposes STEPWISE algorithm and introduces the model setup and its detailed description, followed by its theoretical properties and proof of theorems and lemmas used throughout the thesis. Additional lemmas used to construct the theory of the STEPWISE method are also stated. Later it presents results obtained from various numerical studies such as simulations and real data analysis. Simulation studies comprise seven examples and are aimed to compare STEPWISE algorithm to other competing methods, and provide numerical evidence of its superiority. Real data analysis involves studies of gene regulation in the mammalian eye, esophageal squamous cell carcinoma, and neurobehavioral impairment from total sleep deprivation, and demonstrates the utility of the proposed method in real life scenarios. Chapter 4 proposes a multi-stage hybrid machine learning ensemble method that is aimed to enhance STEPWISE's performance. It also introduces a web application that employs the method. Finally, Chapter 5 completes the thesis with final conclusion and discussions. Appendices include some tables and figures used throughout the thesis.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Pijyan, Alex

Thesis Advisors: Hong, Hyokyoung

Date Published: 2022

Subjects: Statistics
Information technology
Linear models (Statistics)

Program of Study: Statistics - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: ix, 119 pages

ISBN: 9798426817975

Permalink: https://doi.org/doi:10.25335/nyxh-g161

Simultaneous model selection and estimation of generalized linear models with high dimensional predictors

Full text