SIMULTANEOUS MODEL SELECTION AND ESTIMATION OF GENERALIZED LINEAR MODELS WITH HIGH DIMENSIONAL PREDICTORS By Alex Pijyan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2022 ABSTRACT SIMULTANEOUS MODEL SELECTION AND ESTIMATION OF GENERALIZED LINEAR MODELS WITH HIGH DIMENSIONAL PREDICTORS By Alex Pijyan In the past couple of decades the progressive use of technology made the enormous amount of data in different formats available and easily accessible. The size and volume of available data sets have grown rapidly and the technological capacity of the world to store information has almost doubled every 40 months since the 1980s [1]. As of 2020, every day 2.5 quintillion of data are generated. Based on an International Data Group (IDG) report, the global data volume was predicted to grow exponentially and by 2025, IDG predicts there will be 163 zettabytes of data [2]. This enormous amount of data is often characterized by its high dimensionality. Quite often, well- known statistical methods fail to manage such data due to their limitations (e.g., in high-dimensional settings they often encounter various issues such as no unique solution for the model parameters, inflated standard errors, overfitted models, multicollinearity). This resulted in resurging interest in the algorithms that are capable of handling massive quantities of data, extracting and analysing information from it, and uncovering key insights that subsequently will lead to decision making. Techniques used by these algorithms are tend to speed up and improve the quality of predictive analysis, thus, they found their application in various fields. For instance, medicine becomes more and more individualized nowadays and drugs or treatments can be designed to target small groups, rather than big populations, based on characteristics such as medical history, genetic makeup etc. This kind of treatment is referred to as precision medicine. In the era of precision medicine, constructing interpretable and accurate predictive models, based on patients’ demographic characteristics, clinical conditions, and molecular biomarkers, has been crucial for disease prevention, early diagnosis and targeted therapy [3]. The models, for example, can be used to predict patients’ susceptibility to disease [4], identify high risk groups [5], and guide behavioral changes [6]. Therefore, predictive models play a central role in decision making. Several well-known approaches can be used to solve the problem mentioned above. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. In this dissertation, we would like to propose a procedure for fitting generalized linear models with ultrahigh-dimensional predictors. Our procedure can provide a final model, control both false negatives and false positives, and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. In addition, under a sparsity assumption of the true model, the proposed approach can discover all of the relevant predictors within a finite number of steps. The thesis work is organized as follows. Chapter 1 highlights an importance of predictive models and names several examples where these models can be implemented. The main focus of Chapter 2 is to describe all well-known and already existing in the theory methods that attempted to solve the aforementioned problems, along with their shortcomings and disadvantages. Chapter 3 proposes STEPWISE algorithm and introduces the model setup and its detailed description, followed by its theoretical properties and proof of theorems and lemmas used throughout the thesis. Additional lemmas used to construct the theory of the STEPWISE method are also stated. Later it presents results obtained from various numerical studies such as simulations and real data analysis. Simulation studies comprise seven examples and are aimed to compare STEPWISE algorithm to other competing methods, and provide numerical evidence of its superiority. Real data analysis involves studies of gene regulation in the mammalian eye, esophageal squamous cell carcinoma, and neurobehavioral impairment from total sleep deprivation, and demonstrates the utility of the proposed method in real life scenarios. Chapter 4 proposes a multi-stage hybrid machine learning ensemble method that is aimed to enhance STEPWISE’s performance. It also introduces a web application that employs the method. Finally, Chapter 5 completes the thesis with final conclusion and discussions. Appendices include some tables and figures used throughout the thesis. ACKNOWLEDGEMENTS First and foremost, it is my genuine pleasure to express my sincere gratitude and the deepest appreciation to my academic advisor at Michigan State University, Dr. Hyokyoung Hong. This dis- sertation would not have been possible without her dedication, advice, continuous encouragement, priceless support, and persistent help. I have been extremely fortunate to have Dr. Hong as my advisor as her insistence on perfection and careful supervision have given me a lot of confidence and inspiration to achieve my goals. I would like to thank our collaborators, Dr. Qi Zheng and Dr. Yi Li, for their significant contribution to the work presented in this thesis. My sincere thanks also go to my committee members, Dr. Taps Maiti, Dr. Lyudmila Sakhanenko, and Dr. Chenxi Li, for generously giving their time to offer me valuable and insightful comments toward improving my work. Finally, I would like to express special thanks to my friends and family for overwhelming love and support, for being by my side when I was going through hard times, and for celebrating each accomplishment. Through the struggles and challenges of writing this dissertation, you have been a constant source of joy and motivation. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Gene Regulation in the Mammalian Eye . . . . . . . . . . . . . . . . . . . 4 1.1.2 An Esophageal Squamous Cell Carcinoma Study . . . . . . . . . . . . . . 6 1.1.3 Bladder Cancer Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.4 Neurobehavioral Impairment from Total Sleep Deprivation . . . . . . . . . 10 CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Sequential Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Feature Selection using BIC criteria . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Forward Regression with High-Dimensional Predictors . . . . . . . . . . . 17 2.1.3 A Stepwise Regression Algorithm for High-Dimensional Variable Selection 18 2.1.4 Generalized Linear Models with High-Dimensional Predictors via For- ward Regression: offset approach . . . . . . . . . . . . . . . . . . . . . . 20 2.1.5 Cox Models with High-Dimensional Predictors via Forward Regression . . 22 2.1.6 A Stepwise Regression Method and Consistent Model Selection for High- Dimensional Sparse Linear Models . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 CHAPTER 3 STEPWISE METHOD: THEORY AND APPLICATIONS . . . . . . . . . . 30 3.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Proof of the Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Additional Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Applications: Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.6.1 A Study of Gene Regulation in the Mammalian Eye . . . . . . . . . . . . . 66 3.6.2 An Esophageal Squamous Cell Carcinoma Study . . . . . . . . . . . . . . 68 3.6.3 Neurobehavioral Impairment from Total Sleep Deprivation . . . . . . . . . 73 CHAPTER 4 MULTI-STAGE HYBRID MACHINE LEARNING METHOD . . . . . . . 76 4.1 Machine Learning Ensemble Methods: Categories and Types . . . . . . . . . . . . 76 4.2 A Review on Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.2 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.3 Gradient Boosting Machine (GBM) . . . . . . . . . . . . . . . . . . . . . 80 v 4.2.4 Artificial Neural Network (ANN) . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.5 Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . . . 83 4.2.6 STEPWISE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Multi-Stage Hybrid Machine Learning Method . . . . . . . . . . . . . . . . . . . 85 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Application: Bladder Cancer Prediction . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Web Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 CHAPTER 5 CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 APPENDIX A SUPPLEMENT MATERIALS . . . . . . . . . . . . . . . . . . . . 100 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 vi LIST OF TABLES Table 3.1: The values of 𝜂1 and 𝜂2 used in the simulation studies . . . . . . . . . . . . . . . 52 Table 3.2: Normal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 3.3: Binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 3.4: Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Table 3.5: Comparisons of MSPE between competing methods using the mammalian eye data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 3.6: Comparisons of competing methods over 100 independent splits of the ESCC data into training and testing sets . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Table 3.7: Comparisons of competing methods over 100 independent splits of the Total Sleep Deprivation data into training and testing sets . . . . . . . . . . . . . . . . 75 Table 4.1: Results of the 5-fold cross-validation procedure for the STEPWISE method . . . 90 Table 4.2: Assessment of the proposed STEPWISE procedure using the bladder cancer data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 4.3: Comparison of base-learner methods included in the multi-stage hybrid ma- chine learning model over 100 independent splits of the bladder cancer data into training and testing sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 4.4: Evaluation of the proposed multi-stage hybrid machine learning model with the bladder cancer data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 4.5: Comparison of various model configurations included in the sensitivity analysis . 93 Table A.1: Comparison of genes selected by each competing method from the mammalian eye data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Table A.2: Selected miRNAs for ESCC training dataset . . . . . . . . . . . . . . . . . . . . 101 Table A.3: Clinicopathologic characteristics of participants in bladder cancer study . . . . . 101 Table A.4: Clinicopathologic characteristics of study participants of the ESCC data set . . . 102 vii LIST OF FIGURES Figure 3.1: Box plot of model sizes for each method over 120 different training samples from the mammalian eye data set. STEPWISE was performed with 𝜂1 = 1 and 𝜂2 = 4, and FR and SC were conducted with 𝛾 = 1. . . . . . . . . . . . . . 67 Figure 3.2: Box plot of model sizes for each method based on 100 ESCC training datasets. Performance of STEPWISE is reported with 𝜂1 = 0 and 𝜂2 = 3.5. Performance of SC and FR are reported with 𝛾 = 0. . . . . . . . . . . . . . . . . . . . . . . 71 Figure 3.3: Comparisons of ROC curves for the selected models in the ESCC data set by the sequentially selected order. Model 1 includes Age and Gender feature, and the following features are sequnatially added to the model: miR-4783-3p, miR-320b, miR-1225-3p, miR-6789-5p. . . . . . . . . . . . . . . . . . . . . . . 72 Figure 3.4: Box plot of model sizes for each method based on 100 total sleep deprivation training datasets. Performance of STEPWISE is reported with 𝜂1 = 0.5 and 𝜂2 = 3. Performance of SC and FR are reported with 𝛾 = 0.5 . . . . . . . . . . 74 Figure A.1: R-Shiny Web Application for solving classification problems. The plot illus- trates uploading and splitting a dataset into training and testing sets. . . . . . . . 103 Figure A.2: R-Shiny Web Application for solving classification problems. The plot depicts a tuning parameters step for a Random Forest method. . . . . . . . . . . . . . . 104 Figure A.3: R-Shiny Web Application for solving classification problems. The plot depicts an output of the final predictive model developed by the web application. . . . . 105 viii LIST OF ALGORITHMS Algorithm 1 MULTI-STAGE HYBRID MACHINE LEARNING METHOD . . . . . . . . . 86 ix CHAPTER 1 INTRODUCTION 1.1 Predictive Models In biomedical research and clinical studies predictive model are utilized for several purposes such as risk management and prognosis. Consequently, the reliability of clinical data is directly related to the quality of predictive analysis. Nowadays, electronic health records became more available and contain rich information, which enables researchers to develop and deploy highly efficient clin- ical predictive methods. These methods have potential to be key components in making decisions related to patient treatments, drug development, and so on. Over the last decade, the technological advances and explosion of information profounded the un- derstanding of the molecular basis of tumor progression and identified numerous tumor biomarkers [7]. A certain type of biomarkers, which posses predictive power, are capable of assessing the benefit from clinical interventions and has a significant impact on clinical research. For instance, a cancer screening biomarker is a prognostic biomarker that can be used to predict the development of symptomatic cancer even in asymptomatic persons. In practice, such screening biomarkers can be used as a cancer prediction model [8]. The main purpose of building these models is discovering new cancer screening biomarkers and assessing their effect on the disease. Insights and information obtained from these models can potentially lead to early detection of disease in patients, early intervention and prevention from its further development. Further, these predictive models can be feasibly used as cancer screening tests for patients, including ones with no symptoms [9]. Technological advances have also made possible detailed genetic characterization of biological specimens. High-throughput genomic technologies, including gene expression microarray, mi- croRNA (micro Ribonucleic acid) array, RNA-seq, ChIP-seq (chromatin immunoprecipitation se- quencing), and whole genome sequencing, have become powerful tools and dramatically changed 1 the landscape of biological research. For instance, a gene expression profile can be extracted for a specimen by simultaneously evaluating expression levels of thousands of genes on that single specimen using complementary DNA (cDNA) microarray technology [10]. Single nucleotide polymorphisms (SNPs) are one form of natural sequence variation common to all genomes [11]. These SNPs are highly abundant, and are estimated to occur at 1 out of every 1,000 bases in the human genome [12, 13]. SNPs are particularly useful as DNA markers for mapping susceptibility genes for complex diseases and population genetics since they demonstrate the high density and mutational stability [14, 15]. SNPs in the coding regions of genes that alter the function or structure of the encoded proteins can be a necessary and sufficient cause of most of the known recessively or dominantly inherited mono- genic disorders [16], and are analyzed for diagnostic purposes. Moreover, SNPs can be analysed to assess the risk of an individual for a particular disease. For instance, the identification of SNPs made possible to screen somatic (non-tumor) DNA for mutations that alter treatment response or predispose to cancer [7]. In addition, a large number of profiles based on the abundance of micro-RNAs (miRNAs) have been used to predict prognosis or treatment response in cancer [7]. For example, Genome-wide associa- tion (GWA) have identified cancer-causing mutations in breast [17] and colon [18] tumors, somatic genetic screens can also identify predictors of radiation sensitivity [19] and the pharmacodynamics of anticancer drugs [20]. Moreover, sets of genes identified through mRNA profiling have been used to classify tumors into oncogenic subtypes of breast cancer [21, 22]; many individual miRNAs have been associated with patient survival and drug treatment response in a number of different cancers [23, 24]. Clearly, the enormous interest in genomic data is determined by the hope of finding candidate biomarkers and using them to identify genes that predispose individuals to common diseases. Al- though genome data analysis has already made a significant impact on biological and biomedical research, it is still accompanied by certain challenges that yet have to be overcome. Specifi- cally, complex genomic data introduce substantial challenges for statistical data analysis as its 2 high-dimensionality makes the classical statistical model framework no longer implementable. As opposed to low-dimensional data when the number of observations is greater than the number of explanatory features (also known as predictors), high- and ultrahigh-dimensional settings are comprised of data in which number of predictors is greater than or is in the exponential order of the sample size, respectively. Most of the traditional statistical methods are developed around the concept of low-dimensional data and are not aimed to accommodate high- or ultrahigh-dimensional data. Thus, high- dimensionality has significantly challenged traditional statistical theory. Applying these meth- ods to high-dimensional data leads to unstable, unreliable, biased, and inconsistent results which demolishes the main purpose of predictive model development. The problems that arise while analyzing such data are typically referred to as the ’curse of dimensionality’, a term introduced by mathematician Richard Bellman. Some aspects of it are discussed further. First, classical statistical models applied to high-dimensional data have no unique solution for their parameters. In fact, these models will have infinitely many solutions. This is mainly induced by ill-defined, uninvertible, and singular matrices involved in the computation of parameter estimates, making the estimation process ill-posed. These models are also know as unidentifiable models. Consequently, effect size estimation in predictive models will become meaningless. Second, as the number of predictors increases and surpasses the number of observations in the model, variances of the parameter estimates will become large (even infinitely large in some cases), resulting in inflated standard errors. In other words, a wide range of values of parameter estimates will be consistent with data, making the confidence intervals uncommonly wide. Hence, validating a significance of the predictors included in the predictive model will be nearly impossible. Third, employment of classical statistical models in high-dimensional settings can provide incon- sistent estimates as a small corruption of data can result in very different estimated parameters. Furthermore, these models tend to capture the artificial trends of measurement noise, also know as, overfitting. Overfitted models fit training data too closely and normally capture trends in data that are applicable to this particular data set only. This decays their ability to generalized results with 3 new unseen data and results in poor predictive capability. Lastly, using classical statistical methods with high-dimensional data often introduces multi- collinearity issues that violate the underlying assumption of independent predictors in the model. Multicollinearity implies the existence of highly correlated predictors among predictor features. These structures are commonly observed in genomic data. Multicollinearity can create inaccurate estimates of the model parameters; make insignificant predictors significant and vice versa, that is, imposing false positives and false negatives in the predictive model; and, finally, it can degrade the predictability of the model. As it was shown, the traditional methods that perform well in low-dimensional settings run into severe problems in analyzing high- or ultrahigh-dimensional data. They cannot cope with the explosive growth of dimensionality of data. Therefore, in order to face the problem of high- dimensionality, we must reshape the classical statistical thinking. These problems create signifi- cant challenges, but, on the other hand, they create great opportunities for the development of new statistical methodologies. It is worth mentioning that developing predictive models along with feature selection and estima- tion play crucial and fundamental role in knowledge discovery. As more amount of massive and complex data become available, there is no doubt that high-dimensional data analysis will be one of the most important and demanding research topics in our field. In this thesis we propose a new method (introduced and described in Chapter 3) for model selection and estimation that will overcome aforementioned limitations in high-dimensional problems. The remaining sections of this chapter discuss a few problems from various research areas that will illustrate challenges of high-dimensional data and to which the proposed method could be applied. 1.1.1 Gene Regulation in the Mammalian Eye Human genetics has sparked a revolution in medical research on the basis of the seemingly unthink- able notion that one can systematically discover the genes causing inherited diseases without any prior biological knowledge as to how they function [25]. Most characteristics of medical pertinence 4 do not follow simple Mendelian monogenic inheritance. Such complex traits include vulnerability to heart disease, hypertension, diabetes, cancer, and infection. The genetic dissection of complex traits is attracting many investigators with the promise of solving old problems and is generating a variety of analytical methods. Recent advancement in microarray technology and bioinformatics made it possible to examine the expression of numerous genes in a large number of individuals and enabled researcher to identify genetic elements that cause the gene expression to vary among individuals [26, 27, 28, 29]. Dis- covering specific disease mechanics is a big challenge that biomedical researchers face nowadays. These mechanics might potentially underlie heritable disorders that reveal complex inheritance, for instance, Mendelian disorders [25, 30, 31]. In addition, these approaches can help identify genes related to development of Mendelian forms of complex diseases such as obesity [32, 33, 34], macular disease [35, 36, 37], hypertension [38], and glaucoma [39, 40]. Mutations that alter gene expression might play a significant role in complex disease. Transgenic animal studies revealed that gene dosage of mutant genes can have a keen effect on phenotype [41]. It was shown that the cause of disease can become an improper regulation of structurally normal genes and alterations in gene dosage [39]. For example, overexpression and haploinsufficiency of the FOXC1 gene can lead to developmental defects of the anterior chamber of the eye [39]. Scheetz et al. [41] used expression quantitative trait locus mapping in the laboratory rat to gain a broad perspective of gene regulation in the mammalian eye and to identify genetic variation relevant to human eye disease. They analyzed data obtained from Rat Genome Database by using analy- sis of variance (ANOVA) technique and identified significant genes based on their corresponding p-values. Certainly, Scheetz’s results provide meaningful insights on how genetic variation can be associated with specific diseases, but they do not estimate the magnitude of effects these genes are having on the disease. In addition, they have not built a predictive model that will enable researchers to link genes and assess their contribution toward developing diseases, and have not evaluate its predictive power. We adopted their data and aimed to improve results achieved by Scheetz et al. [41]. Data contained 120 observation profiling 31042 probes of genes, but due to a 5 small variation in many of these probes, the number of probes was reduced to 5000. A gene TRIM32 that has been found to cause Bardet-Biedl syndrome [42] was treated a response variable, and the expression of 5000 genes as the predictors. Our predictive model has identified three probes of genes (1376747_at, 1381902_at, 1382673_at) that can be potentially linked to TRIM32. We achieved a high accuracy with the mean squared prediction error (MSPE) as low as 0.0012. Detailed description of the results can be found in Chapter 3. 1.1.2 An Esophageal Squamous Cell Carcinoma Study Esophageal cancer is the 7th most common cancer among males and among both sexes combined in the world and ranks 6th in terms of mortality overall because of the poor survival rate it confers [43, 44]. Additionaly, incidence and mortality rates in males are 2- to 3-fold higher than the rates in females [43]. Compared with more developed geographic regions, overall incidence rates are 2-fold higher in less-developed countries, with the highest rates occurring in Asia [43]. Esophageal squamous cell carcinoma (ESCC) is the predominant histologic type with the highest incidence rate in populations within Southeastern and Central Asia [44]. There are two major histological types of esophageal carcinoma: esophageal squamous cell carcinoma (ESCC) and adenocarcinoma [45]. ESCC is the major type in China, where it accounts for more than 90% of cases of esophageal carcinoma; whereas adenocarcinoma is more common in the United States and in European countries [46]. ESCC is often diagnosed at a locally advanced stage and the outcomes for affected patients are poor [45]. With various treatment methods employed in clinical practice after extensive research, the diagno- sis and treatment of ESCC have been greatly improved [47]. Esophagectomy, chemotherapy, and radiotherapy are currently the main treatments for ESCC [45]. However, the prognosis remains poor, with 5-year survival proportions of 21% and 14% (2005–2011) in the United States for whites and blacks, respectively, and 12% (2000–2007) in Europe [44], which is far below the estimated effectiveness of the therapy [47]. An accurate clinical staging and prognostic information is essential to direct appropriate treatment 6 strategies [45]. Accumulating evidence suggests that the prognosis is affected by several factors, including the delayed diagnosis, high recurrence, and metastasis rate [47, 48]. Thus, identifying the diagnostic and prognostic tumor markers and further elucidating their clinical implications are urgently needed. To develop new diagnostic methods and treatment strategies, investigators have focused on the potential of a particular class of microRNAs (miRNAs) to provide additional information about the characteristics and survival prospects of patients with ESCC. miRNAs are small (22-24 nucleotides), noncoding RNA molecules that play important roles in regulating cell differentiation, proliferation, migration and apoptosis [44]. Altered miRNA expres- sion in cancer tissue has been reported in most tumor types [49, 50]. There is increasing evidence that miRNA expression in cancer tissue is a useful prognostic marker [51, 52, 53]. In addition, the application of miRNA expression levels as a blood biomarker has been explored in various types of cancer, including gastric, hepatocellular, and non-small cell lung cancer [54, 55, 56]. However, whether miRNA levels in plasma are a useful biomarker for patients with ESCC remains largely unexplored [45]. Sudo et al. [57] explored ways of developing a detection model for ESCC based on large-scale miRNA profiling. For these purposes, they analyzed data submitted to the National Center for Biotechnology Gene Expression Omnibus (NCBI GEO) database, available under accession num- ber GSE122497. To establish a diagnostic model, they developed a model based on the observations obtained from 566 patients (283 with ESCC and 283 healthy controls) profiling 2565 miRNAs by carrying out Fisher’s linear discriminant analysis with a greedy algorithm. Although their model has achieved high predictive accuracy, it has some drawbacks. Given the nature of the algorithm that has been employed, they developed a predictive model for the pre- determined model size: they built models with model sizes ranging 2-8 and selected the one that achieved higher accuracy with fewer variables (model size = 6). The disadvantage of this method is that it might lead to false negatives and false positive in the final model. Moreover, the importance of the miRNAs included in the model will also be determined by the model size. This might lead to a wrong assessment of the effect sizes identified in the model. 7 We adopted this dataset and demonstrated the utility of our proposed method (introduced and de- scribed in Chapter 3) and its superiority over other methods. Our model achieved similar accuracy by recruiting fewer variables (3 miRNAs were selected: miR - 4783 - 3p, miR - 320b, miR - 1225 - 3p). It is worth mentioning that our model overcomes the issues associated with the model introduced by Sudo et al. as our model size was defined by scanning the entire feature space and selecting features based on their significance. Detailed description of our results and methodology is presented in Chapter 3. 1.1.3 Bladder Cancer Study Bladder cancer is any of several types of cancer arising from the tissues of the urinary bladder and has high prevalence and recurrence rates [58, 59, 60]. According to American Society of Clinical Oncology, among men bladder cancer is the fourth most common cancer and men are 4 times more likely to be diagnosed with the disease. In addition, incidence in white men is twice more than that in black men. The earlier bladder cancer is found, the better the chance for successful treatment and cure. Prog- nosis varies inversely with higher tumor stage and lymph node involvement [61]. Typically, the 5- and 10-year survival rates for patients with lymph node involvement are 31% and 23%, respectively [62]. Combination platinum-based chemotherapy is an potion for patients with metastatic disease, but the survival is only 15 months, with a 5-year survival rate of 15% [63]. Since there is not yet an accurate test to screen the general population for bladder cancer, most people are diagnosed with bladder cancer once they have developed symptoms. As a result, some people have more advanced (later stage) disease when the cancer is found. In the year 2000, the total expenditure for lower tract urothelial cancers in the United States sur- passed one billion dollars [64]. Bladder cancer affected about 1.6 million people globally in 2020 with 549,000 new cases and 200,000 deaths, and the late stage disease is associated with poor survival. The cost of bladder cancer per patient from diagnosis to death is the highest of all cancers [65]. 8 Identifying the related biomarkers and predicting the disease at its early stage is crucial for better prognosis. Discovery of diagnostic, prognostic, and predictive biomarkers in bladder cancer made molecular markers an area of research. Potential biomarkers include the overexpression of mu- tated genes, whole genome-wide array signatures, and microRNAs. For instance, microarray gene expression profiling is studied in the blood of cancer patients in order to detect gene expression patterns representing the cancer itself or a host’s reaction to the tumor [66, 67]. Recent studies have suggested that 70% of bladder cancer involve a specific mutation in genes [68], therefore it can be potentially used as a biomarker in early detections of the disease and preventing it from further development. Gene changes can also assist doctors in choosing the best treatment possible or be useful in finding bladder cancers that can potentially come back after treatment. Although recent progress made by scientists is significant, classification of bladder cancer patients using gene expression data with regular statistical tools can become complicated and sometimes be even impossible due to incapability of these methods to process data with a large scale, also known as, high-dimensional data. Usuba et al. [69] attempted to develop a predictive model for an early detection in bladder cancer. They applied similar technique as described in Sudo et al. [57] via Fisher’s linear discriminant analysis and recruited seven miRNAs in their final model. They achieved high accuracy in predic- tion, but their model suffered from the same issues mentioned in Sudo et al. [57]. Specifically, the method they employed might lead to false negatives and false positives in the final model, and won’t be able to assess the effect sizes correctly due to pre-determined model size. We aimed to improve given results and adopted data utilized in Usuba’s model. These data were submitted to the NCBI GEO under accession number GSE113486. The predictive model was built based on observations obtained from 768 patients (310 patients with bladder cancer and 468 healthy controls) profiling 2565 miRNAs. We demonstrated that our proposed multi-stage hybrid machine learning method (introduced and described in Chapter 4) has achieved high prediction accuracy with sensitivity, specificity, and area under the receiver operating curve (AUC) of 0.98, 0.98, and 0.99, respectively, and outperformed 9 Usuba’s model. Detailed description of our results can be found in Chapter 4. 1.1.4 Neurobehavioral Impairment from Total Sleep Deprivation Sleep plays a key role in health, performance, and cognition [70]. Sleep deprivation is common- place in modern society and its effects on neurobehavioral function (e.g., vigilance and cognition) are well studied and documented. Sleep deprivation can induce giddiness, child-like behaviors, and silliness [71], as well as more widely recognized negative effects including dysphoria, increased irritability, and lowered frustration tolerance. The increased irritability that often accompanies sleep deprivation hints that sleep-deprived in- dividuals are highly reactive to emotional signals. These effects on mood can lead to negative consequences and impact functioning abilities [72]. For example, sleep duration is inversely asso- ciated with interpersonal difficulties and even violence has been observed in medical residents [73], and sleeping less than 8 hours is associated with increased risk for adolescent suicidal behavior [74]. Interestingly, alertness and vigilance also appear to be the cognitive capacities most consistently and dramatically impacted by insufficient sleep [75]. When the envelope of continuous wakeful- ness is pushed beyond about 16 hours, most individuals begin to show a substantial slowing of reaction time (RT) and worsening of performance accuracy on tests of psychomotor vigilance [76]. Moreover, neurobehavioral tests have revealed assorted forms of performance deficits from sleep loss, including impairment of learning and of responses to feedback in decision making [77]. The Psychomotor Vigilance Test (PVT) is one of the most commonly applied neurobehavioral assays of performance impairment due to sleep loss [75]. This test assays stimulus-response time, with failure to respond within 500 ms recorded as a lapse. Sleep deprivation is associated with increased variability in stimulus-response times, and more lapses, on the PVT [78]. Besides neurobehavioral testing, efforts have been made to identify molecular biomarkers such as differentially expressed genes or metabolites affected by sleep loss [79, 80]. A biomarker has been defined as “a characteristic that is objectively measured and evaluated as an indicator of normal 10 biological processes, pathogenic processes, or pharmacologic responses to a therapeutic interven- tion” [81]. Many biomarkers such as differentially expressed genes can provide meaningful insights including identification of a process or response. Humans are known to differ in their sensitivity to sleep loss [82], and recent work has sought to identify biomarkers distinguishing individuals as susceptible or resistant to sleep deprivation [83]. Yet surprisingly little effort has been made to research molecular biomarkers with results from neurobehavioral assays. Microarrays and bioinformatics analyses can be employed to explore candidate gene expression biomarkers associated with total sleep deprivation (TSD), and more specifically, the phenotype of neurobehavioral impairment from TSD. Uyhelji et al. [70] explored gene expression biomarker candidates for neurobehavioral impairment from total sleep deprivation. They employed Weighted Gene Co-expression Network Analysis (WGCNA) using data obtained from the NCBI GEO under accession number GSE98582. Data contain 555 samples profiling 8284 gene features. In the treatment effect analysis, they identified 212 genes that exhibited a significant difference between TSD and control group, and 91 of them passed human blood biomarker filter. Although Uyhelji et al. have done a great job in identifying important gene biomarkers associated with TSD, effect sizes of these genes have not been estimated. Moreover, neither predictive power of the diagnostic model was assessed. Thus, we employed our proposed method (discussed in Chapter 3) to overcome the issues introduced in Uyhelji’s model. We have built a model based on 389 observations profiling 8284 gene features. Our model recruited five genes (PF4V1, USP32P1, EMR1, NBR2, and DUSP23) and achieved high accuracy with sensitivity, specificity, and AUC of 0.99, 0.97, and 0.99, respectively. 11 CHAPTER 2 LITERATURE REVIEW When the number of predictors is moderate, penalized regression approaches such as least absolute shrinkage and selection operator (LASSO) by Tibshirani [84] have been used to construct predictive models and explain the impacts of the selected predictors. LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. However, in ultrahigh-dimensional settings where a number of predictors 𝑝 is in the exponential order of the sample size 𝑛, penalized methods may incur computational challenges [85], may not reach globally optimal solutions, and often generate biased estimates [86]. Sure independence screening (SIS) proposed by Fan and Lv [87] has emerged as a powerful tool for modeling ultrahigh dimensional data. This method is based on correlation learning, which filters out the features that have weak correlation with the response and reduces dimensionality from high to a moderate that is below the sample size. Specifically, such correlation learning ranks the importance of features according to their marginal correlation with the response variable and eliminates the ones with weak marginal correlations. However, the method relies on a partial faithfulness assumption, which stipulates that jointly im- portant variables must be marginally important, an assumption that may not be always realistic. To relieve this condition, some iterative procedures, such as ISIS [87], have been adopted to repeatedly screen variables based on the residuals from the previous iterations, but with heavy computation and unclear theoretical properties. Conditional screening approaches (see, e.g. [88]) have, to some extent, addressed the challenge. However, screening methods do not directly generate a final model, and post-screening regularization methods, such as LASSO, are recommended by Fan and Lv [87] to produce a final model. Closely related to forward selection is least angle regression (LARS) by Efron et al. [89], a widely 12 used model selection algorithm for high-dimensional models. In the LARS method, a multivariate solution path is defined by using the geometrical theory of the linear regression model. The result- ing method defines a continuous solution path for Generalized Linear Models (GLMs), with on the extreme of the path the maximum likelihood estimate of the coefficient vector and on the other side the intercept-only estimate. The LARS method is based on a recursive procedure selecting, at each step, the covariates having largest absolute correlation with the response feature [89]. It is worth mentioning that a simple modification of the LARS algorithm implements the LASSO method and calculates all possible LASSO estimates for a given problem. In addition, an approx- imation for the degrees of freedom of a LARS estimate is available, from which Mallows’s 𝐶 𝑝 estimate of prediction error can be derived; this allows a principled choice among the range of possible LARS estimates. Though LARS achieves impressive results in its performance, some researches raised concerns in the following regards. Ishwaran (see discussion section in Efron et al. [89]) suggests that the use of 𝐶 𝑝 coupled with LARS forward optimization procedure might raise some potential flags. Specifically, the use of 𝐶 𝑝 will encourage large models in LARS, especially in high-dimensional orthogonal problems, and will have a negative impact on variable selection performance. The claim was supported with the high-dimensional simulation examples. Moreover, Weisberg (see discussion section in Efron et al. [89]) believes that multicollinearity problem among independent features and presence of noise in the dependent variable will affect the performance of LARS in regards of variable selection, specifically, reducing chances of selecting significant variables in the model. Examples supporting the claim were provided. In the GLM setting, Augugliaro et al. [90] and Pazira et al. [91] proposed differential geometrical LARS (dgLARS) based on a differential geometrical extension of LARS. The dgLARS estimator follows naturally from a differential geometric interpretation of a GLM, generalizing the LARS method. The subsequent section discusses sequential model selection techniques known to the literature. Sequential model selection assumes including features into the final model sequentially with the 13 entry order determined by their relative importance based on certain criteria. Although the methods described in this section have achieved significant results and have been implemented in different scenarios, lack of certain properties make them less reliable and more vulnerable against challenges introduced by the size and volume of data available nowadays. 2.1 Sequential Model Selection For generating a final predictive model in ultrahigh-dimensional settings, recent years have seen a surging interest of performing forward regression, an old technique for model selection that has been widely used for model building when the number of covariates is relatively low. But due to its complicated computations and unknown theoretical properties, forward regression technique is rarely used in high-dimensional settings. Under some regularity conditions and with some proper stopping criteria, forward regression can achieve screening consistency and sequentially select variables according to metrics such as Akaike information criterion (AIC), Bayesian information criterion (BIC), or 𝑅 2 . Below are listed methods that try to overcome limitations introduced by forward regression and utilize it for models with high-dimensional predictors. 2.1.1 Feature Selection using BIC criteria The problem of variable selection with an ultrahigh-dimensional predictor becomes a problem of fundamental importance. The traditional method of best subset selection is computationally infeasible for high dimensional data. As a result, various shrinkage methods have gain a lot of popularity. All those methods are very useful and can be formulated as penalized optimization problems, which could be selection consistent, if the sample size is much larger than the predictor dimension. However, if the predictor dimension is much larger than the sample size, the story changes drastically. One frequently used assumption is the so-called sparsity condition which assumes that the effective 14 contribution to a dependent variable rests on a much small number of regressors than the sample size. The challenge then is to find those ‘true’ regressors from a much larger number of candidate variables. This leads to a surging interest in new methods and theory for regression model selection with 𝑝  𝑛. An et al. [92] revisited the classical forward and backward stepwise regression methods for model selection and adapted them to the cases with the number of candidate variables p greater than the number of observations n. In the noiseless case, they gave definite upper bounds for the number of forward search steps to recover all relevant variables, given each step of the forward search is approximately optimal in reduction of residual sum of squares, up to a fraction. In the presence of noise, they proposed two information criteria BICP (BIC modified for a case with large number of predictors) and BICC (BIC with an added constant) that overcome the difficulties related to employing regular BIC and AIC. These criteria serve as a stopping rule in the stepwise search: the BICP increases the penalty to overcome overfitting and the BICC controls the residuals in the sense that it will stop the search before the residuals diminish to 0 as the number of selected variables increases to n. In addition, they proved that the BICP stops the forward search as soon as it recovers all relevant variables and removes all extra variables in the backward deletion, which lead to the selection consistency of the estimated models. The algorithm can be summarized as follows. Consider a linear regression model y = Xβ + , (2.1) where y = (𝑦 1 , . . . , 𝑦 𝑛 ) T is an n-vector of random responses, X = (𝑥 1 , . . . , 𝑥 𝑝 ) is a 𝑛 × 𝑝 design matrix, β is a p-vector of regression coefficients,  = (𝜖1 , . . . , 𝜖 𝑛 ) 0 ∼ 𝑁 (0, 𝜎 2 I), where 𝜎 2 > 0 is an unknown but fixed constant and I denotes an identity matrix. Let I𝑛 = {1 ≤ 𝑖 ≤ 𝑝 : 𝛽𝑛,𝑖 ≠ 0} and 𝑑𝑛 = |𝐼𝑛 | denote the number of elements in 𝐼𝑛 . Further, for any subject 𝐽 ⊂ {1, . . . , 𝑝}, let X𝐽 denote the 𝑛 × |𝐽 | matrix consisting of the columns of X corresponding to the indices in J, and 𝛽𝐽 the |J|-vector consisting of the components 𝛽 15 corresponding to the indices of J. Put P𝐽 = X𝐽 (XT𝐽 X𝐽 ) − XT𝐽 , P𝐽⊥ = 𝐼𝑛 − P𝐽 , 𝐿 𝑢,𝑣 (𝐽) = 𝑢 T P𝐽⊥ 𝑣, 𝑢, 𝑣 ∈ R𝑛 (2.2) P𝐽 is a projection matrix onto the linear space spanned by the columns of X𝐽 , 𝐿 𝑦,𝑦 (𝐽) is the sum of squared residuals resulted from the least square fitting ŷ = X𝐽 β̂𝐽 = P𝐽 y. The algorithm concerned is based on a combined use of the standard stepwise addition and deletion with some adjusted information criteria and can be described in the following steps: Stage I - Forward Addition: 1. Let 𝐽1 = { 𝑗1 }, where 𝑗 1 = arg min1≤𝑖≤ 𝑝 𝐿 𝑦,𝑦 ({𝑖}). Put   BICP1 = log 𝐿 𝑦,𝑦 (𝐽1 )/𝑛 + 2log 𝑝/𝑛 2. Continue with k = 1, 2, 3, . . . , provided BICP𝑘 < BICP 𝑘−1 , where   BICP𝑘 = log 𝐿 𝑦,𝑦 (𝐽 𝑘 )/𝑛 + 2𝑘log 𝑝/𝑛 In the above expression, 𝐽 𝑘 = 𝐽 𝑘−1 ∪ { 𝑗 𝑘 } 3. For BICP𝑘 ≥ BICP 𝑘−1 , let 𝑘˜ = 𝑘 − 1, and 𝐼ˆ𝑛,1 = 𝐽 𝑘˜ Stage II - Backward deletion: 1. Let BICP∗𝑘˜ = BICP 𝑘˜ and 𝐽 𝑘∗˜ = 𝐼ˆ𝑛,1 2. Continue with 𝑘 = 𝑘˜ − 1, 𝑘˜ − 2, . . . , providing BICP∗𝑘 ≤ BICP∗𝑘+1 , where BICP∗𝑘 = log 𝐿 𝑦,𝑦 (𝐽 𝑘∗ )/𝑛 + 2𝑘log 𝑝/𝑛   In the above expression, 𝐽 𝑘∗ = 𝐽 𝑘+1 \{ 𝑗 𝑘 } 3. BICP∗𝑘 > BICP∗𝑘+1 , 𝑘˜ = 𝑘 + 1, and 𝐼ˆ𝑛,2 = 𝐽 𝑘∗˜ The drawback of the method is that it is unclear whether the results are applicable to high- dimensional GLMs. 16 2.1.2 Forward Regression with High-Dimensional Predictors Consider, for example, those useful methods with non-convex objective functions (e.g., bridge regression, the SCAD, etc). With the predictor dimension much larger than the sample size, computationally how to optimize those non-convex objective functions remains a nontrivial task. Efficient algorithms (e.g., LARS) do exist for LASSO-type methods, where the objective functions are strictly convex. However, those methods are not selection consistent under a general design condition. Another reasonable solution can be variable screening, such as very popular yet classical method Forward Regression. Motivated by SIS method [87], Wang [93] proposed a Forward Regression (FR) method for ultrahigh-dimensional variable selection. It was showed that FR method can identify all relevant predictors consistently, even if the number of predictors is significantly larger than the sample size. Particularly, FR is capable of discovering all relevant predictors within a finite number of steps, given that the dimension of the true model is finite. To select the final model from the set of candidate models, Wang [93] makes use of BIC criteria introduced by Chen and Chen [94]. The resulting model can then serve as an excellent starting point, from where many existing variable selection methods can be applied directly. FR algorithms can be summarized as follows. Suppose (X𝑖 , Y𝑖 ) are observation from the 𝑖th subject (1 ≤ 𝑖 ≤ 𝑛), Y𝑖 ∈ R1 is the response variable, and X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖𝑑 ) T ∈ R𝑑 is ultrahigh-dimensional predictor with 𝑑  𝑛. The response and predictor features are linked as Y𝑖 = X𝑖T β + σ𝑖 , where β = (𝛽1 , . . . , 𝛽𝑑 ) T ∈ R𝑑 and σ𝑖 is a random noise. Let M = { 𝑗1 , . . . , 𝑗 𝑑 ∗ } denote an arbitrary model with 𝑋 𝑗1 , . . . , 𝑋 𝑗 𝑑∗ as relevant predictors. Then the full model is defined as 𝚪 = {1, . . . , 𝑑} and the true model as τ = { 𝑗 : 𝛽 𝑗 ≠ 0}. Additionally, X𝑖(𝑀) = {𝑋𝑖 𝑗 : 𝑗 ∈ 𝑀 } denotes the subvector of X𝑖(𝑀) corresponding to M , Y = (𝑌1 , . . . , 𝑌𝑛 ) T ∈ R𝑛 is the response vector, and ξ 𝑀 = (𝑋1 , . . . , 𝑋𝑛 ) ∈ R𝑛×𝑑 is the sub-design matrix corresponding to M . FR algorithm is implemented in three major steps: 1. (Initialization) It starts with initialization, that is, setting 𝑆 0 = ∅. 17 2. (Forward Regression) 𝑆 (𝑘−1) is given at the 𝑘th step. For every 𝑗 ∈ 𝚪\𝑆 (𝑘−1) , it constructs a candidate model M 𝑗(𝑘−1) = 𝑆 (𝑘−1) ∪ { 𝑗 }. Then it computes RSS (𝑘−1) 𝑗 = YT {𝐼 T − Ĥ 𝑗(𝑘−1) }Y, where   −1 Ĥ 𝑗(𝑘−1) = ξM (𝑘−1) ξ T (𝑘−1) ξM (𝑘−1) ξT 𝑗 M𝑗 𝑗 M 𝑗(𝑘−1) is a projection matrix. It finds 𝑎 𝑘 = arg max 𝑗 ∈𝚪\𝑆 (𝑘−1) RSS (𝑘−1) 𝑗 and updates 𝑆 𝑘 = 𝑆 (𝑘−1) ∪{𝑎 𝑘 } accordingly. 3. (Solution Path) Then FR algorithms iterates step 2 𝑛 times and generates total of 𝑛 nested candidate models and collects these models by a solution path S = {𝑆 𝑘 : 1 ≤ 𝑘 ≤ 𝑛} with 𝑆 𝑘 = {𝑎 1 , . . . , 𝑎 𝑘 }. Authors showed both theoretically and numerically that FR can discover all relevant predictors consistently, even if the predictor dimension is substantially larger that the sample size. However, the proposed method is limited to linear regression models in high-dimensional settings only . 2.1.3 A Stepwise Regression Algorithm for High-Dimensional Variable Selection Hwang et al. [95] proposed a stepwise regression algorithm with a simple stopping rule for the identification of significant predictors and interactions among a huge number of variables in various statistical models. It improves the results of the Forward Regression method called paring-down variation (SPV) algorithm, proposed by Hwang and Hu [96], which was limited to the analysis of the variation model for continuous responses, and required independence between factor predic- tors. The new stepwise regression algorithm, like ordinary stepwise regression, at each forward selection step includes a variable in the current model if the test statistic of the enlarged model with the predictor against the current model has the minimum p-value among all the candidates and is smaller than a predetermined threshold. Instead of using conventional information types of criteria, the threshold is determined by a lower percentile of the beta distribution. The proposed stopping rule is based on the well-known theoretical properties that (1) the p-values of the test statistics are 18 Unif (0, 1) distributed if the predictors are irrelevant to the responses and (2) the minimum of m independent Unif (0, 1) random variables can be assumed to be beta distributed with parameters 1 and m approximately [97, 98]. The algorithm can be summarized as follows. Suppose Y is an n-vector of responses and X = (𝑋1 , . . . , 𝑋 𝑝 ) is a 𝑛 × 𝑝 design matix of variables with 𝑝  𝑛. Let S be a subset of {0, 1, . . . , 𝑝} and denote X𝑆 as the sub-matrix of X obtained by extracting its columns corresponding to the indices in S. In addition let M𝑆 be the model relating the distribution of Y to the predictors X𝑆 through a function of the linear predictor X𝑆 β𝑆 with parameter vector β𝑆 of size |S|. Finally, let denote the vector of residuals from the fitted model M𝑆 is denoted by R𝑆 . Then the algorithm can be expressed in the following steps. Forward Selection Step 1: Start with the null model M𝑆 , where S = ∅ Step 2: Let the step count be l = |𝑆| + 1 Step 3: Calculate the correlation between R𝑆 and X 𝑗 , denoted as 𝑟 𝑗 , for 𝑗 = 1, . . . , 𝑝. Set D = {1 ≤ 𝑗 ≤ 𝑝 : |𝑟 𝑗 | is among the first d largest of all}, where 𝑑 = [𝑛/log{𝑛}] Step 4: Test the difference in the goodness-of-fit between each M𝑆∪{ 𝑗 } against M𝑆 for all 𝑗 ∈ 𝐷\𝑆 Step 5: Replace S with 𝑆 ∪ { 𝑗 } when the test statistic of 𝑀𝑆∪{ 𝑗 } against M𝑆 has the minimum p-value, denote as p𝑙 , among the |𝐷\𝑆| competing models Step 6: Stop forward selection and go to Step 7 when 𝑝 ℎ > the 10th percentile of Beta(1, 𝑝 − ℎ + 1) for ℎ = 𝑙, 𝑙 − 1, . . . , 𝑙 − 9; otherwise, go to Step 2 Backward Selection: Step 7: Set 𝜋 = 𝑒𝑥 𝑝(𝜇 − 𝑧 × 𝑣) where 𝜇 and 𝑣 are the sample mean and standard deviation of log{𝑝 ℎ } Step 8: Test the difference in the goodness-of-fit between M𝑆 and 𝑀𝑆\{ 𝑗 } , for each 𝑗 ∈ 𝑆 Step 9: Replace 𝑆 with 𝑆\{ 𝑗 𝑆 } and go to Step 8 if the test statistic of M𝑆 against M𝑆\{ 𝑗 𝑆 } has the largest 𝑝-value e among all the reduced models and is larger than 𝜋; otherwise, stop and report the set of remaining predictors {X 𝑗 , 𝑗 ∈ 𝑆} as final influential predictors 19 The main drawback of this method is that it was not supported with theoretical properties on model selection. 2.1.4 Generalized Linear Models with High-Dimensional Predictors via Forward Regres- sion: offset approach As the dimension of predictors defies any existing modeling approaches, feature screening has been commonly used for dimension reduction. The most popular screening approach is marginal screening [87], which selects variables based on their marginal associations with the response. However, marginal screening may miss signals that are marginally unimportant but conditionally important [88], resulting in biased predictive results. Conditional screening methods have been known as an alternative to well-known marginal screen- ing, as they identify marginally weak but conditionally important variables. Nevertheless, the initial conditioning set need to be fixed for the most of existing conditional screening methods and if not chosen properly, may produce false positives and false negatives, and the selected variable might depend on the conditioning set. Moreover, screening approaches typically need to involve tuning parameters and extra modeling steps in order to reach a final model. Zheng et al. [99] proposed a sequential conditioning (SC) approach, wherein variables sequentially enter the conditioning set according to the increment of likelihood. The procedure updates the conditioning set at each iteration based on the extended Bayesian information criterion (EBIC), and constructs an offset term based on the variables in this set. In essence, this offset summarizes the information contained in the updated conditioning set, and it searches for a new variable that max- imizes the likelihood given the offset term. The authors emphasize that the proposed SC approach deviates fundamentally from the variable screening or selection approaches as it naturally leads to a final model when the procedure terminates. The SC approach can be summarized as follows. Suppose (X𝑖 , 𝑌𝑖 ) are observations from the 𝑖th subject (1 ≤ 𝑖 ≤ 𝑛), 𝑌𝑖 ∈ R1 is the response variable, and X𝑖 = (𝑋𝑖0 , 𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 )𝑇 is a collection of p+1 predictors for the ith sample and 𝑋𝑖0 = 1 corresponds to the intercept. SC modeling focuses on GLMs by assuming that the conditional 20 density of 𝑌𝑖 given 𝑋𝑖 belongs to the linear exponential family 𝜋(𝑌 | X) = exp{𝑌 XT β − 𝑏(XT β) + A (𝑌 )}, (2.3) where β = (𝛽0 , 𝛽1 , . . . , 𝛽 𝑝 ) T is the vector of coefficients, 𝛽0 is the intercept, and A (·) and 𝑏(·) are known functions. This model with a canonical link function and a unit dispersion parameter, belongs to a larger exponential family [100]. It also assumes that 𝑏(·) is twice continuously differentiable with a non-negative second derivative 𝑏00 (·). In addition, it uses 𝜇(·) and 𝜎(·) to denote 𝑏0 (·) and 𝑏00 (·), i.e. the mean and variance functions, respectively. For example, 𝑏(𝜃) = log(1 + exp(𝜃)) in a logistic distribution and 𝑏(𝜃) = exp(𝜃) in a Poisson distribution. Let E𝑛 { 𝑓 (𝜉)} = 𝑛−1 𝑖=1 Í𝑛 𝑓 (𝜉𝑖 ) denote the mean of { 𝑓 (𝜉𝑖 )}𝑖=1 𝑛 for a sequence of i.i.d. random variables 𝜉 (𝑖 = 1, . . . , 𝑛) and a 𝑖 non-random function 𝑓 (·). The loglikelihood function, apart from an additive constant is Õ𝑛 −1 𝑛 𝐿 (X𝑖T β, 𝑌𝑖 ) = E𝑛 {𝐿 (XT β, 𝑌 )} (2.4) 𝑖=1 It uses β∗ = (𝛽∗0 , 𝛽∗1 , . . . , 𝛽∗𝑝 ) T to denote the true values of β. Then the true model is M = { 𝑗 : 𝛽∗ 𝑗 ≠ 0, 𝑗 ≥ 1} ∪ {0}, which consists of the intercept and all variables with nonzero effects. The estimate of M is denoted as M̂. It elaborate on the idea of building model with the proposed SC approach. The key is to include an offset term which summarizes the information acquired from the previous selection steps and to search for a new candidate variable that maximizes the likelihood with such an offset. An SC approach algorithm starts with initial index set, S0 , and initial offset, O0 . Having S0 = {0}, O0 = 𝛽ˆS0 , where 𝛽ˆS0 is estimated intercept without any covariates. First, with such O0 , it computes 𝛽ˆ (1) 𝑗 = arg maxβ 𝑙 O0 , 𝑗 (𝛽) for j ∈ {0, 1, . . . , 𝑝}, where 𝑙 O0 , 𝑗 (β) is the average log-likelihood of the regression model. Then, 𝑗1 = arg max 𝑗 ∈{0,1,...,𝑝} 𝑙O0 , 𝑗 ( 𝛽ˆ (1) T 𝑗 ), S1 = {0, 𝑗 1 }, and O1 = XS1 β̂S1 . Iteratively, for k ≥ 1, given S 𝑘 and O 𝑘 , it computes 𝛽ˆ (𝑘+1) 𝑗 = arg max 𝛽 𝑙O𝑘 , 𝑗 (β) for 𝑗 ∈ 𝑆 𝑐𝑘 . Then, 𝑗 𝑘+1 = arg max 𝑗 ∈𝑆 𝑐 𝑙 O𝑘 , 𝑗 ( β̂ (𝑘+1) 𝑗 ), 𝑆 𝑘+1 = 𝑆 𝑘 ∪ { 𝑗 𝑘+1 }, and O 𝑘+1 = XTS𝑘+1 β̂S𝑘+1 . To decide whether 𝑘 it recruits another variable 𝑗 𝑘+1 or stops procedure at 𝑘th step, it computed EBIC on set 𝑆 𝑘+1 , where EBIC(𝑆 𝑘+1 ) = −2ℓ𝑆 𝑘+1 ( β̂𝑆 𝑘+1 ) + (𝑘 + 1)(log 𝑛 + 2𝜂 log 𝑝)/𝑛 21 If EBIC(𝑆 𝑘+1 )≥EBIC(𝑆 𝑘 ), it stops and declares 𝑆 𝑘 the final model. The SC approach is computationally efficient as it maximizes the likelihood with respect to only one covariate at each step given the offset, the property that was not observed in other methods. The main drawback of SC approach is that it may be suboptimal compared to a full scale forward optimization approach. Additionally, the consistency of the estimated model parameters has not been addressed in related literature. 2.1.5 Cox Models with High-Dimensional Predictors via Forward Regression As mentioned, forward regression can consistently identify all relevant predictors in high-dimensional linear regression settings by using EBIC stopping rule. However, existing results from recent works are based on the sum of residual squares from linear models and it is not certain whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models since the results are based on the sum of residual squares from linear models. There has been active research in developing high-dimensional screening tools for survival data. These works include principled sure screening [101], feature aberration at survival times screening [102] and conditional screening [103], quantile adaptive sure independence screening [104], a censored rank independence screening procedure [105], and integrated powered density screening [106]. However, the screening methods require a threshold to dictate how many variables to retain, for which unfortunately there are no clear rules. Zhao and Li [101] did tie the threshold with false discoveries, but it still needs to prefix the number of false positives that users are willing to tolerate. Recently, Li et al. [107] designed a model-free measure, namely the survival impact index, that sensibly captures the overall influence of a covari- ate on the survival outcome and can help guide selecting important variables. However, even this method, like the other screening methods, does not directly lead to a final model, for which extra modeling steps have to be implemented. Hong et al. [108] introduced a new forward variable selection procedure for survival data based on partial likelihood. It selects important variables sequentially according to the increment of partial 22 likelihood, with a stopping rule based on EBIC. The algorithm for the proposed method is the following. Suppose n objects with 𝑝 covariates are observed, where 𝑝  𝑛. For subject 𝑖, denote by 𝑋𝑖 𝑗 the 𝑗th covariate for subject 𝑖, write X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 ) T , and let 𝑇𝑖 and 𝐶𝑖 be the underlying survival and censoring times. We only observe 𝑌𝑖 = min(𝑇𝑖 , 𝐶𝑖 ), and the event indicator 𝛿𝑖 = 𝐼 (𝑇𝑖 ≤ 𝐶𝑖 ), where 𝐼 is the indicator function. We assume random censoring such that 𝐶𝑖 and 𝑇𝑖 are independent given X. To link 𝑇𝑖 to X𝑖 , for each 𝑖 ∈ {1, . . . , 𝑛}, we consider the Cox proportional hazards model 𝜆(𝑡|X𝑖 ) = 𝜆 0 (𝑡) exp{β0T X𝑖 }, (2.5) where 𝜆 0 is the unspecified baseline hazard function and β0 = (𝛽01 , . . . , 𝛽0𝑝 ) T is the vector of regression coefficients. Additionally, let 𝑆 ⊂ {1, 2, . . . , 𝑝} be an index set and |𝑆| cardinality of 𝑆. First, we initialize 𝑆0 = ∅ and sequentially select the sets of covariates such that 𝑆0 ⊂ 𝑆1 ⊂ · · · ⊂ 𝑆 𝑘 . At the (𝑘 + 1)th step the algorithm selects a new covariate not observed in 𝑆 𝑘 and then decides whether it includes the new variable into selection and proceeds to the next step or stops at the 𝑘th step. The selection criteria is based on the partial likelihood. Given 𝑆 𝑘 , for every 𝑗 ∈ 𝑆 𝑐𝑘 , it fits a Cox model on the variables indexed by 𝑆 𝑘, 𝑗 , where 𝑆 𝑘, 𝑗 = 𝑆 𝑘 ∪ 𝑗. Then it computes an increment of log partial likelihood for each 𝑗 ∈ 𝑆 𝑐𝑘 , that is, 𝑙 𝑆 𝑘, 𝑗 ( β̂𝑆 𝑘, 𝑗 ) − 𝑙 𝑆 𝑘 ( β̂𝑆 𝑘 ), where 𝑙 𝑆 ( β̂𝑆 ) is log partial likelihood function given X𝑆 : 𝑛 ∫ " ( 𝑛 )# Õ 𝜏 Õ 𝑙 𝑆 ( β̂𝑆 ) = β𝑆T X𝑖𝑆 − 𝑙𝑛 𝑌¯𝑙 (𝑡) exp β𝑆T X𝑙𝑆 𝑑𝑁𝑖 (𝑡), (2.6) 𝑖=1 0 𝑙=1 where 𝑁𝑖 (𝑡) = 𝐼 (𝑌𝑖 ≤ 𝑡, 𝛿𝑖 = 1) is the counting process, 𝑌¯𝑙 (𝑡) = 𝐼 (𝑌𝑖 ≥ 𝑡) is the at-risk process, and 𝜏 > 0 is the study duration such that 𝑃(𝑌 ≥ 𝜏) > 0. The candidate index is chosen as 𝑗 ∗ = arg max 𝑗∉𝑆 𝑘 𝑙 𝑆 𝑘, 𝑗 ( β̂𝑆 𝑘, 𝑗 ) − 𝑙 𝑆 𝑘 ( β̂𝑆 𝑘 ) and the index set is updated 𝑆 𝑘+1 = 𝑆 𝑘 ∪ { 𝑗 ∗ }. To decide whether to stop at the 𝑘th step or to include 𝑗 ∗ in the selection and proceed to the next step, the algorithm makes its decision based on EBIC criteria, which is defined as follows: 23    EBIC(𝑆 𝑘+1 ) = −2𝑙 𝑆 𝑘+1 ( β̂𝑆 𝑘+1 ) + (𝑘 + 1) ln 𝑑 + 2𝜂ln 𝑝 , (2.7) where 𝑑 = 𝛿1 + · · · + 𝛿𝑛 is the number of events and 𝜂 some positive constant. If EBIC(𝑆 𝑘+1 ) > EBIC(𝑆 𝑘 ), the algorithm stops and declares 𝑆 𝑘 the final model, otherwise it proceeds to the next stop. They showed that if the dimension of the true model is finite, within a finite number of steps forward regression can discover all relevant predictors, with the entry order determined by the size of the likelihood increment. The proposed model could potentially be the first work that investigated the partial likelihood-based forward regression in survival models with high-dimensional predictors. Moreover, it represents technical advances and a broadened scope compared to the existing forward regression (e.g., [109], [110], [93]), and it improves the partial likelihood-based variable selection developed by [111], [112] for survival data in low dimensional settings. The disadvantage of the proposed work is that it does not address parameter estimation, which limits its usage in building predictive models. 2.1.6 A Stepwise Regression Method and Consistent Model Selection for High-Dimensional Sparse Linear Models Stepwise least squares regression is widely used in applied regression analysis to handle a large number of input variables, which consists of forward selection of input variables in a ”greedy” manner so that the selected variable at each step minimizes the residual sum of squares after least squares regression is performed on it together with previously selected variables, a stopping rule to terminate forward inclusion of variables, and stepwise backward elimination of variables according to some criterion. Ing et al. [113] developed an asymptotic theory for a version of stepwise regression in the context of high-dimensional regression under certain sparsity assumptions. They introduced a fast stepwise regression method, called the orthogonal greedy algorithm (OGA), that selects input variables to enter a p-dimensional linear regression model (with 𝑝  𝑛, the sample size) sequentially so that the selected variable at each step minimizes the residual sum squares. They derived the convergence 24 rate of OGA and developed a consistent model selection procedure along the OGA path that can adjust for potential spuriousness of the greedily chosen regressors among a large number of candidate variables. The resultant regression estimate is shown to have the oracle property of being equivalent to least squares regression on an asymptotically minimal set of relevant regressors under a strong sparsity condition. The forward stepwise component of the procedure is compressed sensing and approximation theory, which focuses on approximations in noiseless models. They also developed a fast iterative procedure for updating OGA that uses componentwise linear regression similar to the 𝐿 2 -boosting procedure of Buhlmann and Yu [114] and does not require matrix inversion. Consider the linear regression model Õ 𝑝 𝑦𝑖 = 𝛼 + 𝛽 𝑗 𝑥𝑖 𝑗 + 𝜖𝑖 , 𝑖 = 1, . . . , 𝑛, (2.8) 𝑗=1 with 𝑝 predictors and 𝑥𝑖1 , 𝑥𝑖2 , . . . , 𝑥𝑖 𝑝 that are uncorrelated with the mean-zero random disturbances 𝜖𝑖 . As mentioned, 𝐿 2 -boosting is an iterative procedure that generates a sequence of linear approx- imations 𝑦ˆ 𝑘 (𝑥) of the regression function (with 𝛼 = 0), by applying componentwise linear least squares to the residuals obtained at each iteration. Initializing with 𝑦ˆ 0 (·) = 0, it computes the residuals 𝑈𝑖𝑘 := 𝑦𝑖 − 𝑦ˆ 𝑘 (𝑥𝑖 ), 1 ≤ 𝑖 ≤ 𝑛, at the end of the 𝑘th iteration and chooses 𝑥𝑖, 𝑗ˆ𝑘+1 on which the pseudo-responses 𝑈𝑖(𝑘) are regressed, such that Õ 𝑛 𝑗ˆ𝑘+1 = arg min (𝑈𝑖(𝑘) − 𝛽˜ (𝑘) 2 𝑗 𝑥𝑖 𝑗 ) , (2.9) 1≤ 𝑗 ≤ 𝑝 𝑖=1 where 𝑛 , 𝑛 Õ Õ 𝛽˜ (𝑘) 𝑗 = 𝑈𝑖(𝑘) 𝑥𝑖 𝑗 𝑥𝑖2𝑗 . 𝑖=1 𝑖=1 This yields the update 𝑦ˆ 𝑘+1 (𝑥) = 𝑦ˆ 𝑘 (𝑥) + 𝛽˜ (𝑘) ˆ 𝑥 𝑗ˆ𝑘+1 . (2.10) 𝑗 𝑘+1 25 The procedure is then repeated until a pre-specified upper bound 𝑚 on the number of iterations is reached. When the procedure stops at the 𝑚th iteration, 𝑦(𝑥) is approximated by 𝑦ˆ 𝑚 (𝑥). (𝑈𝑖(𝑘) − 𝛽˜ (𝑘) 2 Í𝑛 (𝑘) 2 2 Í𝑛 OGA uses the variable selector (2.9). Since 𝑖=1 𝑗 𝑥𝑖 𝑗 ) / 𝑖=1 (𝑈𝑖 ) = 1−𝑟 𝑗 , where 𝑟 𝑗 is the correlation coefficient between 𝑥𝑡 𝑗 and 𝑈𝑖(𝑘) , (2.9) chooses the predictor that is most correlated with 𝑈𝑖(𝑘) at the 𝑘th stage. However,the implementation of OGA updates (2.10) in another way and also carries out an additional linear transformation of the vector X 𝑗ˆ𝑘+1 to form X⊥𝑗ˆ , where 𝑘+1 X 𝑗 = (𝑥𝑖 𝑗 , . . . , 𝑥 𝑛 𝑗 )T. The idea is to orthogonalize the predictor variables sequentially so that OLS can be computed by componentwise linear regression, thereby circumventing difficulties with inverting high-dimensional matrices in the usual implementation of OLS. With the orthogonal vectors X 𝑗ˆ1 , X⊥𝑗ˆ , . . . , X⊥𝑗ˆ already computed in the previous stages, it can 2 𝑘 compute the projection X̂ 𝑗ˆ𝑘+1 of X 𝑗ˆ𝑘+1 into the linear space by adding the 𝑘 projections into the respective one-dimensional linear spaces. This also yields the residual vector X⊥𝑗ˆ = X 𝑗ˆ𝑘+1 − X̂ 𝑗ˆ𝑘+1 . 𝑘+1 OGA uses the following updates in lieu of (2.10): 𝑦ˆ 𝑘+1 (𝑥) = 𝑦ˆ 𝑘 (𝑥) + 𝛽ˆ (𝑘) ⊥ ˆ 𝑥 𝑗ˆ , (2.11) 𝑗 𝑘+1 𝑘+1 where 𝑛 , 𝑛 Õ Õ 𝛽ˆ (𝑘) ˆ = 𝑈𝑖(𝑘) 𝑥𝑖,⊥𝑗ˆ (𝑥𝑖,⊥𝑗ˆ ) 2 . 𝑗 𝑘+1 𝑘+1 𝑘+1 𝑖=1 𝑖=1 By sequentially orthogonalizing the input variables, OGA preserves the attractive computational features of componentwise linear regression in PGA. However, unlike PGA for which the same predictor variable can be entered repeatedly, OGA excludes variables that are already precluded from further consideration in (2.9). In addition, they developed a consistent model selection procedure along an OGA path under a ”strong sparsity” condition that the nonzero regression coefficients satisfying the weak sparsity condition are not too small. Applying the convergence rate of OGA, they proved that, with probability approaching 1 as 𝑛 → ∞, the OGA path includes all relevant regressors when the 26 number of iterations is large enough. They modified the model selection criteria like BIC and called it high-dimensional information criterion (HDIC), which is defined as:  2  HDIC(𝐽) = 𝑛log 𝜎 ˆ 𝐽 + |𝐽 |𝑤 𝑛 log 𝑝 , (2.12) ˆ 𝐽2 = 𝑛−1 − 𝑦ˆ 𝑖;𝐽 ) 2 . OGA+HDIC is shown Í𝑛 where 𝐽 is a non-empty subset of {1, . . . , 𝑝} and 𝜎 𝑖=1 (𝑦 𝑖 to select the smallest set of all relevant variables along the OGA path with probability approaching 1 (and is therefore variable-selection consistent). 2.2 Our Contribution Although methods described in the previous sections brought novelty to the research area and made a significant contribution to it, they still have some drawbacks. First, once a variable is identified by the forward selection, it is not removable from the list of selected variables. Hence, false positives are unavoidable without a systematic elimination procedure. Second, most of the existing works focus on variable selection and are silent with respect to estimation accuracy. To address the first issue, some works have been proposed to add backward elimination steps once forward selection is accomplished, as backward elimination may further eliminate false positives from the variables selected by forward selection. For example, An et al. [92] and Ing et al. [113] proposed a stepwise selection for linear regression models in high-dimensional settings and proved model selection consistency. However, it is unclear whether the results hold for high-dimensional GLMs; Hwang et al. [95] proposed a similar stepwise algorithm in high-dimensional GLM settings, but with no theoretical properties on model selection. Moreover, none of the relevant works have touched upon the accuracy of estimation. We extend a stepwise regression method to accommodate GLMs with high-dimensional predictors. Our method, termed STEPWISE hereafter and introduced in Chapter 3, embraces both model selection and estimation. It starts with an empty model or pre-specified predictors, scans all features and sequentially selects features, and conducts backward elimination once forward selection is completed. Our proposal controls both false negatives and false positives in high dimensional 27 settings: the forward selection steps recruit variables in an inclusive way by allowing some false positives for the sake of avoiding false negatives, while the backward selection steps eliminate the potential false positives from the recruited variables. We use different stopping criteria in the forward and backward selection steps, to control the numbers of false positives and false negatives. It is achieved by adding flexibility with two tuning parameters in the stopping criteria, which strengthens our algorithm. Values of these parameters define how many parameters will likely be included in the model. For instance, a small value of the tuning parameter in the forward selection will include more variables, and a large value will recruit too few features. In similar fashion, in the backward elimination step, a large value of the parameter would eliminate more variables and vice versa for a small value. It is worth mentioning that our method includes forward selection as a special case when the tuning parameter is equal to 0, making the algorithm more flexible. Moreover, we prove that, under a sparsity assumption of the true model, the proposed approach can discover all of the relevant predictors within a finite number of steps, and the estimated coefficients are consistent, a property still unknown to the literature. Finally, our GLM framework enables our work to accommodate a wide range of data types, such as binary, categorical and count data. Extensive numerical studies have been conducted to compare STEPWISE procedure with the other competing methods mentioned in previous subsections. Specifically, we compared our algorithm with LASSO, dgLARS, forward regression (FR), the SC approach, and the screening methods such as SIS. Our numerical studies included simulations: we compared the proposed method with the other methods over comprehensive simulated studies covering different model structures and variable dependencies. All these examples were tested over various model types, including Normal, Binomial, and Poisson models. The obtained results have indicated that the STEPWISE algorithm was able to detect all the true signals with nearly zero false positive rate. In addition, it outperformed the other methods by selecting more true positives with fewer false positives. Moreover, our numerical studies included real data analysis, which aimed to demonstrate the utility of our method with real-life scenarios. We analyzed data obtained from studies about 28 gene regulation in the mammalian eye, esophageal squamous cell carcinoma, and neurobehavioral impairment from total sleep deprivation. We demonstrated that STEPWISE achieved comparable prediction accuracy, specificity, sensitivity, and AUC by selecting fewer variables that the other variable selection methods. Finally, to enhance the predictive power of the proposed method, we developed a multi-stage hybrid machine learning method. It incorporates a stacking technique and includes both model- free and model-based methods. Specifically, it comprises Random Forest (RF), Extreme Gradient Boosting Machine (XGBoost), Support Vector Machine (SVM), Artificial Neural Network (ANN), and LASSO models along with the STEPWISE procedure. The numerical study has shown an improvement in the predictive power of our method. Furthermore, we developed a web application that enables users to utilize the aforementioned method in practice. To recap, our proposed method distinguishes from the existing stepwise approaches in high- dimensional settings. For example, it improves An et al. [92] and Ing et al. [113] by extending the work to a more broad GLM setting and Hwang et al. [95] by establishing the theoretical properties. Compared with the other variable selection and screening works, our method produces a final model in ultrahigh-dimensional settings, without applying a pre-screening step which may produce unintended false negatives. Under some regularity conditions, the method identifies or includes the true model with probability going to 1. Moreover, unlike the penalized approaches such as LASSO, the coefficients estimated by our STEPWISE selection procedure in the final model will be consistent, which are useful for gauging the real effect sizes of risk factors. 29 CHAPTER 3 STEPWISE METHOD: THEORY AND APPLICATIONS 3.1 Model Setup Let (X𝑖 , 𝑌𝑖 ), 𝑖 = 1, . . . , 𝑛, denote 𝑛 independent and identically distributed (i.i.d.) copies of (X, 𝑌 ). Here, X = (1, 𝑋1 , . . . , 𝑋 𝑝 ) T is a ( 𝑝 + 1)-dimensional predictor vector with 𝑋0 = 1 corresponding to the intercept term, and 𝑌 is an outcome. Suppose that the conditional density of 𝑌 , given X, belongs to a linear exponential family: 𝜋(𝑌 | X) = exp{𝑌 XT β − 𝑏(XT β) + A (𝑌 )}, (3.1) where β = (𝛽0 , 𝛽1 , . . . , 𝛽 𝑝 ) T is the vector of coefficients, 𝛽0 is the intercept, and A (·) and 𝑏(·) are known functions. Model (3.1), with a canonical link function and a unit dispersion parameter, belongs to a larger exponential family [100]. Further, 𝑏(·) is assumed twice continuously differentiable with a non-negative second derivative 𝑏00 (·). We use 𝜇(·) and 𝜎(·) to denote 𝑏0 (·) and 𝑏00 (·), i.e. the mean and variance functions, respectively. For example, 𝑏(𝜃) = log(1 + exp(𝜃)) in a logistic distribution and 𝑏(𝜃) = exp(𝜃) in a Poisson distribution. Let 𝐿 (𝑢, 𝑣) = 𝑢𝑣 − 𝑏(𝑢) and E𝑛 { 𝑓 (𝜉)} = 𝑛−1 𝑖=1 Í𝑛 𝑓 (𝜉𝑖 ) denote the mean of { 𝑓 (𝜉𝑖 )}𝑖=1 𝑛 for a sequence of i.i.d. random variables 𝜉𝑖 (𝑖 = 1, . . . , 𝑛) and a non-random function 𝑓 (·). Based on the i.i.d. observations, the log-likelihood function is Õ𝑛 −1 ℓ(β) = 𝑛 𝐿 (X𝑖T β, 𝑌𝑖 ) = E𝑛 {𝐿 (XT β, 𝑌 )}. (3.2) 𝑖=1 We use β∗ = (𝛽∗0 , 𝛽∗1 , . . . , 𝛽∗𝑝 ) T to denote the true values of β. Then the true model is M = { 𝑗 : 𝛽∗ 𝑗 ≠ 0, 𝑗 ≥ 1} ∪ {0}, which consists of the intercept and all variables with nonzero effects. Overarching goals of ultrahigh-dimensional data analysis are to identify M and estimate 𝛽∗ 𝑗 for 𝑗 ∈ M. While most of the relevant literature [99, 108] is on estimating M, this work is to accomplish both identification of M and estimation of 𝛽∗ 𝑗 . 30 When 𝑝 is in the exponential order of 𝑛, we aim to generate a predictive model that contains the true model with high probability, and provide consistent estimates of regression coefficients. We further introduce the following notation. For a generic index set 𝑆 ⊂ {0, 1, . . . , 𝑝} and a ( 𝑝+1)-dimensional vector A, we use 𝑆 𝑐 to denote the complement of a set 𝑆 and A𝑆 = {𝐴 𝑗 : 𝑗 ∈ 𝑆} to denote the subvector of A corresponding to 𝑆. For instance, if 𝑆 = {2, 3, 4}, then X𝑖𝑆 = (𝑋𝑖2 , 𝑋𝑖3 , 𝑋𝑖4 ) T . Moreover, denote by ℓ𝑆 (β𝑆 ) = E𝑛 {𝐿(XT𝑆 β𝑆 , 𝑌 )} the log-likelihood of the regression model of 𝑌 on X𝑆 and denote by β̂𝑆 the maximizer of ℓ𝑆 (β𝑆 ). Under model (3.1), we elaborate on the idea of stepwise selection, consisting of the forward and backward stages. Forward stage: We start with 𝐹0 , a set of variables that need to be included according to some a priori knowledge, such as clinically important factors and conditions. If no such information is available, 𝐹0 is set to be {0}, corresponding to a null model. We sequentially add covariates as follows: 𝐹0 ⊂ 𝐹1 ⊂ 𝐹2 ⊂ · · · ⊂ 𝐹𝑘 , where 𝐹𝑘 ⊂ {0, 1, . . . , 𝑝} is the index set of the selected covariates upon completion of the 𝑘th step, with 𝑘 ≥ 0. At the (𝑘 + 1)th step, we append new variables to 𝐹𝑘 one at a time and refit GLMs: for every 𝑗 ∈ 𝐹𝑘𝑐 , we let 𝐹𝑘, 𝑗 = 𝐹𝑘 ∪ { 𝑗 }, obtain β̂𝐹𝑘, 𝑗 by maximizing ℓ𝐹𝑘, 𝑗 (β𝐹𝑘, 𝑗 ), and compute the increment of log-likelihood, ℓ𝐹𝑘, 𝑗 ( β̂𝐹𝑘, 𝑗 ) − ℓ𝐹𝑘 ( β̂𝐹𝑘 ). Then the index of a new candidate variable is determined to be 𝑗 𝑘+1 = arg max ℓ𝐹𝑘, 𝑗 ( β̂𝐹𝑘, 𝑗 ) − ℓ𝐹𝑘 ( β̂𝐹𝑘 ). 𝑗 ∈𝐹𝑘𝑐 And we update 𝐹𝑘+1 = 𝐹𝑘 ∪ { 𝑗 𝑘+1 }. We then need to decide whether to stop at the 𝑘th step or move on to the (𝑘 + 1)th step with 𝐹𝑘+1 . To do so, we use the following EBIC criterion: EBIC(𝐹𝑘+1 ) = −2ℓ𝐹𝑘+1 ( β̂𝐹𝑘+1 ) + |𝐹𝑘+1 |𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝), (3.3) where the second term is motivated by Chen and Chen [115] and |𝐹 | denotes the cardinality of a set 𝐹. 31 The forward selection stops if EBIC(𝐹𝑘+1 ) > EBIC(𝐹𝑘 ). We denote the stopping step by 𝑘 ∗ and the set of variables selected so far by 𝐹𝑘 ∗ . Backward stage: Upon the completion of forward stage, backward elimination, starting with 𝐵0 = 𝐹𝑘 ∗ , sequentially drops covariates as follows: 𝐵0 ⊃ 𝐵1 ⊃ 𝐵2 ⊃ · · · ⊃ 𝐵 𝑘 , where 𝐵 𝑘 is the index set of the remaining covariates upon the completion of the 𝑘th step of the backward stage, with 𝑘 ≥ 0. At the (𝑘 + 1)th backward step and for every 𝑗 ∈ 𝐵 𝑘 , we let 𝐵 𝑘/ 𝑗 = 𝐵 𝑘 \ { 𝑗 }, obtain β̂𝐵 𝑘/ 𝑗 by maximizing ℓ(β𝐵 𝑘/ 𝑗 ), and calculate the difference of the log-likelihoods between these two nested models, ℓ𝐵 𝑘 ( β̂𝐵 𝑘 ) − ℓ𝐵 𝑘/ 𝑗 ( β̂𝐵 𝑘/ 𝑗 ). The variable that can be removed from the current set of variables is indexed by 𝑗 𝑘+1 = arg min ℓ𝐵 𝑘 ( β̂𝐵 𝑘 ) − ℓ𝐵 𝑘/ 𝑗 ( β̂𝐵 𝑘/ 𝑗 ). 𝑗 ∈𝐵 𝑘 Let 𝐵 𝑘+1 = 𝐵 𝑘 \ { 𝑗 𝑘+1 }. We determine whether to stop at the 𝑘th step or move on to the (𝑘 + 1)th step of the backward stage according to the following BIC criterion: BIC(𝐵 𝑘+1 ) = −2ℓ𝐵 𝑘+1 ( β̂𝐵 𝑘+1 ) + 𝜂2 𝑛−1 |𝐵 𝑘+1 | log 𝑛. (3.4) If BIC(𝐵 𝑘+1 ) > BIC(𝐵 𝑘 ), we end the backward stage at the 𝑘th step. Let 𝑘 ∗∗ denote the stopping step and we declare the selected model 𝐵 𝑘 ∗∗ to be the final model. Thus, M̂ = 𝐵 𝑘 ∗∗ is the estimate of M. As the backward stage starts with the 𝑘 ∗ variables selected by forward selection, 𝑘 ∗∗ cannot exceed 𝑘 ∗ . A strength of our algorithm is the added flexibility with 𝜂1 and 𝜂2 in the stopping criteria for controlling the false negatives and positives. For example, a smaller value of 𝜂1 close to zero in the forward selection step will likely include more variables, thus incur more false positives and less false negatives, whereas a larger value of 𝜂1 will recruit too few variables and cause too many false negatives. Similarly, in the backward selection step, a large 𝜂2 would eliminate more variables and 32 therefore further reduce more false positives, and vice versa for a small 𝜂2 . While finding optimal 𝜂1 and 𝜂2 is not trivial, our numerical experiences suggest a small 𝜂1 and a large 𝜂2 may well balance the false negatives and positives. When 𝜂2 = 0, no variables can be dropped after forward selection; hence, our proposal includes forward selection as a special case. Moreover, Zheng et al. [99] proposed a sequentially conditioning approach based on offset terms that absorb the prior information. However, our numerical experiments indicate that the offset approach may be suboptimal compared to our full stepwise optimization approach, which will be demonstrated in the simulation studies. 3.2 Theoretical Properties With a column vector v, let kvk 𝑞 denote the 𝐿 𝑞 -norm for any 𝑞 ≥ 1. For simplicity, we denote the 𝐿 2 -norm of v by kvk, and denote vvT by v⊗2 . We use 𝐶1 , 𝐶2 , . . . , to denote some generic constants that do not depend on 𝑛 and may change from line to line. The following regularity conditions are set. 1. There exist a positive integer 𝑞 satisfying |M| ≤ 𝑞 and 𝑞 log 𝑝 = 𝑜(𝑛1/3 ) and a constant 𝐾 > 0 such that sup|𝑆|≤𝑞 kβ𝑆∗ k 1 ≤ 𝐾, where β𝑆∗ = arg maxβ𝑆 𝐸 [ℓ𝑆 (β𝑆 )] is termed the least false value of model 𝑆. 2. kXk ∞ ≤ 𝐾. In addition, 𝐸 (𝑋 𝑗 ) = 0 and 𝐸 (𝑋 2𝑗 ) = 1 for 𝑗 ≥ 1. 3. Let 𝜖𝑖 = 𝑌𝑖 − 𝜇(β∗T X𝑖 ). There exists a positive constant 𝑀 such that the Cramer condition holds, i.e., 𝐸 [|𝜖𝑖 | 𝑚 ] ≤ 𝑚!𝑀 𝑚 for all 𝑚 ≥ 1. 4. |𝜎(𝑎) − 𝜎(𝑏)| ≤ 𝐾 |𝑎 − 𝑏| and 𝜎min := inf |𝑡|≤𝐾 3 |𝑏00 (𝑡)| is bounded below.    5. There exist two positive constants, 𝜅 min and 𝜅 max such that 0 < 𝜅min < Λ 𝐸 X𝑆⊗2 < 𝜅max < ∞, uniformly in 𝑆 ⊂ {0, 1, . . . , 𝑝} satisfying |𝑆| ≤ 𝑞, where Λ(A) is the collection of all eigenvalues of a square matrix A. 33 6. min𝑆:M*𝑆,|𝑆|≤𝑞 𝐷 𝑆 > 𝐶𝑛−𝛼 for some constants 𝐶 > 0 and 𝛼 > 0 that satisfies 𝑞𝑛−1+4𝛼 log 𝑝 → h  i 0, where 𝐷 𝑆 = max 𝑗 ∈S 𝑐 ∩M 𝐸 𝜇(β∗T X) − 𝜇(β𝑆∗T X𝑆 ) 𝑋 𝑗 . Condition (1), as assumed in Buhlmann et al. [116] and Zheng et al. [99], is an alternative to the Lipschitz assumption [117, 87]. The bound of the model size allowed in the selection procedure or 𝑞 is often required in model-based screening methods (see, e.g. [94, 118, 109, 99]). The bound should be large enough so that the correct model can be included, but not too large; otherwise, excessive noise variables would be included, leading to unstable and inconsistent estimates. In- deed, Conditions (1) and (6) reveal that the range of 𝑞 depends on the true model size |M|, the minimum signal strength, 𝑛−𝛼 , and the total number of covariates, 𝑝. The upper bound of 𝑞 is 𝑜((𝑛1−4𝛼 /log 𝑝) ∧ (𝑛1/3 /log 𝑝)), ensuring the consistency of EBIC [115]. Condition (1) also implies that the parameter space under consideration can be restricted to B := {β ∈ R 𝑝+1 : kβk 1 ≤ 𝐾 2 }, for any model 𝑆 with |𝑆| ≤ 𝑞. Condition (2), as assumed in Zhao et al. [101] and Kwemou et al. [119], reflects that data are often standardized at the pre-processing stage. Condition (3) ensures that 𝑌 has a light tail, and is satisfied by Gaussian and discrete data, such as binary and count data [120]. Condition (4) is satisfied by common GLM models, such as Gaussian, Binomial, Poisson and Gamma distributions. Condition (5) represents the sparse Riesz condition [121] and Condition (6) is a strong "irrepre- sentable" condition, suggesting that M cannot be represented by a set of variables that does not include the true model. It further implies that adding a signal variable to a mis-specified model will increase the log-likelihood by a certain lower bound [99]. The signal rate is comparable to the conditions required by the other sequential methods (see, e.g. [93, 109]). Theorem 3.2.1 develops a lower bound of the increment of the log-likelihood if the true model M is not yet included in a selected model 𝑆. Theorem 3.2.1. Suppose Conditions (1) – (6) hold. There exists some constant 𝐶1 such that with probability at least 1 − 6 exp(−6𝑞 log 𝑝),   min max𝑐 ℓ𝑆∪{ 𝑗 } ( β̂𝑆∪{ 𝑗 } ) − ℓ𝑆 ( β̂𝑆 ) ≥ 𝐶1 𝑛−2𝛼 . 𝑆:M*𝑆,|𝑆|<𝑞 𝑗 ∈𝑆 34 Theorem 3.2.1 shows that, before the true model is included in the selected model, we can append a variable which will increase the log-likelihood by at least 𝐶1 𝑛−2𝛼 with probability tending to 1. This ensures that in the forward stage, our proposed STEPWISE approach will keep searching for signal variables until the true model is contained. To see this, suppose at the 𝑘th step of the forward stage that 𝐹𝑘 satisfies 𝑀 * 𝐹𝑘 and |𝐹𝑘 | < 𝑞, and let 𝑟 be the index selected by Stepwise. By Theorem 3.2.1, we obtain that, for any 𝜂1 > 0, when 𝑛 is sufficiently large, EBIC(𝐹𝑘,𝑟 ) − EBIC(𝐹𝑘 ) = −2ℓ𝐹𝑘,𝑟 ( β̂𝐹𝑘,𝑟 ) + (|𝐹𝑘 | + 1)𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝) − −2ℓ𝐹𝑘 ( β̂𝐹𝑘 ) + |𝐹𝑘 |𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝)   ≤ −2𝐶1 𝑛−2𝛼 + 𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝) < 0, with probability at least 1 − 6 exp(−6𝑞 log 𝑝), where the last inequality is due to Condition (6). Therefore, with high probability the forward stage of STEPWISE continues as long as M * 𝐹𝑘 and |𝐹𝑘 | < 𝑞. We next establish an upper bound of the number of steps in the forward stage needed to include the true model. Theorem 3.2.2. Under the same conditions as in Theorem 3.2.1 and if   𝐸 𝑌 − 𝜇(β𝑆∗T X𝑆 ) 𝑋 𝑗 = 𝑜(𝑛−𝛼 ),   max max 𝑆:|𝑆|≤𝑞 𝑗 ∈S 𝑐 ∩M 𝑐 then there exists some constant 𝐶2 > 2 such that M ⊂ 𝐹𝑘 , for some 𝐹𝑘 in the forward stage of Stepwise and 𝑘 ≤ 𝐶2 |M|, with probability at least 1 − 18 exp(−4𝑞 log 𝑝). The "max" condition, as assumed in Section 5.3 of Fan et al. [122], relaxes the partial orthogonality assumption that XM 𝑐 are independent of XM , and ensures that with probability tending to 1, appending a signal variable increases log-likelihood more than adding a noise variable does, uniformly over all possible models 𝑆 satisfying M * 𝑆, |𝑆| < 𝑞. This entails that the proposed procedure is much more likely to select a signal variable, in lieu of a noise variable, at each step. Since EBIC is a consistent model selection criterion [110, 123], the following theorem guarantees termination of the proposed procedure with M ⊂ 𝐹𝑘 for some 𝑘. 35 Theorem 3.2.3. Under the same conditions as in Theorem 3.2.2 and if M ⊄ 𝐹𝑘−1 and M ⊂ 𝐹𝑘 , the forward stage stops at the 𝑘th step with probability going to 1 − exp(−3𝑞 log 𝑝). Theorem 3.2.3 ensures that the forward stage of STEPWISE will stop within a finite number of steps and will cover the true model with probability at least 1 − 𝑞 exp(−3𝑞 log 𝑝) ≥ 1 − exp(−2𝑞 log 𝑝). We next consider the backward stage and provide a probability bound of removing a signal from a set in which the set of true signals M is contained. Theorem 3.2.4. Under the same conditions as in Theorem 3.2.2, BIC(𝑆\{𝑟}) − BIC(𝑆) > 0 uniformly over 𝑟 ∈ M and 𝑆 satisfying M ⊂ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 − 6 exp(−6𝑞 log 𝑝). Theorem 3.2.4 indicates that with probability at 1 − 6 exp(−6𝑞 log 𝑝), BIC would decrease when removing a signal variable from a model that contains the true model. That is, with high probability, back elimination is to reduce false positives. Recall that 𝐹𝑘 ∗ denotes the model selected at the end of the forward selection stage. By Theorem 3.2.2, M ⊂ 𝐹𝑘 ∗ with probability at least 1 − 18 exp(−4𝑞 log 𝑝). Then Theorem 3.2.4 implies that at each step of the backward stage, a signal variable will not be removed from the model with probability at least 1 − 6 exp(−6𝑞 log 𝑝). By Theorem 3.2.2, |𝐹𝑘 ∗ | ≤ 𝐶2 |M|. Thus, the backward elimination will carry out at most (𝐶2 − 1)|M| steps. Combining results from Theorems 3.2.2 and 3.2.3 yields that M ⊂ M̂ with probability at least 1 − 18 exp(−4𝑞 log 𝑝) − 6(𝐶2 − 1)|M | exp(−6𝑞 log 𝑝). Let β̂ be the estimate of β∗ in model (3.1) at the termination of STEPWISE. By convention, the estimates of the coefficients of the unselected covariates are 0. Theorem 3.2.5. Under the same conditions as in Theorem 3.2.2, we have that M ⊆ M̂ and k β̂ − β∗ k → 0 in probability. 36 The theorem warrants that the proposed STEPWISE yields consistent estimates, a property not shared by many regularized methods, including LASSO. Our later simulations verified this. Proof of main theorems and lemmas are provided in the following Chapter. 3.3 Proof of the Theorems Since 𝑏(·) is twice continuously differentiable with a nonnegative second derivative 𝑏00 (·), 𝑏 max := max|𝑡|≤𝐾 3 |𝑏(𝑡)|, 𝜇max := max|𝑡|≤𝐾 3 |𝑏0 (𝑡)|, and 𝜎max := sup|𝑡|≤𝐾 3 |𝑏00 (𝑡)| are bounded above, where 𝐿 and 𝐾 are some constants from Conditions (1) and (2), respectively. Let G𝑛 { 𝑓 (𝜉)} = 𝑛−1/2 𝑖=1 Í𝑛 ( 𝑓 (𝜉𝑖 ) − 𝐸 [ 𝑓 (𝜉𝑖 )]) for a sequence of i.i.d. random variables 𝜉𝑖 (𝑖 = 1, . . . , 𝑛) and a non-random function 𝑓 (·). Given any β𝑆 , when a variable 𝑋𝑟 , 𝑟 ∈ 𝑆 𝑐 is added into the model 𝑆, we define the augmented log-likelihood as n  o ℓ𝑆∪{𝑟} (β𝑆+𝑟 ) := E𝑛 𝐿 β𝑆T X𝑆 + 𝛽𝑟 𝑋𝑟 , 𝑌 . (3.1) We use β̂𝑆+𝑟 to denote the maximizer of (3.1). Thus, β̂𝑆+𝑟 = β̂𝑆∪{𝑟} . In addition, denote the maximizer of 𝐸 [ℓ𝑆∪{𝑟} (β𝑆+𝑟 )] by β𝑆+𝑟 ∗ . Due to the concavity of the log-likelihood in GLMs with the canonical link, β𝑆+𝑟 ∗ is unique. Proof of Theorem 3.2.1. Given an index set 𝑆 and 𝑟 ∈ 𝑆 𝑐 , let B𝑆0 (𝑑) = {β𝑆 : kβ𝑆 − β𝑆∗ k ≤ p p 𝑑/(𝐾 |𝑆|)} where 𝑑 = 𝐴2 𝑞 3 log 𝑝/𝑛 with 𝐴2 defined in Lemma 3.4.6. Let Ω be the event that n h    i p T ∗T sup G𝑛 𝐿 β 𝑆 X 𝑆 , 𝑌 − 𝐿 β 𝑆 X 𝑆 , 𝑌 ≤ 20𝐴1 𝑑 𝑞 log 𝑝 and |𝑆|≤𝑞,β𝑆 ∈B𝑆0 (𝑑) p o max |G𝑛 𝐿 (β𝑆∗T X𝑆 , 𝑌 ) 2   | ≤ 10( 𝐴1 𝐾 + 𝑏 max ) 𝑞 log 𝑝 , |𝑆|≤𝑞 where 𝐴1 is some constant defined in Lemma 3.4.4. By Lemma 3.4.4, 𝑃(Ω) ≥ 1−6 exp(−6𝑞 log 𝑝). Thus in the rest of the proof, we only consider the sample points in Ω. In the proof of Lemma 3.4.6, we show that max|𝑆|≤𝑞 k β̂𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 under Ω. Then given an index set 𝑆 and β𝑆 such that |𝑆| < 𝑞, kβ𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 , and for 37 any 𝑗 ∈ 𝑆 𝑐 , ∗ ∗ ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) − ℓ𝑆 ( β̂𝑆 ) ≥ inf ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) − ℓ𝑆 (β𝑆 ) = k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 h i −1/2 ∗T − 𝑛−1/2 G𝑛 𝐿(β𝑆∗T X𝑆 , 𝑌 ) −   𝑛 G𝑛 𝐿(β𝑆+ 𝑗 X𝑆∪{ 𝑗 } , 𝑌 ) 𝑛−1/2 G𝑛 𝐿 (β𝑆T X𝑆 , 𝑌 ) − 𝐿 (β𝑆∗T X𝑆 , 𝑌 )   sup k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 h i ∗T − 𝐸 𝐿(β𝑆∗T X𝑆 , 𝑌 ) ≥   + 𝐸 𝐿(β𝑆+ X 𝑗 𝑆∪{ 𝑗 } , 𝑌 ) p − 20( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝/𝑛 − 20𝐴1 𝐴2 𝑞 2 log 𝑝/𝑛+ 𝜎min 𝜅 min ∗ kβ𝑆+ 𝑗 − (β𝑆∗T , 0) T k 2 , 2 where the second inequality follows from the event Ω and Lemma 3.4.5. By Lemma 3.4.1, if M * 𝑆, there exists 𝑟 ∈ 𝑆 𝑐 ∩ M, such that ∗T kβ𝑆+𝑟 − (β𝑆∗T , 0)k ≥ 𝐶𝜎max −1 −1 −𝛼 𝜅max 𝑛 . Thus, there exists some constant 𝐶1 that does not depend on 𝑛 such that max 𝑗 ∈𝑆 𝑐 ℓ𝑆∪{ 𝑗 } ( β̂𝑆+ 𝑗 ) − ℓ𝑆 ( β̂𝑆 ) ≥ max 𝑗 ∈𝑆 𝑐 ℓ𝑆∪{ 𝑗 } (β𝑆+ ∗ ) − ℓ ( β̂ ) ≥ 𝑗 𝑆 𝑆 ∗ ) − ℓ ( β̂ ) ≥ −20( 𝐴 𝐾 2 + 𝑏 p ℓ𝑆∪{𝑟} (β𝑆+𝑟 𝑆 𝑆 1 max ) 𝑞 log 𝑝/𝑛 − 𝐶 2 𝜎min 𝜅 min 𝑛−2𝛼 20𝐴1 𝐴2 𝑞 2 log 𝑝/𝑛 + 2 𝜅2 2𝜎max ≥ 𝐶1 𝑛−2𝛼 , (3.2) max where the first inequality follows from β̂𝑆+ 𝑗 being the maximizer of (3.1) and the second inequality follows from Conditions (1) and (6). Withdrawing the restriction to Ω, we obtain that   −2𝛼 𝑃 min max𝑐 ℓ𝑆∪{ 𝑗 } ( β̂𝑆∪{ 𝑗 } ) − ℓ𝑆 ( β̂𝑆 ) ≥ 𝐶1 𝑛 ≥ 1 − 6 exp(−6𝑞 log 𝑝). |𝑆|<𝑞,M*𝑆 𝑗 ∈𝑆 38  Proof of Theorem 3.2.2. We have shown that our forward stage will not stop when M * 𝑆 and |𝑆| < 𝑞 with probability converging to 1. ∗ is the unique solution to the equation 𝐸 𝑌 −𝜇 β T X    For any 𝑟 ∈ 𝑆 𝑐 ∩M 𝑐 , β𝑆+𝑟 𝑆+𝑟 𝑆∪{𝑟 } X 𝑆∪{𝑟} = 0. By the mean value theorem, 𝐸 𝑌 − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 = 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟        𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 − 𝐸 𝜇 β∗T X − 𝜇 β𝑆+𝑟 ∗T         =𝐸 X𝑆∪{𝑟 } 𝑋𝑟 ∗T − (β𝑆∗T , 0) 𝐸 𝜎 β̃𝑆+𝑟   T  ⊗2  = β𝑆+𝑟 X𝑆∪{𝑟} X𝑆∪{𝑟} e𝑟 , where β̃𝑆+𝑟 is some point between β𝑆+𝑟 and (β𝑆∗T , 0) T and e𝑟 is a vector of length (|𝑆| + 1) with the 𝑟th element being 1. T X Since | β̃𝑆+𝑟 ∗T ∗T 2 𝑆∪{𝑟} | ≤ |β𝑆+𝑟 X𝑆∪{𝑟} | + |(β𝑆 , 0)X𝑆∪{𝑟 } | ≤ 2𝐾 by Conditions (1) and (2), T |𝜎( β̃𝑆+𝑟 X𝑆∪{𝑟} )| ≥ 𝜎min and hn o i 𝑜(𝑛−𝛼 ) = 𝐸 𝑌 − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 ≥ 𝜎min 𝜅 min kβ𝑆+𝑟 ∗T − (β𝑆∗T , 0)k. Therefore, ∗T max kβ𝑆+𝑟 − (β𝑆∗T , 0)k = 𝑜(𝑛−𝛼 ). 𝑆:|𝑆|≤𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 Under Ω that is defined in Theorem 3.2.1, max|𝑆|≤𝑞 k β̂𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 . 39 For any 𝑗 ∈ 𝑆 𝑐 , ∗ ∗ ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) − ℓ𝑆 ( β̂𝑆 ) ≤ sup ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) − ℓ𝑆 (β𝑆 ) k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 h i ≤ 𝑛−1/2 G𝑛 𝐿 (β𝑆+ ∗T + 𝑛−1/2 G𝑛 𝐿(β𝑆∗T X𝑆 , 𝑌 )   X 𝑗 𝑆∪{ 𝑗 } , 𝑌 ) 𝑛−1/2 G𝑛 𝐿(β𝑆T X𝑆 , 𝑌 ) − 𝐿(β𝑆∗T X𝑆 , 𝑌 )   + sup k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 h i ∗T − 𝐸 𝐿 (β𝑆∗T X𝑆 , 𝑌 ) ≤   + 𝐸 𝐿 (β𝑆+ X 𝑗 𝑆∪{ 𝑗 } , 𝑌 ) q 20( 𝐴1 𝐾 + 𝑏 max ) 𝑞𝑛−1 log 𝑝 + 20𝐴1 𝐴2 𝑞 2 𝑛−1 log 𝑝+ 2 ∗ ∗T T 2 𝜎max 𝜅 max kβ𝑆+ 𝑗 − (β𝑆 , 0) k /2, where the second inequality follows from the event Ω and Lemma 3.4.5. Since ∗ max𝑐 kβ𝑆+𝑟 − (β𝑆∗T , 0) T k = 𝑜(𝑛−𝛼 ) and 𝑞𝑛−1+4𝛼 log 𝑝 → 0, 𝑆:|𝑆|<𝑞,𝑟∈𝑆 ∩M 𝑐 we have q ∗ 2 max ℓ𝑆∪{𝑟} (β𝑆+𝑟 ) − ℓ𝑆 ( β̂𝑆 ) ≤ 20( 𝐴1 𝐾 + 𝑏 max ) 𝑞𝑛−1 log 𝑝+ 𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 20𝐴1 𝐴2 𝑞 2 𝑛−1 log 𝑝 + 𝜎max 𝜅max kβ𝑆+ ∗ ∗T T 2 𝑗 − (β𝑆 , 0) k /2 = 𝑜(𝑛 −2𝛼 ), with probability at least 1 − 6 exp(−6𝑞 log 𝑝). Then by Lemma 3.4.6, max𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 ℓ𝑆∪{𝑟} ( β̂𝑆+𝑟 ) − ℓ𝑆 ( β̂𝑆 ) ≤ max𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 |ℓ𝑆∪{𝑟} ( β̂𝑆+𝑟 ) − ℓ𝑆∪{𝑟} (β𝑆+𝑟 ∗ )| + max ∗ 𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 ℓ𝑆∪{𝑟} (β𝑆+𝑟 ) − ℓ𝑆 ( β̂𝑆 ) ≤ 𝐴3 𝑞 2 𝑛−1 log 𝑝 + 𝑜(𝑛−2𝛼 ) = 𝑜(𝑛−2𝛼 ), (3.3) 40 with probability at least 1 − 12 exp(−6𝑞 log 𝑝). By Theorem 3.2.1, if M * 𝑆, the forward stage would select a noise variable with probability less than 18 exp(−6𝑞 log 𝑝). For 𝑘 > |M|, M * 𝑆 𝑘 implies that at least 𝑘 − |M| noise variables are selected within the 𝑘 steps. Then for 𝑘 = 𝐶2 |M| with 𝐶2 > 2, 𝑘   Õ 𝑘  𝑘−|M | ≤ |M|𝑘 |M | 18 exp(−6𝑞 log 𝑝) 𝑗  𝑃 (M * 𝑆 𝑘 ) ≤ 18 exp(−6𝑞 log 𝑝) 𝑗 𝑗=𝑘−|M | ≤ 18 exp(−6𝑞 log 𝑝 + log |M| + |M| log 𝑘) ≤ 18 exp(−4𝑞 log 𝑝). Therefore, M ⊂ 𝑆𝐶2 |M | with probability at least 1 − 18 exp(−4𝑞 log 𝑝).  Proof of Theorem 3.2.3. By Theorem 3.2.2, M will be included in 𝐹𝑘 for some 𝑘 < 𝑞 with probability going to 1. Therefore, the forward stage stops at the 𝑘th step if EBIC(𝐹𝑘+1 ) > EBIC(𝐹𝑘 ). On the other hand, that EBIC(𝐹𝑘+1 ) < EBIC(𝐹𝑘 ) if and only if 2ℓ𝐹𝑘+1 ( β̂𝐹𝑘+1 ) − 2ℓ𝐹𝑘 ( β̂𝐹𝑘 ) ≥ (log 𝑛 + 2𝜂1 log 𝑝)/𝑛. Thus, to show the forward stage stops at the 𝑘th step, we only need to show that with probability tending to 1, 2ℓ𝐹𝑘+1 ( β̂𝐹𝑘+1 ) − 2ℓ𝐹𝑘 ( β̂𝐹𝑘 ) < (log 𝑛 + 2𝜂1 log 𝑝)/𝑛, (3.4) for all 𝜂1 > 0. To prove (3.4), we first verify the conditions (A4) and (A5) in Chen and Chen [115]. Given any index 𝑆 such that M ⊆ 𝑆 and |𝑆| ≤ 𝑞, let β∗𝑆 be the subvector of β∗ corresponding to 𝑆. We obtain that T     𝑇   𝐸 (𝑌 − 𝜇(β∗𝑆 X𝑆 ))X𝑆 = 𝐸 𝐸 (𝑌 − 𝜇(β∗M XM ))|X𝑆 X𝑆 = 0. This implies β𝑆∗ = β∗𝑆 .   2 Given any π ∈ R|𝑆| , let H𝑆 := ℎ(π, β𝑆 ) = (𝜎max 𝐾 2 |𝑆|) −1 𝜎 β𝑆T X𝑆 π T X𝑆 , kπk = 1, β𝑆 ∈  B𝑆0 (𝑑) . By Conditions (1) and (2), ℎ(π, β𝑆 ) is bounded between −1 and 1 uniformly over kπk = 1 and β𝑆 ∈ B𝑆0 (𝑑). p By Lemma 2.6.15 in van der Vaart et al. [124], the VC indices of W := {(𝐾 |𝑆|) −1 π T X𝑆 , kπk = 1} and V := {β𝑆T X𝑆 , β𝑆 ∈ B𝑆0 (𝑑)} are bounded by |𝑆| + 2. For the definitions of the VC index 41 and covering numbers, we refer to pages 83 and 85 in van der Vaart et al. [124]. The VC index of the class U := {(𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 , kπk = 1} is the VC index of the class of sets {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡, kπk = 1, 𝑡 ∈ R}. p √ Since {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡} = {(X𝑆 , 𝑡) : 0 < (𝐾 |𝑆|) −1 π T X𝑆 ≤ 𝑡} ∪ {(X𝑆 , 𝑡) : √ p − 𝑡 < (𝐾 |𝑆|) −1 π T X𝑆 ≤ 0}, each set of {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡, kπk = 1, 𝑡 ∈ R} is created by taking finite unions, intersections, and complements of of the basic sets {(X𝑆 , 𝑡) : p (𝐾 |𝑆|) −1 π T X𝑆 < 𝑡}. Therefore, the VC index of {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡, kπk = 1, 𝑡 ∈ p R} is of the same order as the VC index of {(X𝑆 , 𝑡) : (𝐾 |𝑆|) −1 π T X𝑆 < 𝑡}, by Lemma 2.6.17 in [124]. Then by Theorem 2.6.7 in van der Vaart et al. [124], for any probability measure 𝑄, there exists some universal constant 𝐶3 such that 𝑁 (𝜖, U, 𝐿 2 (𝑄)) ≤ (𝐶3 /𝜖) 2(|𝑆|+1) . Likewise, 𝑁 (𝑑𝜖, V, 𝐿 2 (𝑄)) ≤ (𝐶3 /𝜖) 2(|𝑆|+1) . Given a β𝑆,0 ∈ B𝑆0 (𝑑), for any β𝑆 in the ball {β𝑆 : supx |β𝑆T x − β𝑆,0 T x| < 𝑑𝜖 }, we have supx |𝜎(β𝑆T x) − 𝜎(β𝑆,0 T x)| < 𝐾 𝑑𝜖 by Condition (4).Let V 0 := {𝜎max −1 𝜎(β T X ), β ∈ B 0 (𝑑)}. 𝑆 𝑆 𝑆 𝑆 By the definition of covering number, 𝑁 (𝐾 𝑑𝜖, V 0, 𝐿 2 (𝑄)) ≤ (𝐶3 /𝜖) 2(|𝑆|+1) Given a 𝜎(β𝑆,0 T x) and π0T x, for any 𝜎(β𝑆T x) in the ball {𝜎(β𝑆T x) : supx |𝜎(β𝑆T x) − 𝜎(β𝑆,0 T x)| ≤ 𝐾 𝑑𝜖 } and π in the ball {π : supx |(π T x) 2 − (π0T x) 2 | < 𝜖 }, (𝜎max 𝐾 2 |𝑆|) −1 supx |𝜎(β𝑆T x)(π T x) 2 − 𝜎(β𝑆,0 T x)(π0T x) 2 | ≤ −1 𝐾 𝑑 + (𝐾 2 |𝑆|) −1 )𝜖. Thus, 𝑁 ((𝜎 −1 𝐾 𝑑 + (𝐾 2 |𝑆|) −1 )𝜖, H , 𝐿 (𝑄)) ≤ (𝐶 /𝜖) 4(|𝑆|+1) , and (𝜎max max 𝑆 2 3 consequently 𝑁 (𝜖, H𝑆 , 𝐿 2 (𝑄)) ≤ (𝐶4 /𝜖) 4(|𝑆|+1) for some constant 𝐶4 . By Theorem 1.1 in Talagrand [125] and |𝑆| ≤ 𝑞, we can find some constant 𝐶5 such that ! p 𝑃 sup |G𝑛 [ℎ(π, β𝑆 )]| ≥ 𝐶5 𝑞 log 𝑝 k π k=1,β𝑆 ∈B𝑆0 (𝑑) ! 4(|𝑆|+1) 𝐶40 𝐶40 𝐶52 𝑞 log 𝑝 ≤ p exp(−2𝐶52 𝑞 log 𝑝) 𝐶5 𝑞 log 𝑝 4(|𝑆| + 1)   ≤ exp 4(|𝑆| + 1) log(𝐶40 𝐶52 𝑞 log 𝑝) − 2𝐶52 𝑞 log 𝑝 ≤ exp (−5𝑞 log 𝑝) , 42 where 𝐶40 is some constant that depends on 𝐶4 only. Thus,  n   2o h   2i 𝑃 sup|𝑆|≤𝑞,k π k=1,β𝑆 ∈B 0 (𝑑) E𝑛 𝜎 XT𝑆 β𝑆 π T X𝑆 − 𝐸 𝜎 XT𝑆 β𝑆 π T X𝑆 𝑆 p Í ≥ 𝐶5 𝐾 2 𝑞 3 log 𝑝/𝑛 ≤𝑠=|M | 𝑒𝑝 𝑠 𝑞 𝑠 exp (−5𝑞 log 𝑝) ≤ exp(−3𝑞 log 𝑝). (3.5)  h   i By Condition (5), 𝜎min 𝜅min ≤ Λ 𝐸 𝜎 XT𝑆 β𝑆 X𝑆⊗2 ≤ 𝜎max 𝜅max , for all β𝑆 ∈ B𝑆0 (𝑑) and 𝑆 : M ⊆ 𝑆, |𝑆| < 𝑞. Then, by (3.5),  n   o T ⊗2 𝜎min 𝜅min /2 ≤ Λ E𝑛 𝜎 X𝑆 β∗𝑆 X𝑆 ≤ 2𝜎max 𝜅max uniformly over all 𝑆 satisfying M ⊆ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 − exp(−3𝑞 log 𝑝). Hence, the condition (A4) in [115] is satisfied with probability at least 1 − exp(−3𝑞 log 𝑝). Also for any β𝑆 ∈ B𝑆0 (𝑑), n   2 o n   2 o T T T T E𝑛 𝜎 X 𝑆 β 𝑆 π X 𝑆 − E𝑛 𝜎 X𝑆 β∗𝑆 π X𝑆 n   2 o n   2 o ≤ 𝑛−1/2 G𝑛 𝜎 XT𝑆 β𝑆 π T X𝑆 + 𝑛−1/2 G𝑛 𝜎 XT𝑆 β∗𝑆 π T X𝑆 h   2 i h   2 i + 𝐸 𝜎 XT𝑆 β𝑆 π T X𝑆 − 𝐸 𝜎 XT𝑆 β∗𝑆 π T X𝑆 q p 2 ≤ 2𝐶5 𝐾 𝑞 3 log 𝑝/𝑛 + 𝜇max kβ𝑆 − β∗𝑆 k |𝑆|𝐾𝜆max . Hence, the condition (A5) in Chen and Chen [115] is satisfied uniformly over all 𝑆 such that M ⊆ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 − exp(−3𝑞 log 𝑝). Then (3.4) can be shown by following the proof of Equation (3.2) in Chen and Chen [115]. Thus, our forward stage stops at the 𝑘th step with probability at least 1 − exp(−3𝑞 log 𝑝).  Proof of Theorem 3.2.4. Suppose that a covariate 𝑋𝑟 is removed from 𝑆. For any 𝑟 ∈ M, since M * 𝑆\{𝑟} and 𝑟 is the only element that is in (𝑆\{𝑟}) 𝑐 ∩ M, by Lemma 3.4.1 and (3.2) ℓ𝑆 ( β̂𝑆 ) − ℓ𝑆\{𝑟} ( β̂𝑆\{𝑟} ) ≥ ℓ𝑆 (β𝑆∗ ) − ℓ𝑆\{𝑟} ( β̂𝑆\{𝑟} ) ∗ = ℓ𝑆\{𝑟}∪{𝑟} (β𝑆\{𝑟}+𝑟 ) − ℓ𝑆\{𝑟} ( β̂𝑆\{𝑟} ) ≥ 𝐶1 𝑛−2𝛼 , 43 with probability at least 1 − 6 exp(−6𝑞 log 𝑝). From the proof of Theorem 3.2.1, we have for any 𝜂2 > 0, BIC(𝑆) − BIC(𝑆\{𝑟}) ≤ −2𝐶1 𝑛−2𝛼 + 𝜂2 𝑛−1 log 𝑛 < 0, uniformly over 𝑟 ∈ M and 𝑆 satisfying M ⊂ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 − 6 exp(−6𝑞 log 𝑝).  Proof of Theorem 3.2.5. By Theorems 3.2.1, 3.2.2, and 3.2.3, we have that the event Ω1 := {| M̂ | ≤ 𝑞 and M ⊆ M̂} holds with probability at least 1 − 25 exp(−2𝑞 log 𝑝). Thus, in the rest of the proof, we restrict our attention on Ω1 . As shown in the proof of Theorem 3.2.3, we obtain that β ∗ = β∗M̂ . Then by Lemma 3.4.6, we M̂ p ∗ have k β̂M̂ −β k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛 with probability at least 1−6 exp(−6𝑞 log 𝑝). Withdrawing M̂ the attention on Ω1 , we obtain that q ∗ k β̂ − β∗ k = k β̂M̂ − β∗M̂ k = k β̂M̂ − βM̂ k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛, with probability at least 1 − 31 exp(−2𝑞 log 𝑝).  3.4 Additional Lemmas Lemma 3.4.1. Given a model 𝑆 such that |𝑆| < 𝑞, M * 𝑆, under Condition (6), (i): ∃𝑟 ∈ 𝑆 𝑐 ∩ M, such that β𝑆+𝑟 ∗ ≠ (β ∗T , 0) T . 𝑆 (ii): Suppose Conditions (1), (2), and (6’) hold. ∃𝑟 ∈ 𝑆 𝑐 ∩ M, such that kβ𝑆+𝑟 ∗T − (β ∗T , 0)k ≥ 𝑆 𝐶𝜎max−1 𝜅 −1 𝑛−𝛼 . max ∗     ∗ Proof. As β𝑆+ 𝑗 is the maximizer of 𝐸 ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) , by the concavity of 𝐸 ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) , β𝑆+ 𝑗 h  i is the solution to the equation 𝐸 𝑌 − 𝜇 β𝑆∗𝑇 X𝑆 + 𝛽 𝑗 𝑋 𝑗 X𝑆∪{ 𝑗 } = 0, (𝑖): Suppose that β𝑆+ ∗ ∗T T 𝑗 = (β𝑆 , 0) , ∀ 𝑗 ∈ 𝑆 ∩ M. Then, 𝑐 0 = 𝐸 𝑌 − 𝜇 β𝑆∗T X𝑆 𝑋 𝑗 = 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗𝑇 X𝑆 𝑋 𝑗        T  ∗𝑇    ⇒ max 𝐸 𝜇 β ∗ X − 𝜇 β 𝑆 X 𝑆 𝑋 𝑗 = 0, 𝑗 ∈𝑆 ∩M 𝑐 which violates the Condition (6). Therefore, we can find a 𝑟 ∈ 𝑆 𝑐 ∩ M, such that β𝑆+𝑟 ∗ ≠ (β ∗T , 0) T . 𝑆 44 (𝑖𝑖): Let 𝑟 ∈ 𝑆 𝑐 ∩M satisfy that 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 > 𝐶𝑛−𝛼 . Without loss of generality,     we assume that 𝑋𝑟 is the last element of X𝑆∪{𝑟} . By the mean value theorem, 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟     = 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 − 𝐸 𝜇 β∗T X − 𝜇 β𝑆+𝑟 ∗T X         𝑆∪{𝑟} 𝑋𝑟 ∗T X ∗T     = 𝐸 𝜇 β𝑆+𝑟 𝑆∪{𝑟} − 𝜇 (β𝑆 , 0)X𝑆∪{𝑟} 𝑋𝑟 ∗T − (β ∗T , 0) 𝐸 𝜎 β̃ T X    ⊗2  = β𝑆+𝑟 𝑆 𝑆+𝑟 𝑆∪{𝑟} X𝑆∪{𝑟} e𝑟 , (3.1) where β̃𝑆+𝑟 is some point between β𝑆+𝑟 ∗ and (β ∗T , 0) T and e is a vector of length (|𝑆| + 1) with the 𝑆 𝑟 𝑟th element being 1. As β̃𝑆+𝑟 is some point between β𝑆+𝑟 ∗ and (β ∗T , 0) T , 𝑆 ∗T T | β̃𝑆+𝑟 X𝑆∪{𝑟} | ≤ |β𝑆+𝑟 X𝑆∪{𝑟} | + |(β𝑆∗T , 0)X𝑆∪{𝑟} | ≤ 2𝐾 2 , by Conditions (1) and (2). Thus, |𝜎( β̃𝑆+𝑟 T X 𝑆∪{𝑟} )| ≤ 𝜎max . By (3.1) and Condition (5), h  i −𝛼 β∗T X β𝑆∗T X𝑆   𝐶𝑛 ≤ 𝐸 𝜇 −𝜇 𝑋𝑟    ∗T ≤ kβ𝑆+𝑟 − (β𝑆∗T , 0)k𝜎max𝜆 max 𝐸 X𝑆∪{𝑟} ⊗2 ke𝑟 k ∗T ≤ 𝜎max 𝜅max kβ𝑆+𝑟 − (β𝑆∗T , 0)k. ∗T − (β ∗T , 0)k ≥ 𝐶𝜎 −1 𝜅 −1 𝑛−𝛼 . Therefore, kβ𝑆+𝑟  𝑆 max max Lemma 3.4.2. Let 𝜉𝑖 , 𝑖 = 1, . . . , 𝑛 be 𝑛 i.i.d random variables such that |𝜉𝑖 | ≤ 𝐵 for a constant √ 𝐵 > 0. Under Conditions (1) – (3), we have 𝐸 [|𝑌𝑖 𝜉𝑖 − 𝐸 [𝑌𝑖 𝜉𝑖 ] | 𝑚 ] ≤ 𝑚!(2𝐵( 2𝑀 + 𝜇max )) 𝑚 , for every 𝑚 ≥ 1. Proof. By Conditions (1) and (2), |β∗T X𝑖 | ≤ 𝐾 𝐿, ∀𝑖 ≥ 1 and consequently 𝜇(β∗T X𝑖 ) ≤ 𝜇max . Then 45 by Condition (3), 𝑚   Õ 𝑚 𝜇(β∗T X𝑖 )| 𝑚 ]   𝑚−𝑡 𝑚 𝐸 [|𝑌𝑖 | ] = 𝐸 [|𝜖𝑖 + ≤ 𝐸 |𝜖𝑖 | 𝑡 𝜇max 𝑡=0 𝑡 𝑚   Õ 𝑚 𝑡 𝑚−𝑡 ≤ 𝑡! 𝑀 𝜇max ≤ 𝑚!(𝑀 + 𝜇max ) 𝑚 , 𝑡=0 𝑡 for every 𝑚 ≥ 1. By the same arguments, it can be shown that, for every 𝑚 ≥ 1, 𝐸 [|𝑌𝑖 𝜉𝑖 − 𝐸 [𝑌𝑖 𝜉𝑖 ] | 𝑚 ] ≤ 𝐸 [(|𝑌𝑖 𝜉𝑖 | + |𝐸 [𝑌𝑖 𝜉𝑖 ] |) 𝑚 ] ≤ 𝑚!(2𝐵(𝑀 + 𝜇max )) 𝑚 .  p Lemma 3.4.3. Under Conditions (1) – (3), when 𝑛 is sufficiently large such that 28 𝑞 log 𝑝/𝑛 < 1, we have supβ ∈B E𝑛 𝐿 (β T X, 𝑌 ) ≤ 2(𝑀 + 𝜇max )𝐾 3 + 𝑏 max , with probability 1−2 exp(−10𝑞 log 𝑝).  Proof. By Conditions (2), supβ ∈B β T X ≤ 𝐾 3 . Thus, sup E𝑛 𝐿 (β T X, 𝑌 ) ≤ sup E𝑛 𝑌 β T X   + 𝑏 max β ∈B β ∈B ≤ E𝑛 {|𝑌 | − 𝐸 [|𝑌 |]} + 𝐸 [|𝑌 |] 𝐾 3 + 𝑏 max  ≤ E𝑛 {|𝑌 | − 𝐸 [|𝑌 |]} 𝐾 3 + (𝑀 + 𝜇max )𝐾 3 + 𝑏 max ,  where the last inequality follows from that 𝐸 [|𝑌 |] ≤ 𝑀 + 𝜇max as shown in the proof of Lemma 3.4.2.  𝑚 Let 𝜉𝑖 = 1{𝑌𝑖 > 0}−1{𝑌𝑖 < 0}. Thus |𝜉𝑖 | ≤ 1. By Lemma 3.4.2, we have 𝐸 |𝑌𝑖 | − 𝐸 [|𝑌𝑖 |] ≤ 𝑚!(2(𝑀 + 𝜇max )) 𝑚 . Applying Bernstein’s inequality (e.g., Lemma 2.2.11 in van der Vaart et al. [124]) yields that  p  𝑃 |E𝑛 {|𝑌 | − 𝐸 [|𝑌 |]}| > 10(𝑀 + 𝜇max ) 𝑞 log 𝑝/𝑛   196𝑞 log 𝑝 ≤ 2 exp − 21 √ ≤ 2 exp(−10𝑞 log 𝑝), (3.2) 4+20 𝑞 log 𝑝/𝑛 46 p p when 𝑛 is sufficiently large such that 20 𝑞 log 𝑝/𝑛 < 1. Since 10(𝑀 + 𝜇max ) 𝑞 log 𝑝/𝑛 = 𝑜(1), then ! 𝑃 sup E𝑛 𝐿 (β T X, 𝑌 ) ≥ 2(𝑀 + 𝜇max )𝐾 3 + 𝑏 max ≤ 2 exp(−10𝑞 log 𝑝).  β ∈B  p Lemma 3.4.4. Given an index set 𝑆 and 𝑟 ∈ 𝑆 𝑐 , let B𝑆0 (𝑑) = {β𝑆 : kβ𝑆 − β𝑆∗ k ≤ 𝑑/(𝐾 |𝑆|)} and 𝐴1 := (𝑀 + 2𝜇max ). Under Conditions (1) – (3), when 𝑛 is sufficiently large such that p 10 𝑞 log 𝑝/𝑛 < 1, we have h    i p 1. |G𝑛 𝐿 β𝑆T X𝑆 , 𝑌 − 𝐿 β𝑆∗T X𝑆 , 𝑌 | ≤ 20𝐴1 𝑑 𝑞 log 𝑝, uniformly over β𝑆 ∈ B𝑆0 (𝑑) and |𝑆| ≤ 𝑞, with probability at least 1 − 4 exp(−6𝑞 log 𝑝). p 2. |G𝑛 𝐿 (β𝑆∗T X𝑆 , 𝑌 ) | ≤ 10( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝, uniformly over |𝑆| ≤ 𝑞, with probability   at least 1 − 2 exp(−8𝑞 log 𝑝). p Proof. : (1): Let R |𝑆| (𝑑) be a |𝑆|-dimensional ball with center at 0 and radius 𝑑/(𝐾 |𝑆|). Then B𝑆0 (𝑑) = R |𝑆| (𝑑)+β𝑆∗ . Let C|𝑆| := {C(ξ 𝑘 )} be a collection of cubes that cover the ball 𝑅 |𝑆| (𝑑), where p C(ξ 𝑘 ) is a cube containing ξ 𝑘 with sides of length 𝑑/(𝐾 |𝑆|𝑛2 ) and ξ 𝑘 is some point in R |𝑆| (𝑑). As p  |𝑆| p the volume of C(ξ 𝑘 ) is 𝑑/(𝐾 |𝑆|𝑛2 ) and the volume of R |𝑆| (𝑑) is less than (2𝑑/(𝐾 |𝑆|)) |𝑆| , we can select ξ 𝑘 ’s so that no more than (4𝑛2 ) |𝑆| cubes are needed to cover R |𝑆| (𝑑). We thus assume |C|𝑆| | ≤ (4𝑛2 ) |𝑆| . For any ξ ∈ C(ξ 𝑘 ), kξ − ξ 𝑘 k ≤ 𝑑/(𝐾𝑛2 ). In addition, let 𝑇1𝑆 (ξ) := E𝑛 𝑌 ξ T X𝑆 ,     T  𝑇2𝑆 (ξ) := E𝑛 𝑏 β𝑆∗ + ξ X𝑆 − 𝑏 β𝑆∗T X𝑆 , and 𝑇𝑆 (ξ) := 𝑇1𝑆 (ξ) − 𝑇2𝑆 (ξ).  Given any ξ ∈ R |𝑆| (𝑑), there exists C(ξ 𝑘 ) ∈ C|𝑆| such that ξ ∈ C(ξ 𝑘 ). Then |𝑇𝑆 (ξ) − 𝐸 [𝑇𝑆 (ξ)]| ≤ |𝑇𝑆 (ξ) − 𝑇𝑆 (ξ 𝑘 )| |𝑇𝑆 (ξ 𝑘 ) − 𝐸 [𝑇𝑆 (ξ 𝑘 )]| + |𝐸 [𝑇𝑆 (ξ)] − 𝐸 [𝑇𝑆 (ξ 𝑘 )]| =: 𝐼 + 𝐼 𝐼 + 𝐼 𝐼 𝐼. We deal with 𝐼 𝐼 𝐼 first. By the mean value theorem, there exists a ξ̃ between ξ and ξ 𝑘 such that   T  |𝐸 [𝑇𝑆 (ξ 𝑘 )] − 𝐸 [𝑇𝑆 (ξ)]| = 𝐸 𝑌 (ξ 𝑘 − ξ) T X𝑆 + 𝐸 𝜇 β𝑆∗ + ξ̃ X𝑆 (ξ 𝑘 − ξ) T X𝑆    p p ≤ 𝐸 [|𝑌 |] kξ 𝑘 − ξkkX𝑆 k + 𝜇max kξ 𝑘 − ξkkX𝑆 k ≤ (𝑀 + 2𝜇max )𝑑 |𝑆|𝑛−2 = 𝐴1 𝑑 |𝑆|𝑛−2(3.3) , 47 where the last inequality follows from Lemma 3.4.2 and 𝐴1 = 𝑀 + 2𝜇max . p p Next, we evaluate 𝐼 𝐼. By Condition (2), |X𝑖𝑆 T ξ| ≤ kX kkξk ≤ 𝑑/(𝐾 |𝑆|) |𝑆|𝐾 = 𝑑, for all 𝑖𝑆 ξ ∈ R |𝑆| (𝑑). Then by Lemma 3.4.2, h  T  𝑚i 𝐸 𝑌 ξ 𝑘T X𝑆 − 𝐸 𝑌 ξ 𝑘 X𝑆 ≤ 𝑚!(2(𝑀 + 𝜇max )𝑑) 𝑚 . p By Bernstein’s inequality, when 𝑛 is sufficiently large such that 10 𝑞 log 𝑝/𝑛 ≤ 1.  q  𝑃 |𝑇1𝑆 (ξ 𝑘 ) − 𝐸 [𝑇1𝑆 (ξ 𝑘 )]| > 10(𝑀 + 𝜇max )𝑑 𝑞𝑛 log 𝑝 −1 ! 1 100𝑞 log 𝑝 ≤ 2 exp − p ≤ 2 exp(−10𝑞 log 𝑝). (3.4) 2 4 + 20 𝑞 log 𝑝/𝑛  T Since |𝑏( β𝑆∗ + ξ 𝑘 X𝑆 ) − 𝑏(β𝑆∗T X𝑆 )| ≤ 𝜇max 𝑑, by the same arguments used for (3.4), we have  q  𝑃 |𝑇2𝑆 (ξ 𝑘 ) − 𝐸 [𝑇2𝑆 (ξ 𝑘 )]| > 10𝜇max 𝑑 𝑞𝑛−1 log 𝑝 ≤ 2 exp(−10𝑞 log 𝑝). (3.5) Combining (3.4) and (3.5) yields that uniformly over ξ 𝑘 q |𝑇𝑆 (ξ 𝑘 ) − 𝐸 [𝑇𝑆 (ξ 𝑘 )]| ≤ 10𝐴1 𝑑 𝑞𝑛−1 log 𝑝, (3.6) with probability at least 1 − 2(4𝑛2 ) |𝑆| exp(−10𝑞 log 𝑝). We now assess 𝐼. Following the same arguments as in Lemma 3.4.3,  p  𝑃 sup |𝑇𝑆 (ξ) − 𝑇𝑆 (ξ 𝑘 )| > (2𝑀 + 3𝜇max )𝑑 |𝑆|𝑛−2 ≤ 2 exp(−8𝑞 log 𝑝). (3.7) ξ ∈C( ξ 𝑘 ) p p Since |𝑆|𝑛−2 = 𝑜( 𝑞𝑛−1 log 𝑝), combining (3.3), (3.6), and (3.7) together yields that  q  𝑃 sup |𝑇𝑆 (ξ) − 𝐸 [𝑇𝑆 (ξ)]| ≥ 20𝐴1 𝑑 𝑞𝑛−1 log 𝑝 ξ ∈R |𝑆 | (𝑑) ≤ 2(4𝑛2 ) |𝑆| exp(−10𝑞 log 𝑝) + 2 exp(−8𝑞 log 𝑝) ≤ 4 exp(−8𝑞 log 𝑝). 𝑝 By the combinatoric inequality 𝑠 ≤ (𝑒 𝑝/𝑠) 𝑠 , we obtain that  h    i p  𝑃 sup G𝑛 𝐿 β𝑆T X𝑆 , 𝑌 − 𝐿 β𝑆∗T X𝑆 , 𝑌 ≥ 20𝐴1 𝑑 𝑞 log 𝑝 |𝑆|≤𝑞,β𝑆 ∈B𝑆0 (𝑑1 ) Õ 𝑞 ≤ (𝑒 𝑝/𝑠) 𝑠 4 exp(−8𝑞 log 𝑝) ≤ 4 exp(−6𝑞 log 𝑝). 𝑠=1 48 (2): We evaluate the 𝑚th moment of 𝐿(β𝑆∗ X𝑆 , 𝑌 ). " 𝑚   # Õ 𝑚 𝐸 𝑌 β𝑆∗ X𝑆 − 𝑏(β𝑆∗ X𝑆 ) |𝑌 | 𝑡 𝐾 2𝑡 𝑏 max   𝑚  𝑚−𝑡 ≤𝐸 𝑡=0 𝑡 𝑚   Õ 𝑚 𝑡! (𝑀 + 𝜇max )𝐾 2 𝑏 max ≤ 𝑚!((𝑀 + 𝜇max )𝐾 2 + 𝑏 max ) 𝑚 .  𝑡 𝑚−𝑡 ≤ 𝑡=0 𝑡 Then, by Bernstein’s inequality,  p  𝑃 |G𝑛 𝐿(β𝑆∗T X𝑆 , 𝑌 ) | > 10( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝 ≤ 2 exp(−10𝑞 log 𝑝).   By the same arguments used in (i), we obtain that  h  i p  𝑃 sup G𝑛 𝐿 β𝑆∗T X𝑆 , 𝑌 ≥ 10( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝 |𝑆|≤𝑞 Õ 𝑞 ≤ (𝑒 𝑝/𝑠) 𝑠 2 exp(−10𝑞 log 𝑝) ≤ 2 exp(−8𝑞 log 𝑝). 𝑠=1  Lemma 3.4.5. Given a model 𝑆 and 𝑟 ∈ 𝑆 𝑐 , under Conditions (1), (2), and (5), for any p kβ𝑆 − β𝑆∗ k ≤ 𝐾/ |𝑆|, 𝜎min 𝜅min kβ𝑆 − β𝑆∗ k 2 /2 ≤ 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )] ≤ 𝜎max 𝜅max kβ𝑆 − β𝑆∗ k 2 /2.   Proof. Due to the concavity of the log-likelihood in GLMs with the canonical link, p 𝐸 𝑌 X𝑆 − 𝜇(β𝑆∗T X𝑆 )X𝑆 = 0. Then for any kβ𝑆 − β𝑆∗ k ≤ 𝐾/ |𝑆|,   p p |β T X𝑆 | ≤ |β ∗T X𝑆 | + |(β − β ∗ ) T X𝑆 | ≤ 𝐾 2 + 𝐾/ |𝑆| × 𝐾 |𝑆| = 2𝐾 𝐿. Thus, by Taylor’s expansion, 1 h   i 𝐸 [ℓ𝑆 (β𝑆 )] − 𝐸 ℓ𝑆 (β𝑆∗ ) = − (β𝑆 − β𝑆∗ ) T 𝐸 𝜎 β̃𝑆T X𝑆 X𝑆⊗2 (β𝑆 − β𝑆∗ ),   2 where β̃𝑆 is between β𝑆 and β𝑆∗ . By Condition (5), 𝜎min 𝜅min kβ𝑆 − β𝑆∗ k 2 /2 ≤ 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )] ≤ 𝜎max 𝜅max kβ𝑆 − β𝑆∗ k 2 /2.    49 Lemma 3.4.6. Under Conditions (1) – (6), there exist some constants 𝐴2 and 𝐴3 that do not p depend on 𝑛, such that k β̂𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛 and |ℓ𝑆 ( β̂𝑆 ) − ℓ𝑆 (β𝑆∗ )| ≤ 𝐴3 𝑞 2 log 𝑝/𝑛 hold uniformly over 𝑆 : |𝑆| ≤ 𝑞, with probability at least 1 − 6 exp(−6𝑞 log 𝑝). Proof. Define n h    i p o Ω(𝑑) := sup G𝑛 𝐿 β𝑆T X𝑆 , 𝑌 − 𝐿 β𝑆∗T X𝑆 , 𝑌 < 20𝐴1 𝑑 𝑞 log 𝑝 . |𝑆|≤𝑞,β𝑆 ∈B𝑆0 (𝑑) By Lemma 3.4.4, the event Ω(𝑑) holds with probability at least 1 − 4 exp(−6𝑞 log 𝑝). Thus, p in the proof of Lemma 3.4.6, we shall assume Ω(𝑑) hold with 𝑑 = 𝐴2 𝑞 3 log 𝑝/𝑛 for some 𝐴2 > 20(𝜎min 𝜅 min ) −1 𝐾 2 𝐴1 . p p p p For any kβ𝑆 − β𝑆∗ k = 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛, since 𝑞 2 log 𝑝/𝑛 ≤ 𝑞 3 log 𝑝/𝑛/ |𝑆|, β𝑆 ∈ B𝑆0 (𝑑). By Lemma 3.4.5, ℓ𝑆 (β𝑆∗ ) − ℓ𝑆 (β𝑆 )   = ℓ𝑆 (β𝑆∗ ) − 𝐸 ℓ𝑆 (β𝑆∗ ) − (ℓ𝑆 (β𝑆 ) − 𝐸 [ℓ𝑆 (β𝑆 )]) + 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )]      p ≥ 𝜎min 𝜅min kβ𝑆 − β𝑆∗ k 2 /2 − 20𝐴1 𝑑 𝑞 log 𝑝/𝑛 = 𝜎min 𝜅min 𝐴22 𝑞 2 log 𝑝/(𝐾 2 𝑛) − 20𝐴1 𝐴2 𝑞 2 log 𝑝/𝑛 > 0. Thus, inf √ ℓ𝑆 (β𝑆∗ ) − ℓ𝑆 (β𝑆 ) > 0. |𝑆|≤𝑞,k β𝑆 −β𝑆∗ k=𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛 p Then by the concavity of ℓ𝑆 (·), we obtain that max|𝑆|≤𝑞 β̂𝑆 − β𝑆∗ ≤ 𝐴2 𝐾 −1 𝑞 2 𝑛−1 log 𝑝. p On the other hand, for any kβ𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛, ℓ𝑆 (β𝑆∗ ) − ℓ𝑆 (β𝑆 ) ≤ ℓ𝑆 (β𝑆∗ ) − 𝐸 ℓ𝑆 (β𝑆∗ ) − (ℓ𝑆 (β𝑆 ) − 𝐸 [ℓ𝑆 (β𝑆 )]) + 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )]     p ≤ 𝜎max 𝜅max kβ𝑆 − β𝑆∗ k 2 /2 + 20𝐴1 𝑑 𝑞 log 𝑝/𝑛 ≤ 𝐴3 𝑞 2 𝑛−1 log 𝑝, 50 where 𝐴3 := 4𝜎max𝜆max 𝐴22 𝐾 −2 + 20𝐴1 𝐴2 .  3.5 Simulations We compared the proposal with the other competing methods, including the penalized methods such as least absolute shrinkage and selection operator (LASSO), the differential geometric least angle regression (dgLARS) [90, 91], the forward regression (FR) approach [93], the sequentially conditioning (SC) approach [99], and the screening method such as sure independence screening (SIS) [87], which is popular in practice. As SIS does not directly generate a predictive model, we applied LASSO for the top [𝑛/log(𝑛)] variables chosen by SIS and denoted the procedure by SIS+LASSO. As the FR, SC and STEPWISE approaches involve forward searching and to make them comparable, we applied the same stopping rule, for example, Equation (3.3) with the same 𝛾, to their forward steps. In particular, the STEPWISE approach, with 𝜂1 = 𝛾 and 𝜂2 = 0, is equivalent to FR and asymptotically equivalent to SC. By varying 𝛾 in FR and SC between 𝛾 𝐿 and 𝛾𝐻 , we explored the impact of 𝛾 on inducing false positives and negatives. In our numerical studies, we fixed 𝛾𝐻 = 10 and set 𝛾 𝐿 = 𝜂1 . To choose 𝜂1 and 𝜂2 in (3.3) and (3.4) in STEPWISE, we performed 5-fold cross-validation to minimize the mean squared prediction error (MSPE), and reported the results in Table 3.1. Since the proposed STEPWISE algorithm uses the (E)BIC criterion, for a fair comparison we chose the tuning parameter in dgLARS by using the BIC criterion as well, and coined the corresponding approach as dgLARS(BIC). The regularization parameter in LASSO was chosen via the following two approaches: 1) giving the smallest BIC for the models on the LASSO path, denoted by LASSO(BIC); 2) using the one- standard-error rule, denoted by LASSO(1SE), which chooses the most parsimonious model whose error is no more than one standard error above the error of the best model in cross-validation [126]. 51 Table 3.1: The values of 𝜂1 and 𝜂2 used in the simulation studies Normal Model Binomial Model Poisson Model Example 1 (0.5, 3) (0.5, 3) (1, 3) Example 2 (0.5, 3) (1, 3) (1, 3) Example 3 (1, 3) (0.5, 3) (0.5, 1) Example 4 (1, 3.5) (0, 1) (1, 3) Example 5 (0.5, 3) (0.5, 2) (0.5, 3) Example 6 (0.5, 3) (0.5, 3) (1, 3) Example 7 (0.5, 3) (0.5, 3) (0.5, 4.5) Note: Values for 𝜂1 and 𝜂2 were searched on the grid {0, 0.25, 0.5, 1} and {1, 2, 3, 3.5, 4, 4.5, 5}, respectively. Denote by X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 ) T and β = (𝛽1 , . . . , 𝛽 𝑝 ) T , the covariate vector for subject 𝑖 (1, . . . , 𝑛) and the true coefficient vector. The following five examples generated X𝑖T β, the inner product of the coefficient and covariate vectors for each individual, which were used to generate outcomes from the Normal, Binomial, and Poisson models. Example 1: For each 𝑖, 𝑝0 𝑝 ©Õ Õ 𝑐X𝑖T β =𝑐×­ 𝛽 𝑗 𝑋𝑖 𝑗 + 𝛽 𝑗 𝑋𝑖 𝑗 ® , 𝑖 = 1, . . . , 𝑛, ª « 𝑗=1 𝑗=𝑝 0 +1 ¬ √ where 𝛽 𝑗 = (−1) 𝑗 (4log 𝑛/ 𝑛 + |𝑍 𝑗 |), for 𝑗 = 1, . . . , 𝑝 0 and 𝛽 𝑗 =0, otherwise; 𝐵 𝑗 was a binary 𝐵 random variable with 𝑃(𝐵 𝑗 = 1) = 0.4 and 𝑍 𝑗 was generated by a standard normal distribution; 𝑝 0 = 8; 𝑋𝑖 𝑗 ’s were independently generated from a standardized exponential distribution, that is, exp(1) − 1. Here and also in the other examples, 𝑐 (specified later) controls the signal strengths. Example 2: This scenario is the same as Example 1 except that 𝑋𝑖 𝑗 was independently generated from a standard normal distribution. Example 3: For each 𝑖, 𝑝0 𝑝 ©Õ Õ 𝑐X𝑖T β =𝑐×­ 𝛽 𝑗 𝑋𝑖 𝑗 + 𝛽 𝑗 𝑋𝑖∗𝑗 ® , 𝑖 = 1, . . . , 𝑛, ª « 𝑗=1 𝑗=𝑝 0 +1 ¬ where 𝛽 𝑗 = 2j for 1 ≤ 𝑗 ≤ 𝑝 0 and 𝑝 0 = 5. We simulated every component of Z𝑖 = (𝑍𝑖 𝑗 ) ∈ 𝑅 𝑝 and W𝑖 = (𝑊𝑖 𝑗 ) ∈ 𝑅 𝑝 independently from a standard normal distribution. Next, we generated X𝑖 according √ 𝑝0 to 𝑋𝑖 𝑗 = (𝑍𝑖 𝑗 + 𝑊𝑖 𝑗 )/ 2 for 1 ≤ 𝑗 ≤ 𝑝 0 and 𝑋𝑖∗𝑗 = (𝑍𝑖 𝑗 + Í 𝑍𝑖 𝑗 0 )/2 for 𝑝 0 < 𝑗 ≤ 𝑝. 𝑗 0 =1 52 Example 4: For each 𝑖, 500 𝑝 ©Õ Õ 𝑐X𝑖T β =𝑐×­ 𝛽 𝑗 𝑋𝑖 𝑗 + 𝑖 = 1, . . . , 𝑛, ª 𝛽 𝑗 𝑋𝑖 𝑗 ® , « 𝑗=1 𝑗=501 ¬ where the first 500 𝑋𝑖 𝑗 ’s were generated from the multivariate normal distribution with mean 0 0 and a covariance matrix with all of the diagonal entries being 1 and cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0.5| 𝑗− 𝑗 | for 1 ≤ 𝑗, 𝑗 0 ≤ 𝑝. The remaining 𝑝 − 500 𝑋𝑖 𝑗 ’s were generated through the autoregressive processes with 𝑋𝑖,501 ∼ Unif (-2, 2), 𝑋𝑖 𝑗 = 0.5 𝑋𝑖, 𝑗−1 + 0.5 𝑋𝑖∗𝑗 , for 𝑗 = 502, . . . , 𝑝, where 𝑋𝑖∗𝑗 ∼ Unif (-2, 2) were generated independently. The coefficients 𝛽 𝑗 for 𝑗 = 1, . . . , 7, 501, . . . , 507 were generated √ from (−1) 𝐵 𝑗 (4log 𝑛/ 𝑛 + |𝑍 𝑗 |), where 𝐵 𝑗 was a binary random variable with 𝑃(𝐵 𝑗 = 1) = 0.4 and 𝑍 𝑗 was from a standard normal distribution. The remaining 𝛽 𝑗 were zeros. Example 5: For each 𝑖, 𝑐X𝑖T β = 𝑐 × −0.5𝑋𝑖1 + 𝑋𝑖2 + 0.5𝑋𝑖,100 ,  𝑖 = 1, . . . , 𝑛, where X𝑖 were generated from a multivariate normal distribution with mean 0 and a covariance 0 matrix with all of the diagonal entries being 1 and cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0.9| 𝑗− 𝑗 | for 1 ≤ j, j’ ≤ p. All of the coefficients were zero except for 𝑋𝑖1 , 𝑋𝑖2 and 𝑋𝑖,100 . Example 6: For each 𝑖, 𝑞 𝑝 ©Õ Õ 𝑐X𝑖T β =𝑐×­ 𝛽 𝑗 𝑋𝑖 𝑗 + 𝑖 = 1, . . . , 𝑛, ª 𝛽 𝑗 𝑋𝑖 𝑗 ® , « 𝑗=1 𝑗=𝑞+1 ¬ where 𝛽𝑞+1 = · · · = 𝛽 𝑝 = 0, (𝛽1 = · · · = 𝛽𝑞 ) = (3.75, 4.5, 5.25, 6, 6.75, 7.5, 8.25, 9, 9.75), and 𝑞 = 10. 𝑋𝑖 𝑗 ’s were generated from the multivariate standard normal distribution for 1 ≤ 𝑗 ≤ 𝑞 and Õ 𝑞 𝑋𝑖 𝑗 = 𝑑𝑖 𝑗 + 𝑏 𝑋𝑖𝑙 , 𝑖 = 1, . . . , 𝑛, 𝑙=1 for 𝑞 + 1 ≤ 𝑗 ≤ 𝑝, where 𝑏 = (3/4𝑞) 1/2 and (𝑑𝑖(𝑞+1) , . . . , 𝑑𝑖 𝑝 ) are generated from the multivariate normal distribution with mean 0 and a covariance matrix with all of the diagonal entries being 1 and off-diagonal being 0.25, and are independent of 𝑋𝑖 𝑗 for 1 ≤ 𝑗 ≤ 𝑞. 53 Example 7: For each 𝑖, 𝑝 ©Õ 𝑐X𝑖T β = 𝑐 × ­ 𝛽 𝑗 𝑋𝑖 𝑗 ® , 𝑖 = 1, . . . , 𝑛, ª « 𝑗=1 ¬ where 𝛽 𝑗 = 0.2 for 𝑗 = 1, . . . , 7, 501, . . . , 508 and 𝛽 𝑗 =0, otherwise. 𝑋𝑖 𝑗 ’s were generated from the multivariate normal distribution for 1 ≤ 𝑗 ≤ 500 with mean 0 and a covariance matrix with all 0 of the diagonal entries being 1, cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0.9| 𝑗− 𝑗 | for 1 ≤ j, j’ ≤ 15, and cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0 for 16 ≤ j, j’ ≤ 500. 𝑋𝑖 𝑗 ’s were generated from the multivariate double exponential distribution for 501 ≤ 𝑗 ≤ 508 with location parameter equal to 0 and a covariance matrix with all of the diagonal entries being 1, and independent of 𝑋𝑖 𝑗 for 1 ≤ 𝑗 ≤ 500. Examples 1 and 3 were adopted from Wang [93], while Examples 2, 4, and 6 were borrowed from Fan et al. [87], Hwang et al. [95], and Ing et al. [113], respectively. We then generated the responses from the following three models. Normal Model: 𝑌𝑖 = 𝑐X𝑖T β + 𝜖𝑖 with 𝜖𝑖 ∼ 𝑁 (0, 1). Binomial Model: 𝑌𝑖 ∼ Bernoulli( exp(𝑐X𝑖T β)/{1 + exp(𝑐X𝑖T β)}). Poisson Model: 𝑌𝑖 ∼ Poisson( exp(𝑐X𝑖T β)). Our simulated examples cover a wide range of models (Normal, Binomial, Poisson), having their covariates generated from various distributions (multivariate normal, exponential, double exponential, uniform, and mixture) with diverse set of complex covariance structures (independent, compound symmetry, autoregressive, unstructured) and comprised of strong and weak signals including hidden features. We considered 𝑛 = 400 and 𝑝 = 1,000 throughout all of the examples. We specified the magnitude of the coefficients in the GLMs with a constant multiplier, c. For Examples 1-7, this constant was set, respectively for the Normal, Binomial and Poisson models, to be: (1, 1, 0.3), (1, 1.5, 0.3), (1, 1, 0.1), (1, 1.5, 0.3), (1, 3, 2), (1, 1, 1), and (1, 3, 2). For each parameter configuration, we simulated 500 independent data sets. We evaluated the performance of the methods by the criteria of true positives (TP), false positives (FP), the estimated probability of including the true models (PIT), the mean squared error (MSE) of β̂, and the mean squared prediction error (MSPE). To compute 54 the MSPE, we randomly partitioned the samples into the training (75%) and testing (25%) sets. The models obtained from the training datasets were used to predict the responses in the testing datasets. Tables 3.2–3.4 report the average TP, FP, PIT, MSE, and MSPE over 500 datasets along with the standard deviations. The findings are summarized below. First, the proposed STEPWISE method was able to detect all the true signals with nearly zero FPs. Specifically, in all of the Examples, STEPWISE outperformed the other methods by detecting more TPs with fewer FPs, whereas LASSO, SIS+LASSO and dgLARS included much more FPs. Second, though a smaller 𝛾 in FR and SC led to the inclusion of all TPs with a PIT close to 1, it incurred more FPs. On the other hand, a larger 𝛾 may eliminate some TPs, resulting in a smaller PIT and a larger MSPE. Third, for the Normal model, the STEPWISE method yielded an MSE close to 0, the smallest among all the competing methods. The Binary and Poisson data challenged all of the methods, and the MSEs for all the methods were non-negligible. However, the STEPWISE method still produced the lowest MSE. The results seemed to verify the consistency of β̂, which distinguished the proposed STEPWISE method from the other regularized methods and highlighted its ability to provide a more accurate means to characterize the effects of high dimensional predictors. Fourth, for all three models, STEPWISE procedure demonstrated a vivid advantage over other competing methods: for the Poisson model, it outperformed all methods by selecting the highest number of TP and keeping FP at the low rate. In fact, SIS+LASSO failed to detect any TP while including an incomparably high number of FP. High FP rates were observed in dgLARS method as well. Similarly, for the Binomial model, STEPWISE selected almost all TPs while including near zero FPs. LASSO, SIS+LASSO, and dgLARS selected a learge number of FPs while selecting less TPs than our proposed method. 55 Table 3.2: Normal model Example Method TP FP PIT MSE MSPE (×10−4 ) 1 (𝑝 0 = 8) LASSO(1SE) 8.00 (0.00) 5.48 (6.61) 1.00 (0.00) 2.45 1.148 LASSO(BIC) 8.00 (0.00) 2.55 (2.48) 1.00 (0.00) 2.58 1.172 SIS+LASSO(1SE) 8.00 (0.00) 6.59 (4.22) 1.00 (0.00) 1.49 1.042 SIS+LASSO(BIC) 8.00 (0.00) 6.04 (3.33) 1.00 (0.00) 1.37 1.025 dgLARS(BIC) 8.00 (0.00) 3.52(2.53) 1.00 (0.00) 2.25 1.130 SC (𝛾 𝐿 ) 8.00 (0.00) 3.01 (1.85) 1.00 (0.00) 1.09 0.895 SC (𝛾 𝐻 ) 7.60 (1.59) 0.00 (0.00) 0.94 (0.24) 14.56 5.081 FR (𝛾 𝐿 ) 8.00 (0.00) 2.96 (2.04) 1.00 (0.00) 1.08 0.896 FR (𝛾 𝐻 ) 7.88 (0.84) 0.00 (0.00) 0.98 (0.14) 3.74 2.040 STEPWISE 8.00 (0.00) 0.00 (0.00) 1.00 (0.00) 0.21 0.972 2 (𝑝 0 = 8) LASSO(1SE) 8.00 (0.00) 4.74 (4.24) 1.00 (0.00) 2.46 1.154 LASSO(BIC) 8.00 (0.00) 2.12 (2.02) 1.00 (0.00) 2.62 1.182 SIS+LASSO 7.99 (0.10) 6.84 (4.57) 0.99 (0.10) 1.65 1.058 SIS+LASSO(BIC) 7.99 (0.10) 6.11 (3.85) 0.99 (0.10) 1.56 1.041 dgLARS(BIC) 8.00 (0.00) 3.26(2.62) 1.00 (0.00) 2.28 1.138 SC (𝛾 𝐿 ) 8.00 (0.00) 2.73 (1.53) 1.00 (0.00) 0.98 0.901 SC (𝛾 𝐻 ) 7.30 (2.11) 0.00 (0.00) 0.90 (0.30) 23.70 6.397 FR (𝛾 𝐿 ) 8.00 (0.00) 2.45 (1.65) 1.00 (0.00) 0.92 0.907 FR (𝛾 𝐻 ) 7.94 (0.60) 0.00 (0.00) 0.99 (0.00) 2.69 2.062 STEPWISE 8.00 (0.00) 0.01 (0.10) 1.00 (0.00) 0.21 0.972 3 (𝑝 0 = 5) LASSO(1SE) 5.00 (0.00) 8.24 (2.63) 1.00 (0.00) 3.07 1.084 LASSO(BIC) 5.00 (0.00) 12.33 (3.28) 1.00 (0.00) 27.97 2.398 SIS+LASSO(1SE) 0.97 (0.26) 15.94 (2.93) 0.00 (0.00) 1406.22 76.024 SIS+LASSO(BIC) 0.97 (0.26) 16.20 (2.81) 0.00 (0.00) 1354.54 71.017 dgLARS(BIC) 5.00 (0.00) 53.91 (14.44) 1.00 (0.00) 6.63 0.979 Continued on next page 56 Table 3.2 (cont’d) Example Method TP FP PIT MSE MSPE (×10−4 ) SC (𝛾 𝐿 ) 4.48 (0.50) 0.25 (0.44) 0.48 (0.50) 21.74 3.086 SC (𝛾 𝐻 ) 4.48 (0.50) 0.14 (0.35) 0.48 (0.50) 21.70 2.065 FR (𝛾 𝐿 ) 5.00 (0.00) 0.23 (0.66) 1.00 (0.00) 0.27 0.973 FR (𝛾 𝐻 ) 5.00 (0.00) 0.14 (0.35) 1.00 (0.00) 0.15 0.074 STEPWISE 5.00 (0.00) 0.03 (0.22) 1.00 (0.00) 0.18 0.976 4 (𝑝 0 = 14) LASSO(1SE) 14.00 (0.00) 29.84 (15.25) 1.00 (0.00) 13.97 1.148 LASSO(BIC) 13.94 (0.24) 4.92 (5.54) 0.94 (0.24) 38.69 1.995 SIS+LASSO(1SE) 11.44 (1.45) 15.19 (7.29) 0.05 (0.21) 133.38 4.714 SIS+LASSO(BIC) 11.35 (1.51) 10.98 (7.19) 0.05 (0.21) 137.06 4.940 dgLARS(BIC) 14.00 (0.00) 13.93 (6.68) 1.00 (0.00) 18.08 1.329 SC (𝛾 𝐿 ) 13.68 (0.60) 0.86 (0.62) 0.75 (0.44) 11.80 1.148 SC (𝛾 𝐻 ) 4.20 (2.80) 0.03 (0.17) 0.03 (0.17) 407.86 6.567 FR (𝛾 𝐿 ) 14.00 (0.00) 0.50 (0.76) 1.00 (0.00) 1.23 0.940 FR (𝛾 𝐻 ) 4.99 (3.07) 0.00 (0.00) 0.03 (0.17) 360.65 6.640 STEPWISE 14.00 (0.00) 0.00 (0.00) 1.00 (0.00) 0.91 0.958 5 (𝑝 0 = 3) LASSO(1SE) 3.00 (0.00) 22.76 (9.05) 1.00 (0.00) 1.01 0.044 LASSO(BIC) 3.00 (0.00) 8.29 (3.23) 1.00 (0.00) 1.75 0.054 SIS+LASSO(1SE) 3.00 (0.00) 8.40 (3.10) 1.00 (0.00) 0.44 0.041 SIS+LASSO(BIC) 3.00 (0.00) 9.58 (3.36) 1.00 (0.00) 0.29 0.040 dgLARS(BIC) 3.00 (0.00) 13.39 (4.94) 1.00 (0.00) 1.28 0.048 SC (𝛾 𝐿 ) 3.00 (0.00) 1.47 (0.67) 1.00 (0.00) 0.03 0.038 SC (𝛾 𝐻 ) 2.01 (0.10) 0.01 (0.10) 0.01 (0.10) 4.51 0.008 FR (𝛾 𝐿 ) 3.00 (0.00) 1.21 (1.01) 1.00 (0.00) 0.03 0.038 FR ( 𝛾 𝐻 ) 3.00 (0.00) 0.00 (0.00) 1.00 (0.00) 0.01 0.003 STEPWISE 3.00 (0.00) 0.00 (0.00) 1.00 (0.00) 0.01 0.039 Continued on next page 57 Table 3.2 (cont’d) Example Method TP FP PIT MSE MSPE (×10−4 ) 6 (𝑝 0 = 10) LASSO(1SE) 10.00 (0.00) 46.23 (6.61) 1.00 (0.00) 37.70 1.50 LASSO(BIC) 9.86 (0.34) 59.05 (6.18) 0.86 (0.34) 698.70 13.82 SIS+LASSO(1SE) 0.00 (0.00) 37.53 (3.29) 0.00 (0.00) 4620.54 127.51 SIS+LASSO(BIC) 0.00 (0.00) 38.05 (3.21) 1.00 (0.00) 4644.21 118.64 dgLARS(BIC) 10.00 (0.00) 156.15 (26.47) 1.00 (0.00) 20.96 0.88 SC (𝛾 𝐿 ) 2.99 (0.08) 1.41 (0.49) 0.00 (0.00) 2868.96 116.26 SC (𝛾 𝐻 ) 0.96 (1.26) 1.06 (0.23) 0.00 (0.00) 4775.34 51.16 FR (𝛾 𝐿 ) 7.45 (0.08) 3.34 (0.33) 0.00 (0.00) 657.77 30.03 FR ( 𝛾 𝐻 ) 1.48 (2.46) 2.00 (0.08) 0.00 (0.00) 4345.01 55.47 STEPWISE 7.45 (0.08) 2.11 (0.33) 0.00 (0.00) 653.07 29.05 7 (𝑝 0 = 15) LASSO(1SE) 14.71 (0.49) 8.64 (4.88) 0.73 (0.44) 0.77 1.16 LASSO(BIC) 14.67 (0.53) 6.57 (3.51) 0.71 (0.45) 0.76 1.16 SIS+LASSO(1SE) 13.45 (1.77) 10.16 (4.20) 0.38 (0.48) 1.32 1.07 SIS+LASSO(BIC) 13.44 (1.76) 9.55 (4.07) 0.36 (0.48) 1.34 1.05 dgLARS(BIC) 14.69 (0.51) 7.02 (3.54) 0.72 (0.45) 0.76 1.14 SC (𝛾 𝐿 ) 11.12 (0.53) 3.01 (1.82) 0.00 (0.00) 3.34 0.94 SC (𝛾 𝐻 ) 3.19 (3.51) 0.06 (0.08) 0.00 (0.00) 5.97 14.17 FR (𝛾 𝐿 ) 12.20 (0.60) 3.15 (2.00) 0.00 (0.00) 2.27 0.91 FR ( 𝛾 𝐻 ) 6.64 (3.38) 0.01 (0.11) 0.00 (0.00) 6.49 12.43 STEPWISE 12.20 (0.08) 0.09 (0.29) 0.00 (0.00) 2.99 1.03 Continued on next page 58 Table 3.2 (cont’d) Example Method TP FP PIT MSE MSPE (×10−4 ) Note: TP, true positives; FP, false positives; PIT, probability of including all true predictors in the selected predictors; MSE, mean squared error of β̂; MSPE, mean squared prediction error; numbers in the parentheses are standard deviations; LASSO(BIC), LASSO with the tuning parameter chosen to give the smallest BIC for the models on the LASSO path; LASSO(1SE), LASSO with the tuning parameter chosen by the one-standard -error rule; SIS+LASSO(BIC), sure independence screening by [87] followed by LASSO(BIC); SIS+LASSO(1SE), sure independence screening followed by LASSO(1SE); dgLARS(BIC), differential geo- metric least angle regression by [90, 91] with the tuning parameter chosen to give the smallest BIC on the dgLARS path; SC(𝛾), sequentially conditioning approach by [99]; FR(𝛾), forward regression by [93]; STEPWISE, the proposed method; In FR and SC, the smaller and large values of 𝛾 are presented as 𝛾 𝐿 and 𝛾 𝐻 , respectively; 𝑝 0 denotes the number of true signals; LASSO(1SE), LASSO(BIC), SIS, and dgLARS were conducted via R packages glmnet [127], ncvreg [128], screening [129], and dglars [130], respectively. Table 3.3: Binomial model Example Method TP FP PIT MSE MSPE 1 (𝑝 0 = 8) LASSO(1SE) 7.99 (0.10) 4.77 (5.56) 0.99 (0.10) 0.021 0.104 LASSO(BIC) 7.99 (0.10) 3.19 (2.34) 0.99 (0.10) 0.021 0.104 SIS+LASSO(1SE) 7.94 (0.24) 35.42 (6.77) 0.94 (0.24) 0.119 0.048 SIS+LASSO(BIC) 7.94 (0.24) 16.83 (21.60) 0.94 (0.24) 0.119 0.073 dgLARS(BIC) 8.00 (0.00) 3.27 (2.29) 1.00 (0.00) 0.019 0.102 SC (𝛾 𝐿 ) 8.00 (0.00) 2.81 (1.47) 1.00 (0.00) 0.009 0.073 SC (𝛾 𝐻 ) 1.02 (0.14) 0.00 (0.00) 0.00 (0.00) 0.030 0.028 Continued on next page 59 Table 3.3 (cont’d) Example Method TP FP PIT MSE MSPE FR (𝛾 𝐿 ) 8.00 (0.00) 3.90 (2.36) 1.00 (0.00) 0.032 0.066 FR (𝛾 𝐻 ) 2.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.025 0.027 STEPWISE 7.98 (0.14) 0.08 (0.53) 0.98 (0.14) 0.002 0.094 2 (𝑝 0 = 8) LASSO(1SE) 7.98 (0.14) 3.29 (2.76) 0.98 (0.14) 0.054 0.073 LASSO(BIC) 7.99 (0.10) 3.84 (2.72) 0.99 (0.10) 0.052 0.067 SIS+LASSO(1SE) 7.92 (0.27) 28.20 (7.31) 0.92 (0.27) 0.038 0.030 SIS+LASSO(BIC) 7.92 (0.27) 9.60 (12.92) 0.92 (0.27) 0.051 0.058 dgLARS(BIC) 7.99 (0.10) 3.94 (2.65) 0.99 (0.10) 0.050 0.067 SC (𝛾 𝐿 ) 7.72 (0.45) 0.39 (0.49) 0.72 (0.45) 0.005 0.063 SC (𝛾 𝐻 ) 1.13 (0.37) 0.00 (0.00) 0.00 (0.00) 0.069 0.044 FR (𝛾 𝐿 ) 7.99 (0.10) 0.66 (0.76) 0.99 (0.10) 0.014 0.051 FR (𝛾 𝐻 ) 2.10 (0.30) 0.00 (0.00) 0.00 (0.00) 0.061 0.033 STEPWISE 7.99 (0.10) 0.02 (0.14) 0.99 (0.10) 0.004 0.056 3 (𝑝 0 = 5) LASSO(1SE) 4.51 (0.52) 7.36 (2.57) 0.52 (0.50) 0.155 0.051 LASSO(BIC) 4.98 (0.14) 5.97 (2.25) 0.98 (0.14) 0.118 0.037 SIS+LASSO(1SE) 0.85 (0.46) 10.66 (3.01) 0.00 (0.00) 0.206 0.186 SIS+LASSO(BIC) 0.85 (0.46) 12.10 (3.13) 0.00 (0.00) 0.197 0.185 dgLARS(BIC) 4.92 (0.27) 16.21 (6.21) 0.92 (0.27) 0.112 0.035 SC (𝛾 𝐿 ) 4.32 (0.49) 0.47 (0.50) 0.33 (0.47) 0.016 0.048 SC (𝛾 𝐻 ) 2.62 (1.34) 0.42 (0.50) 0.00 (0.00) 0.104 0.066 FR (𝛾 𝐿 ) 4.98 (0.14) 0.67 (0.79) 0.98 (0.14) 0.020 0.033 FR (𝛾 𝐻 ) 2.98 (0.95) 0.40 (0.49) 0.00 (0.00) 0.087 0.043 STEPWISE 4.97 (0.17) 0.04 (0.28) 0.97 (0.17) 0.014 0.034 Continued on next page 60 Table 3.3 (cont’d) Example Method TP FP PIT MSE MSPE 4 (𝑝 0 = 14) LASSO(1SE) 9.96 (1.89) 6.78 (7.92) 0.01 (0.01) 0.112 0.107 LASSO(BIC) 9.33 (1.86) 2.79 (2.87) 0.00 (0.00) 0.112 0.118 SIS+LASSO(1SE) 10.03 (1.62) 28.01 (9.54) 0.03 (0.17) 0.098 0.070 SIS+LASSO(BIC) 8.90 (1.99) 5.42 (10.64) 0.01 (0.10) 0.114 0.120 dgLARS(BIC) 9.31 (1.85) 2.84 (2.86) 0.00 (0.00) 0.110 0.117 SC (𝛾 𝐿 ) 9.48 (1.40) 2.35 (2.14) 0.00 (0.00) 0.043 0.070 SC (𝛾 𝐻 ) 1.17 (0.40) 0.00 (0.00) 0.00 (0.00) 0.125 0.049 FR (𝛾 𝐿 ) 11.83 (1.39) 1.58 (1.60) 0.09 (0.29) 0.026 0.048 FR (𝛾 𝐻 ) 2.06 (0.24) 0.00 (0.00) 0.00 (0.00) 0.119 0.032 STEPWISE 11.81 (1.42) 1.52 (1.58) 0.09 (0.29) 0.026 0.048 5 (𝑝 0 = 3) LASSO(1SE) 2.00 (0.00) 1.55 (1.76) 0.00 (0.00) 0.008 0.215 LASSO(BIC) 2.00 (0.00) 1.86 (1.57) 0.00 (0.00) 0.008 0.213 SIS+LASSO(1SE) 2.23 (0.42) 10.81 (6.45) 0.23 (0.42) 0.007 0.192 SIS+LASSO(BIC) 2.10 (0.30) 3.60 (4.65) 0.10 (0.30) 0.007 0.206 dgLARS(BIC) 2.00 (0.00) 1.64 (1.49) 0.00 (0.00) 0.008 0.213 SC (𝛾 𝐿 ) 2.27 (0.49) 7.16 (3.20) 0.29 (0.46) 0.060 0.166 SC (𝛾 𝐻 ) 1.87 (0.34) 0.03 (0.17) 0.00 (0.00) 0.005 0.030 FR (𝛾 𝐿 ) 2.96 (0.20) 8.88 (5.39) 0.96 (0.20) 0.013 0.147 FR ( 𝛾 𝐻 ) 1.97 (0.17) 0.03 (0.17) 0.00 (0.00) 0.005 0.019 STEPWISE 2.89 (0.31) 0.76 (1.70) 0.89 (0.31) 0.001 0.194 6 (𝑝 0 = 10) LASSO(1SE) 6.10 (1.08) 31.66 (4.63) 0.00 (0.00) 0.41 0.07 LASSO(BIC) 7.88 (0.97) 30.41 (4.63) 0.02 (0.16) 0.38 0.05 Continued on next page 61 Table 3.3 (cont’d) Example Method TP FP PIT MSE MSPE SIS+LASSO(1SE) 0.00 (0.00) 28.40 (3.61) 0.00 (0.00) 0.45 0.15 SIS+LASSO(BIC) 0.00 (0.00) 30.18 (3.24) 0.00 (0.00) 0.45 0.14 dgLARS(BIC) 4.12 (2.29) 32.37 (7.61) 0.00 (0.00) 0.42 0.09 SC (𝛾 𝐿 ) 7.71 (0.58) 2.95 (0.95) 0.00 (0.00) 0.19 0.05 SC (𝛾 𝐻 ) 0.00 (0.00) 1.00 (0.00) 0.00 (0.00) 0.45 0.02 FR (𝛾 𝐿 ) 8.52 (0.89) 3.24 (1.15) 0.00 (0.00) 0.09 0.04 FR ( 𝛾 𝐻 ) 0.12 (0.32) 1.88 (0.32) 0.00 (0.00) 0.45 0.01 STEPWISE 8.52 (0.92) 0.58 (0.77) 0.00 (0.00) 0.09 0.04 7 (𝑝 0 = 15) LASSO(1SE) 9.80 (1.14) 5.60 (4.29) 0.00 (0.00) 0.01 0.04 LASSO(BIC) 9.74 (1.11) 4.68 (2.63) 0.00 (0.00) 0.01 0.04 SIS+LASSO(1SE) 10.67 (1.39) 25.17 (5.85) 0.00 (0.00) 0.04 0.02 SIS+LASSO(BIC) 10.62 (1.64) 24.02 (15.84) 0.00 (0.00) 0.01 0.01 dgLARS(BIC) 9.84 (1.08) 4.58 (2.45) 0.00 (0.00) 0.01 0.04 SC (𝛾 𝐿 ) 8.92 (0.53) 1.75 (0.94) 0.00 (0.00) 0.01 0.02 SC (𝛾 𝐻 ) 1.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 0.02 FR (𝛾 𝐿 ) 9.56 (0.58) 1.52 (0.92) 0.00 (0.00) 0.07 0.02 FR ( 𝛾 𝐻 ) 2.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 0.02 STEPWISE 9.50 (0.54) 0.67 (0.83) 0.00 (0.00) 0.04 0.02 Note: Abbreviations are explained in the footnote of Table 3.2. 62 Table 3.4: Poisson model Example Method TP FP PIT MSE MSPE 1 (𝑝 0 = 8) LASSO(1SE) 7.93 (0.43) 4.64 (4.82) 0.96 (0.19) 0.001 4.236 LASSO(BIC) 7.99 (0.10) 14.37 (14.54) 0.99 (0.10) 0.001 3.133 SIS+LASSO(1SE) 7.89 (0.37) 25.37 (8.39) 0.91 (0.29) 0.001 3.247 SIS+LASSO(BIC) 7.89 (0.37) 17.77 (11.70) 0.91 (0.29) 0.001 3.078 dgLARS(BIC) 8.00 (0.00) 13.28 (14.31) 1.00 (0.00) 0.001 3.183 SC (𝛾 𝐿 ) 7.96 (0.20) 4.94 (3.46) 0.96 (0.20) 0.001 2.874 SC (𝛾 𝐻 ) 5.05 (1.70) 0.04 (0.24) 0.07 (0.26) 0.001 3.902 FR (𝛾 𝐿 ) 7.93 (0.26) 4.86 (3.73) 0.93 (0.26) 0.001 2.837 FR (𝛾 𝐻 ) 5.13 (1.61) 0.06 (0.31) 0.07 (0.26) 0.001 3.833 STEPWISE 7.91 (0.29) 2.77 (2.91) 0.91 (0.29) 0.001 3.410 2 (𝑝 0 = 8) LASSO(1SE) 8.00 (0.00) 2.23 (3.52) 1.00 (0.00) 0.001 3.981 LASSO(BIC) 8.00 (0.00) 8.98 (8.92) 1.00 (0.00) 0.001 3.107 SIS+LASSO(1SE) 7.98 (0.14) 22.85 (7.08) 0.98 (0.14) 0.001 2.824 SIS+LASSO(BIC) 7.98 (0.14) 13.55 (8.24) 0.98 (0.14) 0.001 2.937 dgLARS(BIC) 8.00 (0.00) 8.91 (9.10) 1.00 (0.00) 0.001 3.099 SC (𝛾 𝐿 ) 8.00 (0.00) 3.89 (2.89) 1.00 (0.00) 0.000 2.979 SC (𝛾 𝐻 ) 5.68 (1.45) 0.00 (0.00) 0.12 (0.33) 0.001 3.971 FR (𝛾 𝐿 ) 8.00 (0.00) 3.60 (2.80) 1.00 (0.00) 0.000 3.032 FR (𝛾 𝐻 ) 5.71 (1.42) 0.00 (0.00) 0.10 (0.30) 0.001 3.911 STEPWISE 7.98 (0.14) 2.00 (2.23) 0.98 (0.14) 0.000 3.589 3 (𝑝 0 = 5) LASSO(1SE) 4.37 (0.51) 6.88 (2.61) 0.38(0.48) 0.001 1.959 LASSO(BIC) 4.79 (0.41) 5.62 (2.17) 0.79 (0.41) 0.000 2.044 Continued on next page 63 Table 3.4 (cont’d) Example Method TP FP PIT MSE MSPE SIS+LASSO(1SE) 0.86 (0.47) 10.11 (2.55) 0.00 (0.00) 0.002 3.266 SIS+LASSO(BIC) 0.86 (0.47) 11.86 (2.99) 0.00 (0.00) 0.002 3.160 dgLARS(BIC) 4.55 (0.51) 18.29 (6.13) 0.56 (0.49) 0.001 1.877 SC (𝛾 𝐿 ) 4.73 (0.45) 0.53 (0.66) 0.73 (0.45) 0.000 2.479 SC (𝛾 𝐻 ) 2.84 (0.63) 0.40 (0.49) 0.00 (0.00) 0.001 0.664 FR (𝛾 𝐿 ) 4.54 (0.52) 1.98 (2.19) 0.55 (0.50) 0.000 2.128 FR (𝛾 𝐻 ) 2.71 (0.70) 0.43 (0.50) 0.00 (0.00) 0.001 0.605 STEPWISE 4.54 (0.52) 1.77 (2.01) 0.55 (0.50) 0.000 2.132 4 (𝑝 0 = 14) LASSO(1SE) 10.01 (1.73) 3.91 (6.03) 0.01 (0.10) 0.003 15.582 LASSO(BIC) 12.11 (1.46) 36.56 (22.43) 0.19 (0.39) 0.002 5.688 SIS+LASSO(1SE) 10.42 (1.66) 21.41 (8.87) 0.03 (0.17) 0.003 11.316 SIS+LASSO(BIC) 10.73 (1.66) 32.67 (8.92) 0.03 (0.17) 0.003 8.545 dgLARS(BIC) 12.05 (1.52) 38.70 (28.97) 0.18 (0.38) 0.002 5.111 SC (𝛾 𝐿 ) 10.33 (1.63) 10.48 (6.66) 0.02 (0.14) 0.002 4.499 SC (𝛾 𝐻 ) 5.32 (1.92) 0.52 (1.37) 0.00 (0.00) 0.003 14.005 FR (𝛾 𝐿 ) 12.00 (1.71) 8.93 (6.36) 0.23 (0.42) 0.001 4.503 FR (𝛾 𝐻 ) 5.65 (2.13) 0.38 (1.15) 0.00 (0.00) 0.003 13.802 STEPWISE 11.80 (1.72) 5.97 (5.37) 0.19 (0.39) 0.001 5.809 5 (𝑝 0 = 3) LASSO(1SE) 2.00 (0.00) 1.13 (2.85) 0.00 (0.00) 0.003 2.674 LASSO(BIC) 2.01 (0.10) 2.82 (2.52) 0.01 (0.10) 0.003 2.583 SIS+LASSO(1SE) 2.87 (0.34) 9.28 (3.85) 0.87 (0.34) 0.002 2.455 SIS+LASSO(BIC) 2.87 (0.34) 9.88 (4.29) 0.87 (0.34) 0.002 2.355 dgLARS(BIC) 2.00 (0.00) 2.88 (2.38) 0.00 (0.00) 0.003 2.562 Continued on next page 64 Table 3.4 (cont’d) Example Method TP FP PIT MSE MSPE SC (𝛾 𝐿 ) 2.75 (0.44) 3.27 (1.75) 0.75 (0.44) 0.001 2.339 SC (𝛾 𝐻 ) 2.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.003 1.086 FR (𝛾 𝐿 ) 3.00 (0.00) 2.80 (1.73) 1.00 (0.00) 0.001 2.326 FR (𝛾 𝐻 ) 2.40 (0.49) 0.00 (0.00) 0.40 (0.49) 0.002 0.981 STEPWISE 3.00 (0.00) 0.35 (0.59) 1.00 (0.00) 0.001 2.977 6 (𝑝 0 = 10) LASSO(1SE) 6.08 (1.16) 32.54 (4.83) 0.00 (0.00) 0.01 7.64 LASSO(BIC) 8.15 (0.94) 37.56 (7.96) 0.06 (0.23) 0.01 5.93 SIS+LASSO(1SE) 0.00 (0.00) 26.34 (3.87) 0.00 (0.00) 0.01 12.27 SIS+LASSO(BIC) 0.00 (0.00) 28.03 (4.29) 0.00 (0.00) 0.01 12.06 dgLARS(BIC) 8.42 (1.22) 75.21 (15.56) 0.21 (0.41) 0.01 4.55 SC (𝛾 𝐿 ) 9.45 (0.72) 9.80 (3.12) 0.76 (0.42) 0.00 4.77 SC (𝛾 𝐻 ) 3.73 (1.73) 2.15 (0.45) 0.00 (0.00) 0.01 3.43 FR (𝛾 𝐿 ) 9.54 (1.05) 11.26 (2.97) 0.55 (0.50) 0.01 4.42 FR (𝛾 𝐻 ) 2.85 (1.85) 2.70 (0.57) 0.00 (0.00) 0.01 3.28 STEPWISE 9.54 (1.05) 4.30 (2.03) 0.55 (0.50) 0.01 6.01 7 (𝑝 0 = 15) LASSO(1SE) 11.98 (2.20) 4.00 (3.56) 0.12 (0.32) 0.01 20.71 LASSO(BIC) 14.93 (0.29) 51.44 (11.68) 0.94 (0.23) 0.00 7.17 SIS+LASSO(1SE) 12.51 (1.20) 17.32 (5.91) 0.03 (0.18) 0.00 15.24 SIS+LASSO(BIC) 12.50 (1.29) 33.31 (7.69) 0.05 (0.23) 0.01 11.93 dgLARS(BIC) 14.94 (0.27) 56.10 (17.38) 0.95 (0.20) 0.00 6.58 SC (𝛾 𝐿 ) 12.65 (1.43) 20.94 (4.47) 0.08 (0.28) 0.00 3.92 SC (𝛾 𝐻 ) 6.55 (1.98) 0.54 (0.84) 0.00 (0.00) 0.00 5.67 FR (𝛾 𝐿 ) 13.60 (1.12) 19.11 (4.03) 0.21 (0.41) 0.00 3.96 Continued on next page 65 Table 3.4 (cont’d) Example Method TP FP PIT MSE MSPE FR (𝛾 𝐻 ) 7.13 (1.87) 0.31 (0.55) 0.00 (0.00) 0.00 6.54 STEPWISE 13.54 (1.31) 5.30 (3.12) 0.21 (0.39) 0.00 9.23 Note: Abbreviations are explained in the footnote of Table 3.2. 3.6 Applications: Real Data Analysis 3.6.1 A Study of Gene Regulation in the Mammalian Eye To demonstrate the utility of our proposed method, we analyzed a microarray dataset from Scheetz et al. [41] with 120 twelve-week male rats selected for eye tissue harvesting. The dataset contained more than 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array); see Scheetz et al. [41] for a more detailed description of the data. Although our method was applicable to the original 31,042 probe sets, many probes turned out to have very small variances and were unlikely to be informative for correlative analyses. Therefore, using variance as the screening criterion, we selected 5,000 genes with the largest variances in expressions and correlated them with gene TRIM32 that has been found to cause Bardet-Biedl syndrome, a genetically heterogeneous disease of multiple organ systems including the retina [42]. We applied the proposed STEPWISE method to the dataset with n = 120 and p = 5,000, and treated the TRIM32 gene expression as the response variable and the expressions of 5,000 genes as the predictors. With no prior biological information available, we started with the empty set. To choose 𝜂1 and 𝜂2 , we carried out 5-fold cross-validation to minimize the mean squared prediction error (MSPE) by using the following grid search: 𝜂1 = {0, 0.25, 0.5, 1} and 𝜂2 = {1, 2, 3, 4, 5}, and set 𝜂1 = 1 and 𝜂2 = 4. We also performed the same procedure to choose the 𝛾 for FR and SC. The regularization parameters in LASSO and dgLARS were selected to minimize BIC values. 66 In the forward step, STEPWISE selected the probes of 1376747_at, 1381902_at, 1382673_at and 1375577_at, and the backward step eliminated probe 1375577_at. The STEPWISE procedure produced the following final predictive model: TRIM32 = 4.6208 + 0.2310 × (1376747_at) + 0.1914 × (1381902_at) + 0.1263 × (1382673_at). Table A.1 in Appendix A presents the numbers of overlapping genes among competing methods. It shows that the two out of three probes, 1381902_at and 1376747_at, selected from our method are also discovered by the other methods, except for dgLARS. Next, we performed Leave-One-Out Cross-Validation (LOOCV) to obtain the distribution of the model size (MS) and MSPE for the competing methods. As reported in Table 3.5 and Figure 3.1, LASSO, SIS+LASSO and dgLARS tended to select more variables than the forward approaches and STEPWISE. Among all of the methods, STEPWISE selected the fewest variables but with almost the same MSPE as the other methods. 22 20 18 16 Methods 14 STEPWISE Model size FR 12 LASSO 10 SIS+LASSO 8 SC 6 dgLARS 4 2 0 STEPWISE FR LASSO SIS+LASSO SC dgLARS Figure 3.1: Box plot of model sizes for each method over 120 different training samples from the mammalian eye data set. STEPWISE was performed with 𝜂1 = 1 and 𝜂2 = 4, and FR and SC were conducted with 𝛾 = 1. 67 Table 3.5: Comparisons of MSPE between competing methods using the mammalian eye data set. STEPWISE FR LASSO SIS+LASSO SC dgLARS Training set 0.005 0.005 0.005 0.006 0.005 0.014 Testing set 0.011 0.012 0.010 0.009 0.014 0.020 Note: The mean squared prediction error (MSPE) was averaged over 120 splits. LASSO, least absolute shrinkage and selection operator with regularization parameter that gives the smallest BIC; SIS+LASSO, sure independence screening by [87] followed by LASSO; dgLARS, differential geometric least angle regression by [90, 91] that gives the smallest BIC; SC(𝛾), sequentially conditioning approach by [99]; FR(𝛾), forward regression by [93]; STEPWISE, the proposed method. STEPWISE was performed with 𝜂1 = 1 and 𝜂2 = 4, FR and SC were performed with 𝛾 = 1. 3.6.2 An Esophageal Squamous Cell Carcinoma Study Esophageal squamous cell carcinoma (ESCC), the most common histological type of esophageal cancer, is known to be associated with poor overall survival, making early diagnosis crucial for treatment and disease management [47]. Several studies have investigated the roles of circulating microRNAs (miRNAs) in diagnosis of ESCC [45]. Using a clinical study that investigated the roles of miRNAs on the ESCC [57], we aimed to use miRNAs to predict ESCC risks and estimate their impacts on the development of ESCC. Specifically, with a dataset of serum profiling of 2,565 miRNAs from 566 ESCC patients and 4,965 controls without cancer, we demonstrated the utility of the proposed STEPWISE method in predicting ESCC with miRNAs. To proceed, we used a balance sampling scheme (283 cases and 283 controls) in the training dataset. The design of yielding an equal number of cases and controls in the training set has proved to be useful [57] for handling imbalanced outcomes as we encountered here. To validate our findings, samples were randomly divided into a training (𝑛1 = 566, 𝑝 = 2,565) and testing set (𝑛2 = 4,965, 𝑝 = 2,565). The training set consisted of 283 patients with ESCC (median age of 65 years, 79% male) and 283 control patients (median age of 68 years, 46.3% male), and the testing set consisted of 283 patients with ESCC (median age of 67 years, 85.7% male) and 4,682 control patients (median age of 67.5 years, 44.5% male). Control patients without ESCC came from three sources: 323 individuals 68 from National Cancer Center Biobank (NCCB); 2,670 individuals from the Biobank of the National Center for Geriatrics and Gerontology (NCGG); and 1,972 individuals from Minoru Clinic (MC). More detailed characteristics of cases and controls in the training and testing sets are given in Table A.4. We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. As age and gender (0 = female, 1 = male) are important risk factors for ESCC [131, 132] and it is common to adjust for them in clinical models, we set the initial set in STEPWISE to be 𝐹0 = {age, gender}. With 𝜂1 = 0 and 𝜂2 = 3.5 that were also chosen from 5-fold CV, our procedure recruited three miRNAs. More specifically, miR-4783-3p, miR-320b, miR-1225-3p and miR-6789-5p were selected among 2,565 miRNAs by the forward stage from the training set, and then the backward stage eliminated miR-6789-5p. In comparison, with 𝛾 = 0, both FR and SC selected four miRNAs, miR-4783-3p, miR-320b, miR-1225-3p, and miR-6789-5p. The list of selected miRNAs by different methods is given in Table A.2 in Appendix A. Our findings were biologically meaningful, as the selected miRNAs had been identified by other cancer studies as well. Specifically, miR-320b was found to promote colorectal cancer proliferation and invasion by competing with its homologous miR-320a [133]. In addition, serum levels of miR-320 family members were associated with clinical parameters and diagnosis in prostate cancer patients [134]. Mullany et al. [135] showed that miR-4783-3p was one of the miRNAs that could increase the risk of colorectal cancer death among rectal cancer cases. Finally, miR-1225-5p inhibited proliferation and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation of 𝛽-catenin signaling [136]. Aiming to identify a final model without resorting to a pre-screening procedure that may miss out on important biomarkers, we applied STEPWISE to reach the following predictive model for ESCC based on patients’ demographics and miRNAs: logit−1 (−35.70 + 1.41 × miR-4783-3p + 0.98 × miR-320b + 1.91 × miR-1225-3p + 0.10 × Age − 2.02 × Gender), where logit−1 (𝑥) = exp(𝑥)/(1 + exp(𝑥)). In the testing dataset, the model had an area under the receiver operating curve (AUC) of 0.99 and 69 achieved a high accuracy of 0.96, with a sensitivity and specificity of 0.97 and 0.95, respectively. Also using the testing cohort, we evaluated the performance of the models sequentially selected by STEPWISE. Starting with a model containing age and gender, STEPWISE selected miR- 4783-3p, miR-320b and miR-1225-3p in turn. Figure 3.3, showing the corresponding receiver operating curves (ROC) for these sequential models, revealed the improvement by sequentially adding predictors to the model and justified the importance of these variables in the final model. In addition, Figure 3.3 (e) illustrated that adding an extra miRNA selected by FR and SC made little improvement of the model’s predictive power. Furthermore, we conducted subgroup analysis within the testing cohort to study how the sensitivity of the final model differed by cancer stage, one of the most important risk factors. The sensitivity for stages 0, i.e., non-invasive cancer, 9 (𝑛 = 27), 1 (𝑛 = 128), 2 (𝑛 = 57), 3 (𝑛 = 61), and 4 (𝑛 = 10) was 1.00, 0.98, 0.97, 0.97, and 1.00, respectively. We next evaluated how the specificity varied across controls coming from different data sources. The specificity for the various control groups, namely, NCCB (𝑛 = 306), NCGG (𝑛 = 2,512), and MC (𝑛 = 1,864), was 0.99, 0.99, and 0.98, respectively. The results indicated the robust performance of the miRNA-based model toward cancer stages as well as data sources. Finally, to compare STEPWISE with the other competing methods, we repeatedly applied the aforementioned balance sampling procedure and split the ESCC data into the training and testing sets 100 times. We obtained MSPE and the average of accuracy, sensitivity, specificity, and AUC. Figure 3.2 reported the model size of each method. Though STEPWISE selected fewer variables compared to the other variable selection methods (for example, LASSO selected 11- 31 variables and dgLARS selected 12-51 variables), it achieved comparable prediction accuracy, specificity, sensitivity and AUC (see Table 3.6), evidencing the utility of STEPWISE for generating parsimonious models while maintaining competitive predictability. We used R software [137] to obtain the numerical results in Sections 4 and 5 with following packages: ggplot2 [138], ncvreg [128], glmnet [127], dglars [130], and screening [129]. 70 52 48 44 40 Methods 36 STEPWISE Model size 32 FR 28 LASSO 24 SIS+LASSO 20 SC 16 dgLARS 12 8 4 0 STEPWISE FR LASSO SIS+LASSO SC dgLARS Figure 3.2: Box plot of model sizes for each method based on 100 ESCC training datasets. Performance of STEPWISE is reported with 𝜂1 = 0 and 𝜂2 = 3.5. Performance of SC and FR are reported with 𝛾 = 0. Table 3.6: Comparisons of competing methods over 100 independent splits of the ESCC data into training and testing sets Training set MSPE Accuracy Sensitivity Specificity AUC STEPWISE 0.02 0.97 0.98 0.97 1.00 SC 0.01 0.99 0.98 0.98 1.00 FR 0.02 0.99 0.97 0.97 1.00 LASSO 0.01 0.98 1.00 0.97 1.00 SIS+LASSO 0.01 0.99 1.00 0.99 1.00 dgLARS 0.04 0.96 0.99 0.94 1.00 Test set MSPE Accuracy Sensitivity Specificity AUC STEPWISE 0.04 0.96 0.97 0.95 0.99 SC 0.03 0.96 0.97 0.96 0.99 FR 0.04 0.96 0.97 0.95 0.99 LASSO 0.03 0.96 0.99 0.95 1.00 SIS+LASSO 0.02 0.97 0.99 0.96 1.00 dgLARS 0.05 0.94 0.98 0.94 1.00 Note: Values are averaged over 100 splits. STEPWISE was performed with 𝜂1 = 0 and 𝜂2 = 1. SC and FR were performed with 𝛾 = 1. The regularization parameters in LASSO and dgLARS were selected to minimize the BIC. 71 (a) Model 1, AUC= 0.71 (b) Model 2, AUC=0.97 (c) Model 3, AUC=0.98 (d) Model 4, AUC=0.99 (e) Model 5, AUC=0.99 Figure 3.3: Comparisons of ROC curves for the selected models in the ESCC data set by the sequentially selected order. Model 1 includes Age and Gender feature, and the following features are sequnatially added to the model: miR-4783-3p, miR-320b, miR-1225-3p, miR-6789-5p. 72 3.6.3 Neurobehavioral Impairment from Total Sleep Deprivation In this study we aim to explore gene expression biomarker candidates for neurobehavioral impair- ment from total sleep deprivation. Specifically, using a clinical study, we investigate the role of genes on the Total Sleep Deprivation (TSD) and we use these biomarkers, that is, gene expressions, to predict TSD and estimate their effect on the development of TSD. To perform analysis, data was obtained from NCBI GEO online repository, accession GSE98582. Blood samples were obtained from 17 healthy adults (ages 22–37, 7 females) who were not using drugs. Subjects remained in the sleep laboratory at the Sleep and Performance Research Center of Washington State University (Spokane, WA) for six consecutive nights. Meals were semi- standardized with selection from among a limited number of menu options; blood draws were performed immediately prior to meals. Blood samples were collected with an intravenous catheter approximately every 4 h during time awake on days two, four, and six. At each of the 12 timepoints, 2.5 mL blood was collected in a PAXgene™ Blood RNA tube, and the number of lapses per test bout was recorded from a 10 min PVT assay. Overall, the dataset contains 555 samples and 8284 gene features. We define the binary outcome variable to be 1 if a sample corresponds to the case when TSD is observed and 0 otherwise. Total, 342 samples with TSD and 213 controls (without TSD) were taken. Further, we split the dataset into training and testing sets in order to perform the data analysis. To preserve the underlying distribution of the response variable, a stratified sampling technique was implemented. We kept 70% of data in the training set (389 samples with 8284 features) and the remaining 30% (166 samples with 8284 features) was used for model validation. With 𝜂1 = 0.5 and 𝜂2 = 3 that were chosen from 5-fold cross-validation, the STEPWISE procedure recruited five genes. Particularly, PF4V1, USP32P1, EMR1, NBR2, and DUSP23 were selected among 8284 genes. In addition, our procedure was applied to identify a final model for predicting TSD based on gene biomarkers. As a result, the following model was produced: logit−1 (−322.02 + 13.01 × PF4V1 − 9.96 × USP32P1 + 15.17 × EMR1 + 17.66 × NBR2 + 15.34 × DUSP23), where logit−1 (𝑥) = exp(𝑥)/(1 + exp(𝑥)). 73 In the testing dataset, the final model had an area under the receiver operating curve (AUC) of 0.997 and achieved an accuracy of 0.983, with a sensitivity and specificity of 0.991 and 0.972, respectively. To compare STEPWISE with other competing methods, we repeatedly applied the aforementioned sampling procedure and split the dataset into training and testing sets 100 times. We obtained MSPE, the average of accuracy, sensitivity, specificity, and AUC. Figure 3.4 reports the model size of each method. Again, we observe that although STEPWISE procedure selects fewer variables that other methods, it achieves comparable prediction accuracy, specificity, sensitivity, and AUC. The results are presented in the Table 3.7. Figure 3.4: Box plot of model sizes for each method based on 100 total sleep deprivation training datasets. Performance of STEPWISE is reported with 𝜂1 = 0.5 and 𝜂2 = 3. Performance of SC and FR are reported with 𝛾 = 0.5 74 Table 3.7: Comparisons of competing methods over 100 independent splits of the Total Sleep Deprivation data into training and testing sets Training set MSPE Accuracy Sensitivity Specificity AUC STEPWISE 0.02 0.98 0.98 0.97 0.99 SC 0.01 0.98 0.98 0.98 1.00 FR 0.02 0.98 0.98 0.97 0.99 LASSO 0.00 1.00 1.00 1.00 1.00 SIS+LASSO 0.00 1.00 1.00 1.00 1.00 dgLARS 0.07 0.91 0.92 0.89 0.95 Test set MSPE Accuracy Sensitivity Specificity AUC STEPWISE 0.04 0.97 0.96 0.94 0.98 SC 0.03 0.96 0.97 0.95 0.99 FR 0.04 0.97 0.96 0.94 0.98 LASSO 0.01 0.98 0.98 0.99 1.00 SIS+LASSO 0.01 0.99 0.99 0.98 1.00 dgLARS 0.08 0.88 0.90 0.86 0.95 Note: Values are averaged over 100 splits. STEPWISE was performed with 𝜂1 = 0.5 and 𝜂2 = 3. SC and FR were performed with 𝛾 = 0.5. The regularization parameters in LASSO and dgLARS were selected to minimize the BIC. 75 CHAPTER 4 MULTI-STAGE HYBRID MACHINE LEARNING METHOD 4.1 Machine Learning Ensemble Methods: Categories and Types In Machine Learning, ensemble methods are used to achieve better predictive performance by combining predictions from multiple models instead of using a single model [139]. These methods tend to provide better results when the models used in ensemble methods are significantly diverse [140, 141]. Ensemble methods are mainly divided into two categories: sequential ensemble techniques and parallel ensemble techniques. Former generates base-learners (each method used in the model) in a sequence making them dependent on one another. The model performance tends to improve by assigning higher weights to previously misrepresented learners. In contrast, parallel ensemble techniques generate base learners in a parallel. This is done in order to introduce independence among base learners, which significantly reduces the error due to averaging the results obtained from base learner models. Besides being divided into categories, ensemble methods can be distinguished by their types. The most popular and well-known types are Bagging, Boosting, and Stacking methods. Bagging methods train each base learner on a different sample of a training dataset (normally, these are bootstrap samples taken from the original training dataset). Predictions made by each of ensemble members are then combined by averaging the results, which is done to incorporate all possible outcomes of the prediction and randomize the outcome [142]. In order to improve the predictive power of the model, boosting ensemble technique learns from mistakes made by previous predictors. It adds predictors to the model sequentially, where successor predictors correct mistakes of the preceding predictors [143]. The gradient decent algorithm is used to identify points that need improvement the most. Finally, Stacking technique (also known as stacked generalization) trains a learning algorithm to combine predictions of multiple other learning algorithms. It makes a final prediction by using 76 the predictions made by other algorithms, that is, output values of these methods become the input values of the stacked model. In this chapter, we will utilize parallel ensemble techniques and improve the performance of the STEPWISE algorithm. All methods used in the final model are discussed in subsequent sections. 4.2 A Review on Existing Methods In this section we propose and describe methods included in the multi-stage hybrid machine learning model. They can be divided into two groups: model-based and model-free methods. Model-based methods specifically define the relationship between the response and explanatory variables (also known as predictors) via a certain link function. These models are considered to have a math- ematical structure and involve various parameters that need to be estimated based on observed data. In addition, these models are accompanied by a set of statistical assumptions such as an underlying distribution of the response variable, relationship among predictors (mainly concerning their independence), variability of data, and etc. It becomes crucial to satisfy those assumptions as it guarantees the reliability of results. Thus, implementation of model-based methods should be done by carefully examining and confirming validity of the model assumptions and choosing the appropriate link function. Due to straightfor- ward interpretation and relatively small model complexity, model-based methods gained popularity among practitioners. We selected least absolute shrinkage and selection operator (LASSO) and our proposed STEPWISE method to represent model-based methods in the model. In contrast, model-free methods do not make any assumptions on the parametric form of the under- lying model explicitly. In other words, they adapt to the data characteristics without pre-specified model structure. Model-free algorithms are designed to automatically learn, adjust their actions and improve results with minimal or no human intervention. These algorithms help to gain some insights from data and enable to build right predictions and minimize chances of making any kind of errors. Given complex data, model-free methods are able to construct non-parametric repre- sentations (also known as non-parametric models). These methods develop their models based on 77 constant learning and retraining. Normally, model-free methods are used in problems where mathematical models are unavailable or hard to construct. Therefore, almost all model-free methods have some optimization techniques at their core. For instance, gradient decent optimization algorithm is commonly used in many non-parametric methods. As a result, model-free methods tend to achieve high accuracy in their predictions and are successfully used with complex data. We selected several such methods to include in our model. Specifically, random forest (RF), support vector machine (SVM), extreme gradient boosting machine (XGBoost), and artificial neural network(ANN). These methods are just selective examples of many other methods that are currently available. We decided to include this particular set of methods in our model because it is diverse in its nature and these methods are applicable in various real-life scenarios. For instance, random forest reduces drawback of large variance and is not prone to overfit the model; extreme gradient boosting machine provide lots of flexibility, can optimize on different loss functions and applicable to case with low variance and high bias; support vector machine is more effective in high dimensional spaces and relatively memory efficient; least absolute shrinkage and selection operator performs both auto- mated variable selection and regularization, and helps minimize the impact of multicollinearity among predictors; artificial neural network can handle comprehensive data structures due to its complexity. All these methods are described in subsequent sections. 4.2.1 Random Forest (RF) Random forests combine tree predictors that depend on values of a random vector sampled inde- pendently and identically for all trees in the forest. This methodology was proposed by Breiman [144] and quickly gained popularity among researchers and practitioners due to its simplicity and high accuracy. Random Forest is a generic method, but mostly has been used with classification trees. Random Forests grow multiple classification trees, and classify a new object from an input vector by putting it down each of the trees in the forest. Then each tree gives a classification, and the majority of 78 "votes" determines a class of a prediction. More specifically, each tree in random forest grows as follows. If a number of observations in the training set is n, it takes a sample of n observations at random with replacement, from the original data. This creates a training set for growing the tree. Then, if there are M input predictors, a number m < M is specified such that at each node, m variables are selected at random out of the M and the best split on these m divides the node. The value of m is being held constant during the forest growing. Then each tree grows to the largest extent possible with no pruning applied. Mainly, an error rate produced by random forests depends on two factors: First, increasing a corre- lation between any two trees in the forest increases the error rate. Second, increasing the strength of the individual trees decreases the forest error rate. Additionally, the error rate can be controlled by manipulating an m parameter. Reducing m reduces both the correlation and the strength and increasing it increases both. The advantages of using Random Forest is the ability to cope with thousands of features without variable deletion, provide variable importance assessment in the classification, generate an internal unbiased estimate of the generalization error, and have methods for balancing error in class population unbalanced data sets. 4.2.2 Support Vector Machines (SVM) In essence, the Support Vector Machine is a method proposed by Vapnik [145] that conceptually implements the following idea: it takes input vectors and maps them non-linearly to a very high dimension feature space. Then in this feature space it constructs a linear decision surface. Special properties of the decision surface guarantees high generalization ability of the learning method. SVMs can handle any number of classes, as well as observations of any dimension and can take almost any shape including linear, radial, and polynomial, among others. Particularly, SVMs construct a hyperplane or a set of hyperplanes, that is decision boundaries, in a high- or infinite- dimensional space, which can be used for classification, regression, and other type of problems. Good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class, also known as support vectors. 79 If classes are linearly separable, Hard Margin Classifier (HMC) can be implemented. HMC finds an optimal hyperplane, that separates classes while maximazing the distance to the closest points from the classes. The maximized distance is referred to as the Margin. HMC estimates the coefficients of hyperplanes by solving a quadratic programming problem with linear inequality constrains. If a perfect linear separation is not achievable or desirable, a Soft Margin Classifier (SMC) can be considered. While the data can be still linearly separable, the decision boundaries obtained using the HMC might not generalize well to new data and accuracy will suffer. To solve this issue, SMC loosens the constrains and allows some points to be wrongly classified. The set of points is called allowable budget. Finally, if classes are not linearly separable, Support Vector Machine projects data to higher dimensions, where they are linearly separable and constructs a hyperplane. Then it transforms this hyperplane back to the initial space and obtains a non-linear decision boundary. It is achieved by using a kernel trick that computes a dot product in some feature space without even knowing what the space is and what is a mapping function. Most commonly employed kernel functions are linear, polynomial, and radial basis functions. The advantages of Support Vector Machine are that it always guarantees to find a global optimum as it just solves convex optimization problem, relatively robust to outliers (soft margin), and is flexible (implements various kernel functions). The main drawback of SVM is that it slows down the training process as data become taller (when the number of observations is significantly greater than the number of predictors) as it has to estimate parameters for each row. 4.2.3 Gradient Boosting Machine (GBM) Gradient Boosting Machines, proposed by Friedman [146], quickly gained popularity due to their high accuracy and effectiveness in solving complex problems. Typically, it is hard for other methods to outperform the performance of GBMs and it is the algorithm of choice for many teams of machine learning competitions. GBMs build an ensemble of shallow trees in sequence where each tree learns and improves on the previous one. Even though shallow trees by themselves are 80 more of weak predictive models, they are able to boost and produce a powerful committee. The main idea of boosting algorithms is to add new models to the ensemble sequentially. It starts with a weak tree and sequentially continues to build new trees, where each new tree in the sequence fixes up where the previous one made the mistakes (for instance, each new tree in the sequence focuses on the training rows where the previous tree had the largest prediction errors). Specifically, at any instant the model outcomes are weighed according to the outcomes of the previous instant. The outcomes that are predicted correctly are given a lower weight and the ones that are miss- classified are given higher weights. Gradient Boosting Machines are considered a gradient decent algorithm. Gradient descent is a generic optimization algorithm that is capable of finding optimal solutions to a wide range of problems and can be used on any loss function that is differentiable. The fundamental idea of gradient descent is to search parameter values iteratively that will minimize a loss function. Here it is used to estimate the weights assigned to correctly and incorrectly predicted outcomes. Unlike bagging algorithms, GBM deals with bias variance trade-off by controlling both bias and variance and is proven to be more effective when applied to models with high bias and low variance. There are various versions of Gradient Boosting Machine. Particularly useful are Stochastic GBM and Extreme GBM. Stochastic GBM takes a random subsample of the training dataset that offers additional reduction in tree correlation which improves a prediction accuracy. Extreme GBM is an optimized distributed gradient boosting machine that improves the accuracy and speed of the method by employing parallelism in its algorithm and adding regularization parameters to the model. The main disadvantage of GBM method is that it is a complex and less intuitive algorithms. In addition, it is time and computationally expensive method. 4.2.4 Artificial Neural Network (ANN) An important subfield of Machine Learning is Deep Learning, which focuses on building predictive models based on artificial neural networks with two or more hidden layers. Artificial Neural Networks (ANN), first proposed by McCullough and Pitts [147], a model structure used in most 81 of the Deep Learning models, is inspired by the biological neural networks and mimics the way humans gain certain types of information through a combination of data inputs, weights, and bias. Like other machine learning algorithms, neural networks perform learning by mapping features to targets through a process of simple data transformations and feedback signals. Fundamental to most of the deep learning methods is the feedforward ANN. Feedforward ANNs are densely connected layers where inputs influence each successive layer which then influences the final output layer. Basic neural networks have three layers: an input layer, a hidden layer, and an output layer. The input layer consists of all of the original input features. Most of the learning happens in the hidden layer, and the output layer produces the final predictions. The layers and nodes are the building blocks of our ANN and they decide how complex the network will be. Layers are called dense if all the nodes in successive layer are connected. Consequently, the more layers and nodes you add the more opportunities you create for new features to be learned. There is no unique approach for determining the number of hidden layers and nodes; basically, these are the first hyperparameters among many others to tune. Mainly, features in your data largely determine the number of nodes you define in these hidden layers. The modeling task drives the choice of output layer. For regression problems, the output layer contains just one node that outputs the final prediction. If you are predicting a binary output, your output layer will still contain only one node and that node will predict the probability of success. Finally, if you predict an output with several classes, the output layer will contain the same number of nodes as the number of classes. A crucial component of artificial neural networks is activation. Each node in ANN is connected to all the nodes in the previous layer. Each connection gets a weight and then that node adds all the incoming inputs multiplied by its corresponding connection weight plus an extra bias parameter. This summation becomes an input to an activation function. The activation function is a mathematical function that determines whether to fire a signal to the next layer. On the forward pass, the ANN will select a batch of observations, randomly assign weights across all the node connections, and predict the output. Then, it assesses its own accuracy and adjusts the weights across all the node connections in order to improve the accuracy. This process is called 82 backpropagation. To carry out backpropagation, first, you need to establish an objective (loss) function to measure performance. On each forward pass the ANN will measure its performance based on the loss function chosen. The ANN will then work backwards through the layers, compute the gradient of the loss with regards to the network weights, adjust the weights a little in the opposite direction of the gradient, grab another batch of observations to run through the model, and repeat until the loss function is minimized. The performance of ANN can be optimized by tuning its hyperparameters. It can be done through adjusting model capacity (layers and modes), adding batch normalization, adjusting learning rate, trying out different activation functions and so on. Another possible way of improving ANN’s performance is placing constraints on a model’s complexity with regularization, also referred to as dropout implementation. 4.2.5 Least Absolute Shrinkage and Selection Operator (LASSO) Nowadays, data sets typically contain a large number of features. As the number of features grows, certain assumptions required by traditional methods (e.g., linear models) break down and these models tend to overfit the data, causing the out of sample error to increase and making the results unreliable. One possible solution is to use Regularization methods, which constrain or regularize the estimated coefficients and can reduce the variance and uncertainty in the estimation. As it was mentioned, having a large number of features invites various issues in using classic regression models. For instance, the model becomes much less interpretable, there could be infinite number of estimates for the model coefficients, and the predictors are likely to be highly correlated, which can invite multicollinearity issues. The regularized techniques constrain the total size of all the coefficient estimates that helps to reduce the magnitude and fluctuations of the coefficients and will reduce the variance of the model. Arguably, one of the most well-known and frequently used regularized method is the Least Absolute Shrinkage and Selection Operator (LASSO). The method was poroposed by Tibshirani [84] and it minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint, it pushes coefficients all the way 83 to zero and produces some coefficients that are exactly 0, which eventually provides interpretable models. The LASSO method can be summarized as follows Suppose we have data (X𝑖 , 𝑦𝑖 ), 𝑖 = 1, 2, . . . , 𝑁, where X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 ) T are the predictors and 𝑦𝑖 is the response variable. We assume that either the observations are independent or the responses are conditionally independent of the predictors. Finally, we assume that all predictors are standardized. Letting β̂ = ( 𝛽ˆ1 , . . . , 𝛽ˆ 𝑝 ) T , the LASSO estimate ( 𝛼, ˆ β̂) is defined as ( 𝑁  2) Õ Õ Õ ( 𝛼, ˆ = argmin ˆ 𝛽) 𝑦𝑖 − 𝛼 − 𝛽 𝑗 𝑋𝑖 𝑗 , |𝛽 𝑗 | ≤ 𝜆, 𝑖=1 𝑗 𝑗 where 𝜆 ≥ 0 is a tuning parameter that controls the amount of shrinkage applied to the estimates. The LASSO method provides properties of both automated feature selection and ridge regression, and it exhibits the stability of the latter one. The main disadvantage of the technique is that it achieves these results at the cost of producing biased estimates. 4.2.6 STEPWISE Method STEPWISE method is a procedure proposed in this thesis and described in Chapter 3. The proposed method fits GLMs with ultrahigh-dimensional predictors. It starts with an empty set or pre-specified predictors, scans all features and sequentially selects features, and conducts backward elimination once the forward selection is completed. The forward selection steps recruit variables in an inclusive way by allowing some false positives for the sake of avoiding false negatives, while backward selection steps eliminate the potential false positives from the recruited variables. STEPWISE algorithm embraces model selection and estimation, controls both false negatives and positives by using different stopping criteria in the forward and backward selection steps, yields consistent estimates, and accommodates a wide range of data types, such as binary, categorical, and count data. In addition, under a sparsity assumption of the true model, it can discover all of the relevant predictors within a finite number of steps, and can produce a final model in ultrahigh dimensional settings without applying a pre-screening step which may introduce unintended false negatives. 84 4.3 Multi-Stage Hybrid Machine Learning Method 4.3.1 Introduction The proposed multi-stage hybrid machine learning method carries out a stacking technique. Stack- ing method is designed to boost predictive accuracy by blending the predictions of multiple machine learning models. Stacked generalization or stacked was proposed by Wolpert [148] and is widely used by other researchers and practitioners, [149, 150, 151]. Stacking is a technique in which the predictions produced by a collection of models are given as inputs to a second-level learning algorithm. This second-level algorithm is trained optimally to combine the model predictions and form a final set of predictions. Specifically, stacking method trains a new learning algorithm to combine predictions of several base-learners, also known as, individual models. First, base-learners are trained using the training data, then a combiner, called a super learner, is trained to make a final prediction based on the predictions of the base learners. It is important that the dataset collected for the stacked model consists of out-of-sample model predictions. In other words, to obtain the prediction for a certain data point in the data set, the model parameters should be estimated on a training set which does not include that particular data point. This is normally achieved via 𝐾-fold cross-validation. The training data is split into almost equal 𝐾 subsets and 𝐾 versions of the model are trained, each on the data with a different subset removed. Thus, model predictions for the 𝑘th subset are produced from the model trained on a set that did not include that subset. To set up a multi-stage hybrid machine learning method, the following steps are being completed. First, we specify a list of base learners and a super learner, also know as, a meta algorithm. We select Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting Machine (XGBoost), Least Absolute Shrinkage and Selection Operator (LASSO), Artificial Neural Network (ANN), and STEPWISE method as the base learners. Linear weighted summation combiner is being used as a super learner. Next, we train each of these base learners on the training data. Specifically, we employ 𝐾-fold cross-validation for each of the base learners and collect the cross- 85 validated class probabilities from them. The cross-validated class probabilities are combined to create a new feature matrix. Finally, we use this new feature matrix to train the meta learning algorithm, which can be used to generate predictions on new data. In other words, output values of each base learner become input values for the super learner. To generate ensemble predictions, first we have to generate class probabilities for each of the base learners. Then feed these class probabilities into the super learner, which will generate the ensemble prediction. Algorithm 1 summarizes the proposed method and provides a high-level explanation. Algorithm 1 MULTI-STAGE HYBRID MACHINE LEARNING METHOD 1. Set up the stacked model • Specify a list of M base learners with a determined set of model parameters • Specify a super learner algorithm 2. Train the model • Train each of the M base learners on the training set • Perform cross-validation technique on each of the base-learners and collect cross-validated class probabilities from each (denoted as 𝑝 1 , ..., 𝑝 𝑀 ) • Combine the N (the number of observations in the training set) cross-validated class probability values from each of the M base-learners into a new N × M feature matrix. This matrix, along with the original response vector (y), is called level - one data • Train the super learner on level - one data 3. Predict on new data • To generate stacked predictions, first generate class probabilities from the base-learners • Feed those class probabilities to a super learner and produce new final predictions Our method is called hybrid, because it employs both model-free and model-based methods. And we call it multi-stage, because it consists of two stages: setting up and training base-learners, and training the super learner and generating class probabilities based on it. It is worth to mention that the stacked model works the best when a diverse set of methods is selected as base learners. 86 Therefore, our model includes both types of methods. 4.3.2 Algorithm Define 𝐷 = {(X𝑖 , 𝑌𝑖 ), 𝑖 = 1, . . . , 𝑛} a dataset (𝐷 is also referred to as level-0 data) consisting of a vector X𝑖 representing the attribute values of the 𝑖-th instance, and 𝑌𝑖 representing the class value. Let 𝐴 𝑗 , 𝑗 = 1, . . . , 𝐽, be a base-learner algorithm, also known as, level-0 estimator. Given 𝐷 dataset, we randomly split it into 𝐾 almost equal parts, 𝐷 1 , . . . , 𝐷 𝐾 . Let 𝐷 𝑘 , 𝑘 = 1, . . . , 𝐾, and 𝐷 (−𝑘) = 𝐷 \ 𝐷 𝑘 be the test and training sets for the 𝑘th fold of 𝐾-fold cross-validation. Given 𝐽 base-learner algorithms, we separately invoke the 𝑗th algorithm on the data in the training set 𝐷 (−𝑘) to induce a model 𝑀 𝑗(−𝑘) for 𝑗 = 1, . . . , 𝐽. For each instance (X𝑖 , 𝑌𝑖 ) ∈ 𝐷 𝑘 , let 𝑔 𝑗 (X𝑖 ) denote the class probability estimated by the model 𝑀 𝑗(−𝑘) for X𝑖 . At the end of the cross-validation process, after applying the testing dataset 𝐷 𝑘 for each 𝑘 = 1, . . . , 𝐾 to each 𝑀 𝑗(−𝑘) for 𝑗 = 1, . . . , 𝐽,   the base-learner model class probabilities form a meta-instance 𝑔1 (X𝑖 ), . . . , 𝑔𝐽 (X𝑖 ), 𝑌𝑖 with the output variable for the original instance. A new dataset (  ) 𝐷 𝐶𝑉 = 𝑔1 (X𝑖 ), . . . , 𝑔𝐽 (X𝑖 ), 𝑌𝑖 , 𝑖 = 1, . . . , 𝑁 is assembled, also known as, level-1 data. Note that the original X𝑖 is replaced with the correspond- n o ing level-0 output vectors 𝑔1 (X𝑖 ), . . . , 𝑔𝐽 (X𝑖 ) . Now, at the second stage, referred to as level-1 learning stage, we derive our final level-1 model, ˜ from 𝐷 𝐶𝑉 . The level-1 model will be constructed of the following form: 𝑀, Õ 𝐽 𝑀˜ (X) = 𝛼 𝑗 × 𝑔 𝑗 (X), 𝑗=1 where 𝑔 𝑗 (X) is the 𝑗th level-0 class probability and 𝛼 𝑗 is its corresponding weight. Values for 𝛼’s are computed based on corresponding level-0 model performances and are derived as , 𝐽 Õ 𝛼 𝑗 = 𝐾 2𝑗 𝐾 2𝑗 , 𝑗=1 87 where K is a Kappa coefficient defined as accuracy − expected accuracy 𝐾= , 1 − expected accuracy where 𝑇𝑃 + 𝑇𝑁 accuracy = , 𝑇 𝑃 + 𝑇 𝑁 + 𝐹𝑃 + 𝐹 𝑁 and ! ! 𝑇 𝑃 + 𝐹𝑃 𝑇 𝑃 + 𝐹 𝑁 𝑇 𝑁 + 𝐹𝑃 𝑇 𝑁 + 𝐹 𝑁 expected accuracy = × + × , 𝑁 𝑁 𝑁 𝑁 where TP, FP, TN, and FN are True Positives, False Positives, True Negatives, and False Negatives, respectively. Breiman, [152], suggests that non-negative constraint 𝛼 𝑗 ≥ 0 provides consistently good results. Now, to make the final prediction we use the models 𝑀 𝑗 for 𝑗 = 1, . . . , 𝐽 in conjunction with 𝑀.˜ Given a new instance, X𝑙 models 𝑀 𝑗 produce a vector (𝑔1 (X𝑙 ), . . . , 𝑔𝐽 (X𝑙 )). Then this vector is used as an input value to the level-1 model, 𝑀,˜ whose output is the final prediction for that instance. 4.4 Application: Bladder Cancer Prediction 4.4.1 Data Description To utilize the aforementioned method we obtained data from Usuba et al. [69]. The goal of building this model is to identify important miRNA biomarkers that have an impact on bladder cancer development and can help in early detection of the disease. Moreover, ensemble method will enhance the predictive power of the proposed STEPWISE method taken separately. Data consists of 972 samples profiling 2565 miRNAs. Specifically, 392 serum samples were obtained from bladder cancer patients who were admitted or referred to the National Cancer Center Hospital (NCCH) between 2008 and 2016. A total of 580 serum samples from non-cancer individuals were collected from 2 independent cohorts: the first cohort included individuals whose 88 serum samples were collected and stored by the National Center for Geriatrics and Gerontology (NCGG) Biobank between 2010 and 2012 and the second cohort included volunteers aged over 35 years who were recruited from the Yokohama Minoru Clinic in 2015. We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. To proceed, we randomly divided samples into training and testing sets. Training set consists of 80 % of original data (310 samples with bladder cancer and 468 non-cancer controls) and the testing set consists of the remaining 20 % (82 samples with bladder cancer and 112 non-cancer controls). Table A.3 summarises characteristics of the samples used in the study. 4.4.2 Results We perform data analysis in two parts. First, we employ STEPWISE procedure separately and eval- uate its performance based on obtained results. Then, we implement multi-stage hybrid machine learning method and demonstrate its advantages over the former model. To begin with, we further split the training set into training and validation sets by using 5-fold cross-validation technique in order to identify the best configuration of 𝜂1 and 𝜂2 parameters. Specifically, we use a greed search approach and specify a set of values for each of these two parameters: 𝜂1 is being searched on the grid {0, 0.25, 0.5, 0.75, 1} and 𝜂2 on {1, 2, 3, 3.5, 4, 4.5, 5}. The results from the cross-validation procedure are presented in Table 4.1. It shows that STEPWISE method with 𝜂1 = 0.5 and 𝜂2 = 3 performed the best, so this pair of values will be used further in analysis. Next, we applied the proposed STEPWISE method to the training set with 𝑛 = 778 and 𝑝 = 2565. With no prior biological information available, we started with an empty set. In the forward step, STEPWISE selected mir-6087, mir-5100, mir-1914-3p, mir-6831-5p, mir-2110, mir-6717-5p, mir-1343-3p, mir-6069, mir-6780b-5p, mir-1343-5p miRNAs, and the backward step eliminated mir-6780b-5p, mir-1343-5p miRNAs. The STEPWISE procedure produced the following final predictive model: logit−1 (88.33 − 8.08 × miR-6087 + 2.53 × miR-5100 − 3.54 × miR-1914-3p + 1.22 × miR-6831-5p − 1.57 × miR-2110 + 2.26 × miR-6717-5p − 2.51 × miR-1343-3p + 0.75 × miR-6069, where logit−1 (𝑥) = 89 exp(𝑥)/(1 + exp(𝑥)). Table 4.1: Results of the 5-fold cross-validation procedure for the STEPWISE method 𝜂2 1 2 3 3.5 4 4.5 5 0 0.885 0.893 0.893 0.895 0.895 0.905 0.905 0.25 0.885 0.893 0.893 0.895 0.895 0.905 0.905 𝜂1 0.5 0.917 0.917 0.931 0.925 0.925 0.911 0.911 0.75 0.900 0.900 0.895 0.895 0.883 0.879 0.879 1 0.875 0.870 0.870 0.863 0.861 0.861 0.859 Values for 𝜂1 and 𝜂2 were searched on the grid {0, 0.25, 0.5, 0.75, 1} and {1, 2, 3, 3.5, 4, 4.5, 5}, respectively. The optimal configuration of the parameters was discovered by comparing AUC-ROCs (area under the receiver operating curve). The pair of parameter values that maximized the AUC value was selected for further analysis. In the testing dataset, the model had AUC of 0.92 and achieved an accuracy of 0.91, with sensi- tivity, specificity, and precision of 0.93, 0.86, and 0.90, respectively. Finally, we repeatedly applied the sampling procedure and split the data into the training and testing sets 100 times. We obtained the average accuracy, sensitivity, specificity, precision, and AUC. The results are presented in the Table 4.2. Table 4.2: Assessment of the proposed STEPWISE procedure using the bladder cancer data set Accuracy Sensitivity Specificity Precision AUC Training set 0.92 0.94 0.92 0.92 0.94 Testing set 0.91 0.93 0.90 0.89 0.92 Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits. In order to develop the final multi-stage hybrid machine learning model, we first built the other base-learner models included in the stacked method. Specifically, RF, SVM, XGBoost, LASSO, and ANN. After carrying out 5-fold cross-validation procedure, the following sets of hyperparame- ters have been identified for each of these methods: 600 trees were selected for RF model along with 90 5 instances in the terminal nodes and 250 randomly selected predictors in each tree; SVM achieved its best results with parameter 𝐶 = 0.01 and polynomial structure of the second order; XGBoost performed the best with 300 trees, tree depth and learning rate equal to 8 and 0.1, respectively, and minimum 5 samples in the terminal nodes; ANN constructed its model with two hidden layers having 70% and 35% of predictors on its layers, respectively, learning rate equal to 0.1 and dropout rate to be 0.6; LASSO picked its penalty parameter to be 0.0361. Results of their performances over training and testing sets are summarized in Table 4.3. Weights, 𝛼’s, for the super learner combiner are computed based on base-learners’ performance achieved during the first stage of modeling. Particularly, 0.17, 0.17, 0.14, 0.18, 0.18, and 0.16 are weights assigned to STEPWISE, RF, SVM, XGBosst, ANN, and LASSO, respectively. Table 4.4 presents results obtained from evaluating the multi-stage hybrid machine learning model. It can be observed that hybrid model significantly improved the performance of STEPWISE method. In addition, it also outperformed other methods included in the model. Finally, we performed sensitivity analysis to quantify the relationship between the model perfor- mance and the weights assigned to the base-learners. Mainly, we tried out 7 model settings with different weight configurations and compared them with our existing model. Specifically, we de- veloped a model with equal weights assigned to each method and the remaining 6 models have a high weight of 0.8 assigned to one of the base-learners while keeping other weights equal to 0.04. The results are summarized in the Table 4.5 and illustrate the advantage of our model over other competing model settings. The proposed multi-stage hybrid model outperformed Models 2-7 in all evaluation metrics; Model 1 achieved comparable results as it was expected since assigned weighted were similar to ours. 91 Table 4.3: Comparison of base-learner methods included in the multi-stage hybrid machine learning model over 100 independent splits of the bladder cancer data into training and testing sets Training set Accuracy Sensitivity Specificity Precision AUC RF 0.95 0.95 0.96 0.94 0.95 SVM 0.91 0.93 0.92 0.93 0.92 XGBoost 0.95 0.94 0.96 0.97 0.96 ANN 0.95 0.96 0.94 0.97 0.96 LASSO 0.94 0.93 0.95 0.94 0.95 Test set Accuracy Sensitivity Specificity Precision AUC RF 0.93 0.94 0.93 0.92 0.93 SVM 0.89 0.92 0.90 0.87 0.89 XGBoost 0.94 0.93 0.95 0.93 0.95 ANN 0.94 0.95 0.93 0.91 0.95 LASSO 0.92 0.91 0.92 0.89 0.94 Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits; RF - Random Forest; SVM - Support Vector Machine; XGBoost - Extreme Gradient Boosting Machine; ANN - Artificial Neural Network; LASSO - Least Absolute Shrinkage and Selector Operator Table 4.4: Evaluation of the proposed multi-stage hybrid machine learning model with the bladder cancer data set Accuracy Sensitivity Specificity Precision AUC Training set 0.99 1.00 0.99 0.99 1.00 Testing set 0.98 0.98 0.98 0.97 0.99 Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits. 92 Table 4.5: Comparison of various model configurations included in the sensitivity analysis Training set Accuracy Sensitivity Specificity Precision AUC Model 1 0.98 0.98 0.97 0.96 0.97 Model 2 0.96 0.96 0.96 0.95 0.95 Model 3 0.93 0.94 0.94 0.92 0.93 Model 4 0.97 0.98 0.97 0.97 0.95 Model 5 0.97 0.97 0.96 0.97 0.95 Model 6 0.94 0.95 0.93 0.94 0.93 Model 7 0.94 0.94 0.93 0.94 0.93 Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits; Model 1 corresponds to the equal-weights scenario; the Model 2-7 correspond to the scenarios with a high weight of 0.8 assigned to one of the base-learners while keeping other weights equal to 0.04; a high weight was assigned to methods in the following order: Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting machine (XGBoost), Artificial Neural Network (ANN), least absolute shrinkage and selector Operator (LASSO), STEPWISE procedure 4.5 Web Application An R-Shiny web application was developed to enable users employ the proposed multi-stage hybrid machine learning method in practice. The main goal of this app is to help users analyze their own data and build predictive models according to our algorithm. The web application can be accessed online at Multi-Stage_Hybrid_ML_Method. Specifically, it is aimed to solve classification prob- lems and has the following features. First, users will have an option to either upload their own data sets or use pre-built sets. Pre-built option includes two well-known data sets: Iris and Abalone. Once the data is uploaded/selected from the given options, users can split the data into training and testing sets. For instance, spec- ifying validation split to be 0.8 will split data into train and test sets with 4:1 ratio. In addition, one can apply pre-processing steps to the data, which is an important part of data analysis. As of now, there are three options available: standardizing numerical features (making features have mean equal to zero and variance equal to 1), imputing missing values via K-Nearest Neighbor tech- nique (a method that takes into account the relationship among predictors), and removing Zero and Near-zero variance variables (removes feature that have no effect/ minimal effect on the response 93 feature). Figure A.1 illustrates this step. Next, users are offered to tune hyperparameters of the base learner methods. Almost all Machine learning methods have hyper-parameters that can be tuned. Tuning them will potentially improve the model performance and reduce chances of overfiting the model. In order to accomplish this task, the web application employs greed search technique: users will specify a set of values for each of the hyper-parameters provided in the menu bar. Two resampling methods are available: 𝑘-fold cross-validation and repeated 𝑘-fold cross-validation. Once all necessary items are selected, the web application will tune these parameters and will display numeric and visual results in "Numerical Results" and "Visualize Results" tabs respectively. Figure A.2 illustrates a tuning hyperparameters procedure for a Random Forest method. Lastly, after the tuning hyperparameters step is complete and a set of values for these parameters is selected, users can proceed further and start building their final predictive model. At this step, they can indicate parameter values for each base-learner obtained from the previous step and set weights for them. The app will train the model and display the results, which include model performance metrics (computed for both training and testing sets), feature importance, and final predictions. Figure A.3 illustrates an output results for the final predictive model including model evaluations for both training and testing sets. 94 CHAPTER 5 CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE RESEARCH In this thesis we have proposed to apply STEPWISE to produce final models in ultrahigh- dimensional settings, without resorting to a pre-screening step. We have shown that the method identifies or includes the true model with probability going to 1, and produces consistent coefficient estimates, which are useful for properly interpreting the actual impacts of risk factors. The theoret- ical properties of STEPWISE are established under mild conditions, which are worth discussing. Because in practice covariates are often standardized for various reasons, Condition (2) is assumed without loss of generality. Conditions (3) and (4) are generally satisfied under common GLM models, including Gaussian, Binomial, Poisson, and Gamma distributions. Condition (5) is also often satisfied in practice. Proposition 2 in Zhang et al. [121] may be used as a tool to verify Condition (5) as well. Con- ditions (1) and (6) are in good faith with the unknown true model size |M| and minimum signal strength 𝑛−𝛼 in practice. The "irrepresentable" condition (6) is strong and may not hold in some real datasets (see, e.g. [153, 154]). However, the condition holds under some commonly used covariance structures, including AR(1) and compound symmetry structure [153]. As shown in simulation studies and real data analyses, STEPWISE tends to generate models as pre- dictive as the other well-known methods, with fewer variables (Figure 3.2). Parsimonious models are useful for biomedical studies as they explain data with a small number of important predictors, and offer practitioners a realistic list of biomarkers to investigate. With categorical outcome data frequently observed in biomedical studies (e.g. histology types of cancer), STEPWISE can be ex- tended to accommodate multinomial classification, with more involved notation and computation. We will pursue this elsewhere. As it was shown and discussed in the previous chapters of the thesis, STEPWISE procedure controls both false positives and false negatives in high-dimensional settings. It is achieved by employing different stopping criteria in the forward and backward selection steps that adds flexibility to our 95 algorithm. Mainly, versatility of the stopping criterion in the forward selection step allows to avoid false negatives by including some false positives in the early stages of the model building. While, using stopping criterion in the backward elimination step allows removing the potential false positives from the selected variables. In addition, two extra parameters 𝜂1 and 𝜂2 involved in computing the stopping criteria determine how restrictive the variable screening process should be. Specifically, large values of 𝜂1 in the forward selection step will recruit less variables and vice versa. Similarly, large 𝜂2 values of the stopping criterion in the backward elimination step will remove more features. Thus, this frame- work can address different needs. For instance, if controlling false positives is the priority, then we recommend applying large values for parameters, and if it is more meaningful to control false negatives, then small values must be used. It is worth noting that our method includes forward selection as a special case when the parameter value is equal to 0, making it even more flexible. Moreover, in this thesis we prove that, under a sparsity assumption of the true model, the proposed STEPWISE approach can discover all of the relevant predictors within a finite number of steps. Sparse models are common in high-dimensional settings. Among hundreds or thousands predic- tors involved in the model development, only a handful number of predictors have a significant relationship with the response variable. Including too many predictors in the model may result in overfitting, while keeping a few variables may lead to high bias and low predictive accuracy. Thus, identifying true signals and significant predictors correctly and including them in the final predictive model is a crucial step in a model building process. Finally, we developed a multi-stage hybrid machine learning method to boost a predictive accuracy and improve a performance of the proposed method. It carries out stacking technique and com- bines model-free and model-based methods including the proposed STEPWISE method. Ting and Witten [155] suggested that the users of stacking method have a free choice of base-learner models. Therefore, we have selected heterogeneous machine learning methods (e.g., boosting, bagging, neural nets, and model-based methods) that have different strengths and disadvantages. Having a diverse set of base-learners makes our method applicable in various scenarios. In addition, they 96 claimed and demonstrated that successful stacked generalization implies using class probabilities rather than class predictions, and supported their claim with empirical examples. We also adopted this technique in our model. Ueda [149] defined several combination types that can be used to combine base-learner outputs via weighted sum (WS), class-dependent weighted sum (CSW), and linear stacked generalization (LSG). Erdogan and Sen [156] showed that none of these methods is superior than others and the performance is data-driven and data-dependent. We have selected WS technique to be imple- mented in the super-learner method. A Kappa statistic (𝐾) was used to estimate and assign weights to the individual base-learner outputs, which is considered to be more accurate metric for model evaluation [157, 158]. These weights reflect their performance on level-0 data: greater weights are assigned to base-learners with stronger performance and vice versa. This weights assignment method is believed to be more effective as it incorporates significance of each method included in the model [159]. Finally, Breiman [152] reported that non-negative constraint over the assigned weights will provide consistently good results. This constraint was added to our model as well. The numerical examples we provided have vividly demonstrated an improved predictive power of the proposed method. Moreover, we performed sensitivity analysis to illustrate the superiority of the weight assignment technique used in the model over the other competing techniques. Lastly, we proposed and developed a web application that enables users employ the proposed multi-stage hybrid machine learning method in practice. There are several open questions. First, in our numerical experiments, we used cross-validation to choose values for 𝜂1 and 𝜂2 , which seemed to work well. However, more in-depth research is needed to find their optimal values to strike a balance between false positives and false negatives. Second, despite our consistent estimates, drawing inference based on them remains challenging. Statistical inference, which accounts for uncertainty in estimation, is key for properly interpreting analysis results and drawing appropriate conclusions. Our asymptotic results, nevertheless, are a stepping stone toward this important problem. Third, although the proposed STEPWISE procedure is designed to deal with the binary classification, it can be extended to accommodate multinomial 97 classification, a commonly observed problem in biological or biomedical research. Most multi- nomial classification methods rely on sequential binary classification by way of one-versus-all or direct pairwise comparison [160], which requires selecting a reduction method from multiclass to binary. Further investigation will be needed to identify such methods as it is not a trivial task and is on a case-by-case basis. 98 APPENDICES 99 APPENDIX A SUPPLEMENT MATERIALS A.1 Supplementary Materials An R package, STEPWISE, was developed and is available at https://github.com/AlexPijyan/ STEPWISE, along with the examples shown in the dissertation. A.2 Additional Results in the Real Data Analysis Table A.1: Comparison of genes selected by each competing method from the mammalian eye data set STEPWISE FR LASSO SIS+LASSO SC dgLARS STEPWISE 3 3 2 2 2 0 FR 4 2 2 2 0 LASSO 16 5 2 0 SIS+LASSO 9 2 0 SC 4 0 dgLARS 7 Note: Diagonal and off-diagonal elements of the table represent the model sizes for each method and the number of overlapping genes selected by the two methods corresponding to the row and column, respectively. 100 Table A.2: Selected miRNAs for ESCC training dataset Methods selected miRNAs STEPWISE miR-4783-3p; miR-320b; miR-1225-3p FR miR-4783-3p; miR-320b; miR-1225-3p; 6789-5p SC miR-4783-3p; miR-320b; miR-1225-3p; 6789-5p LASSO miR-6789-5p; miR-6781-5p; miR-1225-3p; miR-1238-5p; miR-320b; miR-6794-5p; miR-6877-5p; miR-6785-5p; miR-718; miR-195-5p SIS+LASSO miR-6785-5p; miR-1238-5p; miR-1225-3p; miR-6789-5p; miR-320b; miR-6875-5p; miR-6127; miR-1268b; miR-6781-5p; miR-125a-3p dgLARS miR-891b; miR-6127; miR-151a-5p; miR-195-5p; ; miR-3688-5p miR-125b-1-3p; miR-1273c; miR-6501-5p; miR-4666a-5p; miR-514a-3p Note: LASSO, SIS+LASSO, dgLARS selected 20, 17, and 33 miRNAs, respectively, and we only reported top 10 miRNAs. Table A.3: Clinicopathologic characteristics of participants in bladder cancer study Covariates Training Set Testing set 𝑛1 (%) 𝑛2 (%) Bladder Cancer patients Total number of patients 310 82 Age, median (range) 68 (32-90) 70 (34-93) Gender: Male 233 (74.9%) 54 (65.8%) Female 77 (25.1%) 28 (34.2%) Tumor Stage (%):