SIMULTANEOUS MODEL SELECTION AND ESTIMATION OF GENERALIZED LINEAR
            MODELS WITH HIGH DIMENSIONAL PREDICTORS
                                       By
                                   Alex Pĳyan
                              A DISSERTATION
                                  Submitted to
                          Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
                      Statistics – Doctor of Philosophy
                                      2022


                                             ABSTRACT
 SIMULTANEOUS MODEL SELECTION AND ESTIMATION OF GENERALIZED LINEAR
                      MODELS WITH HIGH DIMENSIONAL PREDICTORS
                                                   By
                                              Alex Pĳyan
In the past couple of decades the progressive use of technology made the enormous amount of data
in different formats available and easily accessible. The size and volume of available data sets have
grown rapidly and the technological capacity of the world to store information has almost doubled
every 40 months since the 1980s [1]. As of 2020, every day 2.5 quintillion of data are generated.
Based on an International Data Group (IDG) report, the global data volume was predicted to grow
exponentially and by 2025, IDG predicts there will be 163 zettabytes of data [2].
This enormous amount of data is often characterized by its high dimensionality. Quite often, well-
known statistical methods fail to manage such data due to their limitations (e.g., in high-dimensional
settings they often encounter various issues such as no unique solution for the model parameters,
inflated standard errors, overfitted models, multicollinearity). This resulted in resurging interest
in the algorithms that are capable of handling massive quantities of data, extracting and analysing
information from it, and uncovering key insights that subsequently will lead to decision making.
Techniques used by these algorithms are tend to speed up and improve the quality of predictive
analysis, thus, they found their application in various fields. For instance, medicine becomes more
and more individualized nowadays and drugs or treatments can be designed to target small groups,
rather than big populations, based on characteristics such as medical history, genetic makeup etc.
This kind of treatment is referred to as precision medicine.
In the era of precision medicine, constructing interpretable and accurate predictive models, based
on patients’ demographic characteristics, clinical conditions, and molecular biomarkers, has been
crucial for disease prevention, early diagnosis and targeted therapy [3]. The models, for example,
can be used to predict patients’ susceptibility to disease [4], identify high risk groups [5], and guide
behavioral changes [6]. Therefore, predictive models play a central role in decision making.


Several well-known approaches can be used to solve the problem mentioned above. Penalized
regression approaches, such as least absolute shrinkage and selection operator (LASSO), have
been widely used to construct predictive models and explain the impacts of the selected predictors,
but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized
regression is usable only after applying variable screening methods to downsize variables.
In this dissertation, we would like to propose a procedure for fitting generalized linear models
with ultrahigh-dimensional predictors. Our procedure can provide a final model, control both false
negatives and false positives, and yield consistent estimates, which are useful to gauge the actual
effect size of risk factors. In addition, under a sparsity assumption of the true model, the proposed
approach can discover all of the relevant predictors within a finite number of steps.
The thesis work is organized as follows. Chapter 1 highlights an importance of predictive models
and names several examples where these models can be implemented. The main focus of Chapter 2
is to describe all well-known and already existing in the theory methods that attempted to solve the
aforementioned problems, along with their shortcomings and disadvantages. Chapter 3 proposes
STEPWISE algorithm and introduces the model setup and its detailed description, followed by its
theoretical properties and proof of theorems and lemmas used throughout the thesis. Additional
lemmas used to construct the theory of the STEPWISE method are also stated.
Later it presents results obtained from various numerical studies such as simulations and real data
analysis. Simulation studies comprise seven examples and are aimed to compare STEPWISE
algorithm to other competing methods, and provide numerical evidence of its superiority. Real
data analysis involves studies of gene regulation in the mammalian eye, esophageal squamous cell
carcinoma, and neurobehavioral impairment from total sleep deprivation, and demonstrates the
utility of the proposed method in real life scenarios.
Chapter 4 proposes a multi-stage hybrid machine learning ensemble method that is aimed to enhance
STEPWISE’s performance. It also introduces a web application that employs the method. Finally,
Chapter 5 completes the thesis with final conclusion and discussions. Appendices include some
tables and figures used throughout the thesis.


                                    ACKNOWLEDGEMENTS
    First and foremost, it is my genuine pleasure to express my sincere gratitude and the deepest
appreciation to my academic advisor at Michigan State University, Dr. Hyokyoung Hong. This dis-
sertation would not have been possible without her dedication, advice, continuous encouragement,
priceless support, and persistent help. I have been extremely fortunate to have Dr. Hong as my
advisor as her insistence on perfection and careful supervision have given me a lot of confidence
and inspiration to achieve my goals.
I would like to thank our collaborators, Dr. Qi Zheng and Dr. Yi Li, for their significant contribution
to the work presented in this thesis.
My sincere thanks also go to my committee members, Dr. Taps Maiti, Dr. Lyudmila Sakhanenko,
and Dr. Chenxi Li, for generously giving their time to offer me valuable and insightful comments
toward improving my work.
Finally, I would like to express special thanks to my friends and family for overwhelming love
and support, for being by my side when I was going through hard times, and for celebrating each
accomplishment. Through the struggles and challenges of writing this dissertation, you have been
a constant source of joy and motivation.
                                                  iv


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      ix
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . .            . . . . . . . . . 1
   1.1 Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 1
       1.1.1 Gene Regulation in the Mammalian Eye . . . . . . . . . .         . . . . . . . . . 4
       1.1.2 An Esophageal Squamous Cell Carcinoma Study . . . . .            . . . . . . . . . 6
       1.1.3 Bladder Cancer Study . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 8
       1.1.4 Neurobehavioral Impairment from Total Sleep Deprivation          . . . . . . . . . 10
CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . .               12
   2.1 Sequential Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   14
       2.1.1 Feature Selection using BIC criteria . . . . . . . . . . . . . . . . . . . . .     14
       2.1.2 Forward Regression with High-Dimensional Predictors . . . . . . . . . . .          17
       2.1.3 A Stepwise Regression Algorithm for High-Dimensional Variable Selection            18
       2.1.4 Generalized Linear Models with High-Dimensional Predictors via For-
               ward Regression: offset approach . . . . . . . . . . . . . . . . . . . . . .     20
       2.1.5 Cox Models with High-Dimensional Predictors via Forward Regression . .             22
       2.1.6 A Stepwise Regression Method and Consistent Model Selection for High-
               Dimensional Sparse Linear Models . . . . . . . . . . . . . . . . . . . . . .     24
   2.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   27
CHAPTER 3 STEPWISE METHOD: THEORY AND APPLICATIONS .                          . . . . . . . . . 30
   3.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 30
   3.2 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
   3.3 Proof of the Theorems . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 37
   3.4 Additional Lemmas . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . 44
   3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 51
   3.6 Applications: Real Data Analysis . . . . . . . . . . . . . . . . . .   . . . . . . . . . 66
       3.6.1 A Study of Gene Regulation in the Mammalian Eye . . . .          . . . . . . . . . 66
       3.6.2 An Esophageal Squamous Cell Carcinoma Study . . . . .            . . . . . . . . . 68
       3.6.3 Neurobehavioral Impairment from Total Sleep Deprivation          . . . . . . . . . 73
CHAPTER 4 MULTI-STAGE HYBRID MACHINE LEARNING METHOD                              . . . . . . . 76
   4.1 Machine Learning Ensemble Methods: Categories and Types . . . . .          . . . . . . . 76
   4.2 A Review on Existing Methods . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 77
       4.2.1 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . .       . . . . . . . 78
       4.2.2 Support Vector Machines (SVM) . . . . . . . . . . . . . . . .        . . . . . . . 79
       4.2.3 Gradient Boosting Machine (GBM) . . . . . . . . . . . . . .          . . . . . . . 80
                                               v


       4.2.4 Artificial Neural Network (ANN) . . . . . . . . . . . . . .    . . . . . . . . . 81
       4.2.5 Least Absolute Shrinkage and Selection Operator (LASSO)        . . . . . . . . . 83
       4.2.6 STEPWISE Method . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . 84
   4.3 Multi-Stage Hybrid Machine Learning Method . . . . . . . . . .       . . . . . . . . . 85
       4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
       4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 87
   4.4 Application: Bladder Cancer Prediction . . . . . . . . . . . . . .   . . . . . . . . . 88
       4.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . 88
       4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 89
   4.5 Web Application . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 93
CHAPTER 5    CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE
             RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
   APPENDIX A        SUPPLEMENT MATERIALS . . . . . . . . . . . . . . . . . . . . 100
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
                                             vi


                                        LIST OF TABLES
Table 3.1: The values of 𝜂1 and 𝜂2 used in the simulation studies . . . . . . . . . . . . . . . 52
Table 3.2: Normal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Table 3.3: Binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 3.4: Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 3.5: Comparisons of MSPE between competing methods using the mammalian eye
           data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 3.6: Comparisons of competing methods over 100 independent splits of the ESCC
           data into training and testing sets . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 3.7: Comparisons of competing methods over 100 independent splits of the Total
           Sleep Deprivation data into training and testing sets . . . . . . . . . . . . . . . . 75
Table 4.1: Results of the 5-fold cross-validation procedure for the STEPWISE method . . . 90
Table 4.2: Assessment of the proposed STEPWISE procedure using the bladder cancer
           data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 4.3: Comparison of base-learner methods included in the multi-stage hybrid ma-
           chine learning model over 100 independent splits of the bladder cancer data
           into training and testing sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Table 4.4: Evaluation of the proposed multi-stage hybrid machine learning model with
           the bladder cancer data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Table 4.5: Comparison of various model configurations included in the sensitivity analysis . 93
Table A.1: Comparison of genes selected by each competing method from the mammalian
           eye data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Table A.2: Selected miRNAs for ESCC training dataset . . . . . . . . . . . . . . . . . . . . 101
Table A.3: Clinicopathologic characteristics of participants in bladder cancer study . . . . . 101
Table A.4: Clinicopathologic characteristics of study participants of the ESCC data set . . . 102
                                                 vii


                                       LIST OF FIGURES
Figure 3.1: Box plot of model sizes for each method over 120 different training samples
            from the mammalian eye data set. STEPWISE was performed with 𝜂1 = 1
            and 𝜂2 = 4, and FR and SC were conducted with 𝛾 = 1. . . . . . . . . . . . . . 67
Figure 3.2: Box plot of model sizes for each method based on 100 ESCC training datasets.
            Performance of STEPWISE is reported with 𝜂1 = 0 and 𝜂2 = 3.5. Performance
            of SC and FR are reported with 𝛾 = 0. . . . . . . . . . . . . . . . . . . . . . . 71
Figure 3.3: Comparisons of ROC curves for the selected models in the ESCC data set by
            the sequentially selected order. Model 1 includes Age and Gender feature,
            and the following features are sequnatially added to the model: miR-4783-3p,
            miR-320b, miR-1225-3p, miR-6789-5p. . . . . . . . . . . . . . . . . . . . . . . 72
Figure 3.4: Box plot of model sizes for each method based on 100 total sleep deprivation
            training datasets. Performance of STEPWISE is reported with 𝜂1 = 0.5 and
            𝜂2 = 3. Performance of SC and FR are reported with 𝛾 = 0.5 . . . . . . . . . . 74
Figure A.1: R-Shiny Web Application for solving classification problems. The plot illus-
            trates uploading and splitting a dataset into training and testing sets. . . . . . . . 103
Figure A.2: R-Shiny Web Application for solving classification problems. The plot depicts
            a tuning parameters step for a Random Forest method. . . . . . . . . . . . . . . 104
Figure A.3: R-Shiny Web Application for solving classification problems. The plot depicts
            an output of the final predictive model developed by the web application. . . . . 105
                                                viii


                           LIST OF ALGORITHMS
Algorithm 1 MULTI-STAGE HYBRID MACHINE LEARNING METHOD . . . . . . . . . 86
                                    ix


                                           CHAPTER 1
                                        INTRODUCTION
1.1    Predictive Models
In biomedical research and clinical studies predictive model are utilized for several purposes such
as risk management and prognosis. Consequently, the reliability of clinical data is directly related
to the quality of predictive analysis. Nowadays, electronic health records became more available
and contain rich information, which enables researchers to develop and deploy highly efficient clin-
ical predictive methods. These methods have potential to be key components in making decisions
related to patient treatments, drug development, and so on.
Over the last decade, the technological advances and explosion of information profounded the un-
derstanding of the molecular basis of tumor progression and identified numerous tumor biomarkers
[7]. A certain type of biomarkers, which posses predictive power, are capable of assessing the
benefit from clinical interventions and has a significant impact on clinical research. For instance, a
cancer screening biomarker is a prognostic biomarker that can be used to predict the development
of symptomatic cancer even in asymptomatic persons.
In practice, such screening biomarkers can be used as a cancer prediction model [8]. The main
purpose of building these models is discovering new cancer screening biomarkers and assessing
their effect on the disease. Insights and information obtained from these models can potentially
lead to early detection of disease in patients, early intervention and prevention from its further
development. Further, these predictive models can be feasibly used as cancer screening tests for
patients, including ones with no symptoms [9].
Technological advances have also made possible detailed genetic characterization of biological
specimens. High-throughput genomic technologies, including gene expression microarray, mi-
croRNA (micro Ribonucleic acid) array, RNA-seq, ChIP-seq (chromatin immunoprecipitation se-
quencing), and whole genome sequencing, have become powerful tools and dramatically changed
                                                  1


the landscape of biological research. For instance, a gene expression profile can be extracted for
a specimen by simultaneously evaluating expression levels of thousands of genes on that single
specimen using complementary DNA (cDNA) microarray technology [10].
Single nucleotide polymorphisms (SNPs) are one form of natural sequence variation common to all
genomes [11]. These SNPs are highly abundant, and are estimated to occur at 1 out of every 1,000
bases in the human genome [12, 13]. SNPs are particularly useful as DNA markers for mapping
susceptibility genes for complex diseases and population genetics since they demonstrate the high
density and mutational stability [14, 15].
SNPs in the coding regions of genes that alter the function or structure of the encoded proteins can
be a necessary and sufficient cause of most of the known recessively or dominantly inherited mono-
genic disorders [16], and are analyzed for diagnostic purposes. Moreover, SNPs can be analysed
to assess the risk of an individual for a particular disease. For instance, the identification of SNPs
made possible to screen somatic (non-tumor) DNA for mutations that alter treatment response or
predispose to cancer [7].
In addition, a large number of profiles based on the abundance of micro-RNAs (miRNAs) have been
used to predict prognosis or treatment response in cancer [7]. For example, Genome-wide associa-
tion (GWA) have identified cancer-causing mutations in breast [17] and colon [18] tumors, somatic
genetic screens can also identify predictors of radiation sensitivity [19] and the pharmacodynamics
of anticancer drugs [20]. Moreover, sets of genes identified through mRNA profiling have been
used to classify tumors into oncogenic subtypes of breast cancer [21, 22]; many individual miRNAs
have been associated with patient survival and drug treatment response in a number of different
cancers [23, 24].
Clearly, the enormous interest in genomic data is determined by the hope of finding candidate
biomarkers and using them to identify genes that predispose individuals to common diseases. Al-
though genome data analysis has already made a significant impact on biological and biomedical
research, it is still accompanied by certain challenges that yet have to be overcome. Specifi-
cally, complex genomic data introduce substantial challenges for statistical data analysis as its
                                                   2


high-dimensionality makes the classical statistical model framework no longer implementable. As
opposed to low-dimensional data when the number of observations is greater than the number
of explanatory features (also known as predictors), high- and ultrahigh-dimensional settings are
comprised of data in which number of predictors is greater than or is in the exponential order of
the sample size, respectively.
Most of the traditional statistical methods are developed around the concept of low-dimensional
data and are not aimed to accommodate high- or ultrahigh-dimensional data.               Thus, high-
dimensionality has significantly challenged traditional statistical theory. Applying these meth-
ods to high-dimensional data leads to unstable, unreliable, biased, and inconsistent results which
demolishes the main purpose of predictive model development. The problems that arise while
analyzing such data are typically referred to as the ’curse of dimensionality’, a term introduced by
mathematician Richard Bellman. Some aspects of it are discussed further.
First, classical statistical models applied to high-dimensional data have no unique solution for their
parameters. In fact, these models will have infinitely many solutions. This is mainly induced by
ill-defined, uninvertible, and singular matrices involved in the computation of parameter estimates,
making the estimation process ill-posed. These models are also know as unidentifiable models.
Consequently, effect size estimation in predictive models will become meaningless.
Second, as the number of predictors increases and surpasses the number of observations in the
model, variances of the parameter estimates will become large (even infinitely large in some cases),
resulting in inflated standard errors. In other words, a wide range of values of parameter estimates
will be consistent with data, making the confidence intervals uncommonly wide. Hence, validating
a significance of the predictors included in the predictive model will be nearly impossible.
Third, employment of classical statistical models in high-dimensional settings can provide incon-
sistent estimates as a small corruption of data can result in very different estimated parameters.
Furthermore, these models tend to capture the artificial trends of measurement noise, also know as,
overfitting. Overfitted models fit training data too closely and normally capture trends in data that
are applicable to this particular data set only. This decays their ability to generalized results with
                                                    3


new unseen data and results in poor predictive capability.
Lastly, using classical statistical methods with high-dimensional data often introduces multi-
collinearity issues that violate the underlying assumption of independent predictors in the model.
Multicollinearity implies the existence of highly correlated predictors among predictor features.
These structures are commonly observed in genomic data. Multicollinearity can create inaccurate
estimates of the model parameters; make insignificant predictors significant and vice versa, that is,
imposing false positives and false negatives in the predictive model; and, finally, it can degrade the
predictability of the model.
As it was shown, the traditional methods that perform well in low-dimensional settings run into
severe problems in analyzing high- or ultrahigh-dimensional data. They cannot cope with the
explosive growth of dimensionality of data. Therefore, in order to face the problem of high-
dimensionality, we must reshape the classical statistical thinking. These problems create signifi-
cant challenges, but, on the other hand, they create great opportunities for the development of new
statistical methodologies.
It is worth mentioning that developing predictive models along with feature selection and estima-
tion play crucial and fundamental role in knowledge discovery. As more amount of massive and
complex data become available, there is no doubt that high-dimensional data analysis will be one
of the most important and demanding research topics in our field.
In this thesis we propose a new method (introduced and described in Chapter 3) for model selection
and estimation that will overcome aforementioned limitations in high-dimensional problems. The
remaining sections of this chapter discuss a few problems from various research areas that will
illustrate challenges of high-dimensional data and to which the proposed method could be applied.
1.1.1    Gene Regulation in the Mammalian Eye
Human genetics has sparked a revolution in medical research on the basis of the seemingly unthink-
able notion that one can systematically discover the genes causing inherited diseases without any
prior biological knowledge as to how they function [25]. Most characteristics of medical pertinence
                                                  4


do not follow simple Mendelian monogenic inheritance. Such complex traits include vulnerability
to heart disease, hypertension, diabetes, cancer, and infection. The genetic dissection of complex
traits is attracting many investigators with the promise of solving old problems and is generating a
variety of analytical methods.
Recent advancement in microarray technology and bioinformatics made it possible to examine the
expression of numerous genes in a large number of individuals and enabled researcher to identify
genetic elements that cause the gene expression to vary among individuals [26, 27, 28, 29]. Dis-
covering specific disease mechanics is a big challenge that biomedical researchers face nowadays.
These mechanics might potentially underlie heritable disorders that reveal complex inheritance,
for instance, Mendelian disorders [25, 30, 31]. In addition, these approaches can help identify
genes related to development of Mendelian forms of complex diseases such as obesity [32, 33, 34],
macular disease [35, 36, 37], hypertension [38], and glaucoma [39, 40].
Mutations that alter gene expression might play a significant role in complex disease. Transgenic
animal studies revealed that gene dosage of mutant genes can have a keen effect on phenotype [41].
It was shown that the cause of disease can become an improper regulation of structurally normal
genes and alterations in gene dosage [39]. For example, overexpression and haploinsufficiency of
the FOXC1 gene can lead to developmental defects of the anterior chamber of the eye [39].
Scheetz et al. [41] used expression quantitative trait locus mapping in the laboratory rat to gain a
broad perspective of gene regulation in the mammalian eye and to identify genetic variation relevant
to human eye disease. They analyzed data obtained from Rat Genome Database by using analy-
sis of variance (ANOVA) technique and identified significant genes based on their corresponding
p-values. Certainly, Scheetz’s results provide meaningful insights on how genetic variation can
be associated with specific diseases, but they do not estimate the magnitude of effects these genes
are having on the disease. In addition, they have not built a predictive model that will enable
researchers to link genes and assess their contribution toward developing diseases, and have not
evaluate its predictive power. We adopted their data and aimed to improve results achieved by
Scheetz et al. [41]. Data contained 120 observation profiling 31042 probes of genes, but due to a
                                                   5


small variation in many of these probes, the number of probes was reduced to 5000.
A gene TRIM32 that has been found to cause Bardet-Biedl syndrome [42] was treated a response
variable, and the expression of 5000 genes as the predictors. Our predictive model has identified
three probes of genes (1376747_at, 1381902_at, 1382673_at) that can be potentially linked to
TRIM32. We achieved a high accuracy with the mean squared prediction error (MSPE) as low as
0.0012. Detailed description of the results can be found in Chapter 3.
1.1.2   An Esophageal Squamous Cell Carcinoma Study
Esophageal cancer is the 7th most common cancer among males and among both sexes combined
in the world and ranks 6th in terms of mortality overall because of the poor survival rate it confers
[43, 44]. Additionaly, incidence and mortality rates in males are 2- to 3-fold higher than the rates
in females [43]. Compared with more developed geographic regions, overall incidence rates are
2-fold higher in less-developed countries, with the highest rates occurring in Asia [43].
Esophageal squamous cell carcinoma (ESCC) is the predominant histologic type with the highest
incidence rate in populations within Southeastern and Central Asia [44]. There are two major
histological types of esophageal carcinoma: esophageal squamous cell carcinoma (ESCC) and
adenocarcinoma [45]. ESCC is the major type in China, where it accounts for more than 90% of
cases of esophageal carcinoma; whereas adenocarcinoma is more common in the United States and
in European countries [46]. ESCC is often diagnosed at a locally advanced stage and the outcomes
for affected patients are poor [45].
With various treatment methods employed in clinical practice after extensive research, the diagno-
sis and treatment of ESCC have been greatly improved [47]. Esophagectomy, chemotherapy, and
radiotherapy are currently the main treatments for ESCC [45]. However, the prognosis remains
poor, with 5-year survival proportions of 21% and 14% (2005–2011) in the United States for whites
and blacks, respectively, and 12% (2000–2007) in Europe [44], which is far below the estimated
effectiveness of the therapy [47].
An accurate clinical staging and prognostic information is essential to direct appropriate treatment
                                                 6


strategies [45]. Accumulating evidence suggests that the prognosis is affected by several factors,
including the delayed diagnosis, high recurrence, and metastasis rate [47, 48]. Thus, identifying
the diagnostic and prognostic tumor markers and further elucidating their clinical implications
are urgently needed. To develop new diagnostic methods and treatment strategies, investigators
have focused on the potential of a particular class of microRNAs (miRNAs) to provide additional
information about the characteristics and survival prospects of patients with ESCC.
miRNAs are small (22-24 nucleotides), noncoding RNA molecules that play important roles in
regulating cell differentiation, proliferation, migration and apoptosis [44]. Altered miRNA expres-
sion in cancer tissue has been reported in most tumor types [49, 50]. There is increasing evidence
that miRNA expression in cancer tissue is a useful prognostic marker [51, 52, 53]. In addition, the
application of miRNA expression levels as a blood biomarker has been explored in various types
of cancer, including gastric, hepatocellular, and non-small cell lung cancer [54, 55, 56]. However,
whether miRNA levels in plasma are a useful biomarker for patients with ESCC remains largely
unexplored [45].
Sudo et al. [57] explored ways of developing a detection model for ESCC based on large-scale
miRNA profiling. For these purposes, they analyzed data submitted to the National Center for
Biotechnology Gene Expression Omnibus (NCBI GEO) database, available under accession num-
ber GSE122497. To establish a diagnostic model, they developed a model based on the observations
obtained from 566 patients (283 with ESCC and 283 healthy controls) profiling 2565 miRNAs by
carrying out Fisher’s linear discriminant analysis with a greedy algorithm.
Although their model has achieved high predictive accuracy, it has some drawbacks. Given the
nature of the algorithm that has been employed, they developed a predictive model for the pre-
determined model size: they built models with model sizes ranging 2-8 and selected the one that
achieved higher accuracy with fewer variables (model size = 6). The disadvantage of this method is
that it might lead to false negatives and false positive in the final model. Moreover, the importance
of the miRNAs included in the model will also be determined by the model size. This might lead
to a wrong assessment of the effect sizes identified in the model.
                                                    7


We adopted this dataset and demonstrated the utility of our proposed method (introduced and de-
scribed in Chapter 3) and its superiority over other methods. Our model achieved similar accuracy
by recruiting fewer variables (3 miRNAs were selected: miR - 4783 - 3p, miR - 320b, miR - 1225
- 3p). It is worth mentioning that our model overcomes the issues associated with the model
introduced by Sudo et al. as our model size was defined by scanning the entire feature space and
selecting features based on their significance. Detailed description of our results and methodology
is presented in Chapter 3.
1.1.3    Bladder Cancer Study
Bladder cancer is any of several types of cancer arising from the tissues of the urinary bladder and
has high prevalence and recurrence rates [58, 59, 60]. According to American Society of Clinical
Oncology, among men bladder cancer is the fourth most common cancer and men are 4 times more
likely to be diagnosed with the disease. In addition, incidence in white men is twice more than that
in black men.
The earlier bladder cancer is found, the better the chance for successful treatment and cure. Prog-
nosis varies inversely with higher tumor stage and lymph node involvement [61]. Typically, the 5-
and 10-year survival rates for patients with lymph node involvement are 31% and 23%, respectively
[62]. Combination platinum-based chemotherapy is an potion for patients with metastatic disease,
but the survival is only 15 months, with a 5-year survival rate of 15% [63]. Since there is not yet an
accurate test to screen the general population for bladder cancer, most people are diagnosed with
bladder cancer once they have developed symptoms. As a result, some people have more advanced
(later stage) disease when the cancer is found.
In the year 2000, the total expenditure for lower tract urothelial cancers in the United States sur-
passed one billion dollars [64]. Bladder cancer affected about 1.6 million people globally in 2020
with 549,000 new cases and 200,000 deaths, and the late stage disease is associated with poor
survival. The cost of bladder cancer per patient from diagnosis to death is the highest of all cancers
[65].
                                                  8


Identifying the related biomarkers and predicting the disease at its early stage is crucial for better
prognosis. Discovery of diagnostic, prognostic, and predictive biomarkers in bladder cancer made
molecular markers an area of research. Potential biomarkers include the overexpression of mu-
tated genes, whole genome-wide array signatures, and microRNAs. For instance, microarray gene
expression profiling is studied in the blood of cancer patients in order to detect gene expression
patterns representing the cancer itself or a host’s reaction to the tumor [66, 67].
Recent studies have suggested that 70% of bladder cancer involve a specific mutation in genes [68],
therefore it can be potentially used as a biomarker in early detections of the disease and preventing
it from further development. Gene changes can also assist doctors in choosing the best treatment
possible or be useful in finding bladder cancers that can potentially come back after treatment.
Although recent progress made by scientists is significant, classification of bladder cancer patients
using gene expression data with regular statistical tools can become complicated and sometimes be
even impossible due to incapability of these methods to process data with a large scale, also known
as, high-dimensional data.
Usuba et al. [69] attempted to develop a predictive model for an early detection in bladder cancer.
They applied similar technique as described in Sudo et al. [57] via Fisher’s linear discriminant
analysis and recruited seven miRNAs in their final model. They achieved high accuracy in predic-
tion, but their model suffered from the same issues mentioned in Sudo et al. [57]. Specifically,
the method they employed might lead to false negatives and false positives in the final model, and
won’t be able to assess the effect sizes correctly due to pre-determined model size.
We aimed to improve given results and adopted data utilized in Usuba’s model. These data were
submitted to the NCBI GEO under accession number GSE113486. The predictive model was built
based on observations obtained from 768 patients (310 patients with bladder cancer and 468 healthy
controls) profiling 2565 miRNAs.
We demonstrated that our proposed multi-stage hybrid machine learning method (introduced and
described in Chapter 4) has achieved high prediction accuracy with sensitivity, specificity, and area
under the receiver operating curve (AUC) of 0.98, 0.98, and 0.99, respectively, and outperformed
                                                   9


Usuba’s model. Detailed description of our results can be found in Chapter 4.
1.1.4    Neurobehavioral Impairment from Total Sleep Deprivation
Sleep plays a key role in health, performance, and cognition [70]. Sleep deprivation is common-
place in modern society and its effects on neurobehavioral function (e.g., vigilance and cognition)
are well studied and documented. Sleep deprivation can induce giddiness, child-like behaviors, and
silliness [71], as well as more widely recognized negative effects including dysphoria, increased
irritability, and lowered frustration tolerance.
The increased irritability that often accompanies sleep deprivation hints that sleep-deprived in-
dividuals are highly reactive to emotional signals. These effects on mood can lead to negative
consequences and impact functioning abilities [72]. For example, sleep duration is inversely asso-
ciated with interpersonal difficulties and even violence has been observed in medical residents [73],
and sleeping less than 8 hours is associated with increased risk for adolescent suicidal behavior
[74].
Interestingly, alertness and vigilance also appear to be the cognitive capacities most consistently
and dramatically impacted by insufficient sleep [75]. When the envelope of continuous wakeful-
ness is pushed beyond about 16 hours, most individuals begin to show a substantial slowing of
reaction time (RT) and worsening of performance accuracy on tests of psychomotor vigilance [76].
Moreover, neurobehavioral tests have revealed assorted forms of performance deficits from sleep
loss, including impairment of learning and of responses to feedback in decision making [77]. The
Psychomotor Vigilance Test (PVT) is one of the most commonly applied neurobehavioral assays
of performance impairment due to sleep loss [75]. This test assays stimulus-response time, with
failure to respond within 500 ms recorded as a lapse. Sleep deprivation is associated with increased
variability in stimulus-response times, and more lapses, on the PVT [78].
Besides neurobehavioral testing, efforts have been made to identify molecular biomarkers such as
differentially expressed genes or metabolites affected by sleep loss [79, 80]. A biomarker has been
defined as “a characteristic that is objectively measured and evaluated as an indicator of normal
                                                  10


biological processes, pathogenic processes, or pharmacologic responses to a therapeutic interven-
tion” [81]. Many biomarkers such as differentially expressed genes can provide meaningful insights
including identification of a process or response. Humans are known to differ in their sensitivity
to sleep loss [82], and recent work has sought to identify biomarkers distinguishing individuals as
susceptible or resistant to sleep deprivation [83]. Yet surprisingly little effort has been made to
research molecular biomarkers with results from neurobehavioral assays.
Microarrays and bioinformatics analyses can be employed to explore candidate gene expression
biomarkers associated with total sleep deprivation (TSD), and more specifically, the phenotype of
neurobehavioral impairment from TSD. Uyhelji et al. [70] explored gene expression biomarker
candidates for neurobehavioral impairment from total sleep deprivation. They employed Weighted
Gene Co-expression Network Analysis (WGCNA) using data obtained from the NCBI GEO under
accession number GSE98582. Data contain 555 samples profiling 8284 gene features. In the
treatment effect analysis, they identified 212 genes that exhibited a significant difference between
TSD and control group, and 91 of them passed human blood biomarker filter.
Although Uyhelji et al. have done a great job in identifying important gene biomarkers associated
with TSD, effect sizes of these genes have not been estimated. Moreover, neither predictive power
of the diagnostic model was assessed. Thus, we employed our proposed method (discussed in
Chapter 3) to overcome the issues introduced in Uyhelji’s model. We have built a model based on
389 observations profiling 8284 gene features. Our model recruited five genes (PF4V1, USP32P1,
EMR1, NBR2, and DUSP23) and achieved high accuracy with sensitivity, specificity, and AUC of
0.99, 0.97, and 0.99, respectively.
                                                 11


                                            CHAPTER 2
                                     LITERATURE REVIEW
When the number of predictors is moderate, penalized regression approaches such as least absolute
shrinkage and selection operator (LASSO) by Tibshirani [84] have been used to construct predictive
models and explain the impacts of the selected predictors. LASSO minimizes the residual sum
of squares subject to the sum of the absolute value of the coefficients being less than a constant.
Because of the nature of this constraint it tends to produce some coefficients that are exactly 0
and hence gives interpretable models. However, in ultrahigh-dimensional settings where a number
of predictors 𝑝 is in the exponential order of the sample size 𝑛, penalized methods may incur
computational challenges [85], may not reach globally optimal solutions, and often generate biased
estimates [86].
Sure independence screening (SIS) proposed by Fan and Lv [87] has emerged as a powerful tool
for modeling ultrahigh dimensional data. This method is based on correlation learning, which
filters out the features that have weak correlation with the response and reduces dimensionality
from high to a moderate that is below the sample size. Specifically, such correlation learning ranks
the importance of features according to their marginal correlation with the response variable and
eliminates the ones with weak marginal correlations.
However, the method relies on a partial faithfulness assumption, which stipulates that jointly im-
portant variables must be marginally important, an assumption that may not be always realistic. To
relieve this condition, some iterative procedures, such as ISIS [87], have been adopted to repeatedly
screen variables based on the residuals from the previous iterations, but with heavy computation
and unclear theoretical properties. Conditional screening approaches (see, e.g. [88]) have, to some
extent, addressed the challenge. However, screening methods do not directly generate a final model,
and post-screening regularization methods, such as LASSO, are recommended by Fan and Lv [87]
to produce a final model.
Closely related to forward selection is least angle regression (LARS) by Efron et al. [89], a widely
                                                  12


used model selection algorithm for high-dimensional models. In the LARS method, a multivariate
solution path is defined by using the geometrical theory of the linear regression model. The result-
ing method defines a continuous solution path for Generalized Linear Models (GLMs), with on the
extreme of the path the maximum likelihood estimate of the coefficient vector and on the other side
the intercept-only estimate. The LARS method is based on a recursive procedure selecting, at each
step, the covariates having largest absolute correlation with the response feature [89].
It is worth mentioning that a simple modification of the LARS algorithm implements the LASSO
method and calculates all possible LASSO estimates for a given problem. In addition, an approx-
imation for the degrees of freedom of a LARS estimate is available, from which Mallows’s 𝐶 𝑝
estimate of prediction error can be derived; this allows a principled choice among the range of
possible LARS estimates. Though LARS achieves impressive results in its performance, some
researches raised concerns in the following regards.
Ishwaran (see discussion section in Efron et al. [89]) suggests that the use of 𝐶 𝑝 coupled with
LARS forward optimization procedure might raise some potential flags. Specifically, the use of 𝐶 𝑝
will encourage large models in LARS, especially in high-dimensional orthogonal problems, and
will have a negative impact on variable selection performance. The claim was supported with the
high-dimensional simulation examples. Moreover, Weisberg (see discussion section in Efron et al.
[89]) believes that multicollinearity problem among independent features and presence of noise
in the dependent variable will affect the performance of LARS in regards of variable selection,
specifically, reducing chances of selecting significant variables in the model. Examples supporting
the claim were provided.
In the GLM setting, Augugliaro et al. [90] and Pazira et al. [91] proposed differential geometrical
LARS (dgLARS) based on a differential geometrical extension of LARS. The dgLARS estimator
follows naturally from a differential geometric interpretation of a GLM, generalizing the LARS
method.
The subsequent section discusses sequential model selection techniques known to the literature.
Sequential model selection assumes including features into the final model sequentially with the
                                                  13


entry order determined by their relative importance based on certain criteria. Although the methods
described in this section have achieved significant results and have been implemented in different
scenarios, lack of certain properties make them less reliable and more vulnerable against challenges
introduced by the size and volume of data available nowadays.
2.1    Sequential Model Selection
For generating a final predictive model in ultrahigh-dimensional settings, recent years have seen
a surging interest of performing forward regression, an old technique for model selection that has
been widely used for model building when the number of covariates is relatively low. But due to
its complicated computations and unknown theoretical properties, forward regression technique is
rarely used in high-dimensional settings.
Under some regularity conditions and with some proper stopping criteria, forward regression can
achieve screening consistency and sequentially select variables according to metrics such as Akaike
information criterion (AIC), Bayesian information criterion (BIC), or 𝑅 2 . Below are listed methods
that try to overcome limitations introduced by forward regression and utilize it for models with
high-dimensional predictors.
2.1.1   Feature Selection using BIC criteria
The problem of variable selection with an ultrahigh-dimensional predictor becomes a problem
of fundamental importance. The traditional method of best subset selection is computationally
infeasible for high dimensional data. As a result, various shrinkage methods have gain a lot of
popularity. All those methods are very useful and can be formulated as penalized optimization
problems, which could be selection consistent, if the sample size is much larger than the predictor
dimension. However, if the predictor dimension is much larger than the sample size, the story
changes drastically.
One frequently used assumption is the so-called sparsity condition which assumes that the effective
                                                 14


contribution to a dependent variable rests on a much small number of regressors than the sample
size. The challenge then is to find those ‘true’ regressors from a much larger number of candidate
variables. This leads to a surging interest in new methods and theory for regression model selection
with 𝑝  𝑛.
An et al. [92] revisited the classical forward and backward stepwise regression methods for model
selection and adapted them to the cases with the number of candidate variables p greater than the
number of observations n. In the noiseless case, they gave definite upper bounds for the number
of forward search steps to recover all relevant variables, given each step of the forward search is
approximately optimal in reduction of residual sum of squares, up to a fraction.
In the presence of noise, they proposed two information criteria BICP (BIC modified for a case with
large number of predictors) and BICC (BIC with an added constant) that overcome the difficulties
related to employing regular BIC and AIC. These criteria serve as a stopping rule in the stepwise
search: the BICP increases the penalty to overcome overfitting and the BICC controls the residuals
in the sense that it will stop the search before the residuals diminish to 0 as the number of selected
variables increases to n.
In addition, they proved that the BICP stops the forward search as soon as it recovers all relevant
variables and removes all extra variables in the backward deletion, which lead to the selection
consistency of the estimated models. The algorithm can be summarized as follows. Consider a
linear regression model
                                                y = Xβ + ,                                               (2.1)
where y = (𝑦 1 , . . . , 𝑦 𝑛 ) T is an n-vector of random responses, X = (𝑥 1 , . . . , 𝑥 𝑝 ) is a 𝑛 × 𝑝 design
matrix, β is a p-vector of regression coefficients,  = (𝜖1 , . . . , 𝜖 𝑛 ) 0 ∼ 𝑁 (0, 𝜎 2 I), where 𝜎 2 > 0 is
an unknown but fixed constant and I denotes an identity matrix. Let I𝑛 = {1 ≤ 𝑖 ≤ 𝑝 : 𝛽𝑛,𝑖 ≠ 0}
and 𝑑𝑛 = |𝐼𝑛 | denote the number of elements in 𝐼𝑛 .
Further, for any subject 𝐽 ⊂ {1, . . . , 𝑝}, let X𝐽 denote the 𝑛 × |𝐽 | matrix consisting of the columns
of X corresponding to the indices in J, and 𝛽𝐽 the |J|-vector consisting of the components 𝛽
                                                       15


corresponding to the indices of J. Put
           P𝐽 = X𝐽 (XT𝐽 X𝐽 ) − XT𝐽 ,   P𝐽⊥ = 𝐼𝑛 − P𝐽 ,     𝐿 𝑢,𝑣 (𝐽) = 𝑢 T P𝐽⊥ 𝑣, 𝑢, 𝑣 ∈ R𝑛       (2.2)
P𝐽 is a projection matrix onto the linear space spanned by the columns of X𝐽 , 𝐿 𝑦,𝑦 (𝐽) is the sum of
squared residuals resulted from the least square fitting ŷ = X𝐽 β̂𝐽 = P𝐽 y. The algorithm concerned
is based on a combined use of the standard stepwise addition and deletion with some adjusted
information criteria and can be described in the following steps:
Stage I - Forward Addition:
   1. Let 𝐽1 = { 𝑗1 }, where 𝑗 1 = arg min1≤𝑖≤ 𝑝 𝐿 𝑦,𝑦 ({𝑖}). Put
                                                                          
                                     BICP1 = log 𝐿 𝑦,𝑦 (𝐽1 )/𝑛 + 2log 𝑝/𝑛
   2. Continue with k = 1, 2, 3, . . . , provided BICP𝑘 < BICP 𝑘−1 , where
                                                                           
                                   BICP𝑘 = log 𝐿 𝑦,𝑦 (𝐽 𝑘 )/𝑛 + 2𝑘log 𝑝/𝑛
      In the above expression, 𝐽 𝑘 = 𝐽 𝑘−1 ∪ { 𝑗 𝑘 }
   3. For BICP𝑘 ≥ BICP 𝑘−1 , let 𝑘˜ = 𝑘 − 1, and 𝐼ˆ𝑛,1 = 𝐽 𝑘˜
Stage II - Backward deletion:
   1. Let BICP∗𝑘˜ = BICP 𝑘˜ and 𝐽 𝑘∗˜ = 𝐼ˆ𝑛,1
   2. Continue with 𝑘 = 𝑘˜ − 1, 𝑘˜ − 2, . . . , providing BICP∗𝑘 ≤ BICP∗𝑘+1 , where
                                   BICP∗𝑘 = log 𝐿 𝑦,𝑦 (𝐽 𝑘∗ )/𝑛 + 2𝑘log 𝑝/𝑛
                                                                           
      In the above expression, 𝐽 𝑘∗ = 𝐽 𝑘+1 \{ 𝑗 𝑘 }
   3. BICP∗𝑘 > BICP∗𝑘+1 , 𝑘˜ = 𝑘 + 1, and 𝐼ˆ𝑛,2 = 𝐽 𝑘∗˜
    The drawback of the method is that it is unclear whether the results are applicable to high-
dimensional GLMs.
                                                     16


2.1.2   Forward Regression with High-Dimensional Predictors
Consider, for example, those useful methods with non-convex objective functions (e.g., bridge
regression, the SCAD, etc). With the predictor dimension much larger than the sample size,
computationally how to optimize those non-convex objective functions remains a nontrivial task.
Efficient algorithms (e.g., LARS) do exist for LASSO-type methods, where the objective functions
are strictly convex. However, those methods are not selection consistent under a general design
condition. Another reasonable solution can be variable screening, such as very popular yet classical
method Forward Regression.
Motivated by SIS method [87], Wang [93] proposed a Forward Regression (FR) method for
ultrahigh-dimensional variable selection. It was showed that FR method can identify all relevant
predictors consistently, even if the number of predictors is significantly larger than the sample size.
Particularly, FR is capable of discovering all relevant predictors within a finite number of steps,
given that the dimension of the true model is finite. To select the final model from the set of
candidate models, Wang [93] makes use of BIC criteria introduced by Chen and Chen [94]. The
resulting model can then serve as an excellent starting point, from where many existing variable
selection methods can be applied directly.
FR algorithms can be summarized as follows. Suppose (X𝑖 , Y𝑖 ) are observation from the 𝑖th
subject (1 ≤ 𝑖 ≤ 𝑛), Y𝑖 ∈ R1 is the response variable, and X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖𝑑 ) T ∈ R𝑑 is
ultrahigh-dimensional predictor with 𝑑  𝑛. The response and predictor features are linked as
Y𝑖 = X𝑖T β + σ𝑖 , where β = (𝛽1 , . . . , 𝛽𝑑 ) T ∈ R𝑑 and σ𝑖 is a random noise. Let M = { 𝑗1 , . . . , 𝑗 𝑑 ∗ }
denote an arbitrary model with 𝑋 𝑗1 , . . . , 𝑋 𝑗 𝑑∗ as relevant predictors. Then the full model is defined
as 𝚪 = {1, . . . , 𝑑} and the true model as τ = { 𝑗 : 𝛽 𝑗 ≠ 0}. Additionally, X𝑖(𝑀) = {𝑋𝑖 𝑗 : 𝑗 ∈ 𝑀 }
denotes the subvector of X𝑖(𝑀) corresponding to M , Y = (𝑌1 , . . . , 𝑌𝑛 ) T ∈ R𝑛 is the response vector,
and ξ 𝑀 = (𝑋1 , . . . , 𝑋𝑛 ) ∈ R𝑛×𝑑 is the sub-design matrix corresponding to M .
FR algorithm is implemented in three major steps:
   1. (Initialization) It starts with initialization, that is, setting 𝑆 0 = ∅.
                                                        17


   2. (Forward Regression) 𝑆 (𝑘−1) is given at the 𝑘th step. For every 𝑗 ∈ 𝚪\𝑆 (𝑘−1) , it constructs a
        candidate model M 𝑗(𝑘−1) = 𝑆 (𝑘−1) ∪ { 𝑗 }. Then it computes RSS (𝑘−1)        𝑗      = YT {𝐼 T − Ĥ 𝑗(𝑘−1) }Y,
        where
                                                                              −1
                                    Ĥ 𝑗(𝑘−1) = ξM (𝑘−1)   ξ T (𝑘−1) ξM (𝑘−1)      ξT
                                                   𝑗        M𝑗          𝑗           M 𝑗(𝑘−1)
        is a projection matrix. It finds 𝑎 𝑘 = arg max 𝑗 ∈𝚪\𝑆 (𝑘−1) RSS (𝑘−1)
                                                                            𝑗      and updates 𝑆 𝑘 = 𝑆 (𝑘−1) ∪{𝑎 𝑘 }
        accordingly.
   3. (Solution Path) Then FR algorithms iterates step 2 𝑛 times and generates total of 𝑛 nested
        candidate models and collects these models by a solution path S = {𝑆 𝑘 : 1 ≤ 𝑘 ≤ 𝑛} with
        𝑆 𝑘 = {𝑎 1 , . . . , 𝑎 𝑘 }.
Authors showed both theoretically and numerically that FR can discover all relevant predictors
consistently, even if the predictor dimension is substantially larger that the sample size. However,
the proposed method is limited to linear regression models in high-dimensional settings only .
2.1.3     A Stepwise Regression Algorithm for High-Dimensional Variable Selection
Hwang et al. [95] proposed a stepwise regression algorithm with a simple stopping rule for the
identification of significant predictors and interactions among a huge number of variables in various
statistical models. It improves the results of the Forward Regression method called paring-down
variation (SPV) algorithm, proposed by Hwang and Hu [96], which was limited to the analysis of
the variation model for continuous responses, and required independence between factor predic-
tors. The new stepwise regression algorithm, like ordinary stepwise regression, at each forward
selection step includes a variable in the current model if the test statistic of the enlarged model with
the predictor against the current model has the minimum p-value among all the candidates and is
smaller than a predetermined threshold. Instead of using conventional information types of criteria,
the threshold is determined by a lower percentile of the beta distribution. The proposed stopping
rule is based on the well-known theoretical properties that (1) the p-values of the test statistics are
                                                         18


Unif (0, 1) distributed if the predictors are irrelevant to the responses and (2) the minimum of m
independent Unif (0, 1) random variables can be assumed to be beta distributed with parameters 1
and m approximately [97, 98]. The algorithm can be summarized as follows.
Suppose Y is an n-vector of responses and X = (𝑋1 , . . . , 𝑋 𝑝 ) is a 𝑛 × 𝑝 design matix of variables
with 𝑝  𝑛. Let S be a subset of {0, 1, . . . , 𝑝} and denote X𝑆 as the sub-matrix of X obtained by
extracting its columns corresponding to the indices in S. In addition let M𝑆 be the model relating
the distribution of Y to the predictors X𝑆 through a function of the linear predictor X𝑆 β𝑆 with
parameter vector β𝑆 of size |S|. Finally, let denote the vector of residuals from the fitted model M𝑆
is denoted by R𝑆 . Then the algorithm can be expressed in the following steps.
Forward Selection
Step 1: Start with the null model M𝑆 , where S = ∅
Step 2: Let the step count be l = |𝑆| + 1
Step 3: Calculate the correlation between R𝑆 and X 𝑗 , denoted as 𝑟 𝑗 , for 𝑗 = 1, . . . , 𝑝. Set
D = {1 ≤ 𝑗 ≤ 𝑝 : |𝑟 𝑗 | is among the first d largest of all}, where 𝑑 = [𝑛/log{𝑛}]
Step 4: Test the difference in the goodness-of-fit between each M𝑆∪{ 𝑗 } against M𝑆 for all 𝑗 ∈ 𝐷\𝑆
Step 5: Replace S with 𝑆 ∪ { 𝑗 } when the test statistic of 𝑀𝑆∪{ 𝑗 } against M𝑆 has the minimum
p-value, denote as p𝑙 , among the |𝐷\𝑆| competing models
Step 6: Stop forward selection and go to Step 7 when 𝑝 ℎ > the 10th percentile of Beta(1, 𝑝 − ℎ + 1)
for ℎ = 𝑙, 𝑙 − 1, . . . , 𝑙 − 9; otherwise, go to Step 2
Backward Selection:
Step 7: Set 𝜋 = 𝑒𝑥 𝑝(𝜇 − 𝑧 × 𝑣) where 𝜇 and 𝑣 are the sample mean and standard deviation of
log{𝑝 ℎ }
Step 8: Test the difference in the goodness-of-fit between M𝑆 and 𝑀𝑆\{ 𝑗 } , for each 𝑗 ∈ 𝑆
Step 9: Replace 𝑆 with 𝑆\{ 𝑗 𝑆 } and go to Step 8 if the test statistic of M𝑆 against M𝑆\{ 𝑗 𝑆 } has the
largest 𝑝-value e among all the reduced models and is larger than 𝜋; otherwise, stop and report the
set of remaining predictors {X 𝑗 , 𝑗 ∈ 𝑆} as final influential predictors
                                                       19


The main drawback of this method is that it was not supported with theoretical properties on model
selection.
2.1.4    Generalized Linear Models with High-Dimensional Predictors via Forward Regres-
         sion: offset approach
As the dimension of predictors defies any existing modeling approaches, feature screening has
been commonly used for dimension reduction. The most popular screening approach is marginal
screening [87], which selects variables based on their marginal associations with the response.
However, marginal screening may miss signals that are marginally unimportant but conditionally
important [88], resulting in biased predictive results.
Conditional screening methods have been known as an alternative to well-known marginal screen-
ing, as they identify marginally weak but conditionally important variables. Nevertheless, the initial
conditioning set need to be fixed for the most of existing conditional screening methods and if not
chosen properly, may produce false positives and false negatives, and the selected variable might
depend on the conditioning set. Moreover, screening approaches typically need to involve tuning
parameters and extra modeling steps in order to reach a final model.
Zheng et al. [99] proposed a sequential conditioning (SC) approach, wherein variables sequentially
enter the conditioning set according to the increment of likelihood. The procedure updates the
conditioning set at each iteration based on the extended Bayesian information criterion (EBIC), and
constructs an offset term based on the variables in this set. In essence, this offset summarizes the
information contained in the updated conditioning set, and it searches for a new variable that max-
imizes the likelihood given the offset term. The authors emphasize that the proposed SC approach
deviates fundamentally from the variable screening or selection approaches as it naturally leads to
a final model when the procedure terminates. The SC approach can be summarized as follows.
Suppose (X𝑖 , 𝑌𝑖 ) are observations from the 𝑖th subject (1 ≤ 𝑖 ≤ 𝑛), 𝑌𝑖 ∈ R1 is the response variable,
and X𝑖 = (𝑋𝑖0 , 𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 )𝑇 is a collection of p+1 predictors for the ith sample and 𝑋𝑖0 = 1
corresponds to the intercept. SC modeling focuses on GLMs by assuming that the conditional
                                                       20


density of 𝑌𝑖 given 𝑋𝑖 belongs to the linear exponential family
                                  𝜋(𝑌 | X) = exp{𝑌 XT β − 𝑏(XT β) + A (𝑌 )},                                      (2.3)
where β = (𝛽0 , 𝛽1 , . . . , 𝛽 𝑝 ) T is the vector of coefficients, 𝛽0 is the intercept, and A (·) and 𝑏(·) are
known functions. This model with a canonical link function and a unit dispersion parameter, belongs
to a larger exponential family [100]. It also assumes that 𝑏(·) is twice continuously differentiable
with a non-negative second derivative 𝑏00 (·). In addition, it uses 𝜇(·) and 𝜎(·) to denote 𝑏0 (·) and
𝑏00 (·), i.e. the mean and variance functions, respectively. For example, 𝑏(𝜃) = log(1 + exp(𝜃)) in
a logistic distribution and 𝑏(𝜃) = exp(𝜃) in a Poisson distribution. Let E𝑛 { 𝑓 (𝜉)} = 𝑛−1 𝑖=1
                                                                                                          Í𝑛
                                                                                                                  𝑓 (𝜉𝑖 )
denote the mean of { 𝑓 (𝜉𝑖 )}𝑖=1       𝑛 for a sequence of i.i.d. random variables 𝜉 (𝑖 = 1, . . . , 𝑛) and a
                                                                                              𝑖
non-random function 𝑓 (·).
The loglikelihood function, apart from an additive constant is
                                            Õ𝑛
                                         −1
                                       𝑛        𝐿 (X𝑖T β, 𝑌𝑖 ) = E𝑛 {𝐿 (XT β, 𝑌 )}                                (2.4)
                                            𝑖=1
It uses β∗ = (𝛽∗0 , 𝛽∗1 , . . . , 𝛽∗𝑝 ) T to denote the true values of β. Then the true model is M = { 𝑗 :
𝛽∗ 𝑗 ≠ 0, 𝑗 ≥ 1} ∪ {0}, which consists of the intercept and all variables with nonzero effects. The
estimate of M is denoted as M̂. It elaborate on the idea of building model with the proposed SC
approach. The key is to include an offset term which summarizes the information acquired from the
previous selection steps and to search for a new candidate variable that maximizes the likelihood
with such an offset.
An SC approach algorithm starts with initial index set, S0 , and initial offset, O0 . Having S0 = {0},
O0 = 𝛽ˆS0 , where 𝛽ˆS0 is estimated intercept without any covariates. First, with such O0 , it computes
𝛽ˆ (1)
   𝑗 = arg maxβ 𝑙 O0 , 𝑗 (𝛽) for j ∈ {0, 1, . . . , 𝑝}, where 𝑙 O0 , 𝑗 (β) is the average log-likelihood of
the regression model. Then, 𝑗1 = arg max 𝑗 ∈{0,1,...,𝑝} 𝑙O0 , 𝑗 ( 𝛽ˆ (1)                                         T
                                                                            𝑗 ), S1 = {0, 𝑗 1 }, and O1 = XS1 β̂S1 .
Iteratively, for k ≥ 1, given S 𝑘 and O 𝑘 , it computes 𝛽ˆ (𝑘+1)      𝑗   = arg max 𝛽 𝑙O𝑘 , 𝑗 (β) for 𝑗 ∈ 𝑆 𝑐𝑘 . Then,
𝑗 𝑘+1 = arg max 𝑗 ∈𝑆 𝑐 𝑙 O𝑘 , 𝑗 ( β̂ (𝑘+1)
                                     𝑗     ), 𝑆 𝑘+1 = 𝑆 𝑘 ∪ { 𝑗 𝑘+1 }, and O 𝑘+1 = XTS𝑘+1 β̂S𝑘+1 . To decide whether
                     𝑘
it recruits another variable 𝑗 𝑘+1 or stops procedure at 𝑘th step, it computed EBIC on set 𝑆 𝑘+1 , where
                       EBIC(𝑆 𝑘+1 ) = −2ℓ𝑆 𝑘+1 ( β̂𝑆 𝑘+1 ) + (𝑘 + 1)(log 𝑛 + 2𝜂 log 𝑝)/𝑛
                                                               21


If EBIC(𝑆 𝑘+1 )≥EBIC(𝑆 𝑘 ), it stops and declares 𝑆 𝑘 the final model.
The SC approach is computationally efficient as it maximizes the likelihood with respect to only
one covariate at each step given the offset, the property that was not observed in other methods.
The main drawback of SC approach is that it may be suboptimal compared to a full scale forward
optimization approach. Additionally, the consistency of the estimated model parameters has not
been addressed in related literature.
2.1.5    Cox Models with High-Dimensional Predictors via Forward Regression
As mentioned, forward regression can consistently identify all relevant predictors in high-dimensional
linear regression settings by using EBIC stopping rule. However, existing results from recent works
are based on the sum of residual squares from linear models and it is not certain whether forward
regression can be applied to more general regression settings, such as Cox proportional hazards
models since the results are based on the sum of residual squares from linear models.
There has been active research in developing high-dimensional screening tools for survival data.
These works include principled sure screening [101], feature aberration at survival times screening
[102] and conditional screening [103], quantile adaptive sure independence screening [104], a
censored rank independence screening procedure [105], and integrated powered density screening
[106]. However, the screening methods require a threshold to dictate how many variables to retain,
for which unfortunately there are no clear rules.
Zhao and Li [101] did tie the threshold with false discoveries, but it still needs to prefix the number
of false positives that users are willing to tolerate. Recently, Li et al. [107] designed a model-free
measure, namely the survival impact index, that sensibly captures the overall influence of a covari-
ate on the survival outcome and can help guide selecting important variables. However, even this
method, like the other screening methods, does not directly lead to a final model, for which extra
modeling steps have to be implemented.
Hong et al. [108] introduced a new forward variable selection procedure for survival data based on
partial likelihood. It selects important variables sequentially according to the increment of partial
                                                   22


likelihood, with a stopping rule based on EBIC. The algorithm for the proposed method is the
following.
Suppose n objects with 𝑝 covariates are observed, where 𝑝  𝑛. For subject 𝑖, denote by 𝑋𝑖 𝑗 the
𝑗th covariate for subject 𝑖, write X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 ) T , and let 𝑇𝑖 and 𝐶𝑖 be the underlying survival
and censoring times. We only observe 𝑌𝑖 = min(𝑇𝑖 , 𝐶𝑖 ), and the event indicator 𝛿𝑖 = 𝐼 (𝑇𝑖 ≤ 𝐶𝑖 ),
where 𝐼 is the indicator function. We assume random censoring such that 𝐶𝑖 and 𝑇𝑖 are independent
given X. To link 𝑇𝑖 to X𝑖 , for each 𝑖 ∈ {1, . . . , 𝑛}, we consider the Cox proportional hazards model
                                                𝜆(𝑡|X𝑖 ) = 𝜆 0 (𝑡) exp{β0T X𝑖 },                                          (2.5)
where 𝜆 0 is the unspecified baseline hazard function and β0 = (𝛽01 , . . . , 𝛽0𝑝 ) T is the vector of
regression coefficients. Additionally, let 𝑆 ⊂ {1, 2, . . . , 𝑝} be an index set and |𝑆| cardinality of 𝑆.
First, we initialize 𝑆0 = ∅ and sequentially select the sets of covariates such that 𝑆0 ⊂ 𝑆1 ⊂ · · · ⊂ 𝑆 𝑘 .
At the (𝑘 + 1)th step the algorithm selects a new covariate not observed in 𝑆 𝑘 and then decides
whether it includes the new variable into selection and proceeds to the next step or stops at the 𝑘th
step. The selection criteria is based on the partial likelihood. Given 𝑆 𝑘 , for every 𝑗 ∈ 𝑆 𝑐𝑘 , it fits a
Cox model on the variables indexed by 𝑆 𝑘, 𝑗 , where 𝑆 𝑘, 𝑗 = 𝑆 𝑘 ∪ 𝑗. Then it computes an increment
of log partial likelihood for each 𝑗 ∈ 𝑆 𝑐𝑘 , that is, 𝑙 𝑆 𝑘, 𝑗 ( β̂𝑆 𝑘, 𝑗 ) − 𝑙 𝑆 𝑘 ( β̂𝑆 𝑘 ), where 𝑙 𝑆 ( β̂𝑆 ) is log partial
likelihood function given X𝑆 :
                                       𝑛 ∫
                                                  "               (   𝑛
                                                                                             )#
                                    Õ          𝜏                    Õ
                   𝑙 𝑆 ( β̂𝑆 ) =                    β𝑆T X𝑖𝑆 − 𝑙𝑛         𝑌¯𝑙 (𝑡) exp β𝑆T X𝑙𝑆    𝑑𝑁𝑖 (𝑡),                  (2.6)
                                    𝑖=1     0                        𝑙=1
where 𝑁𝑖 (𝑡) = 𝐼 (𝑌𝑖 ≤ 𝑡, 𝛿𝑖 = 1) is the counting process, 𝑌¯𝑙 (𝑡) = 𝐼 (𝑌𝑖 ≥ 𝑡) is the at-risk process,
and 𝜏 > 0 is the study duration such that 𝑃(𝑌 ≥ 𝜏) > 0. The candidate index is chosen as
𝑗 ∗ = arg max 𝑗∉𝑆 𝑘 𝑙 𝑆 𝑘, 𝑗 ( β̂𝑆 𝑘, 𝑗 ) − 𝑙 𝑆 𝑘 ( β̂𝑆 𝑘 ) and the index set is updated 𝑆 𝑘+1 = 𝑆 𝑘 ∪ { 𝑗 ∗ }.
To decide whether to stop at the 𝑘th step or to include 𝑗 ∗ in the selection and proceed to the next
step, the algorithm makes its decision based on EBIC criteria, which is defined as follows:
                                                                   23


                                                                             
                   EBIC(𝑆 𝑘+1 ) = −2𝑙 𝑆 𝑘+1 ( β̂𝑆 𝑘+1 ) + (𝑘 + 1) ln 𝑑 + 2𝜂ln 𝑝 ,                (2.7)
where 𝑑 = 𝛿1 + · · · + 𝛿𝑛 is the number of events and 𝜂 some positive constant. If EBIC(𝑆 𝑘+1 ) >
EBIC(𝑆 𝑘 ), the algorithm stops and declares 𝑆 𝑘 the final model, otherwise it proceeds to the next
stop. They showed that if the dimension of the true model is finite, within a finite number of steps
forward regression can discover all relevant predictors, with the entry order determined by the size
of the likelihood increment.
The proposed model could potentially be the first work that investigated the partial likelihood-based
forward regression in survival models with high-dimensional predictors. Moreover, it represents
technical advances and a broadened scope compared to the existing forward regression (e.g., [109],
[110], [93]), and it improves the partial likelihood-based variable selection developed by [111],
[112] for survival data in low dimensional settings. The disadvantage of the proposed work is that
it does not address parameter estimation, which limits its usage in building predictive models.
2.1.6    A Stepwise Regression Method and Consistent Model Selection for High-Dimensional
         Sparse Linear Models
Stepwise least squares regression is widely used in applied regression analysis to handle a large
number of input variables, which consists of forward selection of input variables in a ”greedy”
manner so that the selected variable at each step minimizes the residual sum of squares after least
squares regression is performed on it together with previously selected variables, a stopping rule to
terminate forward inclusion of variables, and stepwise backward elimination of variables according
to some criterion.
Ing et al. [113] developed an asymptotic theory for a version of stepwise regression in the context
of high-dimensional regression under certain sparsity assumptions. They introduced a fast stepwise
regression method, called the orthogonal greedy algorithm (OGA), that selects input variables to
enter a p-dimensional linear regression model (with 𝑝  𝑛, the sample size) sequentially so that the
selected variable at each step minimizes the residual sum squares. They derived the convergence
                                                        24


rate of OGA and developed a consistent model selection procedure along the OGA path that
can adjust for potential spuriousness of the greedily chosen regressors among a large number of
candidate variables. The resultant regression estimate is shown to have the oracle property of being
equivalent to least squares regression on an asymptotically minimal set of relevant regressors under
a strong sparsity condition.
The forward stepwise component of the procedure is compressed sensing and approximation theory,
which focuses on approximations in noiseless models. They also developed a fast iterative procedure
for updating OGA that uses componentwise linear regression similar to the 𝐿 2 -boosting procedure
of Buhlmann and Yu [114] and does not require matrix inversion. Consider the linear regression
model
                                                Õ 𝑝
                                 𝑦𝑖 = 𝛼 +            𝛽 𝑗 𝑥𝑖 𝑗 + 𝜖𝑖 ,      𝑖 = 1, . . . , 𝑛,              (2.8)
                                                 𝑗=1
with 𝑝 predictors and 𝑥𝑖1 , 𝑥𝑖2 , . . . , 𝑥𝑖 𝑝 that are uncorrelated with the mean-zero random disturbances
𝜖𝑖 . As mentioned, 𝐿 2 -boosting is an iterative procedure that generates a sequence of linear approx-
imations 𝑦ˆ 𝑘 (𝑥) of the regression function (with 𝛼 = 0), by applying componentwise linear least
squares to the residuals obtained at each iteration.
Initializing with 𝑦ˆ 0 (·) = 0, it computes the residuals 𝑈𝑖𝑘 := 𝑦𝑖 − 𝑦ˆ 𝑘 (𝑥𝑖 ), 1 ≤ 𝑖 ≤ 𝑛, at the end of the
𝑘th iteration and chooses 𝑥𝑖, 𝑗ˆ𝑘+1 on which the pseudo-responses 𝑈𝑖(𝑘) are regressed, such that
                                                          Õ 𝑛
                                   𝑗ˆ𝑘+1 = arg min             (𝑈𝑖(𝑘) − 𝛽˜ (𝑘)            2
                                                                                 𝑗 𝑥𝑖 𝑗 ) ,              (2.9)
                                                1≤ 𝑗 ≤ 𝑝 𝑖=1
where
                                                         𝑛
                                                                          ,     𝑛
                                                        Õ                     Õ
                                             𝛽˜ (𝑘)
                                                𝑗 =           𝑈𝑖(𝑘) 𝑥𝑖 𝑗           𝑥𝑖2𝑗 .
                                                        𝑖=1                    𝑖=1
This yields the update
                                        𝑦ˆ 𝑘+1 (𝑥) = 𝑦ˆ 𝑘 (𝑥) + 𝛽˜ (𝑘)   ˆ 𝑥 𝑗ˆ𝑘+1 .                    (2.10)
                                                                         𝑗 𝑘+1
                                                               25


The procedure is then repeated until a pre-specified upper bound 𝑚 on the number of iterations is
reached. When the procedure stops at the 𝑚th iteration, 𝑦(𝑥) is approximated by 𝑦ˆ 𝑚 (𝑥).
                                                              (𝑈𝑖(𝑘) − 𝛽˜ (𝑘)          2 Í𝑛    (𝑘) 2       2
                                                       Í𝑛
OGA uses the variable selector (2.9). Since 𝑖=1                              𝑗 𝑥𝑖 𝑗 ) / 𝑖=1 (𝑈𝑖 ) = 1−𝑟 𝑗 , where 𝑟 𝑗 is
the correlation coefficient between 𝑥𝑡 𝑗 and 𝑈𝑖(𝑘) , (2.9) chooses the predictor that is most correlated
with 𝑈𝑖(𝑘) at the 𝑘th stage. However,the implementation of OGA updates (2.10) in another way
and also carries out an additional linear transformation of the vector X 𝑗ˆ𝑘+1 to form X⊥𝑗ˆ , where
                                                                                                                𝑘+1
X 𝑗 = (𝑥𝑖 𝑗 , . . . , 𝑥 𝑛 𝑗 )T. The idea is to orthogonalize the predictor variables sequentially so that
OLS can be computed by componentwise linear regression, thereby circumventing difficulties with
inverting high-dimensional matrices in the usual implementation of OLS.
With the orthogonal vectors X 𝑗ˆ1 , X⊥𝑗ˆ , . . . , X⊥𝑗ˆ already computed in the previous stages, it can
                                                2           𝑘
compute the projection X̂ 𝑗ˆ𝑘+1 of X 𝑗ˆ𝑘+1 into the linear space by adding the 𝑘 projections into the
respective one-dimensional linear spaces. This also yields the residual vector X⊥𝑗ˆ                      = X 𝑗ˆ𝑘+1 − X̂ 𝑗ˆ𝑘+1 .
                                                                                                     𝑘+1
OGA uses the following updates in lieu of (2.10):
                                         𝑦ˆ 𝑘+1 (𝑥) = 𝑦ˆ 𝑘 (𝑥) + 𝛽ˆ (𝑘)      ⊥
                                                                      ˆ 𝑥 𝑗ˆ ,                                      (2.11)
                                                                      𝑗 𝑘+1   𝑘+1
where
                                                   𝑛
                                                                      ,    𝑛
                                                  Õ                      Õ
                                      𝛽ˆ (𝑘)
                                         ˆ     =      𝑈𝑖(𝑘) 𝑥𝑖,⊥𝑗ˆ            (𝑥𝑖,⊥𝑗ˆ ) 2 .
                                         𝑗 𝑘+1                    𝑘+1                𝑘+1
                                                  𝑖=1                     𝑖=1
By sequentially orthogonalizing the input variables, OGA preserves the attractive computational
features of componentwise linear regression in PGA. However, unlike PGA for which the same
predictor variable can be entered repeatedly, OGA excludes variables that are already precluded
from further consideration in (2.9).
In addition, they developed a consistent model selection procedure along an OGA path under a
”strong sparsity” condition that the nonzero regression coefficients satisfying the weak sparsity
condition are not too small. Applying the convergence rate of OGA, they proved that, with
probability approaching 1 as 𝑛 → ∞, the OGA path includes all relevant regressors when the
                                                              26


number of iterations is large enough. They modified the model selection criteria like BIC and
called it high-dimensional information criterion (HDIC), which is defined as:
                                                   2                
                               HDIC(𝐽) = 𝑛log 𝜎     ˆ 𝐽 + |𝐽 |𝑤 𝑛 log 𝑝 ,                               (2.12)
                                                       ˆ 𝐽2 = 𝑛−1            − 𝑦ˆ 𝑖;𝐽 ) 2 . OGA+HDIC is shown
                                                                   Í𝑛
where 𝐽 is a non-empty subset of {1, . . . , 𝑝} and 𝜎               𝑖=1 (𝑦 𝑖
to select the smallest set of all relevant variables along the OGA path with probability approaching
1 (and is therefore variable-selection consistent).
2.2    Our Contribution
Although methods described in the previous sections brought novelty to the research area and made
a significant contribution to it, they still have some drawbacks. First, once a variable is identified by
the forward selection, it is not removable from the list of selected variables. Hence, false positives
are unavoidable without a systematic elimination procedure. Second, most of the existing works
focus on variable selection and are silent with respect to estimation accuracy.
To address the first issue, some works have been proposed to add backward elimination steps once
forward selection is accomplished, as backward elimination may further eliminate false positives
from the variables selected by forward selection. For example, An et al. [92] and Ing et al. [113]
proposed a stepwise selection for linear regression models in high-dimensional settings and proved
model selection consistency. However, it is unclear whether the results hold for high-dimensional
GLMs; Hwang et al. [95] proposed a similar stepwise algorithm in high-dimensional GLM settings,
but with no theoretical properties on model selection. Moreover, none of the relevant works have
touched upon the accuracy of estimation.
We extend a stepwise regression method to accommodate GLMs with high-dimensional predictors.
Our method, termed STEPWISE hereafter and introduced in Chapter 3, embraces both model
selection and estimation. It starts with an empty model or pre-specified predictors, scans all
features and sequentially selects features, and conducts backward elimination once forward selection
is completed. Our proposal controls both false negatives and false positives in high dimensional
                                                      27


settings: the forward selection steps recruit variables in an inclusive way by allowing some false
positives for the sake of avoiding false negatives, while the backward selection steps eliminate the
potential false positives from the recruited variables.
We use different stopping criteria in the forward and backward selection steps, to control the
numbers of false positives and false negatives. It is achieved by adding flexibility with two tuning
parameters in the stopping criteria, which strengthens our algorithm. Values of these parameters
define how many parameters will likely be included in the model. For instance, a small value of
the tuning parameter in the forward selection will include more variables, and a large value will
recruit too few features. In similar fashion, in the backward elimination step, a large value of the
parameter would eliminate more variables and vice versa for a small value. It is worth mentioning
that our method includes forward selection as a special case when the tuning parameter is equal to
0, making the algorithm more flexible.
Moreover, we prove that, under a sparsity assumption of the true model, the proposed approach can
discover all of the relevant predictors within a finite number of steps, and the estimated coefficients
are consistent, a property still unknown to the literature. Finally, our GLM framework enables our
work to accommodate a wide range of data types, such as binary, categorical and count data.
Extensive numerical studies have been conducted to compare STEPWISE procedure with the other
competing methods mentioned in previous subsections. Specifically, we compared our algorithm
with LASSO, dgLARS, forward regression (FR), the SC approach, and the screening methods
such as SIS. Our numerical studies included simulations: we compared the proposed method
with the other methods over comprehensive simulated studies covering different model structures
and variable dependencies. All these examples were tested over various model types, including
Normal, Binomial, and Poisson models. The obtained results have indicated that the STEPWISE
algorithm was able to detect all the true signals with nearly zero false positive rate. In addition, it
outperformed the other methods by selecting more true positives with fewer false positives.
Moreover, our numerical studies included real data analysis, which aimed to demonstrate the
utility of our method with real-life scenarios. We analyzed data obtained from studies about
                                                   28


gene regulation in the mammalian eye, esophageal squamous cell carcinoma, and neurobehavioral
impairment from total sleep deprivation. We demonstrated that STEPWISE achieved comparable
prediction accuracy, specificity, sensitivity, and AUC by selecting fewer variables that the other
variable selection methods.
Finally, to enhance the predictive power of the proposed method, we developed a multi-stage
hybrid machine learning method. It incorporates a stacking technique and includes both model-
free and model-based methods. Specifically, it comprises Random Forest (RF), Extreme Gradient
Boosting Machine (XGBoost), Support Vector Machine (SVM), Artificial Neural Network (ANN),
and LASSO models along with the STEPWISE procedure. The numerical study has shown an
improvement in the predictive power of our method. Furthermore, we developed a web application
that enables users to utilize the aforementioned method in practice.
To recap, our proposed method distinguishes from the existing stepwise approaches in high-
dimensional settings. For example, it improves An et al. [92] and Ing et al. [113] by extending
the work to a more broad GLM setting and Hwang et al. [95] by establishing the theoretical
properties. Compared with the other variable selection and screening works, our method produces
a final model in ultrahigh-dimensional settings, without applying a pre-screening step which may
produce unintended false negatives.
Under some regularity conditions, the method identifies or includes the true model with probability
going to 1. Moreover, unlike the penalized approaches such as LASSO, the coefficients estimated
by our STEPWISE selection procedure in the final model will be consistent, which are useful for
gauging the real effect sizes of risk factors.
                                                  29


                                                   CHAPTER 3
                     STEPWISE METHOD: THEORY AND APPLICATIONS
3.1     Model Setup
Let (X𝑖 , 𝑌𝑖 ), 𝑖 = 1, . . . , 𝑛, denote 𝑛 independent and identically distributed (i.i.d.) copies of (X, 𝑌 ).
Here, X = (1, 𝑋1 , . . . , 𝑋 𝑝 ) T is a ( 𝑝 + 1)-dimensional predictor vector with 𝑋0 = 1 corresponding
to the intercept term, and 𝑌 is an outcome. Suppose that the conditional density of 𝑌 , given X,
belongs to a linear exponential family:
                                 𝜋(𝑌 | X) = exp{𝑌 XT β − 𝑏(XT β) + A (𝑌 )},                                (3.1)
where β = (𝛽0 , 𝛽1 , . . . , 𝛽 𝑝 ) T is the vector of coefficients, 𝛽0 is the intercept, and A (·) and
𝑏(·) are known functions. Model (3.1), with a canonical link function and a unit dispersion
parameter, belongs to a larger exponential family [100]. Further, 𝑏(·) is assumed twice continuously
differentiable with a non-negative second derivative 𝑏00 (·). We use 𝜇(·) and 𝜎(·) to denote 𝑏0 (·)
and 𝑏00 (·), i.e. the mean and variance functions, respectively. For example, 𝑏(𝜃) = log(1 + exp(𝜃))
in a logistic distribution and 𝑏(𝜃) = exp(𝜃) in a Poisson distribution.
Let 𝐿 (𝑢, 𝑣) = 𝑢𝑣 − 𝑏(𝑢) and E𝑛 { 𝑓 (𝜉)} = 𝑛−1 𝑖=1
                                                           Í𝑛
                                                                  𝑓 (𝜉𝑖 ) denote the mean of { 𝑓 (𝜉𝑖 )}𝑖=1
                                                                                                       𝑛 for a
sequence of i.i.d. random variables 𝜉𝑖 (𝑖 = 1, . . . , 𝑛) and a non-random function 𝑓 (·). Based on
the i.i.d. observations, the log-likelihood function is
                                            Õ𝑛
                                         −1
                               ℓ(β) = 𝑛         𝐿 (X𝑖T β, 𝑌𝑖 ) = E𝑛 {𝐿 (XT β, 𝑌 )}.                        (3.2)
                                            𝑖=1
We use β∗ = (𝛽∗0 , 𝛽∗1 , . . . , 𝛽∗𝑝 ) T to denote the true values of β. Then the true model is M =
{ 𝑗 : 𝛽∗ 𝑗 ≠ 0, 𝑗 ≥ 1} ∪ {0}, which consists of the intercept and all variables with nonzero effects.
Overarching goals of ultrahigh-dimensional data analysis are to identify M and estimate 𝛽∗ 𝑗 for
𝑗 ∈ M. While most of the relevant literature [99, 108] is on estimating M, this work is to
accomplish both identification of M and estimation of 𝛽∗ 𝑗 .
                                                          30


When 𝑝 is in the exponential order of 𝑛, we aim to generate a predictive model that contains the true
model with high probability, and provide consistent estimates of regression coefficients. We further
introduce the following notation. For a generic index set 𝑆 ⊂ {0, 1, . . . , 𝑝} and a ( 𝑝+1)-dimensional
vector A, we use 𝑆 𝑐 to denote the complement of a set 𝑆 and A𝑆 = {𝐴 𝑗 : 𝑗 ∈ 𝑆} to denote the
subvector of A corresponding to 𝑆. For instance, if 𝑆 = {2, 3, 4}, then X𝑖𝑆 = (𝑋𝑖2 , 𝑋𝑖3 , 𝑋𝑖4 ) T .
Moreover, denote by ℓ𝑆 (β𝑆 ) = E𝑛 {𝐿(XT𝑆 β𝑆 , 𝑌 )} the log-likelihood of the regression model of 𝑌
on X𝑆 and denote by β̂𝑆 the maximizer of ℓ𝑆 (β𝑆 ). Under model (3.1), we elaborate on the idea of
stepwise selection, consisting of the forward and backward stages.
Forward stage: We start with 𝐹0 , a set of variables that need to be included according to some
a priori knowledge, such as clinically important factors and conditions. If no such information is
available, 𝐹0 is set to be {0}, corresponding to a null model. We sequentially add covariates as
follows:
                                         𝐹0 ⊂ 𝐹1 ⊂ 𝐹2 ⊂ · · · ⊂ 𝐹𝑘 ,
where 𝐹𝑘 ⊂ {0, 1, . . . , 𝑝} is the index set of the selected covariates upon completion of the 𝑘th step,
with 𝑘 ≥ 0. At the (𝑘 + 1)th step, we append new variables to 𝐹𝑘 one at a time and refit GLMs:
for every 𝑗 ∈ 𝐹𝑘𝑐 , we let 𝐹𝑘, 𝑗 = 𝐹𝑘 ∪ { 𝑗 }, obtain β̂𝐹𝑘, 𝑗 by maximizing ℓ𝐹𝑘, 𝑗 (β𝐹𝑘, 𝑗 ), and compute the
increment of log-likelihood,
                                           ℓ𝐹𝑘, 𝑗 ( β̂𝐹𝑘, 𝑗 ) − ℓ𝐹𝑘 ( β̂𝐹𝑘 ).
Then the index of a new candidate variable is determined to be
                                  𝑗 𝑘+1 = arg max ℓ𝐹𝑘, 𝑗 ( β̂𝐹𝑘, 𝑗 ) − ℓ𝐹𝑘 ( β̂𝐹𝑘 ).
                                            𝑗 ∈𝐹𝑘𝑐
And we update 𝐹𝑘+1 = 𝐹𝑘 ∪ { 𝑗 𝑘+1 }. We then need to decide whether to stop at the 𝑘th step or move
on to the (𝑘 + 1)th step with 𝐹𝑘+1 . To do so, we use the following EBIC criterion:
                     EBIC(𝐹𝑘+1 ) = −2ℓ𝐹𝑘+1 ( β̂𝐹𝑘+1 ) + |𝐹𝑘+1 |𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝),                 (3.3)
where the second term is motivated by Chen and Chen [115] and |𝐹 | denotes the cardinality of a
set 𝐹.
                                                            31


The forward selection stops if EBIC(𝐹𝑘+1 ) > EBIC(𝐹𝑘 ). We denote the stopping step by 𝑘 ∗ and
the set of variables selected so far by 𝐹𝑘 ∗ .
Backward stage: Upon the completion of forward stage, backward elimination, starting with
𝐵0 = 𝐹𝑘 ∗ , sequentially drops covariates as follows:
                                          𝐵0 ⊃ 𝐵1 ⊃ 𝐵2 ⊃ · · · ⊃ 𝐵 𝑘 ,
where 𝐵 𝑘 is the index set of the remaining covariates upon the completion of the 𝑘th step of
the backward stage, with 𝑘 ≥ 0. At the (𝑘 + 1)th backward step and for every 𝑗 ∈ 𝐵 𝑘 , we
let 𝐵 𝑘/ 𝑗 = 𝐵 𝑘 \ { 𝑗 }, obtain β̂𝐵 𝑘/ 𝑗 by maximizing ℓ(β𝐵 𝑘/ 𝑗 ), and calculate the difference of the
log-likelihoods between these two nested models,
                                           ℓ𝐵 𝑘 ( β̂𝐵 𝑘 ) − ℓ𝐵 𝑘/ 𝑗 ( β̂𝐵 𝑘/ 𝑗 ).
The variable that can be removed from the current set of variables is indexed by
                                 𝑗 𝑘+1 = arg min ℓ𝐵 𝑘 ( β̂𝐵 𝑘 ) − ℓ𝐵 𝑘/ 𝑗 ( β̂𝐵 𝑘/ 𝑗 ).
                                            𝑗 ∈𝐵 𝑘
Let 𝐵 𝑘+1 = 𝐵 𝑘 \ { 𝑗 𝑘+1 }. We determine whether to stop at the 𝑘th step or move on to the (𝑘 + 1)th
step of the backward stage according to the following BIC criterion:
                            BIC(𝐵 𝑘+1 ) = −2ℓ𝐵 𝑘+1 ( β̂𝐵 𝑘+1 ) + 𝜂2 𝑛−1 |𝐵 𝑘+1 | log 𝑛.            (3.4)
If BIC(𝐵 𝑘+1 ) > BIC(𝐵 𝑘 ), we end the backward stage at the 𝑘th step. Let 𝑘 ∗∗ denote the stopping
step and we declare the selected model 𝐵 𝑘 ∗∗ to be the final model. Thus, M̂ = 𝐵 𝑘 ∗∗ is the estimate
of M. As the backward stage starts with the 𝑘 ∗ variables selected by forward selection, 𝑘 ∗∗ cannot
exceed 𝑘 ∗ .
A strength of our algorithm is the added flexibility with 𝜂1 and 𝜂2 in the stopping criteria for
controlling the false negatives and positives. For example, a smaller value of 𝜂1 close to zero in the
forward selection step will likely include more variables, thus incur more false positives and less
false negatives, whereas a larger value of 𝜂1 will recruit too few variables and cause too many false
negatives. Similarly, in the backward selection step, a large 𝜂2 would eliminate more variables and
                                                           32


therefore further reduce more false positives, and vice versa for a small 𝜂2 . While finding optimal
𝜂1 and 𝜂2 is not trivial, our numerical experiences suggest a small 𝜂1 and a large 𝜂2 may well
balance the false negatives and positives. When 𝜂2 = 0, no variables can be dropped after forward
selection; hence, our proposal includes forward selection as a special case.
Moreover, Zheng et al. [99] proposed a sequentially conditioning approach based on offset terms
that absorb the prior information. However, our numerical experiments indicate that the offset
approach may be suboptimal compared to our full stepwise optimization approach, which will be
demonstrated in the simulation studies.
3.2     Theoretical Properties
With a column vector v, let kvk 𝑞 denote the 𝐿 𝑞 -norm for any 𝑞 ≥ 1. For simplicity, we denote the
𝐿 2 -norm of v by kvk, and denote vvT by v⊗2 . We use 𝐶1 , 𝐶2 , . . . , to denote some generic constants
that do not depend on 𝑛 and may change from line to line. The following regularity conditions are
set.
    1. There exist a positive integer 𝑞 satisfying |M| ≤ 𝑞 and 𝑞 log 𝑝 = 𝑜(𝑛1/3 ) and a constant
       𝐾 > 0 such that sup|𝑆|≤𝑞 kβ𝑆∗ k 1 ≤ 𝐾, where β𝑆∗ = arg maxβ𝑆 𝐸 [ℓ𝑆 (β𝑆 )] is termed the least
       false value of model 𝑆.
    2. kXk ∞ ≤ 𝐾. In addition, 𝐸 (𝑋 𝑗 ) = 0 and 𝐸 (𝑋 2𝑗 ) = 1 for 𝑗 ≥ 1.
    3. Let 𝜖𝑖 = 𝑌𝑖 − 𝜇(β∗T X𝑖 ). There exists a positive constant 𝑀 such that the Cramer condition
       holds, i.e., 𝐸 [|𝜖𝑖 | 𝑚 ] ≤ 𝑚!𝑀 𝑚 for all 𝑚 ≥ 1.
    4. |𝜎(𝑎) − 𝜎(𝑏)| ≤ 𝐾 |𝑎 − 𝑏| and 𝜎min := inf |𝑡|≤𝐾 3 |𝑏00 (𝑡)| is bounded below.
                                                                                         
    5. There exist two positive constants, 𝜅 min and 𝜅 max such that 0 < 𝜅min < Λ 𝐸 X𝑆⊗2 < 𝜅max <
       ∞, uniformly in 𝑆 ⊂ {0, 1, . . . , 𝑝} satisfying |𝑆| ≤ 𝑞, where Λ(A) is the collection of all
       eigenvalues of a square matrix A.
                                                     33


    6. min𝑆:M*𝑆,|𝑆|≤𝑞 𝐷 𝑆 > 𝐶𝑛−𝛼 for some constants 𝐶 > 0 and 𝛼 > 0 that satisfies 𝑞𝑛−1+4𝛼 log 𝑝 →
                                         h                                i
       0, where 𝐷 𝑆 = max 𝑗 ∈S 𝑐 ∩M 𝐸 𝜇(β∗T X) − 𝜇(β𝑆∗T X𝑆 ) 𝑋 𝑗 .
Condition (1), as assumed in Buhlmann et al. [116] and Zheng et al. [99], is an alternative to the
Lipschitz assumption [117, 87]. The bound of the model size allowed in the selection procedure
or 𝑞 is often required in model-based screening methods (see, e.g. [94, 118, 109, 99]). The bound
should be large enough so that the correct model can be included, but not too large; otherwise,
excessive noise variables would be included, leading to unstable and inconsistent estimates. In-
deed, Conditions (1) and (6) reveal that the range of 𝑞 depends on the true model size |M|, the
minimum signal strength, 𝑛−𝛼 , and the total number of covariates, 𝑝. The upper bound of 𝑞 is
𝑜((𝑛1−4𝛼 /log 𝑝) ∧ (𝑛1/3 /log 𝑝)), ensuring the consistency of EBIC [115].
Condition (1) also implies that the parameter space under consideration can be restricted to
B := {β ∈ R 𝑝+1 : kβk 1 ≤ 𝐾 2 }, for any model 𝑆 with |𝑆| ≤ 𝑞. Condition (2), as assumed
in Zhao et al. [101] and Kwemou et al. [119], reflects that data are often standardized at the
pre-processing stage. Condition (3) ensures that 𝑌 has a light tail, and is satisfied by Gaussian and
discrete data, such as binary and count data [120]. Condition (4) is satisfied by common GLM
models, such as Gaussian, Binomial, Poisson and Gamma distributions.
Condition (5) represents the sparse Riesz condition [121] and Condition (6) is a strong "irrepre-
sentable" condition, suggesting that M cannot be represented by a set of variables that does not
include the true model. It further implies that adding a signal variable to a mis-specified model
will increase the log-likelihood by a certain lower bound [99]. The signal rate is comparable to the
conditions required by the other sequential methods (see, e.g. [93, 109]).
Theorem 3.2.1 develops a lower bound of the increment of the log-likelihood if the true model M
is not yet included in a selected model 𝑆.
Theorem 3.2.1. Suppose Conditions (1) – (6) hold. There exists some constant 𝐶1 such that with
probability at least 1 − 6 exp(−6𝑞 log 𝑝),
                                                                             
                           min        max𝑐 ℓ𝑆∪{ 𝑗 } ( β̂𝑆∪{ 𝑗 } ) − ℓ𝑆 ( β̂𝑆 ) ≥ 𝐶1 𝑛−2𝛼 .
                        𝑆:M*𝑆,|𝑆|<𝑞   𝑗 ∈𝑆
                                                        34


    Theorem 3.2.1 shows that, before the true model is included in the selected model, we can
append a variable which will increase the log-likelihood by at least 𝐶1 𝑛−2𝛼 with probability tending
to 1. This ensures that in the forward stage, our proposed STEPWISE approach will keep searching
for signal variables until the true model is contained. To see this, suppose at the 𝑘th step of the
forward stage that 𝐹𝑘 satisfies 𝑀 * 𝐹𝑘 and |𝐹𝑘 | < 𝑞, and let 𝑟 be the index selected by Stepwise.
By Theorem 3.2.1, we obtain that, for any 𝜂1 > 0, when 𝑛 is sufficiently large,
            EBIC(𝐹𝑘,𝑟 ) − EBIC(𝐹𝑘 ) = −2ℓ𝐹𝑘,𝑟 ( β̂𝐹𝑘,𝑟 ) + (|𝐹𝑘 | + 1)𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝)
                                              − −2ℓ𝐹𝑘 ( β̂𝐹𝑘 ) + |𝐹𝑘 |𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝)
                                                                                             
                                          ≤ −2𝐶1 𝑛−2𝛼 + 𝑛−1 (log 𝑛 + 2𝜂1 log 𝑝) < 0,
with probability at least 1 − 6 exp(−6𝑞 log 𝑝), where the last inequality is due to Condition (6).
Therefore, with high probability the forward stage of STEPWISE continues as long as M *
𝐹𝑘 and |𝐹𝑘 | < 𝑞. We next establish an upper bound of the number of steps in the forward stage
needed to include the true model.
Theorem 3.2.2. Under the same conditions as in Theorem 3.2.1 and if
                                                                      
                                             𝐸 𝑌 − 𝜇(β𝑆∗T X𝑆 ) 𝑋 𝑗       = 𝑜(𝑛−𝛼 ),
                                                                   
                        max         max
                       𝑆:|𝑆|≤𝑞   𝑗 ∈S 𝑐 ∩M 𝑐
then there exists some constant 𝐶2 > 2 such that M ⊂ 𝐹𝑘 , for some 𝐹𝑘 in the forward stage of
Stepwise and 𝑘 ≤ 𝐶2 |M|, with probability at least 1 − 18 exp(−4𝑞 log 𝑝).
    The "max" condition, as assumed in Section 5.3 of Fan et al. [122], relaxes the partial
orthogonality assumption that XM 𝑐 are independent of XM , and ensures that with probability
tending to 1, appending a signal variable increases log-likelihood more than adding a noise variable
does, uniformly over all possible models 𝑆 satisfying M * 𝑆, |𝑆| < 𝑞. This entails that the proposed
procedure is much more likely to select a signal variable, in lieu of a noise variable, at each step.
Since EBIC is a consistent model selection criterion [110, 123], the following theorem guarantees
termination of the proposed procedure with M ⊂ 𝐹𝑘 for some 𝑘.
                                                    35


Theorem 3.2.3. Under the same conditions as in Theorem 3.2.2 and if M ⊄ 𝐹𝑘−1 and M ⊂ 𝐹𝑘 ,
the forward stage stops at the 𝑘th step with probability going to 1 − exp(−3𝑞 log 𝑝).
    Theorem 3.2.3 ensures that the forward stage of STEPWISE will stop within a finite number
of steps and will cover the true model with probability at least 1 − 𝑞 exp(−3𝑞 log 𝑝) ≥ 1 −
exp(−2𝑞 log 𝑝). We next consider the backward stage and provide a probability bound of removing
a signal from a set in which the set of true signals M is contained.
Theorem 3.2.4. Under the same conditions as in Theorem 3.2.2, BIC(𝑆\{𝑟}) − BIC(𝑆) > 0
uniformly over 𝑟 ∈ M and 𝑆 satisfying M ⊂ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 −
6 exp(−6𝑞 log 𝑝).
    Theorem 3.2.4 indicates that with probability at 1 − 6 exp(−6𝑞 log 𝑝), BIC would decrease
when removing a signal variable from a model that contains the true model. That is, with high
probability, back elimination is to reduce false positives.
Recall that 𝐹𝑘 ∗ denotes the model selected at the end of the forward selection stage. By Theorem
3.2.2, M ⊂ 𝐹𝑘 ∗ with probability at least 1 − 18 exp(−4𝑞 log 𝑝). Then Theorem 3.2.4 implies
that at each step of the backward stage, a signal variable will not be removed from the model
with probability at least 1 − 6 exp(−6𝑞 log 𝑝). By Theorem 3.2.2, |𝐹𝑘 ∗ | ≤ 𝐶2 |M|. Thus, the
backward elimination will carry out at most (𝐶2 − 1)|M| steps. Combining results from Theorems
3.2.2 and 3.2.3 yields that M ⊂ M̂ with probability at least 1 − 18 exp(−4𝑞 log 𝑝) − 6(𝐶2 −
1)|M | exp(−6𝑞 log 𝑝). Let β̂ be the estimate of β∗ in model (3.1) at the termination of STEPWISE.
By convention, the estimates of the coefficients of the unselected covariates are 0.
Theorem 3.2.5. Under the same conditions as in Theorem 3.2.2, we have that M ⊆ M̂ and
                                            k β̂ − β∗ k → 0
in probability.
                                                   36


    The theorem warrants that the proposed STEPWISE yields consistent estimates, a property not
shared by many regularized methods, including LASSO. Our later simulations verified this. Proof
of main theorems and lemmas are provided in the following Chapter.
3.3    Proof of the Theorems
Since 𝑏(·) is twice continuously differentiable with a nonnegative second derivative 𝑏00 (·), 𝑏 max :=
max|𝑡|≤𝐾 3 |𝑏(𝑡)|, 𝜇max := max|𝑡|≤𝐾 3 |𝑏0 (𝑡)|, and 𝜎max := sup|𝑡|≤𝐾 3 |𝑏00 (𝑡)| are bounded above,
where 𝐿 and 𝐾 are some constants from Conditions (1) and (2), respectively. Let G𝑛 { 𝑓 (𝜉)} =
𝑛−1/2 𝑖=1
      Í𝑛
            ( 𝑓 (𝜉𝑖 ) − 𝐸 [ 𝑓 (𝜉𝑖 )]) for a sequence of i.i.d. random variables 𝜉𝑖 (𝑖 = 1, . . . , 𝑛) and a
non-random function 𝑓 (·).
Given any β𝑆 , when a variable 𝑋𝑟 , 𝑟 ∈ 𝑆 𝑐 is added into the model 𝑆, we define the augmented
log-likelihood as
                                                        n                 o
                                  ℓ𝑆∪{𝑟} (β𝑆+𝑟 ) := E𝑛 𝐿 β𝑆T X𝑆 + 𝛽𝑟 𝑋𝑟 , 𝑌 .                          (3.1)
We use β̂𝑆+𝑟 to denote the maximizer of (3.1). Thus, β̂𝑆+𝑟 = β̂𝑆∪{𝑟} . In addition, denote the
maximizer of 𝐸 [ℓ𝑆∪{𝑟} (β𝑆+𝑟 )] by β𝑆+𝑟     ∗ . Due to the concavity of the log-likelihood in GLMs with
the canonical link, β𝑆+𝑟 ∗ is unique.
Proof of Theorem 3.2.1. Given an index set 𝑆 and 𝑟 ∈ 𝑆 𝑐 , let B𝑆0 (𝑑) = {β𝑆 : kβ𝑆 − β𝑆∗ k ≤
      p                          p
𝑑/(𝐾 |𝑆|)} where 𝑑 = 𝐴2 𝑞 3 log 𝑝/𝑛 with 𝐴2 defined in Lemma 3.4.6.
    Let Ω be the event that
            n                        h                           i         p
                                           T                ∗T
                     sup         G𝑛 𝐿 β 𝑆 X 𝑆 , 𝑌 − 𝐿 β 𝑆 X 𝑆 , 𝑌      ≤ 20𝐴1 𝑑 𝑞 log 𝑝   and
               |𝑆|≤𝑞,β𝑆 ∈B𝑆0 (𝑑)
                                                                   p       o
              max |G𝑛 𝐿 (β𝑆∗T X𝑆 , 𝑌 )                    2
                                         
                                             | ≤ 10( 𝐴1 𝐾 + 𝑏 max ) 𝑞 log 𝑝 ,
               |𝑆|≤𝑞
where 𝐴1 is some constant defined in Lemma 3.4.4. By Lemma 3.4.4, 𝑃(Ω) ≥ 1−6 exp(−6𝑞 log 𝑝).
Thus in the rest of the proof, we only consider the sample points in Ω.
In the proof of Lemma 3.4.6, we show that max|𝑆|≤𝑞 k β̂𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 under Ω.
Then given an index set 𝑆 and β𝑆 such that |𝑆| < 𝑞, kβ𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 , and for
                                                         37


any 𝑗 ∈ 𝑆 𝑐 ,
                             ∗                                                                           ∗
                ℓ𝑆∪{ 𝑗 } (β𝑆+   𝑗 ) − ℓ𝑆 ( β̂𝑆 ) ≥                     inf                    ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) − ℓ𝑆 (β𝑆 ) =
                                                       k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2
                             h                            i
                   −1/2              ∗T
                                                            − 𝑛−1/2 G𝑛 𝐿(β𝑆∗T X𝑆 , 𝑌 ) −
                                                                                                
                𝑛       G𝑛 𝐿(β𝑆+         𝑗 X𝑆∪{ 𝑗 } , 𝑌 )
                                                         𝑛−1/2 G𝑛 𝐿 (β𝑆T X𝑆 , 𝑌 ) − 𝐿 (β𝑆∗T X𝑆 , 𝑌 )
                                                                                                               
                                 sup
                 k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2
                       h                             i
                               ∗T
                                                       − 𝐸 𝐿(β𝑆∗T X𝑆 , 𝑌 ) ≥
                                                                                   
                 + 𝐸 𝐿(β𝑆+          X
                                   𝑗 𝑆∪{ 𝑗 }   , 𝑌 )
                                               p
                 − 20( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝/𝑛 − 20𝐴1 𝐴2 𝑞 2 log 𝑝/𝑛+
                𝜎min 𝜅 min ∗
                              kβ𝑆+ 𝑗 − (β𝑆∗T , 0) T k 2 ,
                      2
where the second inequality follows from the event Ω and Lemma 3.4.5.
By Lemma 3.4.1, if M * 𝑆, there exists 𝑟 ∈ 𝑆 𝑐 ∩ M, such that
                                                 ∗T
                                            kβ𝑆+𝑟     − (β𝑆∗T , 0)k ≥ 𝐶𝜎max       −1 −1 −𝛼
                                                                                      𝜅max 𝑛 .
Thus, there exists some constant 𝐶1 that does not depend on 𝑛 such that
                    max 𝑗 ∈𝑆 𝑐 ℓ𝑆∪{ 𝑗 } ( β̂𝑆+ 𝑗 ) − ℓ𝑆 ( β̂𝑆 ) ≥ max 𝑗 ∈𝑆 𝑐 ℓ𝑆∪{ 𝑗 } (β𝑆+       ∗ ) − ℓ ( β̂ ) ≥
                                                                                                    𝑗      𝑆      𝑆
                                        ∗ ) − ℓ ( β̂ ) ≥ −20( 𝐴 𝐾 2 + 𝑏
                                                                                              p
                           ℓ𝑆∪{𝑟} (β𝑆+𝑟             𝑆     𝑆                1            max ) 𝑞 log 𝑝/𝑛 −
                                                                     𝐶 2 𝜎min 𝜅 min 𝑛−2𝛼
                                    20𝐴1 𝐴2 𝑞 2 log 𝑝/𝑛 +                   2 𝜅2
                                                                         2𝜎max
                                                                                         ≥ 𝐶1 𝑛−2𝛼 ,                         (3.2)
                                                                                  max
where the first inequality follows from β̂𝑆+ 𝑗 being the maximizer of (3.1) and the second inequality
follows from Conditions (1) and (6).
    Withdrawing the restriction to Ω, we obtain that
                                                                                          
                                                                                       −2𝛼
          𝑃          min max𝑐 ℓ𝑆∪{ 𝑗 } ( β̂𝑆∪{ 𝑗 } ) − ℓ𝑆 ( β̂𝑆 ) ≥ 𝐶1 𝑛                      ≥ 1 − 6 exp(−6𝑞 log 𝑝).
                |𝑆|<𝑞,M*𝑆 𝑗 ∈𝑆
                                                                     38


Proof of Theorem 3.2.2. We have shown that our forward stage will not stop when M * 𝑆 and |𝑆| <
𝑞 with probability converging to 1.
                         ∗ is the unique solution to the equation 𝐸 𝑌 −𝜇 β T X
                                                                                                         
For any 𝑟 ∈ 𝑆 𝑐 ∩M 𝑐 , β𝑆+𝑟                                                         𝑆+𝑟    𝑆∪{𝑟 }   X 𝑆∪{𝑟}   = 0.
By the mean value theorem,
               𝐸 𝑌 − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 = 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟
                                                                        
                       𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 − 𝐸 𝜇 β∗T X − 𝜇 β𝑆+𝑟             ∗T
                                                                                         
                =𝐸                                                                X𝑆∪{𝑟 } 𝑋𝑟
                     ∗T
                         − (β𝑆∗T , 0) 𝐸 𝜎 β̃𝑆+𝑟
                                              T           ⊗2 
                = β𝑆+𝑟                              X𝑆∪{𝑟} X𝑆∪{𝑟}    e𝑟 ,
where β̃𝑆+𝑟 is some point between β𝑆+𝑟 and (β𝑆∗T , 0) T and e𝑟 is a vector of length (|𝑆| + 1) with the
𝑟th element being 1.
          T X
Since | β̃𝑆+𝑟             ∗T                   ∗T                    2
              𝑆∪{𝑟} | ≤ |β𝑆+𝑟 X𝑆∪{𝑟} | + |(β𝑆 , 0)X𝑆∪{𝑟 } | ≤ 2𝐾 by Conditions (1) and (2),
                                             T
                                       |𝜎( β̃𝑆+𝑟 X𝑆∪{𝑟} )| ≥ 𝜎min     and
                                 hn                 o i
                  𝑜(𝑛−𝛼 ) = 𝐸      𝑌 − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 ≥ 𝜎min 𝜅 min kβ𝑆+𝑟      ∗T
                                                                                − (β𝑆∗T , 0)k.
Therefore,
                                                    ∗T
                                      max        kβ𝑆+𝑟  − (β𝑆∗T , 0)k = 𝑜(𝑛−𝛼 ).
                              𝑆:|𝑆|≤𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐
Under Ω that is defined in Theorem 3.2.1, max|𝑆|≤𝑞 k β̂𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2 .
                                                       39


For any 𝑗 ∈ 𝑆 𝑐 ,
                            ∗                                                                             ∗
              ℓ𝑆∪{ 𝑗 } (β𝑆+    𝑗 ) − ℓ𝑆 ( β̂𝑆 ) ≤                      sup                    ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) − ℓ𝑆 (β𝑆 )
                                                       k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2
                                  h                             i
                ≤ 𝑛−1/2 G𝑛 𝐿 (β𝑆+         ∗T
                                                                  + 𝑛−1/2 G𝑛 𝐿(β𝑆∗T X𝑆 , 𝑌 )
                                                                                                       
                                               X
                                              𝑗 𝑆∪{ 𝑗 }   , 𝑌 )
                                                            𝑛−1/2 G𝑛 𝐿(β𝑆T X𝑆 , 𝑌 ) − 𝐿(β𝑆∗T X𝑆 , 𝑌 )
                                                                                                                 
               +                    sup
                   k β𝑆 −β𝑆∗ k≤ 𝐴2 𝐾 −1 (𝑞 2 log 𝑝/𝑛) 1/2
                       h                             i
                               ∗T
                                                        − 𝐸 𝐿 (β𝑆∗T X𝑆 , 𝑌 ) ≤
                                                                                   
               + 𝐸 𝐿 (β𝑆+            X
                                   𝑗 𝑆∪{ 𝑗 }   , 𝑌 )
                                         q
              20( 𝐴1 𝐾 + 𝑏 max ) 𝑞𝑛−1 log 𝑝 + 20𝐴1 𝐴2 𝑞 2 𝑛−1 log 𝑝+
                           2
                                  ∗            ∗T       T 2
              𝜎max 𝜅 max kβ𝑆+        𝑗 − (β𝑆 , 0) k /2,
where the second inequality follows from the event Ω and Lemma 3.4.5. Since
                                        ∗
                      max𝑐           kβ𝑆+𝑟     − (β𝑆∗T , 0) T k = 𝑜(𝑛−𝛼 )             and      𝑞𝑛−1+4𝛼 log 𝑝 → 0,
             𝑆:|𝑆|<𝑞,𝑟∈𝑆     ∩M 𝑐
we have
                                                                                                        q
                                                     ∗                                     2
                        max            ℓ𝑆∪{𝑟} (β𝑆+𝑟       ) − ℓ𝑆 ( β̂𝑆 ) ≤ 20( 𝐴1 𝐾 + 𝑏 max ) 𝑞𝑛−1 log 𝑝+
                 𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐
                20𝐴1 𝐴2 𝑞 2 𝑛−1 log 𝑝               + 𝜎max 𝜅max kβ𝑆+      ∗            ∗T      T 2
                                                                             𝑗 − (β𝑆 , 0) k /2 = 𝑜(𝑛
                                                                                                                −2𝛼
                                                                                                                    ),
with probability at least 1 − 6 exp(−6𝑞 log 𝑝). Then by Lemma 3.4.6,
         max𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 ℓ𝑆∪{𝑟} ( β̂𝑆+𝑟 ) − ℓ𝑆 ( β̂𝑆 ) ≤ max𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 |ℓ𝑆∪{𝑟} ( β̂𝑆+𝑟 ) −
                      ℓ𝑆∪{𝑟} (β𝑆+𝑟 ∗ )| + max                                             ∗
                                                     𝑆:|𝑆|<𝑞,𝑟∈𝑆 𝑐 ∩M 𝑐 ℓ𝑆∪{𝑟} (β𝑆+𝑟 ) − ℓ𝑆 ( β̂𝑆 ) ≤
                                         𝐴3 𝑞 2 𝑛−1 log 𝑝 + 𝑜(𝑛−2𝛼 ) = 𝑜(𝑛−2𝛼 ),                                           (3.3)
                                                                   40


with probability at least 1 − 12 exp(−6𝑞 log 𝑝). By Theorem 3.2.1, if M * 𝑆, the forward stage
would select a noise variable with probability less than 18 exp(−6𝑞 log 𝑝).
For 𝑘 > |M|, M * 𝑆 𝑘 implies that at least 𝑘 − |M| noise variables are selected within the 𝑘 steps.
Then for 𝑘 = 𝐶2 |M| with 𝐶2 > 2,
                           𝑘       
                          Õ        𝑘                                                           𝑘−|M |
                                                                  ≤ |M|𝑘 |M | 18 exp(−6𝑞 log 𝑝)
                                                                𝑗            
        𝑃 (M * 𝑆 𝑘 ) ≤                  18 exp(−6𝑞 log 𝑝)
                                    𝑗
                        𝑗=𝑘−|M |
     ≤ 18 exp(−6𝑞 log 𝑝 + log |M| + |M| log 𝑘) ≤ 18 exp(−4𝑞 log 𝑝).
Therefore, M ⊂ 𝑆𝐶2 |M | with probability at least 1 − 18 exp(−4𝑞 log 𝑝).                               
Proof of Theorem 3.2.3. By Theorem 3.2.2, M will be included in 𝐹𝑘 for some 𝑘 < 𝑞 with
probability going to 1. Therefore, the forward stage stops at the 𝑘th step if EBIC(𝐹𝑘+1 ) > EBIC(𝐹𝑘 ).
On the other hand, that EBIC(𝐹𝑘+1 ) < EBIC(𝐹𝑘 ) if and only if 2ℓ𝐹𝑘+1 ( β̂𝐹𝑘+1 ) − 2ℓ𝐹𝑘 ( β̂𝐹𝑘 ) ≥
(log 𝑛 + 2𝜂1 log 𝑝)/𝑛. Thus, to show the forward stage stops at the 𝑘th step, we only need to show
that with probability tending to 1,
                         2ℓ𝐹𝑘+1 ( β̂𝐹𝑘+1 ) − 2ℓ𝐹𝑘 ( β̂𝐹𝑘 ) < (log 𝑛 + 2𝜂1 log 𝑝)/𝑛,                 (3.4)
for all 𝜂1 > 0.
To prove (3.4), we first verify the conditions (A4) and (A5) in Chen and Chen [115]. Given any
index 𝑆 such that M ⊆ 𝑆 and |𝑆| ≤ 𝑞, let β∗𝑆 be the subvector of β∗ corresponding to 𝑆. We obtain
that
                               T
                                                                𝑇
                                                                                 
                 𝐸 (𝑌 − 𝜇(β∗𝑆      X𝑆 ))X𝑆 = 𝐸 𝐸 (𝑌 − 𝜇(β∗M            XM ))|X𝑆 X𝑆 = 0.
This implies β𝑆∗ = β∗𝑆 .
                                                                                     2
Given any π ∈ R|𝑆| , let H𝑆 := ℎ(π, β𝑆 ) = (𝜎max 𝐾 2 |𝑆|) −1 𝜎 β𝑆T X𝑆 π T X𝑆 , kπk = 1, β𝑆 ∈
                                      
B𝑆0 (𝑑) . By Conditions (1) and (2), ℎ(π, β𝑆 ) is bounded between −1 and 1 uniformly over kπk = 1
and β𝑆 ∈ B𝑆0 (𝑑).
                                                                                  p
By Lemma 2.6.15 in van der Vaart et al. [124], the VC indices of W := {(𝐾 |𝑆|) −1 π T X𝑆 , kπk = 1}
and V := {β𝑆T X𝑆 , β𝑆 ∈ B𝑆0 (𝑑)} are bounded by |𝑆| + 2. For the definitions of the VC index
                                                       41


and covering numbers, we refer to pages 83 and 85 in van der Vaart et al. [124]. The VC
index of the class U := {(𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 , kπk = 1} is the VC index of the class of sets
{(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡, kπk = 1, 𝑡 ∈ R}.
                                                                                p                 √
Since {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡} = {(X𝑆 , 𝑡) : 0 < (𝐾 |𝑆|) −1 π T X𝑆 ≤ 𝑡} ∪ {(X𝑆 , 𝑡) :
  √          p
− 𝑡 < (𝐾 |𝑆|) −1 π T X𝑆 ≤ 0}, each set of {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡, kπk = 1, 𝑡 ∈ R}
is created by taking finite unions, intersections, and complements of of the basic sets {(X𝑆 , 𝑡) :
    p
(𝐾 |𝑆|) −1 π T X𝑆 < 𝑡}. Therefore, the VC index of {(X𝑆 , 𝑡) : (𝐾 2 |𝑆|) −1 (π T X𝑆 ) 2 ≤ 𝑡, kπk = 1, 𝑡 ∈
                                                                         p
R} is of the same order as the VC index of {(X𝑆 , 𝑡) : (𝐾 |𝑆|) −1 π T X𝑆 < 𝑡}, by Lemma 2.6.17 in
[124].
Then by Theorem 2.6.7 in van der Vaart et al. [124], for any probability measure 𝑄, there exists some
universal constant 𝐶3 such that 𝑁 (𝜖, U, 𝐿 2 (𝑄)) ≤ (𝐶3 /𝜖) 2(|𝑆|+1) . Likewise, 𝑁 (𝑑𝜖, V, 𝐿 2 (𝑄)) ≤
(𝐶3 /𝜖) 2(|𝑆|+1) . Given a β𝑆,0 ∈ B𝑆0 (𝑑), for any β𝑆 in the ball {β𝑆 : supx |β𝑆T x − β𝑆,0          T
                                                                                                       x| < 𝑑𝜖 }, we
have supx |𝜎(β𝑆T x) − 𝜎(β𝑆,0   T
                                    x)| < 𝐾 𝑑𝜖 by Condition (4).Let V 0 := {𝜎max        −1 𝜎(β T X ), β ∈ B 0 (𝑑)}.
                                                                                               𝑆 𝑆      𝑆      𝑆
By the definition of covering number, 𝑁 (𝐾 𝑑𝜖, V 0, 𝐿 2 (𝑄)) ≤ (𝐶3 /𝜖) 2(|𝑆|+1) Given a 𝜎(β𝑆,0             T
                                                                                                              x) and
π0T x, for any 𝜎(β𝑆T x) in the ball {𝜎(β𝑆T x) : supx |𝜎(β𝑆T x) − 𝜎(β𝑆,0              T
                                                                                         x)| ≤ 𝐾 𝑑𝜖 } and π in the
ball {π : supx |(π T x) 2 − (π0T x) 2 | < 𝜖 }, (𝜎max 𝐾 2 |𝑆|) −1 supx |𝜎(β𝑆T x)(π T x) 2 − 𝜎(β𝑆,0    T
                                                                                                        x)(π0T x) 2 | ≤
    −1 𝐾 𝑑 + (𝐾 2 |𝑆|) −1 )𝜖. Thus, 𝑁 ((𝜎 −1 𝐾 𝑑 + (𝐾 2 |𝑆|) −1 )𝜖, H , 𝐿 (𝑄)) ≤ (𝐶 /𝜖) 4(|𝑆|+1) , and
(𝜎max                                              max                           𝑆    2            3
consequently 𝑁 (𝜖, H𝑆 , 𝐿 2 (𝑄)) ≤ (𝐶4 /𝜖) 4(|𝑆|+1) for some constant 𝐶4 .
By Theorem 1.1 in Talagrand [125] and |𝑆| ≤ 𝑞, we can find some constant 𝐶5 such that
                                                                                          !
                                                                              p
                             𝑃          sup          |G𝑛 [ℎ(π, β𝑆 )]| ≥ 𝐶5 𝑞 log 𝑝
                                 k π k=1,β𝑆 ∈B𝑆0 (𝑑)
                                                               ! 4(|𝑆|+1)
                                   𝐶40         𝐶40 𝐶52 𝑞 log 𝑝
                          ≤ p                                             exp(−2𝐶52 𝑞 log 𝑝)
                             𝐶5 𝑞 log 𝑝 4(|𝑆| + 1)
                                                                                        
                          ≤ exp 4(|𝑆| +       1) log(𝐶40 𝐶52 𝑞 log   𝑝) −  2𝐶52 𝑞 log  𝑝
                          ≤ exp (−5𝑞 log 𝑝) ,
                                                            42


where 𝐶40 is some constant that depends on 𝐶4 only. Thus,
                                           n                    2o         h             2i
        𝑃 sup|𝑆|≤𝑞,k π k=1,β𝑆 ∈B 0 (𝑑) E𝑛 𝜎 XT𝑆 β𝑆 π T X𝑆               − 𝐸 𝜎 XT𝑆 β𝑆 π T X𝑆
                                  𝑆
                     p                 Í
             ≥ 𝐶5 𝐾 2 𝑞 3 log 𝑝/𝑛 ≤𝑠=|M |          𝑒𝑝 𝑠
                                          𝑞
                                                    𝑠    exp (−5𝑞 log 𝑝) ≤ exp(−3𝑞 log 𝑝).              (3.5)
                                          h                    i
By Condition (5), 𝜎min 𝜅min ≤ Λ 𝐸 𝜎 XT𝑆 β𝑆 X𝑆⊗2                       ≤ 𝜎max 𝜅max , for all β𝑆 ∈ B𝑆0 (𝑑) and
𝑆 : M ⊆ 𝑆, |𝑆| < 𝑞. Then, by (3.5),
                                                  n                  o
                                                           T         ⊗2
                        𝜎min 𝜅min /2 ≤ Λ E𝑛 𝜎 X𝑆 β∗𝑆 X𝑆                     ≤ 2𝜎max 𝜅max
uniformly over all 𝑆 satisfying M ⊆ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 − exp(−3𝑞 log 𝑝).
Hence, the condition (A4) in [115] is satisfied with probability at least 1 − exp(−3𝑞 log 𝑝).
    Also for any β𝑆 ∈ B𝑆0 (𝑑),
                  n                      2 o        n                    2 o
                           T           T                       T           T
                E𝑛 𝜎 X 𝑆 β 𝑆 π X 𝑆                − E𝑛 𝜎 X𝑆 β∗𝑆 π X𝑆
                             n                      2 o               n               2 o
                ≤ 𝑛−1/2 G𝑛 𝜎 XT𝑆 β𝑆 π T X𝑆                   + 𝑛−1/2 G𝑛 𝜎 XT𝑆 β∗𝑆 π T X𝑆
                     h                     2 i       h                    2 i
                + 𝐸 𝜎 XT𝑆 β𝑆 π T X𝑆                 − 𝐸 𝜎 XT𝑆 β∗𝑆 π T X𝑆
                            q                                       p
                          2
                ≤ 2𝐶5 𝐾       𝑞 3 log 𝑝/𝑛 + 𝜇max kβ𝑆 − β∗𝑆 k |𝑆|𝐾𝜆max .
Hence, the condition (A5) in Chen and Chen [115] is satisfied uniformly over all 𝑆 such that M ⊆ 𝑆
and |𝑆| ≤ 𝑞, with probability at least 1 − exp(−3𝑞 log 𝑝).
Then (3.4) can be shown by following the proof of Equation (3.2) in Chen and Chen [115]. Thus,
our forward stage stops at the 𝑘th step with probability at least 1 − exp(−3𝑞 log 𝑝).                      
Proof of Theorem 3.2.4. Suppose that a covariate 𝑋𝑟 is removed from 𝑆. For any 𝑟 ∈ M, since
M * 𝑆\{𝑟} and 𝑟 is the only element that is in (𝑆\{𝑟}) 𝑐 ∩ M, by Lemma 3.4.1 and (3.2)
                            ℓ𝑆 ( β̂𝑆 ) − ℓ𝑆\{𝑟} ( β̂𝑆\{𝑟} ) ≥ ℓ𝑆 (β𝑆∗ ) − ℓ𝑆\{𝑟} ( β̂𝑆\{𝑟} )
                                                ∗
                             = ℓ𝑆\{𝑟}∪{𝑟} (β𝑆\{𝑟}+𝑟     ) − ℓ𝑆\{𝑟} ( β̂𝑆\{𝑟} ) ≥ 𝐶1 𝑛−2𝛼 ,
                                                         43


with probability at least 1 − 6 exp(−6𝑞 log 𝑝). From the proof of Theorem 3.2.1, we have for
any 𝜂2 > 0, BIC(𝑆) − BIC(𝑆\{𝑟}) ≤ −2𝐶1 𝑛−2𝛼 + 𝜂2 𝑛−1 log 𝑛 < 0, uniformly over 𝑟 ∈ M and 𝑆
satisfying M ⊂ 𝑆 and |𝑆| ≤ 𝑞, with probability at least 1 − 6 exp(−6𝑞 log 𝑝).                                    
Proof of Theorem 3.2.5. By Theorems 3.2.1, 3.2.2, and 3.2.3, we have that the event Ω1 :=
{| M̂ | ≤ 𝑞 and M ⊆ M̂} holds with probability at least 1 − 25 exp(−2𝑞 log 𝑝). Thus, in the rest of
the proof, we restrict our attention on Ω1 .
     As shown in the proof of Theorem 3.2.3, we obtain that β ∗ = β∗M̂ . Then by Lemma 3.4.6, we
                                                                              M̂
                                   p
                ∗
have k β̂M̂ −β k ≤ 𝐴2 𝐾         −1   𝑞 2 log 𝑝/𝑛 with probability at least 1−6 exp(−6𝑞 log 𝑝). Withdrawing
                M̂
the attention on Ω1 , we obtain that
                                                                                    q
                                                                       ∗
                    k β̂ − β∗ k = k β̂M̂ − β∗M̂ k = k β̂M̂ − βM̂         k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛,
with probability at least 1 − 31 exp(−2𝑞 log 𝑝).                                                                 
3.4     Additional Lemmas
Lemma 3.4.1. Given a model 𝑆 such that |𝑆| < 𝑞, M * 𝑆, under Condition (6),
(i): ∃𝑟 ∈ 𝑆 𝑐 ∩ M, such that β𝑆+𝑟     ∗ ≠ (β ∗T , 0) T .
                                                 𝑆
(ii): Suppose Conditions (1), (2), and (6’) hold. ∃𝑟 ∈ 𝑆 𝑐 ∩ M, such that kβ𝑆+𝑟                ∗T − (β ∗T , 0)k ≥
                                                                                                       𝑆
𝐶𝜎max−1 𝜅 −1 𝑛−𝛼 .
          max
              ∗
                                                                                                        ∗
Proof. As β𝑆+     𝑗 is the maximizer of 𝐸 ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) , by the concavity of 𝐸 ℓ𝑆∪{ 𝑗 } (β𝑆+ 𝑗 ) , β𝑆+ 𝑗
                                          h                                   i
is the solution to the equation 𝐸 𝑌 − 𝜇 β𝑆∗𝑇 X𝑆 + 𝛽 𝑗 𝑋 𝑗 X𝑆∪{ 𝑗 } = 0,
(𝑖): Suppose that β𝑆+   ∗           ∗T      T
                           𝑗 = (β𝑆 , 0) , ∀ 𝑗 ∈ 𝑆 ∩ M. Then,
                                                      𝑐
                        0 = 𝐸 𝑌 − 𝜇 β𝑆∗T X𝑆 𝑋 𝑗 = 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗𝑇 X𝑆 𝑋 𝑗
                                                                                    
                                                  T           ∗𝑇       
                                           
                         ⇒ max          𝐸     𝜇 β ∗ X   − 𝜇  β 𝑆   X 𝑆    𝑋 𝑗 = 0,
                             𝑗 ∈𝑆 ∩M
                                 𝑐
which violates the Condition (6). Therefore, we can find a 𝑟 ∈ 𝑆 𝑐 ∩ M, such that β𝑆+𝑟            ∗ ≠ (β ∗T , 0) T .
                                                                                                          𝑆
                                                            44


(𝑖𝑖): Let 𝑟 ∈ 𝑆 𝑐 ∩M satisfy that 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 > 𝐶𝑛−𝛼 . Without loss of generality,
                                                                   
we assume that 𝑋𝑟 is the last element of X𝑆∪{𝑟} . By the mean value theorem,
                                           𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟
                                                                         
                   = 𝐸 𝜇 β∗T X − 𝜇 β𝑆∗T X𝑆 𝑋𝑟 − 𝐸 𝜇 β∗T X − 𝜇 β𝑆+𝑟                     ∗T X
                                                                                             
                                                                                              𝑆∪{𝑟} 𝑋𝑟
                                             ∗T X                    ∗T
                                                                                    
                                 = 𝐸 𝜇 β𝑆+𝑟        𝑆∪{𝑟} − 𝜇 (β𝑆 , 0)X𝑆∪{𝑟} 𝑋𝑟
                                     ∗T − (β ∗T , 0) 𝐸 𝜎 β̃ T X
                                                                                ⊗2 
                                = β𝑆+𝑟          𝑆                 𝑆+𝑟   𝑆∪{𝑟}     X𝑆∪{𝑟} e𝑟 ,           (3.1)
where β̃𝑆+𝑟 is some point between β𝑆+𝑟      ∗ and (β ∗T , 0) T and e is a vector of length (|𝑆| + 1) with the
                                                        𝑆                𝑟
𝑟th element being 1.
As β̃𝑆+𝑟 is some point between β𝑆+𝑟    ∗ and (β ∗T , 0) T ,
                                                     𝑆
                                                ∗T
                            T
                        | β̃𝑆+𝑟 X𝑆∪{𝑟} | ≤ |β𝑆+𝑟   X𝑆∪{𝑟} | + |(β𝑆∗T , 0)X𝑆∪{𝑟} | ≤ 2𝐾 2 ,
by Conditions (1) and (2). Thus, |𝜎( β̃𝑆+𝑟     T X
                                                    𝑆∪{𝑟} )| ≤ 𝜎max . By (3.1) and Condition (5),
                                             h                                  i
                                  −𝛼
                                                   β∗T X         β𝑆∗T X𝑆
                                                                          
                              𝐶𝑛      ≤ 𝐸       𝜇           −𝜇                 𝑋𝑟
                                                                                  
                                      ∗T
                                ≤ kβ𝑆+𝑟    − (β𝑆∗T , 0)k𝜎max𝜆 max 𝐸 X𝑆∪{𝑟}      ⊗2
                                                                                        ke𝑟 k
                                                  ∗T
                                ≤ 𝜎max 𝜅max kβ𝑆+𝑟      − (β𝑆∗T , 0)k.
               ∗T − (β ∗T , 0)k ≥ 𝐶𝜎 −1 𝜅 −1 𝑛−𝛼 .
Therefore, kβ𝑆+𝑟                                                                                            
                          𝑆                max max
Lemma 3.4.2. Let 𝜉𝑖 , 𝑖 = 1, . . . , 𝑛 be 𝑛 i.i.d random variables such that |𝜉𝑖 | ≤ 𝐵 for a constant
                                                                                                 √
𝐵 > 0. Under Conditions (1) – (3), we have 𝐸 [|𝑌𝑖 𝜉𝑖 − 𝐸 [𝑌𝑖 𝜉𝑖 ] | 𝑚 ] ≤ 𝑚!(2𝐵( 2𝑀 + 𝜇max )) 𝑚 , for
every 𝑚 ≥ 1.
Proof. By Conditions (1) and (2), |β∗T X𝑖 | ≤ 𝐾 𝐿, ∀𝑖 ≥ 1 and consequently 𝜇(β∗T X𝑖 ) ≤ 𝜇max . Then
                                                          45


by Condition (3),
                                                                         𝑚  
                                                                        Õ   𝑚
                                                    𝜇(β∗T X𝑖 )| 𝑚 ]
                                                                                         𝑚−𝑡
                                 𝑚
                         𝐸 [|𝑌𝑖 | ] = 𝐸 [|𝜖𝑖 +                       ≤          𝐸 |𝜖𝑖 | 𝑡 𝜇max
                                                                        𝑡=0
                                                                             𝑡
                          𝑚       
                         Õ        𝑚 𝑡 𝑚−𝑡
                       ≤      𝑡!       𝑀 𝜇max ≤ 𝑚!(𝑀 + 𝜇max ) 𝑚 ,
                         𝑡=0
                                  𝑡
for every 𝑚 ≥ 1. By the same arguments, it can be shown that, for every 𝑚 ≥ 1,
            𝐸 [|𝑌𝑖 𝜉𝑖 − 𝐸 [𝑌𝑖 𝜉𝑖 ] | 𝑚 ] ≤ 𝐸 [(|𝑌𝑖 𝜉𝑖 | + |𝐸 [𝑌𝑖 𝜉𝑖 ] |) 𝑚 ] ≤ 𝑚!(2𝐵(𝑀 + 𝜇max )) 𝑚 .
                                                                                                        
                                                                                               p
Lemma 3.4.3. Under Conditions (1) – (3), when 𝑛 is sufficiently large such that 28 𝑞 log 𝑝/𝑛 < 1,
we have supβ ∈B E𝑛 𝐿 (β T X, 𝑌 ) ≤ 2(𝑀 + 𝜇max )𝐾 3 + 𝑏 max , with probability 1−2 exp(−10𝑞 log 𝑝).
                      
Proof. By Conditions (2), supβ ∈B β T X ≤ 𝐾 3 . Thus,
                             sup E𝑛 𝐿 (β T X, 𝑌 )           ≤ sup E𝑛 𝑌 β T X
                                                                        
                                                                                   + 𝑏 max
                             β ∈B                             β ∈B
                         ≤ E𝑛 {|𝑌 | − 𝐸 [|𝑌 |]} + 𝐸 [|𝑌 |] 𝐾 3 + 𝑏 max
                                                                      
                         ≤ E𝑛 {|𝑌 | − 𝐸 [|𝑌 |]} 𝐾 3 + (𝑀 + 𝜇max )𝐾 3 + 𝑏 max ,
                                                         
where the last inequality follows from that 𝐸 [|𝑌 |] ≤ 𝑀 + 𝜇max as shown in the proof of Lemma
3.4.2.
                                                                                                    𝑚
    Let 𝜉𝑖 = 1{𝑌𝑖 > 0}−1{𝑌𝑖 < 0}. Thus |𝜉𝑖 | ≤ 1. By Lemma 3.4.2, we have 𝐸 |𝑌𝑖 | − 𝐸 [|𝑌𝑖 |]           ≤
𝑚!(2(𝑀 + 𝜇max )) 𝑚 . Applying Bernstein’s inequality (e.g., Lemma 2.2.11 in van der Vaart et al.
[124]) yields that
                                                                              p           
                           𝑃 |E𝑛 {|𝑌 | − 𝐸 [|𝑌 |]}| > 10(𝑀 + 𝜇max ) 𝑞 log 𝑝/𝑛
                                                               
                                                   196𝑞 log 𝑝
                               ≤ 2 exp      − 21     √              ≤ 2 exp(−10𝑞 log 𝑝),             (3.2)
                                                 4+20 𝑞 log 𝑝/𝑛
                                                           46


                                                   p                                         p
when 𝑛 is sufficiently large such that 20 𝑞 log 𝑝/𝑛 < 1. Since 10(𝑀 + 𝜇max ) 𝑞 log 𝑝/𝑛 = 𝑜(1),
then
                                                                          !
               𝑃 sup E𝑛 𝐿 (β T X, 𝑌 )         ≥ 2(𝑀 + 𝜇max )𝐾 3 + 𝑏 max ≤ 2 exp(−10𝑞 log 𝑝).
                            
                   β ∈B
                                                                                                                
                                                                                                          p
Lemma 3.4.4. Given an index set 𝑆 and 𝑟 ∈ 𝑆 𝑐 , let B𝑆0 (𝑑) = {β𝑆 : kβ𝑆 − β𝑆∗ k ≤ 𝑑/(𝐾 |𝑆|)}
and 𝐴1 := (𝑀 + 2𝜇max ). Under Conditions (1) – (3), when 𝑛 is sufficiently large such that
   p
10 𝑞 log 𝑝/𝑛 < 1, we have
             h                                 i             p
    1. |G𝑛 𝐿 β𝑆T X𝑆 , 𝑌 − 𝐿 β𝑆∗T X𝑆 , 𝑌 | ≤ 20𝐴1 𝑑 𝑞 log 𝑝, uniformly over β𝑆 ∈ B𝑆0 (𝑑) and
        |𝑆| ≤ 𝑞, with probability at least 1 − 4 exp(−6𝑞 log 𝑝).
                                                           p
    2. |G𝑛 𝐿 (β𝑆∗T X𝑆 , 𝑌 ) | ≤ 10( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝, uniformly over |𝑆| ≤ 𝑞, with probability
                             
        at least 1 − 2 exp(−8𝑞 log 𝑝).
                                                                                                   p
Proof. : (1): Let R |𝑆| (𝑑) be a |𝑆|-dimensional ball with center at 0 and radius 𝑑/(𝐾 |𝑆|). Then
B𝑆0 (𝑑) = R |𝑆| (𝑑)+β𝑆∗ . Let C|𝑆| := {C(ξ 𝑘 )} be a collection of cubes that cover the ball 𝑅 |𝑆| (𝑑), where
                                                                   p
C(ξ 𝑘 ) is a cube containing ξ 𝑘 with sides of length 𝑑/(𝐾 |𝑆|𝑛2 ) and ξ 𝑘 is some point in R |𝑆| (𝑑). As
                                      p          |𝑆|                                                 p
the volume of C(ξ 𝑘 ) is 𝑑/(𝐾 |𝑆|𝑛2 )                 and the volume of R |𝑆| (𝑑) is less than (2𝑑/(𝐾 |𝑆|)) |𝑆| ,
we can select ξ 𝑘 ’s so that no more than (4𝑛2 ) |𝑆| cubes are needed to cover R |𝑆| (𝑑). We thus assume
|C|𝑆| | ≤ (4𝑛2 ) |𝑆| . For any ξ ∈ C(ξ 𝑘 ), kξ − ξ 𝑘 k ≤ 𝑑/(𝐾𝑛2 ). In addition, let 𝑇1𝑆 (ξ) := E𝑛 𝑌 ξ T X𝑆 ,
                                                                                                               
                            T 
𝑇2𝑆 (ξ) := E𝑛 𝑏 β𝑆∗ + ξ X𝑆 − 𝑏 β𝑆∗T X𝑆 , and 𝑇𝑆 (ξ) := 𝑇1𝑆 (ξ) − 𝑇2𝑆 (ξ).
                                                       
Given any ξ ∈ R |𝑆| (𝑑), there exists C(ξ 𝑘 ) ∈ C|𝑆| such that ξ ∈ C(ξ 𝑘 ). Then
      |𝑇𝑆 (ξ) − 𝐸 [𝑇𝑆 (ξ)]| ≤ |𝑇𝑆 (ξ) − 𝑇𝑆 (ξ 𝑘 )| |𝑇𝑆 (ξ 𝑘 ) − 𝐸 [𝑇𝑆 (ξ 𝑘 )]| + |𝐸 [𝑇𝑆 (ξ)] − 𝐸 [𝑇𝑆 (ξ 𝑘 )]|
                               =: 𝐼 + 𝐼 𝐼 + 𝐼 𝐼 𝐼.
We deal with 𝐼 𝐼 𝐼 first. By the mean value theorem, there exists a ξ̃ between ξ and ξ 𝑘 such that
                                                                                   T 
           |𝐸 [𝑇𝑆 (ξ 𝑘 )] − 𝐸 [𝑇𝑆 (ξ)]| = 𝐸 𝑌 (ξ 𝑘 − ξ) T X𝑆 + 𝐸 𝜇 β𝑆∗ + ξ̃ X𝑆 (ξ 𝑘 − ξ) T X𝑆
                                                                                                       
                                                                                    p              p
          ≤ 𝐸 [|𝑌 |] kξ 𝑘 − ξkkX𝑆 k + 𝜇max kξ 𝑘 − ξkkX𝑆 k ≤ (𝑀 + 2𝜇max )𝑑 |𝑆|𝑛−2 = 𝐴1 𝑑 |𝑆|𝑛−2(3.3)           ,
                                                           47


where the last inequality follows from Lemma 3.4.2 and 𝐴1 = 𝑀 + 2𝜇max .
                                                                                                p p
   Next, we evaluate 𝐼 𝐼. By Condition (2), |X𝑖𝑆                   T ξ| ≤ kX kkξk ≤ 𝑑/(𝐾 |𝑆|) |𝑆|𝐾 = 𝑑, for all
                                                                              𝑖𝑆
ξ ∈ R |𝑆| (𝑑). Then by Lemma 3.4.2,
                                     h                   T  𝑚i
                                 𝐸       𝑌 ξ 𝑘T X𝑆  − 𝐸 𝑌 ξ 𝑘 X𝑆       ≤ 𝑚!(2(𝑀 + 𝜇max )𝑑) 𝑚 .
                                                                                       p
By Bernstein’s inequality, when 𝑛 is sufficiently large such that 10 𝑞 log 𝑝/𝑛 ≤ 1.
                                                                                        q          
                               𝑃 |𝑇1𝑆 (ξ 𝑘 ) − 𝐸 [𝑇1𝑆 (ξ 𝑘 )]| > 10(𝑀 + 𝜇max )𝑑 𝑞𝑛 log 𝑝     −1
                                                                       !
                                                 1     100𝑞 log 𝑝
                               ≤ 2 exp −                  p              ≤ 2 exp(−10𝑞 log 𝑝).                (3.4)
                                                 2 4 + 20 𝑞 log 𝑝/𝑛
                         T
Since |𝑏( β𝑆∗ + ξ 𝑘 X𝑆 ) − 𝑏(β𝑆∗T X𝑆 )| ≤ 𝜇max 𝑑, by the same arguments used for (3.4), we have
                                                                   q             
            𝑃 |𝑇2𝑆 (ξ 𝑘 ) − 𝐸 [𝑇2𝑆 (ξ 𝑘 )]| > 10𝜇max 𝑑                𝑞𝑛−1 log   𝑝 ≤ 2 exp(−10𝑞 log 𝑝).      (3.5)
Combining (3.4) and (3.5) yields that uniformly over ξ 𝑘
                                                                               q
                                         |𝑇𝑆 (ξ 𝑘 ) − 𝐸 [𝑇𝑆 (ξ 𝑘 )]| ≤ 10𝐴1 𝑑 𝑞𝑛−1 log 𝑝,                    (3.6)
with probability at least 1 − 2(4𝑛2 ) |𝑆| exp(−10𝑞 log 𝑝).
   We now assess 𝐼. Following the same arguments as in Lemma 3.4.3,
                                                                         p         
           𝑃       sup |𝑇𝑆 (ξ) − 𝑇𝑆 (ξ 𝑘 )| > (2𝑀 + 3𝜇max )𝑑 |𝑆|𝑛−2 ≤ 2 exp(−8𝑞 log 𝑝).                      (3.7)
                ξ ∈C( ξ 𝑘 )
      p                      p
Since     |𝑆|𝑛−2 = 𝑜( 𝑞𝑛−1 log 𝑝), combining (3.3), (3.6), and (3.7) together yields that
                                                                              q            
                        𝑃        sup |𝑇𝑆 (ξ) − 𝐸 [𝑇𝑆 (ξ)]| ≥ 20𝐴1 𝑑 𝑞𝑛−1 log 𝑝
                              ξ ∈R |𝑆 | (𝑑)
                     ≤ 2(4𝑛2 ) |𝑆| exp(−10𝑞 log 𝑝) + 2 exp(−8𝑞 log 𝑝) ≤ 4 exp(−8𝑞 log 𝑝).
                                                  𝑝
By the combinatoric inequality                    𝑠  ≤ (𝑒 𝑝/𝑠) 𝑠 , we obtain that
                                                    h                             i          p        
                   𝑃             sup             G𝑛 𝐿 β𝑆T X𝑆 , 𝑌 − 𝐿 β𝑆∗T X𝑆 , 𝑌          ≥ 20𝐴1 𝑑 𝑞 log 𝑝
                        |𝑆|≤𝑞,β𝑆 ∈B𝑆0 (𝑑1 )
                   Õ 𝑞
                ≤        (𝑒 𝑝/𝑠) 𝑠 4 exp(−8𝑞 log 𝑝) ≤ 4 exp(−6𝑞 log 𝑝).
                    𝑠=1
                                                                   48


(2): We evaluate the 𝑚th moment of 𝐿(β𝑆∗ X𝑆 , 𝑌 ).
                                                            " 𝑚                       #
                                                              Õ   𝑚
                𝐸 𝑌 β𝑆∗ X𝑆 − 𝑏(β𝑆∗ X𝑆 )                               |𝑌 | 𝑡 𝐾 2𝑡 𝑏 max
                                                𝑚                                𝑚−𝑡
                                                       ≤𝐸
                                                              𝑡=0
                                                                   𝑡
                       𝑚  
                     Õ      𝑚
                                𝑡! (𝑀 + 𝜇max )𝐾 2 𝑏 max         ≤ 𝑚!((𝑀 + 𝜇max )𝐾 2 + 𝑏 max ) 𝑚 .
                                                        𝑡 𝑚−𝑡
                 ≤
                      𝑡=0
                             𝑡
Then, by Bernstein’s inequality,
                                                                  p            
            𝑃 |G𝑛 𝐿(β𝑆∗T X𝑆 , 𝑌 ) | > 10( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝 ≤ 2 exp(−10𝑞 log 𝑝).
                                      
By the same arguments used in (i), we obtain that
                                        h                i                             p   
                        𝑃 sup G𝑛 𝐿 β𝑆∗T X𝑆 , 𝑌                 ≥ 10( 𝐴1 𝐾 2 + 𝑏 max ) 𝑞 log 𝑝
                             |𝑆|≤𝑞
                        Õ 𝑞
                     ≤        (𝑒 𝑝/𝑠) 𝑠 2 exp(−10𝑞 log 𝑝) ≤ 2 exp(−8𝑞 log 𝑝).
                         𝑠=1
                                                                                                  
Lemma 3.4.5. Given a model 𝑆 and 𝑟 ∈ 𝑆 𝑐 , under Conditions (1), (2), and (5), for any
                  p
kβ𝑆 − β𝑆∗ k ≤ 𝐾/ |𝑆|,
          𝜎min 𝜅min kβ𝑆 − β𝑆∗ k 2 /2 ≤ 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )] ≤ 𝜎max 𝜅max kβ𝑆 − β𝑆∗ k 2 /2.
                                                         
Proof. Due to the concavity of the log-likelihood in GLMs with the canonical link,
                                                                             p
𝐸 𝑌 X𝑆 − 𝜇(β𝑆∗T X𝑆 )X𝑆 = 0. Then for any kβ𝑆 − β𝑆∗ k ≤ 𝐾/ |𝑆|,
                           
                                                                             p            p
               |β T X𝑆 | ≤ |β ∗T X𝑆 | + |(β − β ∗ ) T X𝑆 | ≤ 𝐾 2 + 𝐾/ |𝑆| × 𝐾 |𝑆| = 2𝐾 𝐿.
Thus, by Taylor’s expansion,
                                                     1                 h                  i
             𝐸 [ℓ𝑆 (β𝑆 )] − 𝐸 ℓ𝑆 (β𝑆∗ ) = − (β𝑆 − β𝑆∗ ) T 𝐸 𝜎 β̃𝑆T X𝑆 X𝑆⊗2 (β𝑆 − β𝑆∗ ),
                                           
                                                     2
where β̃𝑆 is between β𝑆 and β𝑆∗ . By Condition (5),
          𝜎min 𝜅min kβ𝑆 − β𝑆∗ k 2 /2 ≤ 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )] ≤ 𝜎max 𝜅max kβ𝑆 − β𝑆∗ k 2 /2.
                                                         
                                                                                                  
                                                            49


Lemma 3.4.6. Under Conditions (1) – (6), there exist some constants 𝐴2 and 𝐴3 that do not
                                                          p
depend on 𝑛, such that k β̂𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛 and |ℓ𝑆 ( β̂𝑆 ) − ℓ𝑆 (β𝑆∗ )| ≤ 𝐴3 𝑞 2 log 𝑝/𝑛
hold uniformly over 𝑆 : |𝑆| ≤ 𝑞, with probability at least 1 − 6 exp(−6𝑞 log 𝑝).
Proof. Define
                    n                        h                                   i            p      o
         Ω(𝑑) :=            sup          G𝑛 𝐿 β𝑆T X𝑆 , 𝑌 − 𝐿 β𝑆∗T X𝑆 , 𝑌                 < 20𝐴1 𝑑 𝑞 log 𝑝 .
                      |𝑆|≤𝑞,β𝑆 ∈B𝑆0 (𝑑)
By Lemma 3.4.4, the event Ω(𝑑) holds with probability at least 1 − 4 exp(−6𝑞 log 𝑝). Thus,
                                                                                                p
in the proof of Lemma 3.4.6, we shall assume Ω(𝑑) hold with 𝑑 = 𝐴2 𝑞 3 log 𝑝/𝑛 for some
𝐴2 > 20(𝜎min 𝜅 min ) −1 𝐾 2 𝐴1 .
                                    p                            p                  p             p
For any kβ𝑆 − β𝑆∗ k = 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛, since 𝑞 2 log 𝑝/𝑛 ≤ 𝑞 3 log 𝑝/𝑛/ |𝑆|, β𝑆 ∈ B𝑆0 (𝑑).
By Lemma 3.4.5,
           ℓ𝑆 (β𝑆∗ ) − ℓ𝑆 (β𝑆 )
                                                                            
         = ℓ𝑆 (β𝑆∗ ) − 𝐸 ℓ𝑆 (β𝑆∗ ) − (ℓ𝑆 (β𝑆 ) − 𝐸 [ℓ𝑆 (β𝑆 )]) + 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )]
                                                                                                       
                                                       p
         ≥ 𝜎min 𝜅min kβ𝑆 − β𝑆∗ k 2 /2 − 20𝐴1 𝑑 𝑞 log 𝑝/𝑛
         = 𝜎min 𝜅min 𝐴22 𝑞 2 log 𝑝/(𝐾 2 𝑛) − 20𝐴1 𝐴2 𝑞 2 log 𝑝/𝑛 > 0.
Thus,
                                             inf      √             ℓ𝑆 (β𝑆∗ ) − ℓ𝑆 (β𝑆 ) > 0.
                            |𝑆|≤𝑞,k β𝑆 −β𝑆∗ k=𝐴2 𝐾 −1   𝑞 2 log 𝑝/𝑛
                                                                                                p
Then by the concavity of ℓ𝑆 (·), we obtain that max|𝑆|≤𝑞 β̂𝑆 − β𝑆∗ ≤ 𝐴2 𝐾 −1 𝑞 2 𝑛−1 log 𝑝.
                                                              p
On the other hand, for any kβ𝑆 − β𝑆∗ k ≤ 𝐴2 𝐾 −1 𝑞 2 log 𝑝/𝑛,
             ℓ𝑆 (β𝑆∗ ) − ℓ𝑆 (β𝑆 )
          ≤ ℓ𝑆 (β𝑆∗ ) − 𝐸 ℓ𝑆 (β𝑆∗ ) − (ℓ𝑆 (β𝑆 ) − 𝐸 [ℓ𝑆 (β𝑆 )]) + 𝐸 ℓ𝑆 (β𝑆∗ ) − 𝐸 [ℓ𝑆 (β𝑆 )]
                                                                                         
                                                         p
          ≤ 𝜎max 𝜅max kβ𝑆 − β𝑆∗ k 2 /2 + 20𝐴1 𝑑 𝑞 log 𝑝/𝑛 ≤ 𝐴3 𝑞 2 𝑛−1 log 𝑝,
                                                              50


where 𝐴3 := 4𝜎max𝜆max 𝐴22 𝐾 −2 + 20𝐴1 𝐴2 .                                                         
3.5     Simulations
We compared the proposal with the other competing methods, including the penalized methods
such as least absolute shrinkage and selection operator (LASSO), the differential geometric least
angle regression (dgLARS) [90, 91], the forward regression (FR) approach [93], the sequentially
conditioning (SC) approach [99], and the screening method such as sure independence screening
(SIS) [87], which is popular in practice. As SIS does not directly generate a predictive model,
we applied LASSO for the top [𝑛/log(𝑛)] variables chosen by SIS and denoted the procedure by
SIS+LASSO.
As the FR, SC and STEPWISE approaches involve forward searching and to make them comparable,
we applied the same stopping rule, for example, Equation (3.3) with the same 𝛾, to their forward
steps. In particular, the STEPWISE approach, with 𝜂1 = 𝛾 and 𝜂2 = 0, is equivalent to FR and
asymptotically equivalent to SC. By varying 𝛾 in FR and SC between 𝛾 𝐿 and 𝛾𝐻 , we explored the
impact of 𝛾 on inducing false positives and negatives. In our numerical studies, we fixed 𝛾𝐻 = 10
and set 𝛾 𝐿 = 𝜂1 .
To choose 𝜂1 and 𝜂2 in (3.3) and (3.4) in STEPWISE, we performed 5-fold cross-validation to
minimize the mean squared prediction error (MSPE), and reported the results in Table 3.1. Since
the proposed STEPWISE algorithm uses the (E)BIC criterion, for a fair comparison we chose the
tuning parameter in dgLARS by using the BIC criterion as well, and coined the corresponding
approach as dgLARS(BIC).
The regularization parameter in LASSO was chosen via the following two approaches: 1) giving
the smallest BIC for the models on the LASSO path, denoted by LASSO(BIC); 2) using the one-
standard-error rule, denoted by LASSO(1SE), which chooses the most parsimonious model whose
error is no more than one standard error above the error of the best model in cross-validation [126].
                                                 51


                     Table 3.1: The values of 𝜂1 and 𝜂2 used in the simulation studies
                                     Normal Model         Binomial Model               Poisson Model
                        Example 1             (0.5, 3)                 (0.5, 3)                  (1, 3)
                        Example 2             (0.5, 3)                    (1, 3)                 (1, 3)
                        Example 3               (1, 3)                 (0.5, 3)               (0.5, 1)
                        Example 4             (1, 3.5)                    (0, 1)                 (1, 3)
                        Example 5             (0.5, 3)                 (0.5, 2)               (0.5, 3)
                        Example 6             (0.5, 3)                 (0.5, 3)                  (1, 3)
                        Example 7             (0.5, 3)                 (0.5, 3)             (0.5, 4.5)
 Note: Values for 𝜂1 and 𝜂2 were searched on the grid {0, 0.25, 0.5, 1} and {1, 2, 3, 3.5, 4, 4.5, 5}, respectively.
     Denote by X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 ) T and β = (𝛽1 , . . . , 𝛽 𝑝 ) T , the covariate vector for subject
𝑖 (1, . . . , 𝑛) and the true coefficient vector. The following five examples generated X𝑖T β, the inner
product of the coefficient and covariate vectors for each individual, which were used to generate
outcomes from the Normal, Binomial, and Poisson models.
Example 1: For each 𝑖,
                                           𝑝0               𝑝
                                        ©Õ                 Õ
                           𝑐X𝑖T β =𝑐×­         𝛽 𝑗 𝑋𝑖 𝑗 +          𝛽 𝑗 𝑋𝑖 𝑗 ® , 𝑖 = 1, . . . , 𝑛,
                                                                             ª
                                        « 𝑗=1             𝑗=𝑝 0 +1           ¬
                                 √
where 𝛽 𝑗 = (−1) 𝑗 (4log 𝑛/ 𝑛 + |𝑍 𝑗 |), for 𝑗 = 1, . . . , 𝑝 0 and 𝛽 𝑗 =0, otherwise; 𝐵 𝑗 was a binary
                      𝐵
random variable with 𝑃(𝐵 𝑗 = 1) = 0.4 and 𝑍 𝑗 was generated by a standard normal distribution;
𝑝 0 = 8; 𝑋𝑖 𝑗 ’s were independently generated from a standardized exponential distribution, that is,
exp(1) − 1. Here and also in the other examples, 𝑐 (specified later) controls the signal strengths.
Example 2: This scenario is the same as Example 1 except that 𝑋𝑖 𝑗 was independently generated
from a standard normal distribution.
Example 3: For each 𝑖,
                                           𝑝0               𝑝
                                        ©Õ                 Õ
                           𝑐X𝑖T β =𝑐×­         𝛽 𝑗 𝑋𝑖 𝑗 +          𝛽 𝑗 𝑋𝑖∗𝑗 ® , 𝑖 = 1, . . . , 𝑛,
                                                                             ª
                                        « 𝑗=1             𝑗=𝑝 0 +1           ¬
where 𝛽 𝑗 = 2j for 1 ≤ 𝑗 ≤ 𝑝 0 and 𝑝 0 = 5. We simulated every component of Z𝑖 = (𝑍𝑖 𝑗 ) ∈ 𝑅 𝑝 and W𝑖
= (𝑊𝑖 𝑗 ) ∈ 𝑅 𝑝 independently from a standard normal distribution. Next, we generated X𝑖 according
                          √                                               𝑝0
to 𝑋𝑖 𝑗 = (𝑍𝑖 𝑗 + 𝑊𝑖 𝑗 )/ 2 for 1 ≤ 𝑗 ≤ 𝑝 0 and 𝑋𝑖∗𝑗 = (𝑍𝑖 𝑗 +
                                                                         Í
                                                                               𝑍𝑖 𝑗 0 )/2 for 𝑝 0 < 𝑗 ≤ 𝑝.
                                                                        𝑗 0 =1
                                                          52


Example 4: For each 𝑖,
                                          500               𝑝
                                        ©Õ                 Õ
                        𝑐X𝑖T β =𝑐×­           𝛽 𝑗 𝑋𝑖 𝑗 +                       𝑖 = 1, . . . , 𝑛,
                                                                          ª
                                                                 𝛽 𝑗 𝑋𝑖 𝑗 ® ,
                                        « 𝑗=1             𝑗=501           ¬
where the first 500 𝑋𝑖 𝑗 ’s were generated from the multivariate normal distribution with mean 0
                                                                                                          0
and a covariance matrix with all of the diagonal entries being 1 and cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0.5| 𝑗− 𝑗 | for
1 ≤ 𝑗, 𝑗 0 ≤ 𝑝. The remaining 𝑝 − 500 𝑋𝑖 𝑗 ’s were generated through the autoregressive processes
with 𝑋𝑖,501 ∼ Unif (-2, 2), 𝑋𝑖 𝑗 = 0.5 𝑋𝑖, 𝑗−1 + 0.5 𝑋𝑖∗𝑗 , for 𝑗 = 502, . . . , 𝑝, where 𝑋𝑖∗𝑗 ∼ Unif (-2, 2)
were generated independently. The coefficients 𝛽 𝑗 for 𝑗 = 1, . . . , 7, 501, . . . , 507 were generated
                      √
from (−1) 𝐵 𝑗 (4log 𝑛/ 𝑛 + |𝑍 𝑗 |), where 𝐵 𝑗 was a binary random variable with 𝑃(𝐵 𝑗 = 1) = 0.4 and
𝑍 𝑗 was from a standard normal distribution. The remaining 𝛽 𝑗 were zeros.
Example 5: For each 𝑖,
                       𝑐X𝑖T β = 𝑐 × −0.5𝑋𝑖1 + 𝑋𝑖2 + 0.5𝑋𝑖,100 ,
                                                                            
                                                                                 𝑖 = 1, . . . , 𝑛,
where X𝑖 were generated from a multivariate normal distribution with mean 0 and a covariance
                                                                                              0
matrix with all of the diagonal entries being 1 and cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0.9| 𝑗− 𝑗 | for 1 ≤ j, j’ ≤ p. All of
the coefficients were zero except for 𝑋𝑖1 , 𝑋𝑖2 and 𝑋𝑖,100 .
Example 6: For each 𝑖,
                                           𝑞                𝑝
                                        ©Õ                 Õ
                        𝑐X𝑖T β =𝑐×­           𝛽 𝑗 𝑋𝑖 𝑗 +                        𝑖 = 1, . . . , 𝑛,
                                                                          ª
                                                                 𝛽 𝑗 𝑋𝑖 𝑗 ® ,
                                        « 𝑗=1             𝑗=𝑞+1           ¬
where 𝛽𝑞+1 = · · · = 𝛽 𝑝 = 0, (𝛽1 = · · · = 𝛽𝑞 ) = (3.75, 4.5, 5.25, 6, 6.75, 7.5, 8.25, 9, 9.75), and
𝑞 = 10. 𝑋𝑖 𝑗 ’s were generated from the multivariate standard normal distribution for 1 ≤ 𝑗 ≤ 𝑞 and
                                                     Õ 𝑞
                                   𝑋𝑖 𝑗 = 𝑑𝑖 𝑗 + 𝑏         𝑋𝑖𝑙 ,   𝑖 = 1, . . . , 𝑛,
                                                      𝑙=1
for 𝑞 + 1 ≤ 𝑗 ≤ 𝑝, where 𝑏 = (3/4𝑞) 1/2 and (𝑑𝑖(𝑞+1) , . . . , 𝑑𝑖 𝑝 ) are generated from the multivariate
normal distribution with mean 0 and a covariance matrix with all of the diagonal entries being 1
and off-diagonal being 0.25, and are independent of 𝑋𝑖 𝑗 for 1 ≤ 𝑗 ≤ 𝑞.
                                                          53


Example 7: For each 𝑖,
                                                    𝑝
                                                ©Õ
                                 𝑐X𝑖T β = 𝑐 × ­       𝛽 𝑗 𝑋𝑖 𝑗 ® , 𝑖 = 1, . . . , 𝑛,
                                                               ª
                                                « 𝑗=1          ¬
where 𝛽 𝑗 = 0.2 for 𝑗 = 1, . . . , 7, 501, . . . , 508 and 𝛽 𝑗 =0, otherwise. 𝑋𝑖 𝑗 ’s were generated from
the multivariate normal distribution for 1 ≤ 𝑗 ≤ 500 with mean 0 and a covariance matrix with all
                                                                  0
of the diagonal entries being 1, cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0.9| 𝑗− 𝑗 | for 1 ≤ j, j’ ≤ 15, and cov(𝑋𝑖 𝑗 , 𝑋𝑖 𝑗 0 ) = 0
for 16 ≤ j, j’ ≤ 500. 𝑋𝑖 𝑗 ’s were generated from the multivariate double exponential distribution for
501 ≤ 𝑗 ≤ 508 with location parameter equal to 0 and a covariance matrix with all of the diagonal
entries being 1, and independent of 𝑋𝑖 𝑗 for 1 ≤ 𝑗 ≤ 500.
Examples 1 and 3 were adopted from Wang [93], while Examples 2, 4, and 6 were borrowed
from Fan et al. [87], Hwang et al. [95], and Ing et al. [113], respectively. We then generated the
responses from the following three models.
Normal Model: 𝑌𝑖 = 𝑐X𝑖T β + 𝜖𝑖 with 𝜖𝑖 ∼ 𝑁 (0, 1).
Binomial Model: 𝑌𝑖 ∼ Bernoulli( exp(𝑐X𝑖T β)/{1 + exp(𝑐X𝑖T β)}).
Poisson Model: 𝑌𝑖 ∼ Poisson( exp(𝑐X𝑖T β)).
    Our simulated examples cover a wide range of models (Normal, Binomial, Poisson), having
their covariates generated from various distributions (multivariate normal, exponential, double
exponential, uniform, and mixture) with diverse set of complex covariance structures (independent,
compound symmetry, autoregressive, unstructured) and comprised of strong and weak signals
including hidden features.
We considered 𝑛 = 400 and 𝑝 = 1,000 throughout all of the examples. We specified the magnitude
of the coefficients in the GLMs with a constant multiplier, c. For Examples 1-7, this constant was
set, respectively for the Normal, Binomial and Poisson models, to be: (1, 1, 0.3), (1, 1.5, 0.3), (1, 1,
0.1), (1, 1.5, 0.3), (1, 3, 2), (1, 1, 1), and (1, 3, 2). For each parameter configuration, we simulated
500 independent data sets. We evaluated the performance of the methods by the criteria of true
positives (TP), false positives (FP), the estimated probability of including the true models (PIT),
the mean squared error (MSE) of β̂, and the mean squared prediction error (MSPE). To compute
                                                       54


the MSPE, we randomly partitioned the samples into the training (75%) and testing (25%) sets.
The models obtained from the training datasets were used to predict the responses in the testing
datasets. Tables 3.2–3.4 report the average TP, FP, PIT, MSE, and MSPE over 500 datasets along
with the standard deviations. The findings are summarized below.
First, the proposed STEPWISE method was able to detect all the true signals with nearly zero FPs.
Specifically, in all of the Examples, STEPWISE outperformed the other methods by detecting more
TPs with fewer FPs, whereas LASSO, SIS+LASSO and dgLARS included much more FPs.
Second, though a smaller 𝛾 in FR and SC led to the inclusion of all TPs with a PIT close to 1, it
incurred more FPs. On the other hand, a larger 𝛾 may eliminate some TPs, resulting in a smaller
PIT and a larger MSPE.
Third, for the Normal model, the STEPWISE method yielded an MSE close to 0, the smallest
among all the competing methods. The Binary and Poisson data challenged all of the methods,
and the MSEs for all the methods were non-negligible. However, the STEPWISE method still
produced the lowest MSE. The results seemed to verify the consistency of β̂, which distinguished
the proposed STEPWISE method from the other regularized methods and highlighted its ability to
provide a more accurate means to characterize the effects of high dimensional predictors.
Fourth, for all three models, STEPWISE procedure demonstrated a vivid advantage over other
competing methods: for the Poisson model, it outperformed all methods by selecting the highest
number of TP and keeping FP at the low rate. In fact, SIS+LASSO failed to detect any TP while
including an incomparably high number of FP. High FP rates were observed in dgLARS method as
well. Similarly, for the Binomial model, STEPWISE selected almost all TPs while including near
zero FPs. LASSO, SIS+LASSO, and dgLARS selected a learge number of FPs while selecting less
TPs than our proposed method.
                                                55


                               Table 3.2: Normal model
Example          Method        TP             FP         PIT       MSE         MSPE
                                                                  (×10−4 )
1 (𝑝 0 = 8)   LASSO(1SE)   8.00 (0.00)    5.48 (6.61) 1.00 (0.00)   2.45        1.148
              LASSO(BIC)   8.00 (0.00)   2.55 (2.48)  1.00 (0.00)   2.58        1.172
            SIS+LASSO(1SE) 8.00 (0.00)    6.59 (4.22) 1.00 (0.00)   1.49        1.042
            SIS+LASSO(BIC) 8.00 (0.00)    6.04 (3.33) 1.00 (0.00)   1.37        1.025
              dgLARS(BIC)  8.00 (0.00)    3.52(2.53)  1.00 (0.00)   2.25        1.130
                 SC (𝛾 𝐿 ) 8.00 (0.00)   3.01 (1.85)  1.00 (0.00)   1.09        0.895
                 SC (𝛾 𝐻 ) 7.60 (1.59)    0.00 (0.00) 0.94 (0.24)  14.56        5.081
                 FR (𝛾 𝐿 ) 8.00 (0.00)   2.96 (2.04)  1.00 (0.00)   1.08        0.896
                 FR (𝛾 𝐻 ) 7.88 (0.84)    0.00 (0.00) 0.98 (0.14)   3.74        2.040
               STEPWISE    8.00 (0.00)    0.00 (0.00) 1.00 (0.00)   0.21        0.972
2 (𝑝 0 = 8)   LASSO(1SE)   8.00 (0.00)    4.74 (4.24) 1.00 (0.00)   2.46        1.154
              LASSO(BIC)   8.00 (0.00)   2.12 (2.02)  1.00 (0.00)   2.62        1.182
               SIS+LASSO   7.99 (0.10)   6.84 (4.57)  0.99 (0.10)   1.65        1.058
            SIS+LASSO(BIC) 7.99 (0.10)    6.11 (3.85) 0.99 (0.10)   1.56        1.041
              dgLARS(BIC)  8.00 (0.00)    3.26(2.62)  1.00 (0.00)   2.28        1.138
                 SC (𝛾 𝐿 ) 8.00 (0.00)   2.73 (1.53)  1.00 (0.00)   0.98        0.901
                 SC (𝛾 𝐻 ) 7.30 (2.11)    0.00 (0.00) 0.90 (0.30)  23.70        6.397
                 FR (𝛾 𝐿 ) 8.00 (0.00)   2.45 (1.65)  1.00 (0.00)   0.92        0.907
                 FR (𝛾 𝐻 ) 7.94 (0.60)    0.00 (0.00) 0.99 (0.00)   2.69        2.062
               STEPWISE    8.00 (0.00)    0.01 (0.10) 1.00 (0.00)   0.21        0.972
3 (𝑝 0 = 5)   LASSO(1SE)   5.00 (0.00)    8.24 (2.63) 1.00 (0.00)   3.07        1.084
              LASSO(BIC)   5.00 (0.00)   12.33 (3.28) 1.00 (0.00)  27.97        2.398
            SIS+LASSO(1SE) 0.97 (0.26)   15.94 (2.93) 0.00 (0.00) 1406.22      76.024
            SIS+LASSO(BIC) 0.97 (0.26)   16.20 (2.81) 0.00 (0.00) 1354.54      71.017
              dgLARS(BIC)  5.00 (0.00)  53.91 (14.44) 1.00 (0.00)   6.63        0.979
                                                                   Continued on next page
                                           56


                                      Table 3.2 (cont’d)
 Example         Method          TP              FP         PIT        MSE        MSPE
                                                                     (×10−4 )
                  SC (𝛾 𝐿 )  4.48 (0.50)    0.25 (0.44)  0.48 (0.50)  21.74       3.086
                 SC (𝛾 𝐻 )   4.48 (0.50)     0.14 (0.35) 0.48 (0.50)   21.70      2.065
                  FR (𝛾 𝐿 )  5.00 (0.00)    0.23 (0.66)  1.00 (0.00)   0.27       0.973
                 FR (𝛾 𝐻 )   5.00 (0.00)     0.14 (0.35) 1.00 (0.00)    0.15      0.074
                STEPWISE     5.00 (0.00)     0.03 (0.22) 1.00 (0.00)   0.18       0.976
4 (𝑝 0 = 14)   LASSO(1SE)   14.00 (0.00)   29.84 (15.25) 1.00 (0.00)  13.97       1.148
               LASSO(BIC)   13.94 (0.24)    4.92 (5.54)  0.94 (0.24)  38.69       1.995
             SIS+LASSO(1SE) 11.44 (1.45)    15.19 (7.29) 0.05 (0.21)  133.38      4.714
             SIS+LASSO(BIC) 11.35 (1.51)    10.98 (7.19) 0.05 (0.21)  137.06      4.940
               dgLARS(BIC)  14.00 (0.00)    13.93 (6.68) 1.00 (0.00)   18.08      1.329
                  SC (𝛾 𝐿 ) 13.68 (0.60)    0.86 (0.62)  0.75 (0.44)  11.80       1.148
                 SC (𝛾 𝐻 )   4.20 (2.80)     0.03 (0.17) 0.03 (0.17)  407.86      6.567
                  FR (𝛾 𝐿 ) 14.00 (0.00)    0.50 (0.76)  1.00 (0.00)   1.23       0.940
                 FR (𝛾 𝐻 )   4.99 (3.07)     0.00 (0.00) 0.03 (0.17)  360.65      6.640
                STEPWISE    14.00 (0.00)     0.00 (0.00) 1.00 (0.00)   0.91       0.958
 5 (𝑝 0 = 3)   LASSO(1SE)    3.00 (0.00)    22.76 (9.05) 1.00 (0.00)    1.01      0.044
               LASSO(BIC)    3.00 (0.00)    8.29 (3.23)  1.00 (0.00)   1.75       0.054
             SIS+LASSO(1SE)  3.00 (0.00)     8.40 (3.10) 1.00 (0.00)   0.44       0.041
             SIS+LASSO(BIC)  3.00 (0.00)     9.58 (3.36) 1.00 (0.00)   0.29       0.040
               dgLARS(BIC)   3.00 (0.00)    13.39 (4.94) 1.00 (0.00)   1.28       0.048
                  SC (𝛾 𝐿 )  3.00 (0.00)    1.47 (0.67)  1.00 (0.00)   0.03       0.038
                 SC (𝛾 𝐻 )   2.01 (0.10)     0.01 (0.10) 0.01 (0.10)    4.51      0.008
                  FR (𝛾 𝐿 )  3.00 (0.00)    1.21 (1.01)  1.00 (0.00)   0.03       0.038
                 FR ( 𝛾 𝐻 )  3.00 (0.00)     0.00 (0.00) 1.00 (0.00)    0.01      0.003
                STEPWISE     3.00 (0.00)     0.00 (0.00) 1.00 (0.00)   0.01       0.039
                                                                      Continued on next page
                                              57


                                      Table 3.2 (cont’d)
 Example         Method          TP              FP          PIT        MSE        MSPE
                                                                      (×10−4 )
6 (𝑝 0 = 10)   LASSO(1SE)   10.00 (0.00)    46.23 (6.61)  1.00 (0.00)   37.70       1.50
               LASSO(BIC)    9.86 (0.34)    59.05 (6.18)  0.86 (0.34)  698.70       13.82
             SIS+LASSO(1SE)  0.00 (0.00)    37.53 (3.29)  0.00 (0.00) 4620.54      127.51
             SIS+LASSO(BIC)  0.00 (0.00)    38.05 (3.21)  1.00 (0.00) 4644.21      118.64
               dgLARS(BIC)  10.00 (0.00)   156.15 (26.47) 1.00 (0.00)   20.96       0.88
                  SC (𝛾 𝐿 )  2.99 (0.08)     1.41 (0.49)  0.00 (0.00) 2868.96      116.26
                 SC (𝛾 𝐻 )   0.96 (1.26)     1.06 (0.23)  0.00 (0.00) 4775.34       51.16
                  FR (𝛾 𝐿 )  7.45 (0.08)     3.34 (0.33)  0.00 (0.00)  657.77       30.03
                 FR ( 𝛾 𝐻 )  1.48 (2.46)     2.00 (0.08)  0.00 (0.00) 4345.01       55.47
                STEPWISE    7.45 (0.08)      2.11 (0.33)  0.00 (0.00)  653.07       29.05
7 (𝑝 0 = 15)   LASSO(1SE)   14.71 (0.49)     8.64 (4.88)  0.73 (0.44)    0.77       1.16
               LASSO(BIC)   14.67 (0.53)     6.57 (3.51)  0.71 (0.45)   0.76         1.16
             SIS+LASSO(1SE) 13.45 (1.77)    10.16 (4.20)  0.38 (0.48)    1.32       1.07
             SIS+LASSO(BIC) 13.44 (1.76)     9.55 (4.07)  0.36 (0.48)    1.34       1.05
               dgLARS(BIC)  14.69 (0.51)     7.02 (3.54)  0.72 (0.45)    0.76       1.14
                  SC (𝛾 𝐿 ) 11.12 (0.53)     3.01 (1.82)  0.00 (0.00)   3.34         0.94
                 SC (𝛾 𝐻 )   3.19 (3.51)     0.06 (0.08)  0.00 (0.00)    5.97       14.17
                  FR (𝛾 𝐿 ) 12.20 (0.60)     3.15 (2.00)  0.00 (0.00)   2.27         0.91
                 FR ( 𝛾 𝐻 )  6.64 (3.38)     0.01 (0.11)  0.00 (0.00)    6.49       12.43
                STEPWISE    12.20 (0.08)     0.09 (0.29)  0.00 (0.00)   2.99         1.03
                                                                       Continued on next page
                                              58


                                                Table 3.2 (cont’d)
 Example               Method              TP                FP             PIT          MSE              MSPE
                                                                                        (×10−4 )
Note: TP, true positives; FP, false positives; PIT, probability of including all true predictors in the selected
predictors; MSE, mean squared error of β̂; MSPE, mean squared prediction error; numbers in the parentheses
are standard deviations; LASSO(BIC), LASSO with the tuning parameter chosen to give the smallest BIC for
the models on the LASSO path; LASSO(1SE), LASSO with the tuning parameter chosen by the one-standard
-error rule; SIS+LASSO(BIC), sure independence screening by [87] followed by LASSO(BIC);
SIS+LASSO(1SE), sure independence screening followed by LASSO(1SE); dgLARS(BIC), differential geo-
metric least angle regression by [90, 91] with the tuning parameter chosen to give the smallest BIC on the
dgLARS path; SC(𝛾), sequentially conditioning approach by [99]; FR(𝛾), forward regression by [93];
STEPWISE, the proposed method; In FR and SC, the smaller and large values of 𝛾 are presented as 𝛾 𝐿 and
𝛾 𝐻 , respectively; 𝑝 0 denotes the number of true signals; LASSO(1SE), LASSO(BIC), SIS, and dgLARS
were conducted via R packages glmnet [127], ncvreg [128], screening [129], and dglars [130], respectively.
                                          Table 3.3: Binomial model
      Example              Method                 TP                 FP               PIT         MSE       MSPE
      1 (𝑝 0 = 8)        LASSO(1SE)           7.99 (0.10)       4.77 (5.56)       0.99 (0.10)     0.021     0.104
                         LASSO(BIC)           7.99 (0.10)       3.19 (2.34)       0.99 (0.10)     0.021     0.104
                      SIS+LASSO(1SE)          7.94 (0.24)      35.42 (6.77)       0.94 (0.24)     0.119     0.048
                      SIS+LASSO(BIC)          7.94 (0.24)     16.83 (21.60)       0.94 (0.24)     0.119     0.073
                         dgLARS(BIC)          8.00 (0.00)       3.27 (2.29)       1.00 (0.00)     0.019     0.102
                            SC (𝛾 𝐿 )         8.00 (0.00)       2.81 (1.47)       1.00 (0.00)     0.009     0.073
                           SC (𝛾 𝐻 )          1.02 (0.14)       0.00 (0.00)       0.00 (0.00)     0.030     0.028
                                                                                     Continued on next page
                                                         59


                             Table 3.3 (cont’d)
Example         Method         TP             FP         PIT      MSE    MSPE
                 FR (𝛾 𝐿 ) 8.00 (0.00)    3.90 (2.36) 1.00 (0.00) 0.032  0.066
                FR (𝛾 𝐻 )  2.00 (0.00)    0.00 (0.00) 0.00 (0.00) 0.025  0.027
               STEPWISE    7.98 (0.14)    0.08 (0.53) 0.98 (0.14) 0.002  0.094
2 (𝑝 0 = 8)   LASSO(1SE)   7.98 (0.14)    3.29 (2.76) 0.98 (0.14) 0.054  0.073
              LASSO(BIC)   7.99 (0.10)    3.84 (2.72) 0.99 (0.10) 0.052  0.067
            SIS+LASSO(1SE) 7.92 (0.27)   28.20 (7.31) 0.92 (0.27) 0.038  0.030
            SIS+LASSO(BIC) 7.92 (0.27)   9.60 (12.92) 0.92 (0.27) 0.051  0.058
              dgLARS(BIC)  7.99 (0.10)    3.94 (2.65) 0.99 (0.10) 0.050  0.067
                 SC (𝛾 𝐿 ) 7.72 (0.45)    0.39 (0.49) 0.72 (0.45) 0.005  0.063
                SC (𝛾 𝐻 )  1.13 (0.37)    0.00 (0.00) 0.00 (0.00) 0.069  0.044
                 FR (𝛾 𝐿 ) 7.99 (0.10)    0.66 (0.76) 0.99 (0.10) 0.014  0.051
                FR (𝛾 𝐻 )  2.10 (0.30)    0.00 (0.00) 0.00 (0.00) 0.061  0.033
               STEPWISE    7.99 (0.10)    0.02 (0.14) 0.99 (0.10) 0.004  0.056
3 (𝑝 0 = 5)   LASSO(1SE)   4.51 (0.52)    7.36 (2.57) 0.52 (0.50) 0.155  0.051
              LASSO(BIC)   4.98 (0.14)    5.97 (2.25) 0.98 (0.14) 0.118  0.037
            SIS+LASSO(1SE) 0.85 (0.46)   10.66 (3.01) 0.00 (0.00) 0.206  0.186
            SIS+LASSO(BIC) 0.85 (0.46)   12.10 (3.13) 0.00 (0.00) 0.197  0.185
              dgLARS(BIC)  4.92 (0.27)   16.21 (6.21) 0.92 (0.27) 0.112  0.035
                 SC (𝛾 𝐿 ) 4.32 (0.49)    0.47 (0.50) 0.33 (0.47) 0.016  0.048
                SC (𝛾 𝐻 )  2.62 (1.34)    0.42 (0.50) 0.00 (0.00) 0.104  0.066
                 FR (𝛾 𝐿 ) 4.98 (0.14)    0.67 (0.79) 0.98 (0.14) 0.020  0.033
                FR (𝛾 𝐻 )  2.98 (0.95)    0.40 (0.49) 0.00 (0.00) 0.087  0.043
               STEPWISE    4.97 (0.17)    0.04 (0.28) 0.97 (0.17) 0.014  0.034
                                                         Continued on next page
                                     60


                               Table 3.3 (cont’d)
 Example         Method          TP             FP         PIT      MSE    MSPE
4 (𝑝 0 = 14)   LASSO(1SE)    9.96 (1.89)    6.78 (7.92) 0.01 (0.01) 0.112  0.107
               LASSO(BIC)    9.33 (1.86)    2.79 (2.87) 0.00 (0.00) 0.112  0.118
             SIS+LASSO(1SE) 10.03 (1.62)   28.01 (9.54) 0.03 (0.17) 0.098  0.070
             SIS+LASSO(BIC)  8.90 (1.99)   5.42 (10.64) 0.01 (0.10) 0.114  0.120
               dgLARS(BIC)   9.31 (1.85)    2.84 (2.86) 0.00 (0.00) 0.110  0.117
                  SC (𝛾 𝐿 )  9.48 (1.40)    2.35 (2.14) 0.00 (0.00) 0.043  0.070
                 SC (𝛾 𝐻 )   1.17 (0.40)    0.00 (0.00) 0.00 (0.00) 0.125  0.049
                  FR (𝛾 𝐿 ) 11.83 (1.39)    1.58 (1.60) 0.09 (0.29) 0.026  0.048
                 FR (𝛾 𝐻 )   2.06 (0.24)    0.00 (0.00) 0.00 (0.00) 0.119  0.032
                STEPWISE    11.81 (1.42)    1.52 (1.58) 0.09 (0.29) 0.026  0.048
5 (𝑝 0 = 3)    LASSO(1SE)    2.00 (0.00)    1.55 (1.76) 0.00 (0.00) 0.008  0.215
               LASSO(BIC)    2.00 (0.00)    1.86 (1.57) 0.00 (0.00) 0.008  0.213
             SIS+LASSO(1SE)  2.23 (0.42)   10.81 (6.45) 0.23 (0.42) 0.007  0.192
             SIS+LASSO(BIC)  2.10 (0.30)    3.60 (4.65) 0.10 (0.30) 0.007  0.206
               dgLARS(BIC)   2.00 (0.00)    1.64 (1.49) 0.00 (0.00) 0.008  0.213
                 SC (𝛾 𝐿 )   2.27 (0.49)    7.16 (3.20) 0.29 (0.46) 0.060  0.166
                 SC (𝛾 𝐻 )   1.87 (0.34)    0.03 (0.17) 0.00 (0.00) 0.005  0.030
                  FR (𝛾 𝐿 )  2.96 (0.20)    8.88 (5.39) 0.96 (0.20) 0.013  0.147
                 FR ( 𝛾 𝐻 )  1.97 (0.17)    0.03 (0.17) 0.00 (0.00) 0.005  0.019
                STEPWISE     2.89 (0.31)    0.76 (1.70) 0.89 (0.31) 0.001  0.194
6 (𝑝 0 = 10)   LASSO(1SE)    6.10 (1.08)   31.66 (4.63) 0.00 (0.00)  0.41   0.07
               LASSO(BIC)    7.88 (0.97)   30.41 (4.63) 0.02 (0.16)  0.38   0.05
                                                           Continued on next page
                                       61


                                   Table 3.3 (cont’d)
 Example         Method              TP              FP         PIT      MSE  MSPE
             SIS+LASSO(1SE)      0.00 (0.00)    28.40 (3.61) 0.00 (0.00) 0.45  0.15
             SIS+LASSO(BIC)      0.00 (0.00)    30.18 (3.24) 0.00 (0.00) 0.45  0.14
               dgLARS(BIC)       4.12 (2.29)    32.37 (7.61) 0.00 (0.00) 0.42  0.09
                 SC (𝛾 𝐿 )      7.71 (0.58)     2.95 (0.95)  0.00 (0.00) 0.19  0.05
                 SC (𝛾 𝐻 )       0.00 (0.00)     1.00 (0.00) 0.00 (0.00) 0.45  0.02
                  FR (𝛾 𝐿 )     8.52 (0.89)     3.24 (1.15)  0.00 (0.00) 0.09  0.04
                 FR ( 𝛾 𝐻 )      0.12 (0.32)    1.88 (0.32)  0.00 (0.00) 0.45  0.01
                STEPWISE        8.52 (0.92)     0.58 (0.77)  0.00 (0.00) 0.09  0.04
7 (𝑝 0 = 15)   LASSO(1SE)        9.80 (1.14)     5.60 (4.29) 0.00 (0.00) 0.01  0.04
               LASSO(BIC)       9.74 (1.11)     4.68 (2.63)  0.00 (0.00) 0.01  0.04
             SIS+LASSO(1SE)     10.67 (1.39)    25.17 (5.85) 0.00 (0.00) 0.04  0.02
             SIS+LASSO(BIC)     10.62 (1.64)   24.02 (15.84) 0.00 (0.00) 0.01  0.01
               dgLARS(BIC)       9.84 (1.08)     4.58 (2.45) 0.00 (0.00) 0.01  0.04
                 SC (𝛾 𝐿 )      8.92 (0.53)     1.75 (0.94)  0.00 (0.00) 0.01  0.02
                 SC (𝛾 𝐻 )       1.00 (0.00)     0.00 (0.00) 0.00 (0.00) 0.01  0.02
                  FR (𝛾 𝐿 )     9.56 (0.58)     1.52 (0.92)  0.00 (0.00) 0.07  0.02
                 FR ( 𝛾 𝐻 )      2.00 (0.00)    0.00 (0.00)  0.00 (0.00) 0.01  0.02
                STEPWISE        9.50 (0.54)     0.67 (0.83)  0.00 (0.00) 0.04  0.02
Note: Abbreviations are explained in the footnote of Table 3.2.
                                           62


                           Table 3.4: Poisson model
Example         Method           TP             FP         PIT      MSE    MSPE
1 (𝑝 0 = 8)   LASSO(1SE)     7.93 (0.43)    4.64 (4.82) 0.96 (0.19) 0.001  4.236
              LASSO(BIC)     7.99 (0.10)  14.37 (14.54) 0.99 (0.10) 0.001  3.133
            SIS+LASSO(1SE)   7.89 (0.37)   25.37 (8.39) 0.91 (0.29) 0.001  3.247
            SIS+LASSO(BIC)   7.89 (0.37)  17.77 (11.70) 0.91 (0.29) 0.001  3.078
              dgLARS(BIC)    8.00 (0.00)  13.28 (14.31) 1.00 (0.00) 0.001  3.183
                 SC (𝛾 𝐿 )   7.96 (0.20)   4.94 (3.46)  0.96 (0.20) 0.001  2.874
                SC (𝛾 𝐻 )    5.05 (1.70)    0.04 (0.24) 0.07 (0.26) 0.001  3.902
                 FR (𝛾 𝐿 )   7.93 (0.26)   4.86 (3.73)  0.93 (0.26) 0.001  2.837
                FR (𝛾 𝐻 )    5.13 (1.61)    0.06 (0.31) 0.07 (0.26) 0.001  3.833
               STEPWISE      7.91 (0.29)   2.77 (2.91)  0.91 (0.29) 0.001  3.410
2 (𝑝 0 = 8)   LASSO(1SE)     8.00 (0.00)    2.23 (3.52) 1.00 (0.00) 0.001  3.981
              LASSO(BIC)     8.00 (0.00)   8.98 (8.92)  1.00 (0.00) 0.001  3.107
            SIS+LASSO(1SE)   7.98 (0.14)   22.85 (7.08) 0.98 (0.14) 0.001  2.824
            SIS+LASSO(BIC)   7.98 (0.14)   13.55 (8.24) 0.98 (0.14) 0.001  2.937
              dgLARS(BIC)    8.00 (0.00)    8.91 (9.10) 1.00 (0.00) 0.001  3.099
                 SC (𝛾 𝐿 )   8.00 (0.00)   3.89 (2.89)  1.00 (0.00) 0.000  2.979
                SC (𝛾 𝐻 )    5.68 (1.45)    0.00 (0.00) 0.12 (0.33) 0.001  3.971
                 FR (𝛾 𝐿 )   8.00 (0.00)   3.60 (2.80)  1.00 (0.00) 0.000  3.032
                FR (𝛾 𝐻 )    5.71 (1.42)    0.00 (0.00) 0.10 (0.30) 0.001  3.911
               STEPWISE      7.98 (0.14)   2.00 (2.23)  0.98 (0.14) 0.000  3.589
3 (𝑝 0 = 5)   LASSO(1SE)     4.37 (0.51)    6.88 (2.61) 0.38(0.48)  0.001  1.959
              LASSO(BIC)     4.79 (0.41)   5.62 (2.17)  0.79 (0.41) 0.000  2.044
                                                           Continued on next page
                                       63


                               Table 3.4 (cont’d)
 Example         Method          TP              FP         PIT      MSE    MSPE
             SIS+LASSO(1SE)  0.86 (0.47)    10.11 (2.55) 0.00 (0.00) 0.002   3.266
             SIS+LASSO(BIC)  0.86 (0.47)    11.86 (2.99) 0.00 (0.00) 0.002   3.160
               dgLARS(BIC)   4.55 (0.51)    18.29 (6.13) 0.56 (0.49) 0.001   1.877
                  SC (𝛾 𝐿 )  4.73 (0.45)     0.53 (0.66) 0.73 (0.45) 0.000   2.479
                 SC (𝛾 𝐻 )   2.84 (0.63)     0.40 (0.49) 0.00 (0.00) 0.001   0.664
                  FR (𝛾 𝐿 )  4.54 (0.52)     1.98 (2.19) 0.55 (0.50) 0.000   2.128
                 FR (𝛾 𝐻 )   2.71 (0.70)     0.43 (0.50) 0.00 (0.00) 0.001   0.605
                STEPWISE    4.54 (0.52)     1.77 (2.01)  0.55 (0.50) 0.000   2.132
4 (𝑝 0 = 14)   LASSO(1SE)   10.01 (1.73)     3.91 (6.03) 0.01 (0.10) 0.003  15.582
               LASSO(BIC)   12.11 (1.46)   36.56 (22.43) 0.19 (0.39) 0.002   5.688
             SIS+LASSO(1SE) 10.42 (1.66)    21.41 (8.87) 0.03 (0.17) 0.003  11.316
             SIS+LASSO(BIC) 10.73 (1.66)    32.67 (8.92) 0.03 (0.17) 0.003   8.545
               dgLARS(BIC)  12.05 (1.52)   38.70 (28.97) 0.18 (0.38) 0.002   5.111
                  SC (𝛾 𝐿 ) 10.33 (1.63)    10.48 (6.66) 0.02 (0.14) 0.002   4.499
                 SC (𝛾 𝐻 )   5.32 (1.92)     0.52 (1.37) 0.00 (0.00) 0.003  14.005
                  FR (𝛾 𝐿 ) 12.00 (1.71)    8.93 (6.36)  0.23 (0.42) 0.001   4.503
                 FR (𝛾 𝐻 )   5.65 (2.13)     0.38 (1.15) 0.00 (0.00) 0.003  13.802
                STEPWISE    11.80 (1.72)    5.97 (5.37)  0.19 (0.39) 0.001   5.809
5 (𝑝 0 = 3)    LASSO(1SE)    2.00 (0.00)     1.13 (2.85) 0.00 (0.00) 0.003   2.674
               LASSO(BIC)    2.01 (0.10)    2.82 (2.52)  0.01 (0.10) 0.003   2.583
             SIS+LASSO(1SE)  2.87 (0.34)     9.28 (3.85) 0.87 (0.34) 0.002   2.455
             SIS+LASSO(BIC)  2.87 (0.34)     9.88 (4.29) 0.87 (0.34) 0.002   2.355
               dgLARS(BIC)   2.00 (0.00)     2.88 (2.38) 0.00 (0.00) 0.003   2.562
                                                            Continued on next page
                                       64


                               Table 3.4 (cont’d)
 Example         Method          TP              FP         PIT      MSE    MSPE
                  SC (𝛾 𝐿 )  2.75 (0.44)     3.27 (1.75) 0.75 (0.44) 0.001  2.339
                 SC (𝛾 𝐻 )   2.00 (0.00)     0.00 (0.00) 0.00 (0.00) 0.003  1.086
                  FR (𝛾 𝐿 )  3.00 (0.00)     2.80 (1.73) 1.00 (0.00) 0.001  2.326
                 FR (𝛾 𝐻 )   2.40 (0.49)     0.00 (0.00) 0.40 (0.49) 0.002  0.981
                STEPWISE    3.00 (0.00)     0.35 (0.59)  1.00 (0.00) 0.001  2.977
6 (𝑝 0 = 10)   LASSO(1SE)    6.08 (1.16)    32.54 (4.83) 0.00 (0.00) 0.01    7.64
               LASSO(BIC)   8.15 (0.94)    37.56 (7.96)  0.06 (0.23)  0.01   5.93
             SIS+LASSO(1SE)  0.00 (0.00)    26.34 (3.87) 0.00 (0.00)  0.01  12.27
             SIS+LASSO(BIC)  0.00 (0.00)    28.03 (4.29) 0.00 (0.00)  0.01  12.06
               dgLARS(BIC)   8.42 (1.22)   75.21 (15.56) 0.21 (0.41)  0.01   4.55
                  SC (𝛾 𝐿 )  9.45 (0.72)     9.80 (3.12) 0.76 (0.42)  0.00   4.77
                 SC (𝛾 𝐻 )   3.73 (1.73)     2.15 (0.45) 0.00 (0.00)  0.01   3.43
                  FR (𝛾 𝐿 )  9.54 (1.05)    11.26 (2.97) 0.55 (0.50)  0.01   4.42
                 FR (𝛾 𝐻 )   2.85 (1.85)     2.70 (0.57) 0.00 (0.00)  0.01   3.28
                STEPWISE    9.54 (1.05)     4.30 (2.03)  0.55 (0.50)  0.01   6.01
7 (𝑝 0 = 15)   LASSO(1SE)   11.98 (2.20)     4.00 (3.56) 0.12 (0.32) 0.01   20.71
               LASSO(BIC)   14.93 (0.29)   51.44 (11.68) 0.94 (0.23)  0.00   7.17
             SIS+LASSO(1SE) 12.51 (1.20)    17.32 (5.91) 0.03 (0.18)  0.00  15.24
             SIS+LASSO(BIC) 12.50 (1.29)    33.31 (7.69) 0.05 (0.23)  0.01  11.93
               dgLARS(BIC)  14.94 (0.27)   56.10 (17.38) 0.95 (0.20)  0.00   6.58
                  SC (𝛾 𝐿 ) 12.65 (1.43)    20.94 (4.47) 0.08 (0.28)  0.00   3.92
                 SC (𝛾 𝐻 )   6.55 (1.98)     0.54 (0.84) 0.00 (0.00)  0.00   5.67
                  FR (𝛾 𝐿 ) 13.60 (1.12)    19.11 (4.03) 0.21 (0.41)  0.00   3.96
                                                            Continued on next page
                                       65


                                          Table 3.4 (cont’d)
       Example          Method              TP             FP           PIT      MSE     MSPE
                        FR (𝛾 𝐻 )       7.13 (1.87)    0.31 (0.55)   0.00 (0.00)  0.00     6.54
                      STEPWISE         13.54 (1.31)    5.30 (3.12)   0.21 (0.39)  0.00     9.23
      Note: Abbreviations are explained in the footnote of Table 3.2.
3.6     Applications: Real Data Analysis
3.6.1    A Study of Gene Regulation in the Mammalian Eye
To demonstrate the utility of our proposed method, we analyzed a microarray dataset from Scheetz
et al. [41] with 120 twelve-week male rats selected for eye tissue harvesting. The dataset contained
more than 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array); see
Scheetz et al. [41] for a more detailed description of the data.
Although our method was applicable to the original 31,042 probe sets, many probes turned out to
have very small variances and were unlikely to be informative for correlative analyses. Therefore,
using variance as the screening criterion, we selected 5,000 genes with the largest variances in
expressions and correlated them with gene TRIM32 that has been found to cause Bardet-Biedl
syndrome, a genetically heterogeneous disease of multiple organ systems including the retina [42].
We applied the proposed STEPWISE method to the dataset with n = 120 and p = 5,000, and treated
the TRIM32 gene expression as the response variable and the expressions of 5,000 genes as the
predictors. With no prior biological information available, we started with the empty set. To choose
𝜂1 and 𝜂2 , we carried out 5-fold cross-validation to minimize the mean squared prediction error
(MSPE) by using the following grid search: 𝜂1 = {0, 0.25, 0.5, 1} and 𝜂2 = {1, 2, 3, 4, 5}, and set
𝜂1 = 1 and 𝜂2 = 4. We also performed the same procedure to choose the 𝛾 for FR and SC. The
regularization parameters in LASSO and dgLARS were selected to minimize BIC values.
                                                  66


In the forward step, STEPWISE selected the probes of 1376747_at, 1381902_at, 1382673_at and
1375577_at, and the backward step eliminated probe 1375577_at. The STEPWISE procedure
produced the following final predictive model:
TRIM32 = 4.6208 + 0.2310 × (1376747_at) + 0.1914 × (1381902_at) + 0.1263 × (1382673_at).
Table A.1 in Appendix A presents the numbers of overlapping genes among competing methods.
It shows that the two out of three probes, 1381902_at and 1376747_at, selected from our method
are also discovered by the other methods, except for dgLARS.
Next, we performed Leave-One-Out Cross-Validation (LOOCV) to obtain the distribution of the
model size (MS) and MSPE for the competing methods. As reported in Table 3.5 and Figure 3.1,
LASSO, SIS+LASSO and dgLARS tended to select more variables than the forward approaches
and STEPWISE. Among all of the methods, STEPWISE selected the fewest variables but with
almost the same MSPE as the other methods.
             22
             20
             18
             16                                                                   Methods
             14                                                                       STEPWISE
Model size
                                                                                      FR
             12
                                                                                      LASSO
             10                                                                       SIS+LASSO
             8                                                                        SC
             6                                                                        dgLARS
             4
             2
             0
                  STEPWISE   FR   LASSO     SIS+LASSO      SC        dgLARS
Figure 3.1: Box plot of model sizes for each method over 120 different training samples from the
mammalian eye data set. STEPWISE was performed with 𝜂1 = 1 and 𝜂2 = 4, and FR and SC were
conducted with 𝛾 = 1.
                                                 67


Table 3.5: Comparisons of MSPE between competing methods using the mammalian eye data set.
                            STEPWISE          FR    LASSO     SIS+LASSO           SC    dgLARS
             Training set          0.005   0.005       0.005          0.006    0.005       0.014
             Testing set           0.011   0.012       0.010          0.009    0.014       0.020
 Note: The mean squared prediction error (MSPE) was averaged over 120 splits. LASSO, least absolute
shrinkage and selection operator with regularization parameter that gives the smallest BIC; SIS+LASSO, sure
independence screening by [87] followed by LASSO; dgLARS, differential geometric least angle regression
by [90, 91] that gives the smallest BIC; SC(𝛾), sequentially conditioning approach by [99]; FR(𝛾), forward
regression by [93]; STEPWISE, the proposed method. STEPWISE was performed with 𝜂1 = 1 and 𝜂2 = 4,
FR and SC were performed with 𝛾 = 1.
3.6.2   An Esophageal Squamous Cell Carcinoma Study
Esophageal squamous cell carcinoma (ESCC), the most common histological type of esophageal
cancer, is known to be associated with poor overall survival, making early diagnosis crucial for
treatment and disease management [47]. Several studies have investigated the roles of circulating
microRNAs (miRNAs) in diagnosis of ESCC [45].
Using a clinical study that investigated the roles of miRNAs on the ESCC [57], we aimed to
use miRNAs to predict ESCC risks and estimate their impacts on the development of ESCC.
Specifically, with a dataset of serum profiling of 2,565 miRNAs from 566 ESCC patients and
4,965 controls without cancer, we demonstrated the utility of the proposed STEPWISE method
in predicting ESCC with miRNAs. To proceed, we used a balance sampling scheme (283 cases
and 283 controls) in the training dataset. The design of yielding an equal number of cases and
controls in the training set has proved to be useful [57] for handling imbalanced outcomes as we
encountered here. To validate our findings, samples were randomly divided into a training (𝑛1 =
566, 𝑝 = 2,565) and testing set (𝑛2 = 4,965, 𝑝 = 2,565).
The training set consisted of 283 patients with ESCC (median age of 65 years, 79% male) and 283
control patients (median age of 68 years, 46.3% male), and the testing set consisted of 283 patients
with ESCC (median age of 67 years, 85.7% male) and 4,682 control patients (median age of 67.5
years, 44.5% male). Control patients without ESCC came from three sources: 323 individuals
                                                      68


from National Cancer Center Biobank (NCCB); 2,670 individuals from the Biobank of the National
Center for Geriatrics and Gerontology (NCGG); and 1,972 individuals from Minoru Clinic (MC).
More detailed characteristics of cases and controls in the training and testing sets are given in Table
A.4.
We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. As
age and gender (0 = female, 1 = male) are important risk factors for ESCC [131, 132] and it is
common to adjust for them in clinical models, we set the initial set in STEPWISE to be 𝐹0 = {age,
gender}. With 𝜂1 = 0 and 𝜂2 = 3.5 that were also chosen from 5-fold CV, our procedure recruited
three miRNAs. More specifically, miR-4783-3p, miR-320b, miR-1225-3p and miR-6789-5p were
selected among 2,565 miRNAs by the forward stage from the training set, and then the backward
stage eliminated miR-6789-5p. In comparison, with 𝛾 = 0, both FR and SC selected four miRNAs,
miR-4783-3p, miR-320b, miR-1225-3p, and miR-6789-5p.
The list of selected miRNAs by different methods is given in Table A.2 in Appendix A. Our findings
were biologically meaningful, as the selected miRNAs had been identified by other cancer studies
as well. Specifically, miR-320b was found to promote colorectal cancer proliferation and invasion
by competing with its homologous miR-320a [133]. In addition, serum levels of miR-320 family
members were associated with clinical parameters and diagnosis in prostate cancer patients [134].
Mullany et al. [135] showed that miR-4783-3p was one of the miRNAs that could increase the risk
of colorectal cancer death among rectal cancer cases. Finally, miR-1225-5p inhibited proliferation
and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation
of 𝛽-catenin signaling [136].
Aiming to identify a final model without resorting to a pre-screening procedure that may miss out
on important biomarkers, we applied STEPWISE to reach the following predictive model for ESCC
based on patients’ demographics and miRNAs:
logit−1 (−35.70 + 1.41 × miR-4783-3p + 0.98 × miR-320b + 1.91 × miR-1225-3p + 0.10 × Age −
2.02 × Gender), where logit−1 (𝑥) = exp(𝑥)/(1 + exp(𝑥)).
In the testing dataset, the model had an area under the receiver operating curve (AUC) of 0.99 and
                                                 69


achieved a high accuracy of 0.96, with a sensitivity and specificity of 0.97 and 0.95, respectively.
Also using the testing cohort, we evaluated the performance of the models sequentially selected
by STEPWISE. Starting with a model containing age and gender, STEPWISE selected miR-
4783-3p, miR-320b and miR-1225-3p in turn. Figure 3.3, showing the corresponding receiver
operating curves (ROC) for these sequential models, revealed the improvement by sequentially
adding predictors to the model and justified the importance of these variables in the final model. In
addition, Figure 3.3 (e) illustrated that adding an extra miRNA selected by FR and SC made little
improvement of the model’s predictive power.
Furthermore, we conducted subgroup analysis within the testing cohort to study how the sensitivity
of the final model differed by cancer stage, one of the most important risk factors. The sensitivity
for stages 0, i.e., non-invasive cancer, 9 (𝑛 = 27), 1 (𝑛 = 128), 2 (𝑛 = 57), 3 (𝑛 = 61), and 4
(𝑛 = 10) was 1.00, 0.98, 0.97, 0.97, and 1.00, respectively. We next evaluated how the specificity
varied across controls coming from different data sources. The specificity for the various control
groups, namely, NCCB (𝑛 = 306), NCGG (𝑛 = 2,512), and MC (𝑛 = 1,864), was 0.99, 0.99, and
0.98, respectively. The results indicated the robust performance of the miRNA-based model toward
cancer stages as well as data sources.
Finally, to compare STEPWISE with the other competing methods, we repeatedly applied the
aforementioned balance sampling procedure and split the ESCC data into the training and testing
sets 100 times. We obtained MSPE and the average of accuracy, sensitivity, specificity, and
AUC. Figure 3.2 reported the model size of each method. Though STEPWISE selected fewer
variables compared to the other variable selection methods (for example, LASSO selected 11-
31 variables and dgLARS selected 12-51 variables), it achieved comparable prediction accuracy,
specificity, sensitivity and AUC (see Table 3.6), evidencing the utility of STEPWISE for generating
parsimonious models while maintaining competitive predictability.
We used R software [137] to obtain the numerical results in Sections 4 and 5 with following
packages: ggplot2 [138], ncvreg [128], glmnet [127], dglars [130], and screening [129].
                                                  70


             52
             48
             44
             40
                                                                                               Methods
             36
                                                                                                  STEPWISE
Model size
             32
                                                                                                  FR
             28                                                                                   LASSO
             24                                                                                   SIS+LASSO
             20                                                                                   SC
             16                                                                                   dgLARS
             12
             8
             4
             0
                  STEPWISE      FR        LASSO     SIS+LASSO            SC        dgLARS
Figure 3.2: Box plot of model sizes for each method based on 100 ESCC training datasets.
Performance of STEPWISE is reported with 𝜂1 = 0 and 𝜂2 = 3.5. Performance of SC and FR are
reported with 𝛾 = 0.
Table 3.6: Comparisons of competing methods over 100 independent splits of the ESCC data into
training and testing sets
                         Training set   MSPE    Accuracy   Sensitivity    Specificity   AUC
                         STEPWISE        0.02       0.97         0.98           0.97    1.00
                         SC              0.01       0.99         0.98           0.98    1.00
                         FR              0.02       0.99         0.97           0.97    1.00
                         LASSO           0.01       0.98         1.00           0.97    1.00
                         SIS+LASSO       0.01       0.99         1.00           0.99    1.00
                         dgLARS          0.04       0.96         0.99           0.94    1.00
                         Test set       MSPE    Accuracy   Sensitivity    Specificity   AUC
                         STEPWISE        0.04       0.96         0.97           0.95    0.99
                         SC              0.03       0.96         0.97           0.96    0.99
                         FR              0.04       0.96         0.97           0.95    0.99
                         LASSO           0.03       0.96         0.99           0.95    1.00
                         SIS+LASSO       0.02       0.97         0.99           0.96    1.00
                         dgLARS          0.05       0.94         0.98           0.94    1.00
Note: Values are averaged over 100 splits. STEPWISE was performed with 𝜂1 = 0 and 𝜂2 = 1. SC and
FR were performed with 𝛾 = 1. The regularization parameters in LASSO and dgLARS were selected to
minimize the BIC.
                                                      71


                   (a) Model 1, AUC= 0.71               (b) Model 2, AUC=0.97
                    (c) Model 3, AUC=0.98               (d) Model 4, AUC=0.99
                                      (e) Model 5, AUC=0.99
Figure 3.3: Comparisons of ROC curves for the selected models in the ESCC data set by the
sequentially selected order. Model 1 includes Age and Gender feature, and the following features
are sequnatially added to the model: miR-4783-3p, miR-320b, miR-1225-3p, miR-6789-5p.
                                                72


3.6.3    Neurobehavioral Impairment from Total Sleep Deprivation
In this study we aim to explore gene expression biomarker candidates for neurobehavioral impair-
ment from total sleep deprivation. Specifically, using a clinical study, we investigate the role of
genes on the Total Sleep Deprivation (TSD) and we use these biomarkers, that is, gene expressions,
to predict TSD and estimate their effect on the development of TSD.
To perform analysis, data was obtained from NCBI GEO online repository, accession GSE98582.
Blood samples were obtained from 17 healthy adults (ages 22–37, 7 females) who were not using
drugs. Subjects remained in the sleep laboratory at the Sleep and Performance Research Center
of Washington State University (Spokane, WA) for six consecutive nights. Meals were semi-
standardized with selection from among a limited number of menu options; blood draws were
performed immediately prior to meals. Blood samples were collected with an intravenous catheter
approximately every 4 h during time awake on days two, four, and six. At each of the 12 timepoints,
2.5 mL blood was collected in a PAXgene™ Blood RNA tube, and the number of lapses per test
bout was recorded from a 10 min PVT assay. Overall, the dataset contains 555 samples and 8284
gene features.
We define the binary outcome variable to be 1 if a sample corresponds to the case when TSD is
observed and 0 otherwise. Total, 342 samples with TSD and 213 controls (without TSD) were
taken. Further, we split the dataset into training and testing sets in order to perform the data analysis.
To preserve the underlying distribution of the response variable, a stratified sampling technique
was implemented. We kept 70% of data in the training set (389 samples with 8284 features) and
the remaining 30% (166 samples with 8284 features) was used for model validation.
With 𝜂1 = 0.5 and 𝜂2 = 3 that were chosen from 5-fold cross-validation, the STEPWISE procedure
recruited five genes. Particularly, PF4V1, USP32P1, EMR1, NBR2, and DUSP23 were selected
among 8284 genes. In addition, our procedure was applied to identify a final model for predicting
TSD based on gene biomarkers. As a result, the following model was produced:
logit−1 (−322.02 + 13.01 × PF4V1 − 9.96 × USP32P1 + 15.17 × EMR1 + 17.66 × NBR2 + 15.34 ×
DUSP23), where logit−1 (𝑥) = exp(𝑥)/(1 + exp(𝑥)).
                                                    73


In the testing dataset, the final model had an area under the receiver operating curve (AUC) of
0.997 and achieved an accuracy of 0.983, with a sensitivity and specificity of 0.991 and 0.972,
respectively. To compare STEPWISE with other competing methods, we repeatedly applied the
aforementioned sampling procedure and split the dataset into training and testing sets 100 times.
We obtained MSPE, the average of accuracy, sensitivity, specificity, and AUC. Figure 3.4 reports the
model size of each method. Again, we observe that although STEPWISE procedure selects fewer
variables that other methods, it achieves comparable prediction accuracy, specificity, sensitivity,
and AUC. The results are presented in the Table 3.7.
Figure 3.4: Box plot of model sizes for each method based on 100 total sleep deprivation training
datasets. Performance of STEPWISE is reported with 𝜂1 = 0.5 and 𝜂2 = 3. Performance of SC and
FR are reported with 𝛾 = 0.5
                                                74


Table 3.7: Comparisons of competing methods over 100 independent splits of the Total Sleep
Deprivation data into training and testing sets
                  Training set  MSPE       Accuracy  Sensitivity  Specificity AUC
                  STEPWISE        0.02         0.98        0.98         0.97  0.99
                  SC              0.01         0.98        0.98         0.98  1.00
                  FR              0.02         0.98        0.98         0.97  0.99
                  LASSO           0.00         1.00        1.00         1.00  1.00
                  SIS+LASSO       0.00         1.00        1.00         1.00  1.00
                  dgLARS          0.07         0.91        0.92         0.89  0.95
                  Test set      MSPE       Accuracy  Sensitivity  Specificity AUC
                  STEPWISE        0.04         0.97        0.96         0.94  0.98
                  SC              0.03         0.96        0.97         0.95  0.99
                  FR              0.04         0.97        0.96         0.94  0.98
                  LASSO           0.01         0.98        0.98         0.99  1.00
                  SIS+LASSO       0.01         0.99        0.99         0.98  1.00
                  dgLARS          0.08         0.88        0.90         0.86  0.95
Note: Values are averaged over 100 splits. STEPWISE was performed with 𝜂1 = 0.5 and 𝜂2 = 3. SC and
FR were performed with 𝛾 = 0.5. The regularization parameters in LASSO and dgLARS were selected to
minimize the BIC.
                                                  75


                                          CHAPTER 4
                MULTI-STAGE HYBRID MACHINE LEARNING METHOD
4.1    Machine Learning Ensemble Methods: Categories and Types
In Machine Learning, ensemble methods are used to achieve better predictive performance by
combining predictions from multiple models instead of using a single model [139]. These methods
tend to provide better results when the models used in ensemble methods are significantly diverse
[140, 141]. Ensemble methods are mainly divided into two categories: sequential ensemble
techniques and parallel ensemble techniques. Former generates base-learners (each method used
in the model) in a sequence making them dependent on one another. The model performance
tends to improve by assigning higher weights to previously misrepresented learners. In contrast,
parallel ensemble techniques generate base learners in a parallel. This is done in order to introduce
independence among base learners, which significantly reduces the error due to averaging the
results obtained from base learner models.
Besides being divided into categories, ensemble methods can be distinguished by their types.
The most popular and well-known types are Bagging, Boosting, and Stacking methods. Bagging
methods train each base learner on a different sample of a training dataset (normally, these are
bootstrap samples taken from the original training dataset). Predictions made by each of ensemble
members are then combined by averaging the results, which is done to incorporate all possible
outcomes of the prediction and randomize the outcome [142]. In order to improve the predictive
power of the model, boosting ensemble technique learns from mistakes made by previous predictors.
It adds predictors to the model sequentially, where successor predictors correct mistakes of the
preceding predictors [143]. The gradient decent algorithm is used to identify points that need
improvement the most.
Finally, Stacking technique (also known as stacked generalization) trains a learning algorithm to
combine predictions of multiple other learning algorithms. It makes a final prediction by using
                                                76


the predictions made by other algorithms, that is, output values of these methods become the input
values of the stacked model. In this chapter, we will utilize parallel ensemble techniques and
improve the performance of the STEPWISE algorithm. All methods used in the final model are
discussed in subsequent sections.
4.2     A Review on Existing Methods
In this section we propose and describe methods included in the multi-stage hybrid machine learning
model. They can be divided into two groups: model-based and model-free methods. Model-based
methods specifically define the relationship between the response and explanatory variables (also
known as predictors) via a certain link function. These models are considered to have a math-
ematical structure and involve various parameters that need to be estimated based on observed
data. In addition, these models are accompanied by a set of statistical assumptions such as an
underlying distribution of the response variable, relationship among predictors (mainly concerning
their independence), variability of data, and etc.
It becomes crucial to satisfy those assumptions as it guarantees the reliability of results. Thus,
implementation of model-based methods should be done by carefully examining and confirming
validity of the model assumptions and choosing the appropriate link function. Due to straightfor-
ward interpretation and relatively small model complexity, model-based methods gained popularity
among practitioners. We selected least absolute shrinkage and selection operator (LASSO) and our
proposed STEPWISE method to represent model-based methods in the model.
In contrast, model-free methods do not make any assumptions on the parametric form of the under-
lying model explicitly. In other words, they adapt to the data characteristics without pre-specified
model structure. Model-free algorithms are designed to automatically learn, adjust their actions
and improve results with minimal or no human intervention. These algorithms help to gain some
insights from data and enable to build right predictions and minimize chances of making any kind
of errors. Given complex data, model-free methods are able to construct non-parametric repre-
sentations (also known as non-parametric models). These methods develop their models based on
                                                  77


constant learning and retraining.
Normally, model-free methods are used in problems where mathematical models are unavailable
or hard to construct. Therefore, almost all model-free methods have some optimization techniques
at their core. For instance, gradient decent optimization algorithm is commonly used in many
non-parametric methods. As a result, model-free methods tend to achieve high accuracy in their
predictions and are successfully used with complex data. We selected several such methods to
include in our model. Specifically, random forest (RF), support vector machine (SVM), extreme
gradient boosting machine (XGBoost), and artificial neural network(ANN).
These methods are just selective examples of many other methods that are currently available. We
decided to include this particular set of methods in our model because it is diverse in its nature and
these methods are applicable in various real-life scenarios. For instance, random forest reduces
drawback of large variance and is not prone to overfit the model; extreme gradient boosting machine
provide lots of flexibility, can optimize on different loss functions and applicable to case with low
variance and high bias; support vector machine is more effective in high dimensional spaces and
relatively memory efficient; least absolute shrinkage and selection operator performs both auto-
mated variable selection and regularization, and helps minimize the impact of multicollinearity
among predictors; artificial neural network can handle comprehensive data structures due to its
complexity. All these methods are described in subsequent sections.
4.2.1   Random Forest (RF)
Random forests combine tree predictors that depend on values of a random vector sampled inde-
pendently and identically for all trees in the forest. This methodology was proposed by Breiman
[144] and quickly gained popularity among researchers and practitioners due to its simplicity and
high accuracy.
Random Forest is a generic method, but mostly has been used with classification trees. Random
Forests grow multiple classification trees, and classify a new object from an input vector by putting
it down each of the trees in the forest. Then each tree gives a classification, and the majority of
                                                  78


"votes" determines a class of a prediction. More specifically, each tree in random forest grows as
follows. If a number of observations in the training set is n, it takes a sample of n observations
at random with replacement, from the original data. This creates a training set for growing the
tree. Then, if there are M input predictors, a number m < M is specified such that at each node, m
variables are selected at random out of the M and the best split on these m divides the node. The
value of m is being held constant during the forest growing. Then each tree grows to the largest
extent possible with no pruning applied.
Mainly, an error rate produced by random forests depends on two factors: First, increasing a corre-
lation between any two trees in the forest increases the error rate. Second, increasing the strength
of the individual trees decreases the forest error rate. Additionally, the error rate can be controlled
by manipulating an m parameter. Reducing m reduces both the correlation and the strength and
increasing it increases both. The advantages of using Random Forest is the ability to cope with
thousands of features without variable deletion, provide variable importance assessment in the
classification, generate an internal unbiased estimate of the generalization error, and have methods
for balancing error in class population unbalanced data sets.
4.2.2   Support Vector Machines (SVM)
In essence, the Support Vector Machine is a method proposed by Vapnik [145] that conceptually
implements the following idea: it takes input vectors and maps them non-linearly to a very high
dimension feature space. Then in this feature space it constructs a linear decision surface. Special
properties of the decision surface guarantees high generalization ability of the learning method.
SVMs can handle any number of classes, as well as observations of any dimension and can take
almost any shape including linear, radial, and polynomial, among others. Particularly, SVMs
construct a hyperplane or a set of hyperplanes, that is decision boundaries, in a high- or infinite-
dimensional space, which can be used for classification, regression, and other type of problems.
Good separation is achieved by the hyperplane that has the largest distance to the nearest training
data point of any class, also known as support vectors.
                                                  79


If classes are linearly separable, Hard Margin Classifier (HMC) can be implemented. HMC finds an
optimal hyperplane, that separates classes while maximazing the distance to the closest points from
the classes. The maximized distance is referred to as the Margin. HMC estimates the coefficients
of hyperplanes by solving a quadratic programming problem with linear inequality constrains. If
a perfect linear separation is not achievable or desirable, a Soft Margin Classifier (SMC) can be
considered. While the data can be still linearly separable, the decision boundaries obtained using
the HMC might not generalize well to new data and accuracy will suffer. To solve this issue, SMC
loosens the constrains and allows some points to be wrongly classified. The set of points is called
allowable budget.
Finally, if classes are not linearly separable, Support Vector Machine projects data to higher
dimensions, where they are linearly separable and constructs a hyperplane. Then it transforms this
hyperplane back to the initial space and obtains a non-linear decision boundary. It is achieved
by using a kernel trick that computes a dot product in some feature space without even knowing
what the space is and what is a mapping function. Most commonly employed kernel functions
are linear, polynomial, and radial basis functions. The advantages of Support Vector Machine are
that it always guarantees to find a global optimum as it just solves convex optimization problem,
relatively robust to outliers (soft margin), and is flexible (implements various kernel functions). The
main drawback of SVM is that it slows down the training process as data become taller (when the
number of observations is significantly greater than the number of predictors) as it has to estimate
parameters for each row.
4.2.3    Gradient Boosting Machine (GBM)
Gradient Boosting Machines, proposed by Friedman [146], quickly gained popularity due to their
high accuracy and effectiveness in solving complex problems. Typically, it is hard for other
methods to outperform the performance of GBMs and it is the algorithm of choice for many teams
of machine learning competitions. GBMs build an ensemble of shallow trees in sequence where
each tree learns and improves on the previous one. Even though shallow trees by themselves are
                                                     80


more of weak predictive models, they are able to boost and produce a powerful committee.
The main idea of boosting algorithms is to add new models to the ensemble sequentially. It starts
with a weak tree and sequentially continues to build new trees, where each new tree in the sequence
fixes up where the previous one made the mistakes (for instance, each new tree in the sequence
focuses on the training rows where the previous tree had the largest prediction errors). Specifically,
at any instant the model outcomes are weighed according to the outcomes of the previous instant.
The outcomes that are predicted correctly are given a lower weight and the ones that are miss-
classified are given higher weights.
Gradient Boosting Machines are considered a gradient decent algorithm. Gradient descent is a
generic optimization algorithm that is capable of finding optimal solutions to a wide range of
problems and can be used on any loss function that is differentiable. The fundamental idea of
gradient descent is to search parameter values iteratively that will minimize a loss function. Here
it is used to estimate the weights assigned to correctly and incorrectly predicted outcomes. Unlike
bagging algorithms, GBM deals with bias variance trade-off by controlling both bias and variance
and is proven to be more effective when applied to models with high bias and low variance.
There are various versions of Gradient Boosting Machine. Particularly useful are Stochastic GBM
and Extreme GBM. Stochastic GBM takes a random subsample of the training dataset that offers
additional reduction in tree correlation which improves a prediction accuracy. Extreme GBM is
an optimized distributed gradient boosting machine that improves the accuracy and speed of the
method by employing parallelism in its algorithm and adding regularization parameters to the
model. The main disadvantage of GBM method is that it is a complex and less intuitive algorithms.
In addition, it is time and computationally expensive method.
4.2.4    Artificial Neural Network (ANN)
An important subfield of Machine Learning is Deep Learning, which focuses on building predictive
models based on artificial neural networks with two or more hidden layers. Artificial Neural
Networks (ANN), first proposed by McCullough and Pitts [147], a model structure used in most
                                                 81


of the Deep Learning models, is inspired by the biological neural networks and mimics the way
humans gain certain types of information through a combination of data inputs, weights, and bias.
Like other machine learning algorithms, neural networks perform learning by mapping features to
targets through a process of simple data transformations and feedback signals. Fundamental to most
of the deep learning methods is the feedforward ANN. Feedforward ANNs are densely connected
layers where inputs influence each successive layer which then influences the final output layer.
Basic neural networks have three layers: an input layer, a hidden layer, and an output layer.
The input layer consists of all of the original input features. Most of the learning happens in the
hidden layer, and the output layer produces the final predictions. The layers and nodes are the
building blocks of our ANN and they decide how complex the network will be. Layers are called
dense if all the nodes in successive layer are connected. Consequently, the more layers and nodes
you add the more opportunities you create for new features to be learned.
There is no unique approach for determining the number of hidden layers and nodes; basically,
these are the first hyperparameters among many others to tune. Mainly, features in your data largely
determine the number of nodes you define in these hidden layers. The modeling task drives the
choice of output layer. For regression problems, the output layer contains just one node that outputs
the final prediction. If you are predicting a binary output, your output layer will still contain only
one node and that node will predict the probability of success. Finally, if you predict an output with
several classes, the output layer will contain the same number of nodes as the number of classes.
A crucial component of artificial neural networks is activation. Each node in ANN is connected
to all the nodes in the previous layer. Each connection gets a weight and then that node adds
all the incoming inputs multiplied by its corresponding connection weight plus an extra bias
parameter. This summation becomes an input to an activation function. The activation function is
a mathematical function that determines whether to fire a signal to the next layer.
On the forward pass, the ANN will select a batch of observations, randomly assign weights across
all the node connections, and predict the output. Then, it assesses its own accuracy and adjusts the
weights across all the node connections in order to improve the accuracy. This process is called
                                                  82


backpropagation. To carry out backpropagation, first, you need to establish an objective (loss)
function to measure performance. On each forward pass the ANN will measure its performance
based on the loss function chosen. The ANN will then work backwards through the layers,
compute the gradient of the loss with regards to the network weights, adjust the weights a little in
the opposite direction of the gradient, grab another batch of observations to run through the model,
and repeat until the loss function is minimized. The performance of ANN can be optimized by
tuning its hyperparameters. It can be done through adjusting model capacity (layers and modes),
adding batch normalization, adjusting learning rate, trying out different activation functions and so
on. Another possible way of improving ANN’s performance is placing constraints on a model’s
complexity with regularization, also referred to as dropout implementation.
4.2.5    Least Absolute Shrinkage and Selection Operator (LASSO)
Nowadays, data sets typically contain a large number of features. As the number of features grows,
certain assumptions required by traditional methods (e.g., linear models) break down and these
models tend to overfit the data, causing the out of sample error to increase and making the results
unreliable. One possible solution is to use Regularization methods, which constrain or regularize
the estimated coefficients and can reduce the variance and uncertainty in the estimation.
As it was mentioned, having a large number of features invites various issues in using classic
regression models. For instance, the model becomes much less interpretable, there could be
infinite number of estimates for the model coefficients, and the predictors are likely to be highly
correlated, which can invite multicollinearity issues. The regularized techniques constrain the
total size of all the coefficient estimates that helps to reduce the magnitude and fluctuations of the
coefficients and will reduce the variance of the model.
Arguably, one of the most well-known and frequently used regularized method is the Least Absolute
Shrinkage and Selection Operator (LASSO). The method was poroposed by Tibshirani [84] and it
minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients
being less than a constant. Because of the nature of this constraint, it pushes coefficients all the way
                                                    83


to zero and produces some coefficients that are exactly 0, which eventually provides interpretable
models. The LASSO method can be summarized as follows
Suppose we have data (X𝑖 , 𝑦𝑖 ), 𝑖 = 1, 2, . . . , 𝑁, where X𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖 𝑝 ) T are the predictors
and 𝑦𝑖 is the response variable. We assume that either the observations are independent or the
responses are conditionally independent of the predictors. Finally, we assume that all predictors
are standardized. Letting β̂ = ( 𝛽ˆ1 , . . . , 𝛽ˆ 𝑝 ) T , the LASSO estimate ( 𝛼,
                                                                               ˆ β̂) is defined as
                                   (    𝑁                               2)
                                      Õ                      Õ                  Õ
                 ( 𝛼, ˆ = argmin
                   ˆ 𝛽)                        𝑦𝑖 − 𝛼 −         𝛽 𝑗 𝑋𝑖 𝑗     ,      |𝛽 𝑗 | ≤ 𝜆,
                                       𝑖=1                    𝑗                   𝑗
where 𝜆 ≥ 0 is a tuning parameter that controls the amount of shrinkage applied to the estimates.
The LASSO method provides properties of both automated feature selection and ridge regression,
and it exhibits the stability of the latter one. The main disadvantage of the technique is that it
achieves these results at the cost of producing biased estimates.
4.2.6   STEPWISE Method
STEPWISE method is a procedure proposed in this thesis and described in Chapter 3. The
proposed method fits GLMs with ultrahigh-dimensional predictors. It starts with an empty set or
pre-specified predictors, scans all features and sequentially selects features, and conducts backward
elimination once the forward selection is completed. The forward selection steps recruit variables
in an inclusive way by allowing some false positives for the sake of avoiding false negatives, while
backward selection steps eliminate the potential false positives from the recruited variables.
STEPWISE algorithm embraces model selection and estimation, controls both false negatives and
positives by using different stopping criteria in the forward and backward selection steps, yields
consistent estimates, and accommodates a wide range of data types, such as binary, categorical,
and count data. In addition, under a sparsity assumption of the true model, it can discover all of
the relevant predictors within a finite number of steps, and can produce a final model in ultrahigh
dimensional settings without applying a pre-screening step which may introduce unintended false
negatives.
                                                            84


4.3     Multi-Stage Hybrid Machine Learning Method
4.3.1   Introduction
The proposed multi-stage hybrid machine learning method carries out a stacking technique. Stack-
ing method is designed to boost predictive accuracy by blending the predictions of multiple machine
learning models. Stacked generalization or stacked was proposed by Wolpert [148] and is widely
used by other researchers and practitioners, [149, 150, 151].
Stacking is a technique in which the predictions produced by a collection of models are given as
inputs to a second-level learning algorithm. This second-level algorithm is trained optimally to
combine the model predictions and form a final set of predictions. Specifically, stacking method
trains a new learning algorithm to combine predictions of several base-learners, also known as,
individual models. First, base-learners are trained using the training data, then a combiner, called
a super learner, is trained to make a final prediction based on the predictions of the base learners.
It is important that the dataset collected for the stacked model consists of out-of-sample model
predictions. In other words, to obtain the prediction for a certain data point in the data set, the
model parameters should be estimated on a training set which does not include that particular data
point. This is normally achieved via 𝐾-fold cross-validation. The training data is split into almost
equal 𝐾 subsets and 𝐾 versions of the model are trained, each on the data with a different subset
removed. Thus, model predictions for the 𝑘th subset are produced from the model trained on a set
that did not include that subset.
To set up a multi-stage hybrid machine learning method, the following steps are being completed.
First, we specify a list of base learners and a super learner, also know as, a meta algorithm. We
select Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting Machine
(XGBoost), Least Absolute Shrinkage and Selection Operator (LASSO), Artificial Neural Network
(ANN), and STEPWISE method as the base learners. Linear weighted summation combiner is
being used as a super learner. Next, we train each of these base learners on the training data.
Specifically, we employ 𝐾-fold cross-validation for each of the base learners and collect the cross-
                                                  85


validated class probabilities from them. The cross-validated class probabilities are combined to
create a new feature matrix.
Finally, we use this new feature matrix to train the meta learning algorithm, which can be used
to generate predictions on new data. In other words, output values of each base learner become
input values for the super learner. To generate ensemble predictions, first we have to generate class
probabilities for each of the base learners. Then feed these class probabilities into the super learner,
which will generate the ensemble prediction. Algorithm 1 summarizes the proposed method and
provides a high-level explanation.
Algorithm 1 MULTI-STAGE HYBRID MACHINE LEARNING METHOD
   1. Set up the stacked model
          • Specify a list of M base learners with a determined set of model parameters
          • Specify a super learner algorithm
   2. Train the model
          • Train each of the M base learners on the training set
          • Perform cross-validation technique on each of the base-learners and collect cross-validated class
            probabilities from each (denoted as 𝑝 1 , ..., 𝑝 𝑀 )
          • Combine the N (the number of observations in the training set) cross-validated class probability
            values from each of the M base-learners into a new N × M feature matrix. This matrix, along
            with the original response vector (y), is called level - one data
          • Train the super learner on level - one data
   3. Predict on new data
          • To generate stacked predictions, first generate class probabilities from the base-learners
          • Feed those class probabilities to a super learner and produce new final predictions
    Our method is called hybrid, because it employs both model-free and model-based methods.
And we call it multi-stage, because it consists of two stages: setting up and training base-learners,
and training the super learner and generating class probabilities based on it. It is worth to mention
that the stacked model works the best when a diverse set of methods is selected as base learners.
                                                      86


Therefore, our model includes both types of methods.
4.3.2    Algorithm
Define 𝐷 = {(X𝑖 , 𝑌𝑖 ), 𝑖 = 1, . . . , 𝑛} a dataset (𝐷 is also referred to as level-0 data) consisting of a
vector X𝑖 representing the attribute values of the 𝑖-th instance, and 𝑌𝑖 representing the class value.
Let 𝐴 𝑗 , 𝑗 = 1, . . . , 𝐽, be a base-learner algorithm, also known as, level-0 estimator. Given 𝐷
dataset, we randomly split it into 𝐾 almost equal parts, 𝐷 1 , . . . , 𝐷 𝐾 . Let 𝐷 𝑘 , 𝑘 = 1, . . . , 𝐾, and
𝐷 (−𝑘) = 𝐷 \ 𝐷 𝑘 be the test and training sets for the 𝑘th fold of 𝐾-fold cross-validation. Given
𝐽 base-learner algorithms, we separately invoke the 𝑗th algorithm on the data in the training set
𝐷 (−𝑘) to induce a model 𝑀 𝑗(−𝑘) for 𝑗 = 1, . . . , 𝐽. For each instance (X𝑖 , 𝑌𝑖 ) ∈ 𝐷 𝑘 , let 𝑔 𝑗 (X𝑖 )
denote the class probability estimated by the model 𝑀 𝑗(−𝑘) for X𝑖 . At the end of the cross-validation
process, after applying the testing dataset 𝐷 𝑘 for each 𝑘 = 1, . . . , 𝐾 to each 𝑀 𝑗(−𝑘) for 𝑗 = 1, . . . , 𝐽,
                                                                                                
the base-learner model class probabilities form a meta-instance 𝑔1 (X𝑖 ), . . . , 𝑔𝐽 (X𝑖 ), 𝑌𝑖 with the
output variable for the original instance. A new dataset
                                      (                                                  )
                             𝐷 𝐶𝑉 =      𝑔1 (X𝑖 ), . . . , 𝑔𝐽 (X𝑖 ), 𝑌𝑖 , 𝑖 = 1, . . . , 𝑁
is assembled, also known as, level-1 data. Note that the original X𝑖 is replaced with the correspond-
                               n                          o
ing level-0 output vectors 𝑔1 (X𝑖 ), . . . , 𝑔𝐽 (X𝑖 ) .
Now, at the second stage, referred to as level-1 learning stage, we derive our final level-1 model,
 ˜ from 𝐷 𝐶𝑉 . The level-1 model will be constructed of the following form:
𝑀,
                                                        Õ 𝐽
                                          𝑀˜ (X) =            𝛼 𝑗 × 𝑔 𝑗 (X),
                                                         𝑗=1
where 𝑔 𝑗 (X) is the 𝑗th level-0 class probability and 𝛼 𝑗 is its corresponding weight. Values for 𝛼’s
are computed based on corresponding level-0 model performances and are derived as
                                                             ,   𝐽
                                                               Õ
                                              𝛼 𝑗 = 𝐾 2𝑗            𝐾 2𝑗 ,
                                                                𝑗=1
                                                            87


where K is a Kappa coefficient defined as
                                     accuracy − expected accuracy
                               𝐾=                                    ,
                                         1 − expected accuracy
where
                                                    𝑇𝑃 + 𝑇𝑁
                                accuracy =                         ,
                                              𝑇 𝑃 + 𝑇 𝑁 + 𝐹𝑃 + 𝐹 𝑁
and
                                                           !                                !
                                   𝑇 𝑃 + 𝐹𝑃 𝑇 𝑃 + 𝐹 𝑁          𝑇 𝑁 + 𝐹𝑃 𝑇 𝑁 + 𝐹 𝑁
            expected accuracy =               ×              +             ×                  ,
                                        𝑁            𝑁             𝑁              𝑁
where TP, FP, TN, and FN are True Positives, False Positives, True Negatives, and False Negatives,
respectively. Breiman, [152], suggests that non-negative constraint 𝛼 𝑗 ≥ 0 provides consistently
good results. Now, to make the final prediction we use the models 𝑀 𝑗 for 𝑗 = 1, . . . , 𝐽 in conjunction
with 𝑀.˜ Given a new instance, X𝑙 models 𝑀 𝑗 produce a vector (𝑔1 (X𝑙 ), . . . , 𝑔𝐽 (X𝑙 )). Then this
vector is used as an input value to the level-1 model, 𝑀,˜ whose output is the final prediction for
that instance.
4.4     Application: Bladder Cancer Prediction
4.4.1   Data Description
To utilize the aforementioned method we obtained data from Usuba et al. [69]. The goal of
building this model is to identify important miRNA biomarkers that have an impact on bladder
cancer development and can help in early detection of the disease. Moreover, ensemble method
will enhance the predictive power of the proposed STEPWISE method taken separately.
Data consists of 972 samples profiling 2565 miRNAs. Specifically, 392 serum samples were
obtained from bladder cancer patients who were admitted or referred to the National Cancer
Center Hospital (NCCH) between 2008 and 2016. A total of 580 serum samples from non-cancer
individuals were collected from 2 independent cohorts: the first cohort included individuals whose
                                                  88


serum samples were collected and stored by the National Center for Geriatrics and Gerontology
(NCGG) Biobank between 2010 and 2012 and the second cohort included volunteers aged over 35
years who were recruited from the Yokohama Minoru Clinic in 2015.
We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. To
proceed, we randomly divided samples into training and testing sets. Training set consists of 80
% of original data (310 samples with bladder cancer and 468 non-cancer controls) and the testing
set consists of the remaining 20 % (82 samples with bladder cancer and 112 non-cancer controls).
Table A.3 summarises characteristics of the samples used in the study.
4.4.2    Results
We perform data analysis in two parts. First, we employ STEPWISE procedure separately and eval-
uate its performance based on obtained results. Then, we implement multi-stage hybrid machine
learning method and demonstrate its advantages over the former model. To begin with, we further
split the training set into training and validation sets by using 5-fold cross-validation technique in
order to identify the best configuration of 𝜂1 and 𝜂2 parameters. Specifically, we use a greed search
approach and specify a set of values for each of these two parameters: 𝜂1 is being searched on the
grid {0, 0.25, 0.5, 0.75, 1} and 𝜂2 on {1, 2, 3, 3.5, 4, 4.5, 5}. The results from the cross-validation
procedure are presented in Table 4.1. It shows that STEPWISE method with 𝜂1 = 0.5 and 𝜂2 = 3
performed the best, so this pair of values will be used further in analysis.
Next, we applied the proposed STEPWISE method to the training set with 𝑛 = 778 and 𝑝 =
2565. With no prior biological information available, we started with an empty set. In the forward
step, STEPWISE selected mir-6087, mir-5100, mir-1914-3p, mir-6831-5p, mir-2110, mir-6717-5p,
mir-1343-3p, mir-6069, mir-6780b-5p, mir-1343-5p miRNAs, and the backward step eliminated
mir-6780b-5p, mir-1343-5p miRNAs. The STEPWISE procedure produced the following final
predictive model:
logit−1 (88.33 − 8.08 × miR-6087 + 2.53 × miR-5100 − 3.54 × miR-1914-3p + 1.22 × miR-6831-5p −
1.57 × miR-2110 + 2.26 × miR-6717-5p − 2.51 × miR-1343-3p + 0.75 × miR-6069, where logit−1 (𝑥) =
                                                   89


exp(𝑥)/(1 + exp(𝑥)).
       Table 4.1: Results of the 5-fold cross-validation procedure for the STEPWISE method
                                                       𝜂2
                                      1        2         3      3.5        4     4.5         5
                             0  0.885     0.893     0.893    0.895    0.895   0.905    0.905
                          0.25  0.885     0.893     0.893    0.895    0.895   0.905    0.905
                     𝜂1    0.5  0.917     0.917     0.931    0.925    0.925   0.911    0.911
                          0.75  0.900     0.900     0.895    0.895    0.883   0.879    0.879
                             1  0.875     0.870     0.870    0.863    0.861   0.861    0.859
 Values for 𝜂1 and 𝜂2 were searched on the grid {0, 0.25, 0.5, 0.75, 1} and {1, 2, 3, 3.5, 4, 4.5, 5}, respectively.
The optimal configuration of the parameters was discovered by comparing AUC-ROCs (area under the
receiver operating curve). The pair of parameter values that maximized the AUC value was selected for
further analysis.
    In the testing dataset, the model had AUC of 0.92 and achieved an accuracy of 0.91, with sensi-
tivity, specificity, and precision of 0.93, 0.86, and 0.90, respectively. Finally, we repeatedly applied
the sampling procedure and split the data into the training and testing sets 100 times. We obtained
the average accuracy, sensitivity, specificity, precision, and AUC. The results are presented in the
Table 4.2.
  Table 4.2: Assessment of the proposed STEPWISE procedure using the bladder cancer data set
                                  Accuracy      Sensitivity    Specificity   Precision   AUC
                   Training set         0.92           0.94          0.92         0.92     0.94
                   Testing set          0.91           0.93          0.90         0.89     0.92
 Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits.
    In order to develop the final multi-stage hybrid machine learning model, we first built the other
base-learner models included in the stacked method. Specifically, RF, SVM, XGBoost, LASSO,
and ANN. After carrying out 5-fold cross-validation procedure, the following sets of hyperparame-
ters have been identified for each of these methods: 600 trees were selected for RF model along with
                                                        90


5 instances in the terminal nodes and 250 randomly selected predictors in each tree; SVM achieved
its best results with parameter 𝐶 = 0.01 and polynomial structure of the second order; XGBoost
performed the best with 300 trees, tree depth and learning rate equal to 8 and 0.1, respectively,
and minimum 5 samples in the terminal nodes; ANN constructed its model with two hidden layers
having 70% and 35% of predictors on its layers, respectively, learning rate equal to 0.1 and dropout
rate to be 0.6; LASSO picked its penalty parameter to be 0.0361. Results of their performances
over training and testing sets are summarized in Table 4.3.
Weights, 𝛼’s, for the super learner combiner are computed based on base-learners’ performance
achieved during the first stage of modeling. Particularly, 0.17, 0.17, 0.14, 0.18, 0.18, and 0.16 are
weights assigned to STEPWISE, RF, SVM, XGBosst, ANN, and LASSO, respectively. Table 4.4
presents results obtained from evaluating the multi-stage hybrid machine learning model. It can
be observed that hybrid model significantly improved the performance of STEPWISE method. In
addition, it also outperformed other methods included in the model.
Finally, we performed sensitivity analysis to quantify the relationship between the model perfor-
mance and the weights assigned to the base-learners. Mainly, we tried out 7 model settings with
different weight configurations and compared them with our existing model. Specifically, we de-
veloped a model with equal weights assigned to each method and the remaining 6 models have
a high weight of 0.8 assigned to one of the base-learners while keeping other weights equal to
0.04. The results are summarized in the Table 4.5 and illustrate the advantage of our model over
other competing model settings. The proposed multi-stage hybrid model outperformed Models 2-7
in all evaluation metrics; Model 1 achieved comparable results as it was expected since assigned
weighted were similar to ours.
                                                91


Table 4.3: Comparison of base-learner methods included in the multi-stage hybrid machine learning
model over 100 independent splits of the bladder cancer data into training and testing sets
                   Training set  Accuracy      Sensitivity   Specificity Precision  AUC
                   RF                  0.95           0.95          0.96      0.94   0.95
                   SVM                 0.91           0.93          0.92      0.93   0.92
                   XGBoost             0.95           0.94          0.96      0.97   0.96
                   ANN                 0.95           0.96          0.94      0.97   0.96
                   LASSO               0.94           0.93          0.95      0.94   0.95
                   Test set      Accuracy      Sensitivity   Specificity Precision  AUC
                   RF                  0.93           0.94          0.93      0.92   0.93
                   SVM                 0.89           0.92          0.90      0.87   0.89
                   XGBoost             0.94           0.93          0.95      0.93   0.95
                   ANN                 0.94           0.95          0.93      0.91   0.95
                   LASSO               0.92           0.91          0.92      0.89   0.94
Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits; RF -
Random Forest; SVM - Support Vector Machine; XGBoost - Extreme Gradient Boosting Machine; ANN -
Artificial Neural Network; LASSO - Least Absolute Shrinkage and Selector Operator
Table 4.4: Evaluation of the proposed multi-stage hybrid machine learning model with the bladder
cancer data set
                                 Accuracy      Sensitivity   Specificity Precision  AUC
                   Training set        0.99           1.00          0.99      0.99   1.00
                   Testing set         0.98           0.98          0.98      0.97   0.99
Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits.
                                                       92


    Table 4.5: Comparison of various model configurations included in the sensitivity analysis
                  Training set    Accuracy     Sensitivity    Specificity Precision  AUC
                  Model 1               0.98           0.98         0.97       0.96   0.97
                  Model 2               0.96           0.96         0.96       0.95   0.95
                  Model 3               0.93           0.94         0.94       0.92   0.93
                  Model 4               0.97           0.98         0.97       0.97   0.95
                  Model 5               0.97           0.97         0.96       0.97   0.95
                  Model 6               0.94           0.95         0.93       0.94   0.93
                  Model 7               0.94           0.94         0.93       0.94   0.93
 Note: values of accuracy, sensitivity, specificity, precision, and AUC were averaged over 100 splits; Model
1 corresponds to the equal-weights scenario; the Model 2-7 correspond to the scenarios with a high weight
of 0.8 assigned to one of the base-learners while keeping other weights equal to 0.04; a high weight was
assigned to methods in the following order: Random Forest (RF), Support Vector Machine (SVM), Extreme
Gradient Boosting machine (XGBoost), Artificial Neural Network (ANN), least absolute shrinkage and
selector Operator (LASSO), STEPWISE procedure
4.5     Web Application
An R-Shiny web application was developed to enable users employ the proposed multi-stage hybrid
machine learning method in practice. The main goal of this app is to help users analyze their own
data and build predictive models according to our algorithm. The web application can be accessed
online at Multi-Stage_Hybrid_ML_Method. Specifically, it is aimed to solve classification prob-
lems and has the following features.
First, users will have an option to either upload their own data sets or use pre-built sets. Pre-built
option includes two well-known data sets: Iris and Abalone. Once the data is uploaded/selected
from the given options, users can split the data into training and testing sets. For instance, spec-
ifying validation split to be 0.8 will split data into train and test sets with 4:1 ratio. In addition,
one can apply pre-processing steps to the data, which is an important part of data analysis. As
of now, there are three options available: standardizing numerical features (making features have
mean equal to zero and variance equal to 1), imputing missing values via K-Nearest Neighbor tech-
nique (a method that takes into account the relationship among predictors), and removing Zero and
Near-zero variance variables (removes feature that have no effect/ minimal effect on the response
                                                        93


feature). Figure A.1 illustrates this step.
Next, users are offered to tune hyperparameters of the base learner methods. Almost all Machine
learning methods have hyper-parameters that can be tuned. Tuning them will potentially improve
the model performance and reduce chances of overfiting the model. In order to accomplish this
task, the web application employs greed search technique: users will specify a set of values for each
of the hyper-parameters provided in the menu bar. Two resampling methods are available: 𝑘-fold
cross-validation and repeated 𝑘-fold cross-validation. Once all necessary items are selected, the
web application will tune these parameters and will display numeric and visual results in "Numerical
Results" and "Visualize Results" tabs respectively. Figure A.2 illustrates a tuning hyperparameters
procedure for a Random Forest method.
Lastly, after the tuning hyperparameters step is complete and a set of values for these parameters is
selected, users can proceed further and start building their final predictive model. At this step, they
can indicate parameter values for each base-learner obtained from the previous step and set weights
for them. The app will train the model and display the results, which include model performance
metrics (computed for both training and testing sets), feature importance, and final predictions.
Figure A.3 illustrates an output results for the final predictive model including model evaluations
for both training and testing sets.
                                                  94


                                              CHAPTER 5
     CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE RESEARCH
In this thesis we have proposed to apply STEPWISE to produce final models in ultrahigh-
dimensional settings, without resorting to a pre-screening step. We have shown that the method
identifies or includes the true model with probability going to 1, and produces consistent coefficient
estimates, which are useful for properly interpreting the actual impacts of risk factors. The theoret-
ical properties of STEPWISE are established under mild conditions, which are worth discussing.
Because in practice covariates are often standardized for various reasons, Condition (2) is assumed
without loss of generality.
Conditions (3) and (4) are generally satisfied under common GLM models, including Gaussian,
Binomial, Poisson, and Gamma distributions. Condition (5) is also often satisfied in practice.
Proposition 2 in Zhang et al. [121] may be used as a tool to verify Condition (5) as well. Con-
ditions (1) and (6) are in good faith with the unknown true model size |M| and minimum signal
strength 𝑛−𝛼 in practice. The "irrepresentable" condition (6) is strong and may not hold in some
real datasets (see, e.g. [153, 154]). However, the condition holds under some commonly used
covariance structures, including AR(1) and compound symmetry structure [153].
As shown in simulation studies and real data analyses, STEPWISE tends to generate models as pre-
dictive as the other well-known methods, with fewer variables (Figure 3.2). Parsimonious models
are useful for biomedical studies as they explain data with a small number of important predictors,
and offer practitioners a realistic list of biomarkers to investigate. With categorical outcome data
frequently observed in biomedical studies (e.g. histology types of cancer), STEPWISE can be ex-
tended to accommodate multinomial classification, with more involved notation and computation.
We will pursue this elsewhere.
As it was shown and discussed in the previous chapters of the thesis, STEPWISE procedure controls
both false positives and false negatives in high-dimensional settings. It is achieved by employing
different stopping criteria in the forward and backward selection steps that adds flexibility to our
                                                  95


algorithm. Mainly, versatility of the stopping criterion in the forward selection step allows to
avoid false negatives by including some false positives in the early stages of the model building.
While, using stopping criterion in the backward elimination step allows removing the potential
false positives from the selected variables.
In addition, two extra parameters 𝜂1 and 𝜂2 involved in computing the stopping criteria determine
how restrictive the variable screening process should be. Specifically, large values of 𝜂1 in the
forward selection step will recruit less variables and vice versa. Similarly, large 𝜂2 values of the
stopping criterion in the backward elimination step will remove more features. Thus, this frame-
work can address different needs. For instance, if controlling false positives is the priority, then
we recommend applying large values for parameters, and if it is more meaningful to control false
negatives, then small values must be used. It is worth noting that our method includes forward
selection as a special case when the parameter value is equal to 0, making it even more flexible.
Moreover, in this thesis we prove that, under a sparsity assumption of the true model, the proposed
STEPWISE approach can discover all of the relevant predictors within a finite number of steps.
Sparse models are common in high-dimensional settings. Among hundreds or thousands predic-
tors involved in the model development, only a handful number of predictors have a significant
relationship with the response variable. Including too many predictors in the model may result
in overfitting, while keeping a few variables may lead to high bias and low predictive accuracy.
Thus, identifying true signals and significant predictors correctly and including them in the final
predictive model is a crucial step in a model building process.
Finally, we developed a multi-stage hybrid machine learning method to boost a predictive accuracy
and improve a performance of the proposed method. It carries out stacking technique and com-
bines model-free and model-based methods including the proposed STEPWISE method. Ting and
Witten [155] suggested that the users of stacking method have a free choice of base-learner models.
Therefore, we have selected heterogeneous machine learning methods (e.g., boosting, bagging,
neural nets, and model-based methods) that have different strengths and disadvantages. Having a
diverse set of base-learners makes our method applicable in various scenarios. In addition, they
                                                 96


claimed and demonstrated that successful stacked generalization implies using class probabilities
rather than class predictions, and supported their claim with empirical examples. We also adopted
this technique in our model.
Ueda [149] defined several combination types that can be used to combine base-learner outputs
via weighted sum (WS), class-dependent weighted sum (CSW), and linear stacked generalization
(LSG). Erdogan and Sen [156] showed that none of these methods is superior than others and
the performance is data-driven and data-dependent. We have selected WS technique to be imple-
mented in the super-learner method. A Kappa statistic (𝐾) was used to estimate and assign weights
to the individual base-learner outputs, which is considered to be more accurate metric for model
evaluation [157, 158]. These weights reflect their performance on level-0 data: greater weights
are assigned to base-learners with stronger performance and vice versa. This weights assignment
method is believed to be more effective as it incorporates significance of each method included in
the model [159]. Finally, Breiman [152] reported that non-negative constraint over the assigned
weights will provide consistently good results. This constraint was added to our model as well.
The numerical examples we provided have vividly demonstrated an improved predictive power of
the proposed method. Moreover, we performed sensitivity analysis to illustrate the superiority of
the weight assignment technique used in the model over the other competing techniques. Lastly,
we proposed and developed a web application that enables users employ the proposed multi-stage
hybrid machine learning method in practice.
There are several open questions. First, in our numerical experiments, we used cross-validation
to choose values for 𝜂1 and 𝜂2 , which seemed to work well. However, more in-depth research is
needed to find their optimal values to strike a balance between false positives and false negatives.
Second, despite our consistent estimates, drawing inference based on them remains challenging.
Statistical inference, which accounts for uncertainty in estimation, is key for properly interpreting
analysis results and drawing appropriate conclusions. Our asymptotic results, nevertheless, are a
stepping stone toward this important problem. Third, although the proposed STEPWISE procedure
is designed to deal with the binary classification, it can be extended to accommodate multinomial
                                                 97


classification, a commonly observed problem in biological or biomedical research. Most multi-
nomial classification methods rely on sequential binary classification by way of one-versus-all or
direct pairwise comparison [160], which requires selecting a reduction method from multiclass to
binary. Further investigation will be needed to identify such methods as it is not a trivial task and
is on a case-by-case basis.
                                                98


APPENDICES
    99


                                             APPENDIX A
                                   SUPPLEMENT MATERIALS
A.1    Supplementary Materials
An R package, STEPWISE, was developed and is available at https://github.com/AlexPijyan/
STEPWISE, along with the examples shown in the dissertation.
A.2    Additional Results in the Real Data Analysis
Table A.1: Comparison of genes selected by each competing method from the mammalian eye data
set
                            STEPWISE        FR   LASSO     SIS+LASSO       SC    dgLARS
             STEPWISE                  3      3        2              2      2          0
             FR                               4        2              2      2          0
             LASSO                                    16              5      2          0
             SIS+LASSO                                                9      2          0
             SC                                                              4          0
             dgLARS                                                                     7
Note: Diagonal and off-diagonal elements of the table represent the model sizes for each method and the
number of overlapping genes selected by the two methods corresponding to the row and column, respectively.
                                                  100


                     Table A.2: Selected miRNAs for ESCC training dataset
  Methods      selected miRNAs
  STEPWISE     miR-4783-3p; miR-320b; miR-1225-3p
  FR           miR-4783-3p; miR-320b; miR-1225-3p; 6789-5p
  SC           miR-4783-3p; miR-320b; miR-1225-3p; 6789-5p
  LASSO        miR-6789-5p; miR-6781-5p; miR-1225-3p; miR-1238-5p; miR-320b;
               miR-6794-5p; miR-6877-5p; miR-6785-5p; miR-718; miR-195-5p
  SIS+LASSO    miR-6785-5p; miR-1238-5p; miR-1225-3p; miR-6789-5p; miR-320b;
               miR-6875-5p; miR-6127; miR-1268b; miR-6781-5p; miR-125a-3p
  dgLARS       miR-891b; miR-6127; miR-151a-5p; miR-195-5p; ; miR-3688-5p
               miR-125b-1-3p; miR-1273c; miR-6501-5p; miR-4666a-5p; miR-514a-3p
 Note: LASSO, SIS+LASSO, dgLARS selected 20, 17, and 33 miRNAs, respectively, and we only reported
top 10 miRNAs.
        Table A.3: Clinicopathologic characteristics of participants in bladder cancer study
                               Covariates          Training Set   Testing set
                                                  𝑛1 (%)               𝑛2 (%)
                       Bladder Cancer patients
                       Total number of patients   310                      82
                       Age, median (range)        68 (32-90)       70 (34-93)
                       Gender:
                       Male                       233 (74.9%)     54 (65.8%)
                       Female                     77 (25.1%)      28 (34.2%)
                       Tumor Stage (%):
                       <pT2                       239 (77.09%)   62 (75.21%)
                       ≥ pT2                      71 (22.55%)    20 (23.93%)
                       Healthy Control patients
                       Total number of patients   468                     112
                       Age, median (range)        66.5 (35-90)   68.5 (41-92)
                       Gender:
                       Male                       241 (51.43%)   45 (40.17%)
                       Female                     227 (48.57%)   67 (59.83%)
                                                101


Table A.4: Clinicopathologic characteristics of study participants of the ESCC data set
                         Covariates                       Training Set     Testing set
                                                          𝑛1 (%)                 𝑛2 (%)
  Esophageal squamous cell carcinoma (ESCC) patients
  Total number of patients                                283                       283
  Age, median (range)                                     65 [40, 86]       67 [37, 90]
  Gender:
  Male                                                    224 (79.0%)      247 (87.3%)
  Female                                                  59 (21.0%)        36 (12.7%)
  Stage:
  0                                                       24 (8.5%)          27 (9.5%)
  1                                                       127 (44.9%)      128 (45.2%)
  2                                                       58 (20.5%)        57 (20.1%)
  3                                                       67 (23.7%)        61 (21.6%)
  4                                                       7 (2.4%)           10 (3.6%)
  Non-ESCC Controls
  Total number of patients                                283                     4,682
  Age, median (range)                                     68 [27, 92]    67.5 [20, 100]
  Gender:
  Male                                                    131 (46.3%)    2,086 (44.5%)
  Female                                                  152 (53.7%)    2,596 (55.5%)
  Data sources of the controls:
  National Cancer Center Biobank (NCCB)                   17 (6.0%)         306 (6.5%)
  National Center for Geriatrics and Gerontology (NCGG)   158 (55.8%)    2,512 (53.7%)
  Minoru clinic (MC)                                      108 (38.2%)    1,864 (39.8%)
                                            102


Figure A.1: R-Shiny Web Application for solving classification problems. The plot illustrates
uploading and splitting a dataset into training and testing sets.
                                                 103


Figure A.2: R-Shiny Web Application for solving classification problems. The plot depicts a tuning
parameters step for a Random Forest method.
                                              104


Figure A.3: R-Shiny Web Application for solving classification problems. The plot depicts an
output of the final predictive model developed by the web application.
                                               105


BIBLIOGRAPHY
     106


                                       BIBLIOGRAPHY
[1]  M. Hilbert and P. López, “The world’s technological capacity to store, communicate, and
     compute information,” science, vol. 332, no. 6025, pp. 60–65, 2011.
[2]  D. Reinsel, J. Gantz, and J. Rydning, “Data age 2025: The evolution of data to life-critical,”
     Don’t Focus on Big Data, vol. 2, 2017.
[3]  M. Prosperi, J. S. Min, J. Bian, and F. Modave, “Big data hurdles in precision medicine
     and precision public health,” BMC Medical Informatics and Decision Making, vol. 18(139),
     no. 1, 2018.
[4]  J. Kang, J. Cho, and H. Zhao, “Practical issues in building risk-predicting models for complex
     diseases,” Journal of biopharmaceutical statistics, vol. 20, no. 2, pp. 415–440, 2010.
[5]  K. M. Corey, S. Kashyap, E. Lorenzi, S. A. Lagoo-Deenadayalan, K. Heller, K. Whalen,
     S. Balu, M. T. Heflin, S. R. McDonald, and M. Swaminathan, “Development and validation
     of machine learning models to identify high-risk surgical patients using automatically curated
     electronic health record data (pythia): A retrospective, single-site study,” PLoS medicine,
     vol. 15, no. 11, p. e1002701, 2018.
[6]  P. Johansson, T. Jaarsma, G. Andersson, and J. Lundgren, “The impact of internet-based
     cognitive behavioral therapy and depressive symptoms on self-care behavior in patients with
     heart failure. a secondary analysis of a randomised controlled trial,” International Journal
     of Nursing Studies, p. 103454, 2019.
[7]  S. Mehta, A. Shelling, A. Muthukaruppan, A. Lasham, C. Blenkiron, G. Laking, and C. Print,
     “Predictive and prognostic molecular markers for cancer medicine,” Therapeutic advances
     in medical oncology, vol. 2, no. 2, pp. 125–148, 2010.
[8]  S. G. Baker and B. S. Kramer, “Simple methods for evaluating 4 types of biomarkers: Sur-
     rogate endpoint, prognostic, predictive, and cancer screening,” Biomarker Insights, vol. 15,
     p. 1177271920946715, 2020.
[9]  S. G. Baker, “Improving the biomarker pipeline to develop and evaluate cancer screening
     tests,” JNCI: Journal of the National Cancer Institute, vol. 101, no. 16, pp. 1116–1119,
     2009.
[10] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene
     expression patterns with a complementary dna microarray,” Science, vol. 270, no. 5235,
     pp. 467–470, 1995.
[11] I. Ruczinski, C. Kooperberg, and M. L. LeBlanc, “Exploring interactions in high-dimensional
     genomic data: an overview of logic regression, with applications,” Journal of Multivariate
     Analysis, vol. 90, no. 1, pp. 178–195, 2004.
                                               107


[12] R. Sachidanandam, D. Weissman, S. C. Schmidt, J. M. Kakol, L. D. Stein, G. Marth,
     S. Sherry, J. C. Mullikin, B. J. Mortimore, and D. L. Willey, “A map of human genome se-
     quence variation containing 1.42 million single nucleotide polymorphisms,” Nature, vol. 409,
     no. 6822, pp. 928–934, 2001.
[13] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith,
     M. Yandell, C. A. Evans, and R. A. Holt, “The sequence of the human genome,” science,
     vol. 291, no. 5507, pp. 1304–1351, 2001.
[14] A. J. Brookes, “The essence of snps,” Gene, vol. 234, no. 2, pp. 177–186, 1999.
[15] S. Ye, S. Dhillon, X. Ke, A. R. Collins, and I. N. Day, “An efficient procedure for genotyping
     single nucleotide polymorphisms,” Nucleic acids research, vol. 29, no. 17, pp. e88–e88,
     2001.
[16] A.-C. Syvänen, “Accessing genetic variation: genotyping single nucleotide polymorphisms,”
     Nature Reviews Genetics, vol. 2, no. 12, pp. 930–942, 2001.
[17] D. F. Easton, K. A. Pooley, A. M. Dunning, P. D. Pharoah, D. Thompson, D. G. Ballinger, J. P.
     Struewing, J. Morrison, H. Field, and R. Luben, “Genome-wide association study identifies
     novel breast cancer susceptibility loci,” Nature, vol. 447, no. 7148, pp. 1087–1093, 2007.
[18] A. Tenesa and M. G. Dunlop, “New insights into the aetiology of colorectal cancer from
     genome-wide association studies,” Nature Reviews Genetics, vol. 10, no. 6, pp. 353–358,
     2009.
[19] G. C. Barnett, C. M. West, A. M. Dunning, R. M. Elliott, C. E. Coles, P. D. Pharoah, and
     N. G. Burnet, “Normal tissue reactions to radiotherapy: towards tailoring treatment dose by
     genotype,” Nature Reviews Cancer, vol. 9, no. 2, pp. 134–142, 2009.
[20] C. L. Sawyers, “The cancer biomarker problem,” Nature, vol. 452, no. 7187, pp. 548–552,
     2008.
[21] A. V. Kapp, S. S. Jeffrey, A. Langerød, A.-L. Børresen-Dale, W. Han, D.-Y. Noh, I. R.
     Bukholm, M. Nicolau, P. O. Brown, and R. Tibshirani, “Discovery and validation of breast
     cancer subtypes,” BMC genomics, vol. 7, no. 1, pp. 1–15, 2006.
[22] C. M. Perou, T. Sørlie, M. B. Eisen, M. Van De Rĳn, S. S. Jeffrey, C. A. Rees, J. R. Pollack,
     D. T. Ross, H. Johnsen, and L. A. Akslen, “Molecular portraits of human breast tumours,”
     nature, vol. 406, no. 6797, pp. 747–752, 2000.
[23] G. Sotiropoulou, G. Pampalakis, E. Lianidou, and Z. Mourelatos, “Emerging roles of mi-
     crornas as molecular switches in the integrated circuit of the cancer cell,” Rna, vol. 15, no. 8,
     pp. 1443–1461, 2009.
[24] C. Blenkiron and E. A. Miska, “Mirnas in cancer: approaches, aetiology, diagnostics and
     therapy,” Human molecular genetics, vol. 16, no. R1, pp. R106–R113, 2007.
[25] E. S. Lander and N. J. Schork, “Genetic dissection of complex traits,” Science, vol. 265,
     no. 5181, pp. 2037–2048, 1994.
                                               108


[26] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett,
     C. T. Harbison, C. M. Thompson, and I. Simon, “Transcriptional regulatory networks in
     saccharomyces cerevisiae,” science, vol. 298, no. 5594, pp. 799–804, 2002.
[27] E. E. Schadt, S. A. Monks, T. A. Drake, A. J. Lusis, N. Che, V. Colinayo, T. G. Ruff, S. B.
     Milligan, J. R. Lamb, and G. Cavet, “Genetics of gene expression surveyed in maize, mouse
     and man,” Nature, vol. 422, no. 6929, pp. 297–302, 2003.
[28] M. Morley, C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens, R. S. Spielman, and V. G.
     Cheung, “Genetic analysis of genome-wide variation in human gene expression,” Nature,
     vol. 430, no. 7001, pp. 743–747, 2004.
[29] G. Yvert, R. B. Brem, J. Whittle, J. M. Akey, E. Foss, E. N. Smith, R. Mackelprang, and
     L. Kruglyak, “Trans-acting regulatory variation in saccharomyces cerevisiae and the role of
     transcription factors,” Nature genetics, vol. 35, no. 1, pp. 57–64, 2003.
[30] F. S. Collins, M. S. Guyer, and A. Chakravarti, “Variations on a theme: cataloging human
     dna sequence variation,” Science, vol. 278, no. 5343, pp. 1580–1581, 1997.
[31] M. E. Zwick, D. J. Cutler, and A. Chakravarti, “Patterns of genetic variation in mendelian and
     complex traits,” Annual review of genomics and human genetics, vol. 1, no. 1, pp. 387–407,
     2000.
[32] A. M. Slavotinek, E. M. Stone, K. Mykytyn, J. R. Heckenlively, J. S. Green, E. Heon, M. A.
     Musarella, P. S. Parfrey, V. C. Sheffield, and L. G. Biesecker, “Mutations in mkks cause
     bardet-biedl syndrome,” Nature genetics, vol. 26, no. 1, pp. 15–16, 2000.
[33] D. Y. Nishimura, C. C. Searby, R. Carmi, K. Elbedour, L. Van Maldergem, A. B. Fulton,
     B. L. Lam, B. R. Powell, R. E. Swiderski, K. E. Bugge, et al., “Positional cloning of a
     novel gene on chromosome 16q causing bardet–biedl syndrome (bbs2),” Human Molecular
     Genetics, vol. 10, no. 8, pp. 865–874, 2001.
[34] K. Mykytyn, T. Braun, R. Carmi, N. B. Haider, C. C. Searby, M. Shastri, G. Beck, A. F.
     Wright, A. Iannaccone, and K. Elbedour, “Identification of the gene that, when mutated,
     causes the human obesity syndrome bbs4,” Nature genetics, vol. 28, no. 2, pp. 188–191,
     2001.
[35] E. M. Stone, A. J. Lotery, F. L. Munier, E. Héon, B. Piguet, R. H. Guymer, K. Vandenburgh,
     P. Cousin, D. Nishimura, and R. E. Swiderski, “A single efemp1 mutation associated with
     both malattia leventinese and doyne honeycomb retinal dystrophy,” Nature genetics, vol. 22,
     no. 2, pp. 199–202, 1999.
[36] E. Héon, B. Piguet, F. Munier, S. R. Sneed, C. M. Morgan, S. Forni, G. Pescia, D. Schorderet,
     C. M. Taylor, and L. M. Streb, “Linkage of autosomal dominant radial drusen (malattia
     leventinese) to chromosome 2p16-21,” Archives of Ophthalmology, vol. 114, no. 2, pp. 193–
     198, 1996.
                                                109


[37] E. M. Stone, B. E. Nichols, A. E. Kimura, T. A. Weingeist, A. Drack, and V. C. Sheffield,
     “Clinical features of a stargardt-like dominant progressive macular dystrophy with genetic
     linkage to chromosome 6q,” Archives of Ophthalmology, vol. 112, no. 6, pp. 765–772, 1994.
[38] R. P. Lifton, A. G. Gharavi, and D. S. Geller, “Molecular mechanisms of human hyperten-
     sion,” Cell, vol. 104, no. 4, pp. 545–556, 2001.
[39] D. Y. Nishimura, R. E. Swiderski, W. L. Alward, C. C. Searby, S. R. Patil, S. R. Bennet, A. B.
     Kanis, J. M. Gastier, E. M. Stone, and V. C. Sheffield, “The forkhead transcription factor
     gene fkhl7 is responsible for glaucoma phenotypes which map to 6p25,” Nature genetics,
     vol. 19, no. 2, pp. 140–147, 1998.
[40] J. H. Fingert, E. Héon, J. M. Liebmann, T. Yamamoto, J. E. Craig, J. Rait, K. Kawase,
     S.-T. Hoh, Y. M. Buys, and J. Dickinson, “Analysis of myocilin mutations in 1703 glaucoma
     patients from five different populations,” Human molecular genetics, vol. 8, no. 5, pp. 899–
     905, 1999.
[41] T. E. Scheetz, K.-Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson,
     A. M. Dorrance, G. F. DiBona, J. Huang, T. L. Casavant, V. C. Sheffield, and E. M. Stone,
     “Regulation of gene expression in the mammalian eye and its relevance to eye disease,”
     Proceedings of the National Academy of Sciences, vol. 103, no. 39, pp. 14429–14434, 2006.
[42] A. P. Chiang, J. S. Beck, H.-J. Yen, M. K. Tayeh, T. E. Scheetz, R. E. Swiderski, D. Y.
     Nishimura, T. A. Braun, K.-Y. A. Kim, J. Huang, K. Elbedour, R. Carmi, D. C. Slusarski, T. L.
     Casavant, E. M. Stone, and V. C. Sheffield, “Homozygosity mapping with snp arrays identifies
     trim32, an e3 ubiquitin ligase, as a bardet–biedl syndrome gene (bbs11),” Proceedings of the
     National Academy of Sciences, vol. 103, no. 16, pp. 6287–6292, 2006.
[43] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal, “Global cancer
     statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in
     185 countries,” CA: a cancer journal for clinicians, vol. 68, no. 6, pp. 394–424, 2018.
[44] M. Arnold, M. Laversanne, L. M. Brown, S. S. Devesa, and F. Bray, “Predicting the future
     burden of esophageal cancer by histological subtype: international trends in incidence up to
     2030,” Official journal of the American College of Gastroenterology| ACG, vol. 112, no. 8,
     pp. 1247–1255, 2017.
[45] B.-X. Li, Q. Yu, Z.-L. Shi, P. Li, and S. Fu, “Circulating microRNAs in esophageal squamous
     cell carcinoma: association with locoregional staging and survival,” International Journal
     of Clinical and Experimental Medicine, vol. 8, no. 5, pp. 7241–7250, 2015.
[46] S. H. Lu, “Alterations of oncogenes and tumor suppressor genes in esophageal cancer in
     china: Molecular pathology and epidemiology,” Mutation research. Reviews in mutation
     research, vol. 462, no. 2-3, pp. 343–353, 2000.
[47] S. He, J. Peng, L. Li, Y. Xu, X. Wu, J. Yu, J. Liu, J. Zhang, R. Zhang, and W. Wang, “High
     expression of cytokeratin CAM5.2 in esophageal squamous cell carcinoma is associated with
     poor prognosis,” Medicine, vol. 98(37), no. 37, p. e17104, 2019.
                                                 110


[48] X. Qian, C. Tan, F. Wang, B. Yang, Y. Ge, Z. Guan, and J. Cai, “Esophageal cancer stem
     cells and implications for future therapeutics,” OncoTargets and therapy, vol. 9, p. 2247,
     2016.
[49] B. Zhang, X. Pan, G. P. Cobb, and T. A. Anderson, “Micrornas as oncogenes and tumor
     suppressors,” Developmental biology, vol. 302, no. 1, pp. 1–12, 2007.
[50] J. Lu, G. Getz, E. A. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck, A. Sweet-Cordero,
     B. L. Ebert, R. H. Mak, and A. A. Ferrando, “Microrna expression profiles classify human
     cancers,” nature, vol. 435, no. 7043, pp. 834–838, 2005.
[51] R. Hamano, H. Miyata, M. Yamasaki, K. Sugimura, K. Tanaka, Y. Kurokawa, K. Nakajima,
     S. Takiguchi, Y. Fujiwara, and M. Mori, “High expression of lin28 is associated with tumour
     aggressiveness and poor prognosis of patients in oesophagus cancer,” British journal of
     cancer, vol. 106, no. 8, pp. 1415–1423, 2012.
[52] N. Liu, N.-Y. Chen, R.-X. Cui, W.-F. Li, Y. Li, R.-R. Wei, M.-Y. Zhang, Y. Sun, B.-J. Huang,
     and M. Chen, “Prognostic value of a microrna signature in nasopharyngeal carcinoma: a
     microrna expression analysis,” The lancet oncology, vol. 13, no. 6, pp. 633–641, 2012.
[53] G. Gao, H. A. Gay, R. D. Chernock, T. R. Zhang, J. Luo, W. L. Thorstad, J. S. Lewis Jr, and
     X. Wang, “A microrna expression signature for the prognosis of oropharyngeal squamous
     cell carcinoma,” Cancer, vol. 119, no. 1, pp. 72–80, 2013.
[54] H. Konishi, D. Ichikawa, S. Komatsu, A. Shiozaki, M. Tsujiura, H. Takeshita, R. Morimura,
     H. Nagata, T. Arita, and T. Kawaguchi, “Detection of gastric cancer-associated micrornas on
     microrna microarray comparing pre-and post-operative plasma,” British journal of cancer,
     vol. 106, no. 4, pp. 740–747, 2012.
[55] Y. Tomimaru, H. Eguchi, H. Nagano, H. Wada, S. Kobayashi, S. Marubashi, M. Tanemura,
     A. Tomokuni, I. Takemasa, and K. Umeshita, “Circulating microrna-21 as a novel biomarker
     for hepatocellular carcinoma,” Journal of hepatology, vol. 56, no. 1, pp. 167–175, 2012.
[56] N. H. Heegaard, A. J. Schetter, J. A. Welsh, M. Yoneda, E. D. Bowman, and C. C. Har-
     ris, “Circulating micro-rna expression profiles in early stage nonsmall cell lung cancer,”
     International journal of cancer, vol. 130, no. 6, pp. 1378–1386, 2012.
[57] K. Sudo, K. Kato, J. Matsuzaki, N. Boku, S. Abe, Y. Saito, H. Daiko, S. Takizawa, Y. Aoki,
     H. Sakamoto, Y. Aoki, H. Sakamoto, S. Niida, F. Takeshita, T. Fukuda, and T. Ochiya,
     “Development and validation of an esophageal squamous cell carcinoma detection model by
     large-scale microrna profiling,” JAMA Network Open, vol. 2, no. 5, pp. e194573–e194573,
     2019.
[58] M. Burger, J. W. Catto, G. Dalbagni, H. B. Grossman, H. Herr, P. Karakiewicz, W. Kassouf,
     L. A. Kiemeney, C. La Vecchia, and S. Shariat, “Epidemiology and risk factors of urothelial
     bladder cancer,” European urology, vol. 63, no. 2, pp. 234–241, 2013.
                                               111


[59] M. Burger, W. Oosterlinck, B. Konety, S. Chang, S. Gudjonsson, R. Pruthi, M. Soloway,
     E. Solsona, P. Sved, and M. Babjuk, “Icud-eau international consultation on bladder cancer
     2012: non–muscle-invasive urothelial carcinoma of the bladder,” European urology, vol. 63,
     no. 1, pp. 36–44, 2013.
[60] M. Babjuk, M. Burger, R. Zigeuner, S. F. Shariat, B. W. Van Rhĳn, E. Compérat, R. J.
     Sylvester, E. Kaasinen, A. Böhle, and J. P. Redorta, “Eau guidelines on non–muscle-invasive
     urothelial carcinoma of the bladder: update 2013,” European urology, vol. 64, no. 4, pp. 639–
     653, 2013.
[61] A. B. Apolo, M. Milowsky, and D. F. Bajorin, “Clinical states model for biomarkers in
     bladder cancer,” Future Oncology, vol. 5, no. 7, pp. 977–992, 2009.
[62] J. P. Stein, G. Lieskovsky, R. Cote, S. Groshen, A.-C. Feng, S. Boyd, E. Skinner, B. Bochner,
     D. Thangathurai, and M. Mikhail, “Radical cystectomy in the treatment of invasive bladder
     cancer: long-term results in 1,054 patients,” Journal of clinical oncology, vol. 19, no. 3,
     pp. 666–675, 2001.
[63] H. von der Maase, L. Sengelov, J. T. Roberts, S. Ricci, L. Dogliotti, T. Oliver, M. J.
     Moore, A. Zimmermann, and M. Arning, “Long-term survival results of a randomized
     trial comparing gemcitabine plus cisplatin, with methotrexate, vinblastine, doxorubicin, plus
     cisplatin in patients with bladder cancer,” Journal of clinical oncology, vol. 23, no. 21,
     pp. 4602–4608, 2005.
[64] P. Whelan, “Bladder cancer–contemporary dilemmas in its management.,” European urol-
     ogy, vol. 53, no. 1, pp. 24–26, 2007.
[65] M. F. Botteman, C. L. Pashos, A. Redaelli, B. Laskin, and R. Hauser, “The health economics
     of bladder cancer,” Pharmacoeconomics, vol. 21, no. 18, pp. 1315–1330, 2003.
[66] I. Osman, D. F. Bajorin, T.-T. Sun, H. Zhong, D. Douglas, J. Scattergood, R. Zheng, M. Han,
     K. W. Marshall, and C.-C. Liew, “Novel blood biomarkers of human urinary bladder cancer,”
     Clinical Cancer Research, vol. 12, no. 11, pp. 3374–3380, 2006.
[67] A. Apolo, I. Osman, R. Shen, A. Olshen, and D. Bajorin, “Peripheral-blood gene expression
     profiling in bladder cancer (bc) patients (pts),” Journal of Clinical Oncology, vol. 26,
     no. 15_suppl, pp. 5076–5076, 2008.
[68] X. Zhang and Y. Zhang, “Bladder cancer and genetic mutations,” Cell biochemistry and
     biophysics, vol. 73, no. 1, pp. 65–69, 2015.
[69] W. Usuba, F. Urabe, Y. Yamamoto, J. Matsuzaki, H. Sasaki, M. Ichikawa, S. Takizawa,
     Y. Aoki, S. Niida, and K. Kato, “Circulating mirna panels for specific and early detection in
     bladder cancer,” Cancer science, vol. 110, no. 1, pp. 408–419, 2019.
[70] H. A. Uyhelji, D. M. Kupfer, V. L. White, M. L. Jackson, H. Van Dongen, and D. M. Burian,
     “Exploring gene expression biomarker candidates for neurobehavioral impairment from total
     sleep deprivation,” BMC genomics, vol. 19, no. 1, pp. 1–17, 2018.
                                               112


[71] E. L. Bliss, L. D. Clark, and C. D. West, “Studies of sleep deprivation—relationship to
     schizophrenia,” AMA Archives of Neurology & Psychiatry, vol. 81, no. 3, pp. 348–359,
     1959.
[72] P. L. Franzen, D. J. Buysse, R. E. Dahl, W. Thompson, and G. J. Siegle, “Sleep depri-
     vation alters pupillary reactivity to emotional stimuli in healthy young adults,” Biological
     psychology, vol. 80, no. 3, pp. 300–305, 2009.
[73] D. C. Baldwin Jr and S. R. Daugherty, “Sleep deprivation and fatigue in residency training:
     results of a national survey of first-and second-year residents,” Sleep, vol. 27, no. 2, pp. 217–
     223, 2004.
[74] X. Liu, “Sleep and adolescent suicidal behavior,” Sleep, vol. 27, no. 7, pp. 1351–1358, 2004.
[75] J. Lim and D. F. Dinges, “Sleep deprivation and vigilant attention,” Annals of the New York
     Academy of Sciences, vol. 1129, no. 1, pp. 305–322, 2008.
[76] N. Goel, H. Rao, J. S. Durmer, and D. F. Dinges, “Neurocognitive consequences of sleep
     deprivation,” in Seminars in neurology, vol. 29, pp. 320–339, © Thieme Medical Publishers,
     2009.
[77] W. D. Killgore, “Effects of sleep deprivation on cognition,” Progress in brain research,
     vol. 185, pp. 105–129, 2010.
[78] S. M. Doran, H. P. Van Dongen, and D. F. Dinges, “Sustained attention performance during
     sleep deprivation: evidence of state instability.,” Archives italiennes de biologie, vol. 139,
     no. 3, pp. 253–267, 2001.
[79] S. K. Davies, J. E. Ang, V. L. Revell, B. Holmes, A. Mann, F. P. Robertson, N. Cui,
     B. Middleton, K. Ackermann, and M. Kayser, “Effect of sleep deprivation on the human
     metabolome,” Proceedings of the National Academy of Sciences, vol. 111, no. 29, pp. 10761–
     10766, 2014.
[80] N. Goel, ““omics” approaches for sleep and circadian rhythm research: biomarkers for
     identifying differential vulnerability to sleep loss,” Current Sleep Medicine Reports, vol. 1,
     no. 1, pp. 38–46, 2015.
[81] B. D. W. Group, A. J. Atkinson Jr, W. A. Colburn, V. G. DeGruttola, D. L. DeMets, G. J.
     Downing, D. F. Hoth, J. A. Oates, C. C. Peck, and R. T. Schooley, “Biomarkers and surrogate
     endpoints: preferred definitions and conceptual framework,” Clinical pharmacology &
     therapeutics, vol. 69, no. 3, pp. 89–95, 2001.
[82] P. Van Dongen, M. D. Baynard, G. Maislin, and D. F. Dinges, “Systematic interindividual
     differences in neurobehavioral impairment from sleep loss: evidence of trait-like differential
     vulnerability,” Sleep, vol. 27, no. 3, pp. 423–433, 2004.
[83] E. S. Arnardottir, E. V. Nikonova, K. R. Shockley, A. A. Podtelezhnikov, R. C. Anafi, K. Q.
     Tanis, G. Maislin, D. J. Stone, J. J. Renger, C. J. Winrow, and A. I. Pack, “Blood-gene ex-
     pression reveals reduced circadian rhythmicity in individuals resistant to sleep deprivation,”
     Sleep, vol. 37, no. 10, pp. 1589–1600, 2014.
                                                 113


[84] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal
     Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[85] C. J. Flynn, C. M. Hurvich, and J. S. Simonoff, “On the sensitivity of the lasso to the number
     of predictor variables,” Statistical Science, vol. 32, no. 1, pp. 88–105, 2017.
[86] S. A. van de Geer, “On the asymptotic variance of the debiased Lasso,” Electronic Journal
     of Statistics, vol. 13, no. 2, pp. 2970–3008, 2019.
[87] J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space (with
     discussion),” Journal of the Royal Statistical Society: Series B (Statistical Methodology),
     vol. 70, no. 5, pp. 849–911, 2008.
[88] E. Barut, J. Fan, and A. Verhasselt, “Conditional sure independence screening,” Journal of
     the American Statistical Association, vol. 111, no. 515, pp. 1266–1277, 2016.
[89] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of
     statistics, vol. 32, no. 2, pp. 407–499, 2004.
[90] L. Augugliaro, A. M. Mineo, and E. C. Wit, “Differential geometric least angle regression:
     a differential geometric approach to sparse generalized linear models,” Journal of the Royal
     Statistical Society: SERIES B: Statistical Methodology, vol. 75, pp. 471–498, 2013.
[91] H. Pazira, L. Augugliaro, and E. Wit, “Extended differential geometric lars for high-
     dimensional glms with general dispersion parameter,” Statistics and Computing, vol. 28,
     no. 4, pp. 753–774, 2018.
[92] H. An, D. Huang, Q. Yao, and C.-H. Zhang, “Stepwise searching for feature variables in
     high-dimensional linear regression,” [Online]. Available: http://eprints.lse.ac.uk/51349/,
     2008.
[93] H. Wang, “Forward regression for ultra-high dimensional variable screening,” Journal of the
     American Statistical Association, vol. 104, no. 488, pp. 1512–1524, 2009.
[94] J. Chen and Z. Chen, “Extended Bayesian information criteria for model selection with large
     model spaces,” Biometrika, vol. 95, no. 3, pp. 759–771, 2008.
[95] J.-S. Hwang and T.-H. Hu, “A stepwise regression algorithm for high-dimensional variable
     selection,” Journal of Statistical Computation and Simulation, vol. 85, no. 9, pp. 1793–1806,
     2015.
[96] J.-S. Hwang and T.-H. Hu, “Stepwise paring down variation for identifying influential multi-
     factor interactions related to a continuous response variable,” Statistics in Biosciences, vol. 4,
     no. 2, pp. 197–212, 2012.
[97] T. M. Loughin, “A systematic comparison of methods for combining p-values from in-
     dependent tests,” Computational statistics & data analysis, vol. 47, no. 3, pp. 467–485,
     2004.
[98] H. A. David, “Order statistics new york,” 1981.
                                                114


[99] Q. Zheng, H. G. Hong, and Y. Li, “Building generalized linear models with ultrahigh
      dimensional features: A sequentially conditional approach,” Biometrics, vol. 76, no. 1,
      pp. 47–60, 2020.
[100] P. McCullagh, Generalized Linear Models. New York: Routledge, 1989.
[101] S. D. Zhao and Y. Li, “Principled sure independence screening for Cox models with ultra-
      high-dimensional covariates,” Journal of Multivariate Analysis, vol. 105, no. 1, pp. 397–411,
      2012.
[102] A. Gorst-Rasmussen and T. Scheike, “Independent screening for single-index hazard rate
      models with ultrahigh dimensional features,” Journal of the Royal Statistical Society: Series
      B (Statistical Methodology), vol. 75, no. 2, pp. 217–245, 2013.
[103] H. G. Hong, J. Kang, and Y. Li, “Conditional screening for ultra-high dimensional covariates
      with survival outcomes,” Lifetime data analysis, vol. 24, no. 1, pp. 45–71, 2018.
[104] X. He, L. Wang, and H. G. Hong, “Quantile-adaptive model-free variable screening for
      high-dimensional heterogeneous data,” The Annals of Statistics, vol. 41, no. 1, pp. 342–369,
      2013.
[105] R. Song, W. Lu, S. Ma, and X. Jessie Jeng, “Censored rank independence screening for
      high-dimensional survival data,” Biometrika, vol. 101, no. 4, pp. 799–814, 2014.
[106] H. G. Hong, X. Chen, D. C. Christiani, and Y. Li, “Integrated powered density: Screening
      ultrahigh dimensional covariates with survival outcomes,” Biometrics, vol. 74, no. 2, pp. 421–
      429, 2018.
[107] J. Li, Q. Zheng, L. Peng, and Z. Huang, “Survival impact index and ultrahigh-dimensional
      model-free screening with survival outcomes,” Biometrics, vol. 72, no. 4, pp. 1145–1154,
      2016.
[108] H. G. Hong, Q. Zheng, and Y. Li, “Forward regression for Cox models with high-dimensional
      covariates,” Journal of Multivariate Analysis, vol. 173, pp. 268–290, 2019.
[109] M.-Y. Cheng, T. Honda, and J.-T. Zhang, “Forward variable selection for sparse ultra-high
      dimensional varying coefficient models,” Journal of the American Statistical Association,
      vol. 111, no. 515, pp. 1209–1221, 2016.
[110] S. Luo and Z. Chen, “Sequential Lasso cum EBIC for feature selection with ultra-high
      dimensional feature space,” Journal of the American Statistical Association, vol. 109, no. 507,
      pp. 1229–1240, 2014.
[111] C. T. Volinsky and A. E. Raftery, “Bayesian information criterion for censored survival
      models,” Biometrics, vol. 56, no. 1, pp. 256–262, 2000.
[112] R. Xu, F. Vaida, and D. P. Harrington, “Using profile likelihood for semiparametric model
      selection with application to proportional hazards mixed models,” Statistica Sinica, vol. 19,
      no. 2, p. 819, 2009.
                                                115


[113] C.-K. Ing and T. L. Lai, “A stepwise regression method and consistent model selection for
      high-dimensional sparse linear models,” Statistica Sinica, vol. 21, pp. 1473–1513, 2011.
[114] P. Bühlmann and B. Yu, “Boosting with the l 2 loss: regression and classification,” Journal
      of the American Statistical Association, vol. 98, no. 462, pp. 324–339, 2003.
[115] J. Chen and Z. Chen, “Extended BIC for small-𝑛-large-𝑃 sparse GLM,” Statistica Sinica,
      vol. 22, no. 2, pp. 555–574, 2012.
[116] P. Bühlmann and B. Yu, “Sparse boosting,” Journal of Machine Learning Research, vol. 7,
      pp. 1001–1024, 2006.
[117] S. A. van de Geer, “High-dimensional generalized linear models and the lasso,” The Annals
      of Statistics, vol. 36, no. 2, pp. 614–645, 2008.
[118] Y. Fan and C. Y. Tang, “Tuning parameter selection in high dimensional penalized likeli-
      hood,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 75,
      no. 3, pp. 531–552, 2013.
[119] M. Kwemou, “Non-asymptotic oracle inequalities for the Lasso and group Lasso in high
      dimensional logistic model,” ESAIM: Probability and Statistics, vol. 20, pp. 309–331, 2016.
[120] Y. Jiang, Y. He, and H. Zhang, “Variable selection with prior information for generalized
      linear models via the prior LASSO method,” Journal of the American Statistical Association,
      vol. 111, no. 513, pp. 355–376, 2016.
[121] C.-H. Zhang and J. Huang, “The sparsity and bias of the Lasso selection in high-dimensional
      linear regression,” The Annals of Statistics, vol. 36, no. 4, pp. 1567–1594, 2008.
[122] J. Fan and R. Song, “Sure independence screening in generalized linear models with np-
      dimensionality,” The Annals of Statistics, vol. 38, no. 6, pp. 3567–3604, 2010.
[123] S. Luo, J. Xu, and Z. Chen, “Extended Bayesian information criterion in the Cox model with
      a high-dimensional feature space,” Annals of the Institute of Statistical Mathematics, vol. 67,
      no. 2, pp. 287–311, 2015.
[124] A. Van der Vaart and J. Wellner, “Weak convergence, in ‘weak convergence and empirical
      processes’,” 1996.
[125] M. Talagrand, “Sharper bounds for Gaussian and empirical processes,” The Annals of Prob-
      ability, vol. 22, no. 1, pp. 28–76, 1994.
[126] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data mining,
      Inference, and Prediction. Springer Science & Business Media, 2009.
[127] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for cox’s propor-
      tional hazards model via coordinate descent,” Journal of Statistical Software, vol. 39, no. 5,
      pp. 1–13, 2011.
                                                 116


[128] P. Breheny and J. Huang, “Coordinate descent algorithms for nonconvex penalized regres-
      sion, with applications to biological feature selection,” Annals of Applied Statistics, vol. 5,
      no. 1, pp. 232–253, 2011.
[129] X. Wang and C. Leng, “R package: screening.” https://github.com/wwrechard/
      screening, 2016.
[130] L. Augugliaro, A. M. Mineo, and E. C. Wit, “dglars: An R package to estimate sparse
      generalized linear models,” Journal of Statistical Software, vol. 59, no. 8, pp. 1–40, 2014.
[131] Y. Zhang, “Epidemiology of esophageal cancer,” World Journal of Gastroenterology: WJG,
      vol. 19, no. 34, pp. 5598–5606, 2013.
[132] L. N. Mathieu, N. F. Kanarek, H.-L. Tsai, C. M. Rudin, and M. V. Brock, “Age and sex
      differences in the incidence of esophageal adenocarcinoma: results from the Surveillance,
      Epidemiology, and End Results (SEER) registry (1973–2008),” Diseases of the Esophagus,
      vol. 27, no. 8, pp. 757–763, 2014.
[133] J. Zhou, M. Zhang, Y. Huang, L. Feng, H. Chen, Y. Hu, H. Chen, K. Zhang, L. Zheng,
      and S. Zheng, “Microrna-320b promotes colorectal cancer proliferation and invasion by
      competing with its homologous microrna-320a,” Cancer Letters, vol. 356, no. 2, pp. 669–
      675, 2015.
[134] V. Lieb, K. Weigelt, L. Scheinost, K. Fischer, T. Greither, M. Marcou, G. Theil, H. Klocker,
      H.-J. Holzhausen, X. Lai, J. Vera, A. Ekici, W. Horninger, P. Fornara, B. Wullich, H. Taubert,
      and S. Wach, “Serum levels of mir-320 family members are associated with clinical parame-
      ters and diagnosis in prostate cancer patients,” Oncotarget, vol. 9, no. 12, pp. 10402–10416,
      2018.
[135] L. E. Mullany, J. S. Herrick, R. K. Wolff, J. R. Stevens, and M. L. Slattery, “Association of
      cigarette smoking and microrna expression in rectal cancer: insight into tumor phenotype,”
      Cancer Epidemiology, vol. 45, pp. 98–107, 2016.
[136] H. Zheng, F. Zhang, X. Lin, C. Huang, Y. Zhang, Y. Li, J. Lin, W. Chen, and X. Lin,
      “Microrna-1225-5p inhibits proliferation and metastasis of gastric carcinoma through re-
      pressing insulin receptor substrate-1 and activation of 𝛽-catenin signaling,” Oncotarget,
      vol. 7, no. 4, pp. 4647–4663, 2016.
[137] R. C. Team et al., “R: A language and environment for statistical computing,” 2013.
[138] H. Wickham, ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York,
      2016.
[139] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits and systems mag-
      azine, vol. 6, no. 3, pp. 21–45, 2006.
[140] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their
      relationship with the ensemble accuracy,” Machine learning, vol. 51, no. 2, pp. 181–207,
      2003.
                                                117


[141] P. S. A. Krogh et al., “Learning with ensembles: How over-fitting can be useful,” in
      Proceedings of the 1995 Conference, vol. 8, p. 190, 1996.
[142] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
[143] R. E. Schapire, “The strength of weak learnability,” Machine learning, vol. 5, no. 2, pp. 197–
      227, 1990.
[144] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[145] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3,
      pp. 273–297, 1995.
[146] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of
      statistics, pp. 1189–1232, 2001.
[147] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,”
      The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.
[148] D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992.
[149] N. Ueda, “Optimal linear combination of neural networks for improving classification per-
      formance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 2,
      pp. 207–215, 2000.
[150] N. Rooney, D. Patterson, and C. Nugent, “Non-strict heterogeneous stacking,” Pattern recog-
      nition letters, vol. 28, no. 9, pp. 1050–1061, 2007.
[151] B. Zenko, L. Todorovski, and S. Dzeroski, “A comparison of stacking with meta decision
      trees to bagging, boosting, and stacking with other methods,” in Proceedings 2001 IEEE
      International Conference on Data Mining, pp. 669–670, IEEE, 2001.
[152] L. Breiman, “Stacked regressions,” Machine learning, vol. 24, no. 1, pp. 49–64, 1996.
[153] P. Zhao and B. Yu, “On model selection consistency of lasso,” Journal of Machine Learning
      Research, vol. 7, no. Nov, pp. 2541–2563, 2006.
[154] P. Bühlmann and S. Van De Geer, Statistics for High-dimensional Data: Methods, Theory
      and Applications. Springer Science & Business Media, 2011.
[155] K. M. Ting and I. H. Witten, “Issues in stacked generalization,” Journal of artificial intelli-
      gence research, vol. 10, pp. 271–289, 1999.
[156] H. Erdogan and M. U. Sen, “A unifying framework for learning the linear combiners for clas-
      sifier ensembles,” in 2010 20th International Conference on Pattern Recognition, pp. 2985–
      2988, IEEE, 2010.
[157] D. V. Cicchetti and A. R. Feinstein, “High agreement but low kappa: Ii. resolving the
      paradoxes,” Journal of clinical epidemiology, vol. 43, no. 6, pp. 551–558, 1990.
                                                 118


[158] R. J. Cook, “Kappa and its dependence on marginal rates,” Wiley StatsRef: Statistics Refer-
      ence Online, 2014.
[159] A. Taha, “Intelligent ensemble learning approach for phishing website detection based on
      weighted soft voting,” Mathematics, vol. 9, no. 21, p. 2799, 2021.
[160] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning, vol. 4.
      Springer, 2006.
                                              119