MACHINE LEARNING AIDED FEATURE SELECTION FOR ULTRAHIGH DIMENSIONAL DATA: THEORY AND APPLICATIONS By Arkaprabha Ganguli A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Doctor of Philosophy 2023 ABSTRACT Feature selection methods for ultra-high dimensional datasets have gained significant popularity in the field of statistical machine learning due to their wide applicability across various scientific domains. These methods aim to uncover the true sparsity pattern by identifying a small subset of features that are truly associated with the response variable. However, traditional feature selection algorithms may suffer from high false discovery rates, limiting their ability to provide meaningful insights into the underlying relationships. To address this issue, this thesis focuses on the development and study of two novel feature selection methods that incorporate False Discovery Rate (FDR) control. These methods are specifically applied to real-world diffusion magnetic resonance imaging (DMRI) tractography data, demonstrating their effectiveness in addressing several challenging issues in ultrahigh dimensional datasets. In the first chapter, we propose a p-value-free FDR controlling method for feature selection. Most of the state-of-the-art methods in the literature for controlling FDR rely on p-value, which depends on specific assumptions on the data distribution and may be questionable in some high-dimensional settings. To surpass this problem, we propose a ‘screening & cleaning’ strategy consisting of assigning importance scores to the predictors, followed by constructing an estimate of the FDR. We study the theoretical properties of the method and demonstrate its superior performance compared to existing methods in an extensive simulation study. Finally, we apply the method to a gene expression dataset and identify important genes associated with drug sensitivity. In the second chapter, We extend the feature selection method from a linear model to a non-linear and non-parametric setting by utilizing the Deep Learning (DL) framework. The DL has been at the center of analytics in recent years due to its impressive empirical success in analyzing complex data objects. Despite this success, most existing tools behave like black-box machines, thus the increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature-selected deep learning has emerged as a promising tool in this realm. However, the recent developments do not accommodate ultra-high dimensional and highly correlated features or high noise levels. In this article, we propose a novel screening and cleaning method with the aid of deep learning for a data-adaptive multi- resolutional discovery of highly correlated predictors with a controlled FDR. Extensive empirical evaluations over a wide range of simulated scenarios and several real datasets demonstrate the effectiveness of the proposed method in achieving high power while keeping the false discovery rate at a minimum. In the third and final chapter, we apply the proposed feature selection methods to the brain imaging tractography dataset. Our motivation comes from the evidence from studies of dementia which shows that some older adults continue to maintain their cognitive abilities despite signs of ongoing neuropathological diseases. Commonly referred to as cognitive reserve, this phe- nomenon has unclear neurobiological substrates and a current understanding of corresponding markers is lacking. This study aims at investigating the immense system of structural connec- tions between brain regions constituting subcortical white matter (WM) as potential markers of cognitive reserve. Diffusion MRI tractography is an established computational neuroimaging method to model WM fiber organization throughout the brain. Standard statistical analyses ca- pable of leveraging the high dimensionality of tractography data face additional methodological complications beyond those encountered in typical feature selection problems. Our proposed methodology is specifically tailored for addressing these concerns. Extensive simulation studies on synthetic datasets mimicking the real tractography dataset demonstrate a substantial gain in power with minimal false discoveries, compared with state-of-the-art methods for feature selection. Our application to predicting cognitive reserve in a clinical aging neuroimaging trac- tography dataset produces anatomically meaningful discoveries in brain regions associated with risk and resilience to neurodegeneration. Overall, this thesis presents novel and effective methods for feature selection in ultrahigh dimensional settings. Our proposed framework would benefit the researchers and professionals who encounter the difficulty of choosing pertinent variables from correlated and vast datasets in diverse fields, ranging from finance and social sciences to biology. Copyright by ARKAPRABHA GANGULI 2023 I dedicate this thesis to my parents, Rita Ganguli and Malay Ganguli, whose unwavering love, support, and encouragement have been the foundation of my academic journey. v ACKNOWLEDGEMENTS I would like to express my deep gratitude to my advisors, Dr. Tapabrata Maiti and Dr. David Todem, for their invaluable guidance, continuous support, and encouragement throughout my graduate studies. Their expertise, wisdom, and generosity have been instrumental in shaping my research and shaping me as a statistician. I am also grateful to my thesis committee members, Dr. Haolei Weng, Dr. Lyudmila Sakha- nenko, and Dr. Andrew Bender, for their time, feedback, and valuable insights. Their immense help is a critical part of the theory and application part of this thesis. Their constructive criticism and thoughtful suggestions have greatly enriched my research and improved the quality of my thesis. I am grateful to the faculty and staff in the Department of Statistics and Probability at Michi- gan State University for providing a stimulating academic environment throughout my studies. The coursework I undertook in the department expanded my knowledge and equipped me with valuable skills for my future endeavors. I will always cherish the interactions with fellow students and faculty, and I am thankful for the unwavering support from the staff. I would like to extend my deepest gratitude to my family, friends, and loved ones for their constant support, love, and inspiration that has been instrumental throughout my academic journey. I am especially grateful to my wife, Sikta, for her presence, solace, and joy that she has brought into my life. I am profoundly grateful to my parents and sister for their unwavering support, unconditional love, and countless sacrifices they have made throughout my academic journey. They have been my pillars of strength. I am also thankful to my friends at MSU, including Anirban, Sampriti, Mookyong, Tathagata, Sumegha, Hema, and many others, for their companionship and support. Their presence has made my journey even more memorable. I extend my heartfelt wishes to all of you for a bright and prosperous future filled with happiness and success. To all of you who have been an integral part of my journey, I am deeply thankful for making it a truly remarkable and fulfilling experience. vi TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The basic model selection methods . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Model free methods - Knockoffs and Deep Learning based approaches . . . . 4 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 CHAPTER 2 A P-VALUE-FREE FDR CONTROL METHOD FOR HIGH DIMENSIONAL VARIABLE SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Theoretical properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 APPENDIX A PROOF OF THEOREM 2.3.6 . . . . . . . . . . . . . . . . . . . . . . 37 APPENDIX B PROOF OF THEOREM 2.3.2 . . . . . . . . . . . . . . . . . . . . . . 40 CHAPTER 3 SCIDNET: ERROR CONTROLLED FEATURE SELECTION FOR ULTRA HIGH DIMENSIONAL AND HIGHLY CORRELATED FEATURE SPACE USING DEEP LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 APPENDIX A THEORETICAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . 73 APPENDIX B ADDITIONAL TECHNICAL DETAILS . . . . . . . . . . . . . . . . . 80 CHAPTER 4 DEEP LEARNING-AIDED FEATURE SELECTION FOR COGNITIVE RESERVE WITH HIGHLY CORRELATED TRACTOGRAPHY DATA . . . . . 88 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3 Real Data Analysis - UM-MAP Tractography Data . . . . . . . . . . . . . . . . . 101 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 APPENDIX A MADC TRACTOGRAPHY DATA - MORE INFORMATION . . . . . . 111 APPENDIX B LIST OF MAJOR WM TRACTS . . . . . . . . . . . . . . . . . . . . . 114 CHAPTER 5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 vii CHAPTER 1 INTRODUCTION High-dimensional data analysis has become increasingly popular in various fields, such as biology, finance, social sciences, and engineering. In these fields, it is common to have datasets with a large number of features or variables, but a relatively small sample size. The main challenge in these situations is to identify the relevant features that are associated with the response variable while discarding the irrelevant ones, which is called variable selection. 1.1 Mathematical formulation Within the framework of supervised learning, we denote a continuous response variable Y and a set of p continuous covariates X = (X 1 , . . . , X p ). The cumulative distribution function (CDF) of the response variable Y is denoted by F y (·), while the CDF of the k t h predictor X k is denoted by F k (·). We consider an ultrahigh-dimensional setting with a sample size of n and p = O(exp (n τ )) where τ > 0. Now, to induce the sparsity, we assume the existence of a subset S 0 ⊂ 1, 2, . . . , p where |S 0 | = O(1), such that conditional on features in S 0 , the response Y is independent of features in S 0c . In other words, S 0 can be defined as k : f (y|X ) depends on X k , where f (y|X ) is the conditional density of y given X . For the rest of this thesis, we call the relevant features in S 0 important or nonnull features; and the irrelevant features in S 0c as unimportant or null features. Our objective is to identify the sparsity structure by estimating S 0 . 1.2 The basic model selection methods Variable selection has a long history in statistics, and various methods have been proposed for this purpose. The basic Model selection approaches aim to select the best parsimonious model among a set of candidate models based on a criterion such as the Akaike Information Criterion (AIC) Bozdogan (1987) or Bayesian Information Criterion (BIC) Chen and Chen (2008). These basic model selection methods are computationally feasible and work well for low-dimensional features. However, for ultra-high dimensional feature space, they become computationally intractable. As a solution, one might use regularization methods, which impose a penalty on the 1 model complexity to avoid overfitting. A detailed review of this literature can be found in Fan and Lv (2010). Under the linear model framework, suppose, the response y is associated with X through a linear model: p yi = µ + X i j β j + ϵi , ∀i = 1, 2, . . . , n X j =1 where due to the sparsity in the model, ∃ a subset S 0 ⊂ {1, 2, . . . , p} for which β j ̸= 0, ∀ j ∈ S 0 and β j = 0, ∀ j ̸∈ S 0 . The most popular penalized regression methods such as Lasso Tibshirani (1996) and MCP Zhang (2010) minimize the following objective function: 1 β̂ ∈β∈R p ||Y − X β||22 + p λ (β) 2n The first part of this objective function minimizes the Mean Square Error (MSE) and the second part imposes a penalty on the number of features in the model, thus increasing the model p ³ ´ parsimony. For example, for the lasso, p λ (β) = ||β||1 , for the MCP p λ (β) = λ |β j | − λa , for P j =1 +  |β | if |β j | ≤ λ    j   p P  (aλ−|β j |)1{λ<|β j |≤aλ} SCAD, p λ (β) = λ if λ < |β j | ≤ aλ j =1  (a−1)  2   (a+1)λ  if |β j | > aλ  2 These sparsity-inducing regularized methods typically result in a set of features for which the estimated coefficients are non-zero; i.e. the selected set of relevant features is Ŝ n = { j ∈ {1, 2, . . . , p} ∋ |β̂ j | ̸= 0}. To assess the performance of a feature selection method, one would ³ ´ |Ŝ n ∩S 0 | • maximize the Power= E |S 0 | , the expected proportion of relevant features that are correctly identified and µ ¶ |Ŝ n ∩S 0c | • minimizes the FDR=E , the expected proportion of falsely identified features |Ŝ n | among all the identified features. Similar to the scenario in multiple testing, a trade-off exists between power and FDR in high-dimensional variable selection. Among the various methods available, the model selection consistent algorithms are the most desirable as they asymptotically uncover the true sparsity 2 pattern, and have an asymptotic power of 1 with FDR decreasing to zero. However, these algorithms are based on stringent assumptions on the design matrix, which are typically not satisfied by modern high-dimensional datasets. For instance, Lasso is model selection consistent only under irrepresentable conditions, as described in Zhao and Yu (2006), which requires the noise variables to be weakly correlated with signal variables. As a result, Lasso becomes inconsistent for variable selection in most modern studies. To relax the assumptions, we can first control the associated error and then try to maximize the power given the controlled error. For example, Lasso can attain asymptotic power 1 under Restricted Eigenvalue (RE) and beta-min conditions. However, such an approach does not offer any control over the associated error. Next, we discuss these two approaches: Sure screening property and FDR control. 1.2.1 Feature screening methods with sure screening property A feature selection method enjoys the sure screening property if its output Ŝ n satisfies the condition: P (S 0 ⊂ Ŝ n ) → 1 as n → ∞ This implies all the relevant features are retained in the selected set of features. Hence, the asymptotic power converges to one; however, due to the lack of error control, these methods may result in higher FDR. The sure screening property was first introduced by Fan and Lv (2008) in the context of variable selection for linear regression models. They proposed the sure independence screening (SIS) method, which selects a subset of features based on marginal correlation with the response variable. Under certain conditions, SIS is able to identify all relevant features with high probability. Consequently, it has been shown to have good empirical performance in many applications. Relaxing the linearity assumption, several model-free feature screening methods have de- veloped over the years. For example, Xue and Liang (2017) developed a screening procedure based on a two-step procedure: (1) transforming the data (Y , X ) to a Gaussian distributed array by norparanormal transformation Liu et al. (2009), and then (2) performing pairwise marginal 3 independence test for a bivariate gaussian distribution on the transformed variables (Y , X j ), for each j = 1, 2, . . . , p. This method enjoys the sure screening property under mild regularity conditions on the dimension of the feature space and the minimum signal strength. However, due to the inherent structure of the testing procedure, this method only considers the continuous response variable, which is applicable for regression tasks only. In a broader context, the feature screening method proposed by Zhou et al. (2018), can be employed for classification tasks. We discuss this in detail in chapters 2 and 3 later in this thesis. 1.2.2 Feature screening methods with FDR control In addition to variable selection, controlling the False Discovery Rate (FDR) has also become an important issue in high-dimensional data analysis. As mentioned above, the FDR is the proportion of false positives among all the rejected null hypotheses. FDR-controlled methods aim to restrict the FDR at a pre-specified level while allowing some false positives. FDR control has been extensively studied in the multiple testing framework, and various methods have been proposed for this purpose, starting from the famous Benjamini-Hochberg (BH) method Benjamini and Hochberg (1995). Despite being widely used across scientific domains, the BH procedure relies on stringent assumptions regarding p-values, which can be challenging to satisfy in real-world datasets. As a solution, many other methods have been proposed along this line, see Tansey et al. (2018); Xia et al. (2017); Li and Barber (2019); Lei and Fithian (2018) for a more detailed overview. However, generating interpretable p-values for nonlinear models on high-dimensional data remains an unresolved research problem. To overcome this limitation, the knockoff framework was proposed by Candès et al. (2018). Essentially, this is a model-free variable selection algorithm with provable FDR control, assuming prior knowledge on the predictors’ distribution is available. In the next section, we discuss in detail the knockoff-based approaches. 1.3 Model free methods - Knockoffs and Deep Learning based approaches Knockoff-based methods have emerged as an alternative approach to address the challenges of variable selection in high-dimensional data. The Knockoff framework, introduced by Candès 4 et al. (2018), constructs a set of knockoff variables to mimic the correlation structure of the original variables while setting the knockoff features unassociated with the response y. The key idea is to use these knockoff variables as a reference to estimate the false discovery rate (FDR) of the selected variables, allowing for a controlled selection of features. The Knockoff approach has been extended to various regression settings and has been shown to outperform many other popular variable selection methods, especially in settings with complex nonlinear relationships between the response and the features. One advantage of the Knockoff framework is that it does not rely on any specific distributional assumptions, and thus is applicable to a wide range of data types. However, to generate the knockoff variables, one needs to know or estimate the distribution of the features which might be a daunting task in practice, especially for ultra-high dimensional data. In some cases, it may be possible to have prior knowledge of the correlation pattern among the features. For example, in genetics studies, there is a common notion of linkage disequilibrium, which helps to specify the dependency pattern among the alleles at polymorphisms (Sesia et al., 2018). However, this information is typically unavailable in many other domains. Recently, Barber et al. (2020) demonstrated that the knockoff framework can yield inflation in false discoveries, consistent with the error incurred in estimating the predictor’s distribution. This problem is further exacerbated by highly correlated features. An empirical illustration is provided in Figure 1.1, demonstrating how the model-X knockoff (Candès et al., 2018) typically fails to control FDR under a simplistic setting with high multicollinearity. In recent years, deep learning (DL) based methods have also gained popularity in high- dimensional data analysis due to their ability to automatically extract relevant features from raw data. As a consequence, DL-based flexible knockoff generating algorithms have been proposed (Liu and Zheng, 2019; Jordon et al., 2019; Romano et al., 2020); however, they are trained in a typical big-n-small-p setting, and it is unclear how they will perform when the sample size n is significantly smaller than the dimension of the covariates p, and the predictors are highly correlated. We discuss this issue in detail in Chapter 3. 5 Figure 1.1 How multicollinearity affects Knockoffs - A demonstration using simplistic simulation setting: We simulate n = 400 iid copies of (y ∈ R, X ∈ R 100 ), where the outcome y is generated from a linear model: y = 0.5(X 20 + X 40 + X 60 + X 80 + X 100 ) + ϵ, ϵ ∼ N (0, 20) and the features X ∼ N100 (0, Σ), (Σ)i j = ρ |i − j | . We implement Model-X knockoff in two ways: (1) Model-X estimated: method proposed in Candès et al. (2018), generating knockoffs from the estimated distribution of the features from the data, and (2) Model-X True: generating the knockoff from the true distribution of X . For a higher autocorrelation ρ, which is a well-known difficult setting for traditional feature selection methods, the knockoffs-estimated loses its FDR control. This is because of the error in estimating the distribution of the features as knockoffs-true successfully maintains the power-FDR balance even for higher correlation. On the other hand, Deep learning-based methods have gained significant attention in recent years for feature selection and prediction tasks in high-dimensional data. Deep learning models are designed to automatically learn feature representations from the data, without requiring explicit distributional assumptions. In the context of feature selection, deep learning models can be used to extract relevant features from high-dimensional data, which can then be used as inputs to downstream statistical models for prediction or classification tasks. Chen et al. (2021) proposed an L 0 -norm based penalized neural network, called Deep Feature Selection (DFS). It enjoys exact model recovery under some mild assumptions on the underlying functional relationship between the response and the features. However, one challenge with deep learning-based feature selection is the lack of sufficient training data; which is quite common in many modern biological datasets. Recent efforts have been made to address this challenge, by developing methods to interpret 6 the learned features and to relate them to known biological or physical processes. To address this issue, Liu et al. (2017) proposed an ensemble-based method utilizing random dropouts, especially for high dimensional low sample size data. In this thesis, we propose an ensemble-based feature selection method for ultrahigh dimen- sional data with FDR control. Our method combines the strengths of regularization and model selection methods and achieves better performance than existing methods in terms of both variable selection and FDR control. We also provide a theoretical analysis of our method and show that it has desirable statistical properties. 7 BIBLIOGRAPHY Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. The Annals of Statistics, 48(3):1409–1431. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Method- ological), 57(1):289–300. Bozdogan, H. (1987). Model selection and Akaike’s Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3):345–370. Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 80(3):pp. 551–577. Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771. Chen, Y., Gao, Q., Liang, F., and Wang, X. (2021). Nonlinear variable selection via deep neural networks. Journal of Computational and Graphical Statistics, 30(2):484–492. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911. Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101. Jordon, J., Yoon, J., and van der Schaar, M. (2019). KnockoffGAN: Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations. Lei, L. and Fithian, W. (2018). Adapt: an interactive procedure for multiple testing with side infor- mation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649–679. Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini–hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):45– 74. Liu, B., Wei, Y., Zhang, Y., and Yang, Q. (2017). Deep neural networks for high dimension, low sample size data. In IJCAI, pages 2287–2293. Liu, H., Lafferty, J., and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(10). 8 Liu, Y. and Zheng, C. (2019). Deep latent variable models for generating knockoffs. Stat, 8(1):e260. Romano, Y., Sesia, M., and Candès, E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532):1861–1872. Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18. Tansey, W., Wang, Y., Blei, D., and Rabadan, R. (2018). Black box FDR. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4867–4876. PMLR. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288. Xia, F., Zhang, M. J., Zou, J. Y., and Tse, D. (2017). Neuralfdr: Learning discovery thresholds from hypothesis features. In NIPS. Xue, J. and Liang, F. (2017). A robust model-free feature screening method for ultrahigh- dimensional data. Journal of Computational and Graphical Statistics, 26(4):803–813. Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563. 9 CHAPTER 2 A P-VALUE-FREE FDR CONTROL METHOD FOR HIGH DIMENSIONAL VARIABLE SELECTION 2.1 Introduction Variable selection is a fundamental problem in high-dimensional statistical analysis, where the goal is to select a subset of relevant variables from a large pool of potential predictors to build a parsimonious model. In modern applications in almost all scientific domains, such as genomics, brain imaging, and finance, the number of potential predictors can be much larger than the sample size, which makes traditional variable selection methods such as stepwise regression or Akaike’s information criterion Bozdogan (1987) inefficient or impractical. Therefore, recent years have witnessed the development of various ultrahigh-dimensional variable selection methods that can handle a large number of variables relative to the sample size. These methods play a crucial role in improving the statistical accuracy and model interpretability while reducing the computational complexity at the same time. Penalized regression, which applies a penalty term to the likelihood function or objective function, is one of the most popular methods for variable selection in high-dimensional data analysis. Examples of these methods include Lasso and relevant algorithms Tibshirani (1996), SCAD Xie and Huang (2009), the elastic net Zou and Hastie (2003), adaptive Lasso Zou (2006) , and many more. Theoretical findings concerning parameter estimation, model selection, prediction, and oracle properties have been formulated across various model contexts. A comprehensive review of this literature is available in Fan and Lv (2010), and therefore, it is excluded from this discussion. One of the main challenges in ultrahigh-dimensional variable selection is controlling the false discovery rate (FDR), which is the proportion of falsely selected variables among all selected variables. FDR control is essential for ensuring the validity and reproducibility of the results, especially in large-scale studies involving thousands or millions of variables. For example, in genome-wide association studies (GWAS), researchers need to consider hundreds of thousands of genetic markers to identify the variants associated with a particular trait or disease. Since the 10 cost of false discoveries is high, as each selected variant requires a costly follow-up experiment, it is crucial to limit the number of false discoveries. Hence, researchers are interested in developing methods that can model the dependence structure of the data while ensuring an upper bound on the false discovery rate (FDR). The False Discovery Proportion (FDP) can be represented as a random variable denoted by F DP , where e0 F DP = N+ ∧ 1 Here, e 0 denotes the number of falsely selected variables, N+ denotes the total discoveries, and a ∧ b = max(a, b) tackles the situation where there is no discovery, i.e. N+ = 0. The FDR is defined as F DR = E (F DP ). Estimating this expectation is a challenging task for the variable selection problem, and researchers have attempted to tackle this issue from multiple perspectives. Traditional FDR controlling methods, such as the Benjamini-Hochberg procedure Benjamini and Hochberg (1995), are based on p-values, which require an assumption of normality and are often not robust to non-normality or heavy-tailed distributions. Moreover, the validity of p-values depends on several assumptions such as independence or positive regression dependency on a subset (PRDS), which may be difficult to justify or examine in practice. Additionally, as noted in Candès et al. (2018), while maximum-likelihood theory can derive asymptotic p-values for low-dimensional generalized linear models (GLMs), it is unclear how to obtain p-values for high-dimensional models where the number of predictors exceeds the number of observations (n < p). Given these challenges, there is a need to develop a new FDR controlling method that does not rely on p-values and is suitable for both linear models (LMs) and GLMs. As a p-value-free FDR controlling method, The Model-X knockoffs Candès et al. (2018) are widely used offering guaranteed control of the FDR and the flexibility to employ arbitrary predictive models. To create the distinction between the null and nonnull features, it generates the ’knockoff’ features mimicking the dependence structure of the feature space while being independent of the response. To generate these knockoffs one needs to know or estimate the predictor’s distribution, which is a daunting task in practice, especially for ultra-high dimensional 11 feature space. However, even with knowledge of the underlying feature distribution, this method is infeasible unless the feature distribution is either a finite mixture of Gaussians Gimenez et al. (2019) or has a known Markov structure Bates et al. (2020). To address some of these complexities, Dai et al. (2022) proposed an FDR control method based on multiple data splitting (MDS) and combining different test statistics from multiple splits to construct the estimate of the FDR. Due to the name, we further call this method "MDS".MDS can theoretically control the FDR while avoiding the limitations of traditional approaches that rely on assumptions about the distribution of the data. It has been shown to perform well in simulations and real data analyses, and it represents a promising avenue for future research in the field of high-dimensional statistical inference. The method works well in general, however, is computationally intensive due to fitting a prediction model multiple splits of an ultra-high dimensional dataset. Also, from our experience, this method is too conservative to discover any features in many practical settings, thus reducing the power of the method. To address these challenges, we present a novel p-value-free FDR controlling method for ultra- high dimensional datasets. The proposed method consists of two steps: screening and cleaning. In the screening step, we simply reduce the dimension of the feature space by eliminating some of the null features by utilizing a feature screening method with the sure screening property. Then in the cleaning step, we further clean out the null features from the selected set of screened features by constructing a p-value-free estimate of the FDR. We distinguish the null features from the set of relevant features by effectively utilizing the adaptive penalization on the repeatedly perturbed lasso. The idea of screening and cleaning is not new in statistics literature and was first used in Wasserman and Roeder (2009). Our method is easy to implement and does not require any tuning parameters, making it suitable for a wide range of applications. Under some mild regularity assumptions, we study the theoretical properties of the proposed method. For an empirical evaluation, we demonstrate that our method outperforms existing methods in terms of both FDR control and variable selection accuracy. We also provide insights into the behavior of our method under different scenarios and show that it can handle various types of data and 12 noise structures. Additionally, we apply our method to a real-world dataset from gene expression analysis for several drugs where we demonstrate its practical usefulness and interpretability. In summary, our proposed method provides a promising solution for controlled variable selection in ultrahigh dimensional problems. We believe that our method has significant practi- cal applications in fields such as genomics, imaging, and natural language processing, where the number of features can be orders of magnitude larger than the number of samples. The remainder of this paper is organized as follows. In Section 2.2, we introduce our p-value-free FDR controlling method and describe its implementation details. Next, in Section 2.3, we study the asymptotic properties of the proposed method and showed its FDR control guarantee. In Section 2.4, we present simulation studies to evaluate the performance of our method and compare it with existing methods. In Section 2.5, we apply our method to two real-world datasets and demonstrate its practical usefulness. Finally, we conclude the paper with a discussion and future research directions in Section 2.6. 2.2 Methodology 2.2.1 Model setup and assumptions In the context of a supervised learning regression framework, we have n independent and identically distributed (i.i.d.) copies of (Yi , X i ), i = 1, 2, . . . , n, where Y is the continuous response variable and X , is the set of p continuous covariates, denoted by X = (X (1) , X (2) , . . . , X (p) ). We consider here the ultrahigh dimensional setting, allowing p = p n → ∞, as n → ∞. For any square matrix C , let φ(C ) and Φ(C ) denote the smallest and largest eigenvalues of C . Also, if k is an integer, define φn (k) = min φ( n1 X M ′ X M ) and Φn (k) = max Φ( n1 X M ′ X M ). In the following M :|M |=k M :|M |=k sections, we will assume the following linear model and some basic key assumptions: ′ A1 Yi = X i β∗ + ϵi , where ϵi are independent and following N (0, σ2 ). c2 A2 The design matrix X ∈ Rn×p n allowing p n → ∞ as n → ∞ with p n ≤ c 1 e n , for some c 1 > 0 and 0 ≤ c 2 < 1. A3 S = { j : β∗j ̸= 0} with s = |S| = s = O(1) and ψ = mi n{|β∗j | : j ∈ S} = ψ > 0, {β∗j , j ∈ S} is 13 assumed to be fixed. A4 For any weight vector w ∈ (0, 1]p , there exists κ > 0 with ³ ′ ´ ∆′ XnX ∆ P (lim inf mi n ∆∈C ≥ κ > 0) = 1, n→∞ ||∆||22 where, C = {∆ ∈ R p : || 2WS c − I p−s ∆S c ||1 ≤ || (2WS + I s ) ∆S ||1 } ¡ ¢ A5 There exists positive constants c 2 , c 3 and c 4 such that P (lim sup Φn (n) ≤ c 2 ) = 1, P (lim inf φn (c 3 log n) ≥ c 4 ) = 1, n→∞ n→∞ and P (φn (n) > 0) = 1, ∀n. A6 We consider standardized covariates: E (X i j ) = 0, E (X i j )2 = 1. Also, there exists a constant B ∈ (0, ∞) such that P (|X i j | < B ) = 1. These assumptions can be relaxed at the expense of more intricate proofs. Our objective is to learn the sparsity structure by estimating the true index set S through the selection of a feature set D̂ n that ensures control over the associated false discovery rate (FDR) under a predefined c ³ ´ threshold q. Specifically, we aim for F DR = E |D̂ n ∩S | < q. Additionally, while maintaining |D̂ n | ³ ´ FDR control, we strive to maximize the Power = E |D̂|S| n ∩S| in order to strike a stable trade-off between type-I and type-II errors, thereby enhancing the overall performance of the proposed methodology. Next, we will provide a detailed description of the multiple steps involved in our proposed methodology. 2.2.2 Screening step Assuming that the cardinality of set S is much smaller than the dimension of the feature space, p, the majority of features are found in the complement of S 0 , denoted as S 0c . Hence, during the screening step, our primary focus is on identifying an active set, Ŝ n , with a much smaller cardinality than p, such that the probability of S being a subset of Ŝ n approaches 1 as n approaches infinity. This is known as the sure screening property, first introduced in Fan and Lv (2008), which guarantees that all relevant predictors are retained in Ŝ n . The remaining predictors, 14 denoted as X j , j ∈ Ŝ nc , are removed from further analysis. This is just a dimension reduction step. We proceed as follows: • First, we randomly split the data into two groups D1 and D2 , each with n 1 and n 2 observa- tions. In D1 , we fit the lasso model with tuning parameter λ ∈ Λ, where Λ is some index set, and get 1 β̂(λ) = argminβ∈Rp n ∥ YD1 − X D1 β ∥22 +λ ∥ β ∥1 (2.1) 2n 1 We further denote the set of selected variables as a function of the tuning parameter λ: S̃ n (λ) = { j : |β̂ j (λ)| ̸= 0}. • In D2 , we minimize the squared error loss to get the optimum value of the tuning parameter ′ λ∗ , i.e., λ∗ = argminλ∈Λ L̂(λ), where L̂(λ) = n12 i ∈D2 (Yi − X i β̂(λ))2 . P • Therefore our active set Ŝ n is the set of variables corresponding to the cross-validated λ, i.e., Ŝ n = S̃ n (λ∗ ), where λ∗ = argminλ∈Λ L̂(λ) By assumption (A1)-(A6) in 2.2.1, it can be shown as in Wasserman and Roeder (2009) that this Ŝ n enjoys sure screening property; i.e. P (Ŝ n ⊃ S) → 1 as n → ∞. Hence, although the dimension is reduced in the active set, we can further clean the active features eliminating the null features still retained in the active set Ŝ n . This is our goal in the next step. 2.2.3 Cleaning step We start from the output y and the active features X Ŝ n . In order to characterize the relevant features, we first define an idea of importance score for the active features. It measures the strength of the association of the predictor with the response. For a given weight vector {w j , j ∈ Ŝ n } and a long sequence of asymptotically increasing tuning parameter λ1 ≤ λ2 ≤ · · · ≤ λ2 r we fit weighted and unweighted lasso r times. For any tuning parameter λi , i = 1, 2, . . . , r , the weighted lasso takes the following form: 1 β̃w (λi ) = argmin ∥ Y − X Ŝ n βŜ n ∥22 +λi ∥ W βŜ n ∥1 (2.2) β∈R|Ŝn | 2n 15 On the other hand, the unweighted lasso is the standard lasso expressed as: 1 β̃uw (λi ) = argmin ∥ Y − X Ŝ n βŜ n ∥22 +λi ∥ βŜ n ∥1 (2.3) β∈R|Ŝn | 2n where W = d i ag (w 1 , w 2 , . . . , w |Ŝ n | ). Then we calculate the importance score for the j-th predictor I jw by measuring how much it has survived through the lambda sequence before vanishing off; i.e. 1X r 1X r I jw = 1(β̃w j (λ i ) ≥ τ), I uw j = 1(β̃uw j (λi ) ≥ τ) (2.4) r i =1 r i =1 for some pre-specified small threshold τ. This is analogous to the thresholded lasso approach by Wang et al. (2017). The superscript ’w’ and ’uw’ indicates the usage of the weighted or unweighted version of lasso respectively. Next, we carefully select a sequence of tuning parameters, which is a crucial step in our proposed approach, to fully exploit the disparity between the weighted and unweighted lasso across a wide range of penalty levels. In order to achieve this, we consider the ordered sequence of tuning parameters λ1 ≤ λ2 ≤ · · · ≤ λr ≤ λr +1 ≤ · · · ≤ λ2r . To understand q l og (p) the behavior of lasso, we set the smallest r λ-values (i.e. λ1 , λ2 , . . . , λr ) in order of n . Now in order to understand the behavior of the weighted lasso, we set the largest r λ-values in a p much higher level o( n γ ) for γ > 0. Specifically, in practice, to select our sequence of tuning parameters λ to further calculate the importance score, we proceed as follows: 1. First, a sequence of r − 1 values for λ1 , . . . , λr −1 is generated in a logarithmic scale; such as q l og (p) r −i λi = 2 n ∗ φ r , ∀i = {1, 2, . . . , r − 1}. 2. Then, we set the λr at λr = max |X j′ y|/n. j ∈{1,2,...,p} 3. Next, for the last and largest r values of the sequence, we start with setting λ2r = max |X j′ y|/n + c̃n γ , where c̃ is a small constant, typically set as c̃ = 0.0001 and j ∈{1,2,...,p} γ < 12 . 4. Finally, we generate the remaining r − 1 largest λ-values (i.e. λr +1 , . . . , λ2r ) using a similar r −i logarithmic scale such as: λr +i = λ2r ∗ φ r , ∀i = {1, 2, . . . , r − 1}. 16 The value of φ is typically set at 0.001, and r is set to 100. Despite its apparent complexity, this construction of the sequence of penalty parameters directly emerges from the asymptotic analysis of the weighted lasso along its regularization path, as we have extensively discussed in Section 2.3. Using this λ-sequence, we bound the unweighted importance scores below 1; i.e. I juw < 1, ∀ j ∈ {1, 2, . . . , p}. The underlying idea is that the important predictors should survive longer even for the higher value of the tuning parameter λ. This importance score has great potential in identifying the association strength of the covariates with the response by utilizing adaptive penalization. Suppose, the importance score for the j-th variable obtained from the unweighted lasso (i.e. taking w j = 1, ∀ j ) is denoted by I juw . Then for a typical simulated dataset, the following Figure 2.2.3 demonstrates the behavior of weighted and unweighted importance scores. We can particularly note that the weighted importance scores for true non-null variables are significantly greater than the whole bootstrap distribution of the unweighted importance scores. On the other hand, the weighted importance score for the true null variable is comparable to their unweighted counterparts. This disparity between the unweighted and weighted version of the importance score makes it apt for replacing the p-values in the FDR estimation in this cleaning step. we proceed as follows. 1. We want to select the variables with high-importance scores. So, for any cutoff ∆, we define, ³ ´ N+ (∆) =number of total discovered variables = j ∈Ŝ n 1 I jw ≥ ∆ and P ³ ´ e 0 (∆) = number of falsely discovered variables = j ∈Ŝ n 1 I jw ≥ ∆, j ̸∈ S P 2. While the random variable N+ (∆) is observable, we need to estimate somehow the unob- served e 0 (∆) by ê 0 (∆) and thus we estimate the False Positive Rate (FPR) as ³ ´ uw I ∆ P ê 0 (∆) j ∈Ŝ n 1 j ≥ F Pˆ R(∆) = = P ³ ´ N+ (∆) 1 I w ≥ ∆ j ∈Ŝ n j 3. We calculate the optimum cutoff ∆∗ , for which the F Pˆ R is controlled at some pre-specified threshold q: ê 0 (∆) ½ ¾ ∗ ∆ = min ∆ > 0 : 0 < ≤q (2.5) N+ (∆) 17 Figure 2.1 Important scores - weighted vs unweighted - A demonstration using simplistic sim- ulation setting: We simulate n = 400 iid copies of (y ∈ R, X ∈ R 100 ), where the outcome y is generated from a linear model: y = 0.5(X 20 +X 40 +X 60 +X 80 +X 100 )+ϵ, ϵ ∼ N (0, 1) and the features X ∼ N100 (0, Σ), (Σ)i j = ρ |i − j | . The ten Vertical Red lines are indicating the true nonnull predictors. The Solid Green is indicating the weighted importance scores and the dense lines are indicating the unweighted importance scores for 1000 bootstrap versions of the data. n o This minimum is taken over ∆ > 0 taking values in the set I1w , I2w , . . . , Ipw 4. Finally the selected set of variables: n o D̂ n = j ∈ Ŝ n : I jw ≥ ∆∗ (2.6) Intuitively, the weighted and unweighted importance scores behave similarly for the null variables and the proposed method utilizes this characteristic to identify the null features. In the next section, we further show theoretically that with mild assumption, our proposed above- mentioned will control the FDR below the user-specified threshold q. 2.2.4 Obtaining appropriate weights The proposed method relies on obtaining suitable weights, where smaller values are assigned to the relevant predictors in set S to minimize their penalty, while larger values are assigned to 18 null features in set S c to increase their penalization. To address this, we introduce a perturbation bootstrap approach in this section. Perturbation bootstrap is a resampling technique used to estimate the sampling variability and uncertainty associated with a statistical model or estimator by perturbing the observed data. It involves generating new datasets by introducing small perturbations or variations to the original data, allowing for the assessment of the stability and robustness of the statistical analysis. A more detailed review of the perturbation bootstrap can be found elsewhere, such as Minnier et al. (2011); Das and Lahiri (2019). In our setting, we define the perturbation bootstrap in the following way. 1 X n β̂b (λ) = argminβ∈R|Ŝ n | u i (Yi − X Ŝ n ,i βŜ n )2 + λ ∥ βŜ n ∥1 (2.7) 2n i =1 where {u 1 , u 2 , . . . , u n } is a set of positive values iid samples generated from a bounded distribution like Uni f or m(1, 2). These random samples stochastically perturb the objective function and repeating this a large number of times helps in recognizing patterns in the solution and assessing the underlying uncertainty. q l og (p) To generate the weights w j , j ∈ Ŝ n , , we set the tuning parameter maintaining λnw = o( n ). We randomly generate the iid samples Ub∗ = {u 1b , u 2b , . . . , u nb } from a non-degenerate bounded positive-valued distribution and repeat the process B times to generate B sets of {Ub∗ }Bb=1 . Let β̂b (λnw ) be the perturbed lasso estimator 2.2.4 perturbed by Ub∗ with tuning parameter λnw . In Section 2.3, we demonstrate this perturbed lasso enjoys asymptoticaly vanishing L 2 error ||β̂w (λ) − β∗ ||2 . Finally,we calculate the weight w j for the variable X j , j ∈ Ŝ n as 1 X B ³ ´ c̃ wj = 1 |β̂bj (λnw )| < τ + 1 (2.8) B b=1 n 2 −γ 1 for some pre-specified small threshold c̃, τ > 0 and γ < 2 as mentioned in Section 2.2.3. By selecting a suitably small value for τ, the consistency of the perturbed lasso method guarantees that the weights w j , j ∈ Ŝ n ∩ S will approach zero suggesting no penalization for the true features. On the other hand, the null weights w j , j ∈ Ŝ n ∩S c would tend to 1 indicating higher penalization. c̃ However, to maintain their rate of convergence we add the small decreasing sequence 1 . We n 2 −γ discuss and prove this in detail in Section 2.3 19 2.3 Theoretical properties In this section, we establish the asymptotic properties of the proposed method, including the guarantee of FDR control. For ease of demonstration, we divide the theoretical study into the following parts: Section 2.3.1 focuses on the asymptotic properties of the weights, while Section 2.3.2 provides a detailed analysis of the weighted lasso. We examine the differences between the weighted and unweighted lasso methods for an asymptotically increasing sequence of tuning parameters, which leads to the attainment of FDR control. Additionally, our analysis demonstrates that the method achieves asymptotic power approaching unity. Moreover, in accordance with the assumptions stated in Section 2.2.1, our screening step directly utilizes the framework proposed in Wasserman and Roeder (2009), which has been proven to satisfy the sure screening property, denoted as P (Ŝ n ⊃ S) → 1 as n → ∞. Therefore, conditioned on this screening step, our focus shifts to the analysis of the FDR and power of our proposed method ³ ´ on the reduced dimensional data Y , X Ŝ n . For the sake of notational simplicity, throughout this section, the term "X" refers specifically to the active features X Ŝ n obtained from the screening step unless stated otherwise. Consequently, p̃ = |Ŝ n | and β = βŜ n 2.3.1 Asymptotic properties of the weights We start from the perturbed lasso estimator defined in 2.2.4. 1 X n β̂b (λ) ∈ argminβ∈Rp̃ u i (Yi − X i β)2 + λ ∥ β ∥1 2n i =1 1 ∈ argminβ∈Rp̃ ||Ũb Y − Ũb X β||22 + λ ∥ β ∥1 2n where Ũb = d i ag {Ub∗ } = d i ag {u 1b , u 2b , . . . , u nb }. Lemma 2.3.1 (L 2 error bound of perturbed Lasso). Under the assumptions mentioned in Section 2.2.1 and with bounded perturbation variables Ub∗ ∈ [c L , cU ]n , the L 2 estimation error of perturbed lasso converges to zero; i.e. s  l og (p̃)  ||β̂b − β||2 ≤ o  (2.9) n 20 Proof. As β̂b minimizes the objective function in 2.2.4, we note that 1 1 ∥ Ũ Y − Ũ X β̂b ∥22 +λ ∥ β̂b ∥1 ≤ ∥ Ũ Y − Ũ X β∗ ∥22 +λ ∥ β∗ ∥1 2n 2n ′ ′ 1 b X Ũ Ũ X 1 =⇒ 0 ≤ (β̂ − β∗ )′ ( )(β̂b − β∗ ) ≤ ϵ′Ũ ′Ũ X (β̂b − β∗ ) + λ||β∗ ||1 − λ||β̂b ||1 (2.10) · 2 n ¸ n 1 b ≤ λ ||β̂ − β∗ ||1 + ||β∗ ||1 − ||β̂b ||1 2 by taking λ ≥ 2|| n1 ϵ′Ũ ′Ũ X ||∞ . Now, we note that from equation 2.3.1, 0 ≤ ||β̂b − β∗ ||1 + 2||β∗ ||1 − 2||β̂||1 = ||β̂bS − β∗S ||1 + ||β̂bSc ||1 + 2||β∗S ||1 − 2||β̂bS ||1 − 2||β̂bSc ||1 (2.11) |β̂bj − β∗j | + 2 |β̂bj − β∗j | + |β̂bj | − 2 |β̂bj | X X X X ≤ j ∈S j ∈S j ∈S c j ∈S c = ||3(β̂bS − β∗S )||1 − ||β̂bSc ||1 Hence, β̂b − β∗ ∈ C , where the cone C is defined in equation A4. Additionally, we note that, as u ib ≥ c L > 0, ∀i ∈ {1, 2, . . . , n}, for any ∆ ∈ C , X ′Ũ ′Ũ X 1 1 ∆′ ∆ = ||Ũ X ∆||22 ≥ mini ∈{1,2,...,n} (u ib )2 ||X ∆||22 ≥ c L κ (2.12) n n n Hence the RE assumption in A4 holds for perturbed Lasso with the new lower bound c L κ. Hence, continuing from equation 2.3.1, cL κ b λh i ||β̂ − β∗ ||22 ≤ ||3(β̂bS − β∗S )||1 − ||β̂bSc ||1 2 2 λh i ≤ ||3(β̂bS − β∗S )||1 2 (2.13) λh p b ∗ i ≤ 3 s||β̂S − βS )||2 2 λh p i ≤ 3 s||β̂b − β∗ )||2 2 p 3λ s Hence, ||β̂b − β∗ ||2 ≤ cL κ , when λ ≥ 2|| n1 ϵ′Ũ ′Ũ X ||∞ . Now we note that as by assumption A6, 1 P (|X i j | < B ) = 1, =⇒ ||Ũ X j ||22 ≤ cU 2 2 B (2.14) n 21 ³ 2 ´ Now, let Z j = n1 ϵ′Ũ X j ∼ N 0, σn ||Ũ X j ||22 , j = 1, 2, . . . , p̃, conditional on Ũ and X . Then || n1 ϵ′Ũ X ||∞ = max1≤ j ≤p̃ |Z j |. This implies, p̃ p̃ 2 2 t − t n2 2 µ ¶ µ ¶ 2 ′ X X 8||Ũ X j || σ P ||ϵ Ũ X ||∞ ≥ t ≤ P |Z j | > ≤ 2e 2 n j =1 2 j =1 t 2 n2 − 8σ2 c 2 B 2 n ≤ 2p̃e U (2.15) 2 − 2t 2n 2 8σ c B = 2p̃e U nδ2 = 2e − 8 →0 µq ¶ 8 log(p̃) by setting t = cU B σ n +δ Extending this result, we can show that with probability tending to 1, p s  3λ s l og (p̃)  ||β̂b − β∗ ||2 ≤ if, λ ≥ o  (2.16) cL κ n µq ¶ l og (p̃) Lemma 2.3.1 establishes that with λ = o n , the L 2 error bound ||β̂b −β∗ ||2 ≤ o P (1).Next, we utilize this result to show that our proposed weights mentioned in Section 2.2.4 achieve the desired properties. First, we observe that by the error bound in Lemma 2.3.1, for any j ∈ S, we can write, |β∗j | = |β∗j − β̂bj + β̂bj | ≤ |β̂bj | + |β̂bj − β∗j | and that implies, |β̂bj | ≥ |β∗j | − |β̂bj − β∗j | ≥ ψ + o P (1). Conse- q b w l og (p) quently, for any j ∈ S, |β̂ j | ≤ o P (1). Hence, with sufficiently small τ < ψ and λn = o( n ), ³ ´ ³ ´ P P 1 |β̂bj (λnw )| < τ → 0. Similarly, for j ∈ Sc , 1 |β̂bj (λnw )| < τ → 1, as n → ∞. For 0 < γ < 21 , we recall, 1 XB ³ ´ c̃ b w wj = 1 |β̂ j (λn )| < τ + 1 (2.17) B b=1 n 2 −γ ³ ´ 1 PB b w Lets denote, w̃ j = B b=1 1 | β̂ j (λn )| < τ . by Markov inequality, we see that, for any α > 0 and 22 j ∈ S, ³ ´ 2 E (w̃ j ) P |β̂bj (λnw )| < τ − n(δ 8+α) P (w̃ j > 2e )< 2 = n(δ2 +α) − n(δ 8+α) 2e 2e − 8 nδ2 2e − 8 ≤ n(δ2 +α) 2e − 8 µ ¶ n(δ2 +α) 1 for δ > 0. Hence, w j = o e − 8 + 1 for j ∈ S. n 2 −γ For j ̸∈ S, for any α > 0, similarly as above w j → 1 as n → ∞ E (1 − w̃ j ) µ ¶ µ ¶ n(δ2 +α) n(δ2 +α) P w̃ j < 1 − 2e − 8 = P 1 − w̃ j > 2e − 8 ≤ n(δ2 +α) 2e − 8 nδ2 (2.18) 2e − 8 ≤ n(δ2 +α) as n → ∞ 2e − 8 µ ¶ n(δ2 +α) 1 Hence, |1−w j | < o e − 8 + 1 for j ̸∈ S. Consequently, the proposed weights offer valuable n 2 −γ proxy information regarding β . They tend to approach zero for relevant features in S, indicating ∗ their amplified significance, while converging towards 1 for null features in S c . This capability to discern between null and non-null features renders them suitable for utilization in adaptive penalization strategies. 2.3.2 Asymptotic FDR control and power analysis We start by expressing the FPR in the following way: Pp ³ ´ P ³ ´ P ³ ´ P ³ ´ w w p uw w j =1 1 I j ≥ ∆ ∗ , j ∈ ̸ S j ∈ ̸ S 1 I j ≥ ∆ ∗ j =1 1 I j ≥ ∆ ∗ j ∈ ̸ S 1 I j ≥ ∆ ∗ FPR = Pp ³ ´ =P ³ ´= P ³ ´ ·P ³ ´ p p p j =1 1 I w j ≥ ∆ ∗ j =1 1 I j w ≥ ∆ ∗ j =1 1 I w j ≥ ∆ ∗ j =1 1 I uw j ≥ ∆ ∗ ≤ q · R(∆∗ ) ³ ´ (2.19) w j ̸∈S 1 I j ≥∆ P where R(∆) = Pp ³ ´. The last inequality holds by the construction of ∆∗ in equation 3. j =1 1 I juw ≥∆ Now in order to show the FDR control, we only need to show that, lim E R(∆∗ ) ≤ 1 ¡ ¢ n→∞ Now, decomposing the E (R(∆∗ )) we get  ³ ´  w I ∆ ∗ P j ̸∈S 1 j ≥ E R(∆∗ ) = E  P ¡ ¢ ³ ´ P ³ ´ j ∈S 1 I j ≥ ∆ + j ̸∈S 1 I j ≥ ∆ uw ∗ uw ∗ 23 Following Theorem 2.3.2 first establishes the L 2 error bound for the weighted lasso in equation 2.2.3. The proof is relegated to Appendix B. Theorem 2.3.2 (L 2 error bound of weighted Lasso). Assume that the weights corresponding to the true features are among the first m elements of the ordered list of weights defined as w or d er = {w (1) , w (2) , . . . , w (p̃) } and let’s denote the set of m ordered weights as w T = {w ( j ) , 1 ≤ j ≤ m}. Under the basic assumptions mentioned in Section 2.2.1 and with positive constants c and c ′ , the weighted lasso maintains the following L 2 error bound p  s  λ m l og (p̃) 1  ||β̂w (λ) − β∗ ||2 ≤ ||w T ||2 , where λ ≥ cσ  p + (2.20) κ n||w T ||2 n w (m+1) q q ||w T ||2 l og (p̃) m when w (m+1) n + n = o(1). This further implies ||β̂w − β∗ ||∞ ≤ o P (1) Theorem 2.3.2 demonstrates how the proper utilization of weights in adaptive penalization improves the accuracy. We can retrieve the traditional L 2 bound for unweighted lasso by setting w j = 1, ∀1 ≤ j ≤ p̃. Next, in the following lemma, we demonstrate some key characteristics for our proposed weighting scheme in Section 2.2.4. Lemma 2.3.3 (Key characteristics for our proposed weighting scheme in Section 2.2.4). Suppose the weights are generated using our proposed weighting scheme in Section 2.2.4. Then, with high probability, (1) The w S = {w j , j ∈ S} belongs to to the first s elements in the ordered list w or d er ; 1 i.e.,P (|m − s| > ϵ) → 0, and (2) P (w̃ mi n < 1 − 1 )→0 n 2 +γ Proof. We prove these statements one by one. Claim 1: The w S belongs to to the first s elements in the ordered list w or d er ; i.e.,P (|m −s| > ϵ) → 0 Proof of Claim 1: This is true as by lemma 2.3.1, for any α > 0 n(δ2 +α) 1 n(δ2 +α) 1 nα P (sup w j > e − P (w j > e − ) ≤ 2se − X 8 + 1 )≤ 8 + 1 8 and j ∈S n 2 −γ j ∈S n 2 −γ n(δ2 +α) 1 n(δ2 +α) 1 nα P ( infc w j < (1 − e − P (w j < 1 − e − ) ≤ 2p̃e − X 8 + 1 )) ≤ 8 + 1 8 j ∈S n 2 −γ j ∈S c n 2 −γ After the screening, the dimension of the active set |Ŝ n | = p̃ = o(n 3 ) Wasserman and Roeder (2009) and by our assumption in Section 2.2.1, s = O(1); hence, both the probabilities tend to 24 zero with a high convergence rate. The weight w S corresponds to the first s elements in the ordered list w or d er , denoted as w (1) , w (2) , . . . , w (p̃) . Consequently, for any ϵ > 0, P (|m − s| > ϵ) ≤ n(δ2 +α) n(δ2 +α) nα 1 1 P (sup w j > e − 8 + 1 ) + P ( infc w j < (1 − e − 8 + 1 )) ≤ 4p̃e − 8 → 0. j ∈S n 2 −γ j ∈S n 2 −γ 1 1 Claim 2: Define w̃ mi n = min w ( j ) . Then, P (w̃ mi n < 1− 1 ) → 0 and P (||w (1:m) ||2 > 1 )→ j =m+1,...,p̃ n 2 +γ n 2 −γ 0. Proof of Claim 2: The proof directly follows from the proof of claim 1. First, we note that n(δ2 +α) 1 n(δ2 +α) 1 P (w̃ (m+1) < 1 − e − 8 + 1 ) = P (w (s+1) < 1 − e − 8 + 1 ) + P (m > s) n 2 −γ n 2 −γ n(δ2 +α) 1 ≤ P ( infc w j < (1 − e − 8 + 1 )) + P (m > s) j ∈S n 2 −γ nα ≤ 6p̃e − 8 Therefore, these weights fulfill all the assumptions of theorem 2.3.2, making the bound directly applicable to the proposed weighting scheme. Particularly, we note that, n(δ2 +α) 1 e− + s s 8 1 ||w T ||2 l og (p̃) n 2 −γ l og (p̃) ≤ 2 = o(1) w (m+1) n − n(δ 8+α) 1 n 1−e + 1 n 2 −γ Next, in the following lemma 2.3.4, we discuss the order of the sequence of tuning parameters λ1 < λ2 , . . . , λ2r which we specifically designed in Section 2.2.3 for the importance score. Addi- tionally, we show the order of the penalty level λmax after which the unweighted lasso returns an empty model with all estimated coefficients β̃uw (λmax ) = 0 Lemma 2.3.4 (Order of the designed tuning parameters). Among the sequence of tuning pa- rameters considered, which has a length of 2r , the first r smallest tuning parameters follow an µq ¶ l op(p̃) order of o n , while the last r largest tuning parameters are o(n γ ) for γ < 12 . Additionally, λmax < o(n γ ) where λmax = sup{λ : max |β̃uw (λ)| = 0} j ∈1,2,...,p̃ µq ¶ l op(p̃) Proof. By construction, the first r smallest tuning parameters follow an order of o n . On 1 the other hand, the sequence of r largest tuning parameter is greater than the order of 1 . n 2 −γ 25 p p = o(n γ ) and our chosen λ-sequence can be application for the L 2 m m Hence, p ≤ p n||w T ||2 n 11 −γ n2 bound. Next, in order to study λmax , we first note that it is the level of penalty where the first variable enters the lasso model along the regularization path. Hence, 1 λmax = max |X j′ y|/n ≤ max ||X j ||2 ||y||2 j ∈{1,2,...,p̃} n j ∈{1,2,...,p} p (2.21) B n B ≤ ||y||2 = p ||y||2 n n n n Now, ||y||22 = y i2 , where y i ∼ N (X i′ β∗ , σ2 ) independently. Hence, ||y||22 = σ2 u i2 , where u i2 = P P i =1 i =1 y i2 ∼ χ21 (X i′ β∗ )2 ¡ ¢ σ2 ; which implies à ! n n E ||y||22 = σ2 1 + (X i′ β∗ )2 ≤ σ2 1 + (B β∗j )2 X ¡ ¢ X X i =1 i =1 j ∈S (2.22) E ||y||22 E ||y||2 =⇒ → 0 =⇒ →0 n 1+γ n 1+γ 2 1 1+γ E ||y||2 for any 2 > γ > 0. Hence, by Markov inequality, P (||y||2 > 2 ) ≤ 1+γ → 0. This implies λmax ≤ n 2 γ pB ||y||2 ≤ n 2 ≤ λr +1 . n Lemma 2.3.4 highlights several crucial characteristics: firstly, it establishes the asymptotic ordering of our carefully designed sequence of penalty parameters, i.e., λ1 ≤ λ2 ≤ · · · ≤ λr ≤ λr +1 ≤ · · · ≤ λ2r ; secondly, it guarantees that the unweighted importance scores I juw will always remain strictly below 1 since our chosen sequence of tuning parameters mostly exceeds λmax . It also, demonstrates that, due to the construction, λr +1 ≤ · · · ≤ λ2r are greater than the order p s of o( pn||w ). The following lemma 2.3.2 demonstrates how the error bound in theorem 2.3.2 S ||2 further characterizes the behavior of the null importance scores defined in 2.4. This shows that for the null features, the weighted and unweighted importance scores become similar asymptotically. Lemma 2.3.5 (Asymptotic behaviour of the importance scores). For any cutoff ∆ > 0, as n → ∞ à ! X ³ w ´ ³ ´ P |1 I j ≥ ∆ − 1 I juw ≥ ∆ | > ϵ → 0 (2.23) j ∈S c 26 Proof. For any j ∈ Sc , |β̂w j | ≤ o P (1) for whole sequence of tuning parameters λ1 , . . . , λ2r . Hence, we have, for any ∆ > 0, ! P2r w i =1 P (|β̂ j | > τ) à ³ ´ 2r ³ ´ 2r e −c5 n I jw >∆ =P w 1 |β̂ j (λi )| > τ > ∆ ≤ X P ≤ i =1 ∆ ∆ for c 5 > 0. Similarly, à ! ³ ´ 2r ³ ´ P I juw > ∆ = P 1 |β̂uw j (λi )| > τ > ∆ X i =1 à ! à ! r ³ ´ ∆ 2r ³ ´ ∆ 1 |β̂uw j (λi )| > τ > 1 |β̂uw j (λi )| > τ > X X ≤P +P i =1 2 i =r +1 2 r e −c5 n ≤ ∆ ³ ´ ³ ´ P P So, with a fixed ∆ ∈ (0, 1), for j ∈ Sc , we have 1 I jw ≥ ∆ → 0 and 1 I juw ≥ ∆ → 0. Hence, for a null feature, the weighted and unweighted importance scores behave similarly. Additionally, due to the monotonicity wrt ∆, these convergences are uniform for ∆ ∈ (0, 1). So, for any j ∈ S c , for ∆ ∈ (0, 1) and ϵ > 0, ³ ³ ´ ³ ´ ´ ³ ³ ´ ³ ´ ´ P |1 I jw ≥ ∆ − 1 I juw ≥ ∆ | > ϵ ≤ P 1 I jw ≥ ∆ + 1 I juw ≥ ∆ > ϵ ³ ³ ´ ϵ´ ³ ³ ´ ϵ ´ 6r e −c5 n ≤ P 1 I jw ≥ ∆ > + P 1 I juw ≥ ∆ > ≤ 2 2 ϵ∆ Furthermore, for a sequence u n = o( n1 ), à ! X ³ w ´ ³ ´ ³ ³ ´ ³ ´ ´ P |1 I j ≥ ∆ − 1 I juw ≥ ∆ | > u n = p̃ maxc P |1 I jw ≥ ∆ − 1 I juw ≥ ∆ | > u n → 0 j ∈S c j ∈S as the basic convergences are in order of exponential to n and p̃ = o(n 3 ). Theorem 2.3.6 (Asymptotic FDR control guarantee of the proposed method). With ∆∗ as the data-dependent optimum cutoff defined in eq. 3,  Pp ³ ´ w j =1 1 I j ≥ ∆ , j ̸∈ S ∗ lim F DR = E  P ³ ´ =0 n→∞ p j =1 1 Ij ≥ ∆ w ∗ Also, the asymptotic power with the optimum cutoff ∆∗ approaches 1; i.e.,  Pp ³ ´ w j =1 1 I j ≥ ∆ ∗ , j ∈ S lim Power = E  =1 n→∞ |S| The proof is relegated to Appendix A. 27 Figure 2.2 Correlation-wise comparison: the effect size and sparsity levels are fixed. 2.4 Simulation Studies To evaluate the performance of the proposed variable selection method, we conducted a comprehensive simulation study. We simulated data under different scenarios, including varying correlation structures, effect sizes, sparsity, and noise levels. Specifically, we consider the following high-dimensional linear regression setup: Y n×1 ∼ N p (X n×p βp×1 , σ2 I n ) with p=1000 and n = 600.The true index set S is randomly generated main- taining |S| = s with βS c = 0 and the values of βS are randomly drawn from N(β0 ,0.1). For the design matrix, each X i , i = 1, 2, ..., n are randomly drawn from N p (0, Σ) where Σi j = cov(X i , X j ) = ρ |i − j | . 1. Correlation wise comparison: we fix the effect size |β0 | = 1 and correlation between X i and X j varied as ρ = {0.2, 0.4, 0.6, 0.8} keeping s = 20. 2. Effect-size wise comparison: We fix ρ at 0.5, s = 20 and varied the effect size with |β0 | = {0.6, 0.8, 1, 1.2}. 3. Sparsity-size wise comparison: We fix ρ at 0.5 and β at 1, then gradually increase s = {10, 20, 30, 40, 50} We compared the performance of our proposed methods with several existing methods, including the Model-X knockoff Candès et al. (2018) and the multiple data splitting (MDS) approach Dai et al. (2022) for FDR control. As we discussed in Section 2.1, the Model-X knockoff Candès et al. (2018) is a variable selection method that controls the FDR by constructing a set of "knockoff" variables that mimic the correlation structure of the original variables. On the other hand, 28 Figure 2.3 Effect size-wise comparison: the autocorrelation and sparsity levels are fixed. the MDS is a recently developed method based on the random splitting based technique that assesses the stability of selected features by repeating the variable selection process on multiple random subsamples of the data. These two methods represent a wide class of feature selection models in the current literature and hence we choose these two as other competing methods for the empirical study. We evaluated the performance of these methods using the FDR as the primary metric, controlling for a pre-specified FDR level at q = 0.10. We also computed the true positive rate (TPR) to assess the power of the methods. We repeated each simulation scenario 100 times to ensure statistical reliability and calculated the mean and standard deviation of the FDR, and Power across replications. For the implementation of the knockoff procedure, we adopt the two-stage approach. First, we randomly split the data into two halves. In the first part, we apply the screening process as in Wasserman and Roeder (2009) and screen out most of the null variables with sure screening property, while in the second part, we apply the knockoff procedure to clean the noise variables with FDR control. For the knockoff variables, we generate second-order Gaussian knockoffs using the estimated model parameters with full semidefinite programming (SDP) construction. In spite of its computational complexity, the SDP knockoffs are statistically superior by having higher power than its other alternatives for creating knockoff variables. For the test statistic for the knockoff approach, we considered the signed maximum statistic: W j = max(Z j , Z̃ j ) · si g n(Z j − Z̃ j ), where Z j and Z̃ j are the maximum values of λ at which the jth variable and its knockoff, respectively, enter the generalized linear model. To implement the MDS method, we use the R-code provided in their paper Dai et al. (2022). 29 Figure 2.4, 2.4, and 2.4 illustrates the power and FDR comparisons in the correlation-wise, effect size-wise, and sparsity-wise experiments respectively. We note that the performance of the knockoffs is dependent on the estimation of the feature distribution in order to generate the knockoffs. Hence to separate this out, we consider the "Model-X knockoff - true" where the true feature’s distribution is used to generate the knockoffs. This is possible here as thein the simulation setup, we know the true data-generating distributions. For a more realistic application, we show "Model-X knockoff - Estimated" where the distribution of the feature is estimated from the data assuming Gaussian distribution. Our simulation results indicate several interesting observations. First, the proposed method quickly gains power for moderate correlation or effect sizes, while successfully maintaining the FDR below the specified threshold 10%. Second, in Figure 2.4, we can see the estimated Model-X knockoff achieves higher power; but it also loses its FDR control. However, the Model-X knockoff-True is not affected by the higher correlation and maintains the FDR control at the cost of reduced power. This experiment further substantiates the claim of Barber et al. (2020) that for high multicollinearity, the error in estimating the feature’s distribution increases which further indices the inflated FDR. Third, consistently, for all the cases, the MDS method is highly conservative, although in the simulation setting its power is comparable to the proposed method. In conclusion, our simulation study demonstrates that the proposed method performs well in terms of controlling the FDR and achieving high power for variable selection in high- dimensional settings. These findings suggest that the proposed method can be a valuable tool for high-dimensional variable selection in various applications. Figure 2.4 Sparsity-wise comparison: the autocorrelation and effect size are fixed. 30 Table 2.1 Drug-sensitive genes identified by the feature selection methods Genes selected Confirming Drug the proposed method Model-X knockoff MDS references Topotecan SLFN11 (10), SF3A2 (10), RP1.199J3.5 (9), SLFN11 (9), MGST3 (8), SLFN11 (7), Barretina et al. (2012) RPL18 (7), AC018755.1 (5), RPL18 (8), CLCN4 (8), OXCT2P1 (7), RP1.199J3.5 (5) Li et al. (2012) THG1L (5),PSAP (5), GAREML (7), BHLHE40 (7), THG1L (7), THRB (5) AP000974.1 (7), AC018755.1 (7), DCHS1 (7), LRRC7 (6), SGK223 (6), MFSD5 (6), RP11.562A8.5 (6), SUSD2 (6), CA5B (6), FAM86EP (5), UFSP2 (5), PLIN2 (5), PC (5), SH3BP1 (5), CDKN2C (5), PSAP (5), SF3A2 (5), KANK3 (5) Irinotecan SLFN11 (7), WTAPP1 (6), ADORA2A (8), SLFN11 (7), TCEANC2 (7), No genes met Li et al. (2012) FKBP2 (5), SYT13 (5), IFITM10 (7), CUEDC1 (6), LMNB1 (5), the selection IFITM10 (5), AC068580.5 (5) C7orf26 (5), SYT13 (5), C16orf71 (5), criteria of RRAD (5), AC068580.5 (5), RPL7AP66 (5), MDS THG1L (5), ARHGAP19 (5), MBNL2 (5), GOLGA5 (5), SLC48A1 (5), FKBP2 (5), ARL2 (5), WTAPP1 (5), CETN2 (5) 17-AAG MB21D1 (10), KCNK7 (10), KCNK7 (10), MB21D1 (10), DNAJC17 (10), No genes met Hadley and Hendricks (2014) , NQO1 (9), MMP24 (9), CSNK1E (10), FTSJ2 (9), ERLIN1 (9), the selection Barretina et al. (2012) SERPINF1 (9), DNAJC17 (9), SERPINF1 (9), MMP24 (9), CTDSP1 (8), criteria of CSNK1E (9), NAPG (8), RP11.442H21.2 (8), DHRS4.AS1 (8), GNPNAT1 (8), MDS FTSJ2 (8), RP11.442H21.2 (7), NQO1 (8), NAPG (8), RP11.218L14.4 (8), CTDSP1 (7), SEPW1 (6), DIMT1 (8), HOXA11 (8), CPED1 (8), SUSD4 (6), RP4.816N1.6 (6), LINC01006 (8), OSBPL9 (7), THRB (7), HOXA11 (6), DIMT1 (6), SLC12A7 (7), RP4.816N1.6 (7), WBP2 (7), RP11.143J12.3 (6), ERLIN1 (5), RP11.143J12.3 (7), SEPW1 (7), LYNX1 (7), SLC12A7 (5), RP13.15M17.1 (5), WWP2 (7), NEK6 (7), RP13.15M17.1 (6), TBC1D4 (5), LYNX1 (5), SUSD4 (6), ATP6V0E1 (6), SEMA4G (6), LINC01006 (5), CPED1 (5), TBC1D4 (6), ZNF571.AS1 (6), PFKFB1 (6), THRB (5), WBP2 (5), PLA2G4A (6), BICC1 (6), DGAT2 (6), DHRS4.AS1 (5), NEK6 (5) RNF121 (6), CLSTN2 (5), OTUD4 (5), NARS2 (5), ZNF420 (5), RP11.661A12.12 (5), RP11.432J22.2 (5), RP11.644F5.11 (5), C12orf73 (5), AMN (5), UACA (5), GPRC5B (5), ZNF506 (5) PaclitaXel ELOVL1 (10), ABCB1 (10), ELOVL1 (9), SUCO (9), ABCB1 (9), ABCB1 (5), Dorman et al. (2016) BCL2L1 (9), PRODH (9), SHANK2 (8), ARL6IP1 (8), STX10 (8), LRRC16B (5) Lee et al. (2016) STX10 (9), SHANK2 (9), PRODH (8), SDHAP3 (8), RUNDC3B (7), SDHAP3 (9), SUCO (8), C8orf46 (7), YTHDF1 (7), TYMP (7), PCDHGA2 (7), RUNDC3B (7), LRRC16B (7), NOS1AP (6), PCDHGA2 (6), TYMP (6), NMB (6), LSMEM1 (6), BUB1B (6), SPATA5L1 (6), C8orf46 (6), UQCRFS1P1 (6), RP11.862L9.3 (6), PRKX (6), UQCRFS1P1 (6), BUB1B (5), INHBA (5), BAK1 (6), RIMKLA (5), PDZK1IP1 (5), ANKRD36BP2 (5), NOS1AP (5), ANKRD36BP2 (5), ENC1 (5), INHBA (5), PRKX (5), SDF2 (5), RP11.644F5.11 (5), NMB (5), CSNK2A1 (5), SH3BP1 (5), LRRC16B (5), BCL2L1 (5), EXOC5P1 (5), SDF2 (5), RIMKLA (5) CETN2 (5), SH3BP1 (5) AEW541 IQGAP2 (8), SOAT1 (8), TSPYL5 (8), SLC44A1 (8), LINC00324 (8), TSPYL5 (5) Liang et al. (2018) KCNAB1 (8), DERL3 (7), DERL3 (8), SOAT1 (7), IRS2 (7), LINC00324 (7), SLC44A1 (7), TMEM101 (7), KCNAB1 (7), IQGAP2 (6), IRS2 (7), AC008132.12 (7), CYP1B1.AS1 (6), CASP10 (6), VILL (6), CASP10 (7), TMEM101 (6), AC004840.9 (6), ATP11B (6), AC008132.12 (6), TSPYL5 (5) ANO10 (5), MAF (5), UNC119 (5) 2.5 Real data analysis In this section, we present real data analysis to illustrate the utility of our proposed statistical method for high-dimensional data analysis. For this application, we choose the CCLE dataset that was generated by the Cancer Cell Line Encyclopedia project ( CCLE, link available here. The 31 dataset contains gene expression profiles of over 100 cancer cell lines across different cancer types, making it a valuable resource for studying the molecular basis of cancer and identifying potential drug targets. With the ultimate goal of improving precision medicine, the CCLE dataset has been extensively used in previous studies to investigate various aspects of cancer biology, such as identifying genetic and epigenetic markers associated with drug sensitivity or resistance, characterizing molecular subtypes of different cancer types, and exploring the genetic basis of cancer evolution and progression. More specifically, the CCLE dataset contains dose-response curves for 24 different drugs across over n = 400 cell lines, with the expression data of p = 18, 926 genes for each cell line considered as features. To measure drug sensitivity, we used the activity area Barretina et al. (2012). Specifically, in this study, we aim to identify the set of genes associated with the sensitivity of five specific anticancer drugs, namely Topotecan, 17-AAG, Irinotecan, Paclitaxel, and AEW541, which have been used to treat various cancer types including ovarian and lung cancer. Previous studies have already investigated these drugs and related gene expression data, more details can be found at Barretina et al. (2012). We implemented the proposed method along with the two other competing feature selection methods Model-X knockoffs Candès et al. (2018) and MDS Dai et al. (2022) on the CCLE dataset, at the nominal FDr control level q = 0.2. Each method selected some genes to indicate that they are highly associated with these five above-mentioned drugs. However, in a real setting, assessing the performance of a feature selection method is difficult as the underlying data-generating mechanism is completely unknown. So, We validate the results of these methods in two ways. First, we check if the selected genes can be confirmed using the existing domain knowledge. Second, for more empirical validation, we perform a 10-fold cross-validation and check the out-of-sample test accuracy only using the handful of selected genes by each of the methods. The results are summarized below. Table 2.1 reports the genes selected in at least five validation out of the 10-fold cross- validation for each of the three methods considered. The selected genes for these drugs exhibit a high degree of consistency with our current knowledge and consistently appeared in multiple 32 cross-validation runs. For instance, in the case of Topotecan and Irinotecan, the proposed method identifies SLFN11 as the top drug-sensitive gene in the majority of the validation itera- tions. This finding is consistent with prior research, as Barretina et al. (2012); Zoppoli et al. (2012) previously reported that SLFN11 is highly predictive for both drugs. Similarly, for 17-AAG, the proposed method identifies NQO1 as the topmost important gene, which is known to be highly sensitive to 17-AAG Hadley and Hendricks (2014). In Table 2.1, the genes that are confirmed by previous research are highlighted in red. One would observe that, for all the drugs, the proposed method selects these important genes more consistently for repeated validations compared to the other competing methods. Although the FDR control level is the same for all these three methods, knockoff consistently selected a higher number of genes. This is justifiable as the knockoff’s FDR inflates Barber et al. (2020) if the feature distribution is not estimated properly. Also, similar to the simulation study, MDS is highly conservative and fails to discover any gene under the 20% FDR control. For a more empirical validation, Table 2.2 reports the average number of genes selected in the 10 cross-validation runs and the average cross-validation test Mean Square Error (MSE). Additional to the Model-X knockoff and MDS, here we also show the results for the cross-validated lasso, which under mild regularity conditions enjoys the sure screening property. Table 2.2 shows the proposed method consistently maintains the prediction accuracy compared to the other methods while selecting much fewer genes. This indicates that the proposed method even after sufficient dimension reduction, the proposed method successfully retains the important genes with higher predictive ability. Table 2.2 Drug-sensitive gene selection and related prediction performance selected number of genes Test MSE Drug Lasso Proposed method Model-X knockoff MDS Lasso Proposed method Model-X knockoff MDS Topotecan 96.3 36.2 50.4 3.8 1.08 1.12 1.61 1.56 17-AAG 103.6 40.1 71.7 1.02 0.87 0.94 1.07 1.14 Irinotecan 52.5 11.9 23.9 0.7 0.63 1.04 1.96 5.01 Paclitaxel 127.3 40.3 57.8 4.6 1.36 1.94 2.21 2.91 AEW541 48.6 16.2 30.4 1.3 0.33 0.35 0.37 0.36 33 2.6 Conclusion In this paper, we have proposed a novel p-value-free FDR controlling method for ultrahigh dimensional variable selection. We effectively utilized the adaptive penalization framework and used it to distinguish between the null and nonnull features. Due to the dimension reduction in the screening step, our method is computationally very efficient compared to the Model-X knockoff or the MDS method. Our empirical results show that the proposed method has an excellent performance in terms of FDR control while maintaining high power compared to other existing methods. Unlike the knockoff-based methods, the proposed methodology does not need much prior knowledge of the joint distribution of features, making it conceptually simple and easy to implement. We have provided theoretical guarantees for FDR control under mild assumptions on the design matrix and the response variable. There are several promising directions for future development that are worth considering. The proposed method is applicable to both linear and generalized linear models. As the idea of adaptive penalization can be easily formalized to more complex nonparametric models, it would be interesting to investigate the potential use and theoretical properties of the proposed methodology in handling neural networks and other non-linear models, particularly with regard to more intricate data types such as natural language and computer vision. Additionally, as the idea of L 1 penalization is used in longitudinal data Barber et al. (2017), another idea of potential future direction is to extend the methodology to incorporate longitudinal or hierarchical structures datasets which is quite common in finance or imaging studies. 34 BIBLIOGRAPHY Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. The Annals of Statistics, 48(3):1409–1431. Barber, R. F., Reimherr, M., and Schill, T. (2017). The function-on-scalar LASSO with applications to longitudinal GWAS. Electronic Journal of Statistics, 11(1):1351 – 1389. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V., Sonkin, D., et al. (2012). The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603–607. Bates, S., Candè s, E., Janson, L., and Wang, W. (2020). Metropolized knockoff sampling. Journal of the American Statistical Association, 116(535):1413–1427. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Method- ological), 57(1):289–300. Bozdogan, H. (1987). Model selection and Akaike’s Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3):345–370. Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 80(3):pp. 551–577. Dai, C., Lin, B., Xing, X., and Liu, J. S. (2022). False discovery rate control via data splitting. Journal of the American Statistical Association, pages 1–18. Das, D. and Lahiri, S. N. (2019). Distributional consistency of the lasso by perturbation bootstrap. Biometrika, 106(4):957–964. Dorman, S. N., Baranova, K., Knoll, J. H. M., Urquhart, B. L., Mariani, G., Carcangiu, M. L., and Rogan, P. K. (2016). Genomic signatures for paclitaxel and gemcitabine resistance in breast cancer derived by machine learning. Mol. Oncol., 10(1):85–100. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911. Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101. Gimenez, J. R., Ghorbani, A., and Zou, J. (2019). Knockoffs for the mass: new feature importance statistics with false discovery guarantees. 35 Hadley, K. E. and Hendricks, D. T. (2014). Use of nqo1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-aag. BMC cancer, 14(1):1– 8. Lee, H. J., Hanibuchi, M., Kim, S.-J., Yu, H., Kim, M. S., He, J., Langley, R. R., Lehembre, F., Regenass, U., and Fidler, I. J. (2016). Treatment of experimental human breast cancer and lung cancer brain metastases in mice by macitentan, a dual antagonist of endothelin receptors, combined with paclitaxel. Neuro-oncology, 18(4):486–496. Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139. Liang, F., Li, Q., and Zhou, L. (2018). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972. Minnier, J., Tian, L., and Cai, T. (2011). A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association, 106(496):1371–1382. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288. Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press. Wang, S., Weng, H., and Maleki, A. (2017). Which bridge estimator is optimal for variable selection? arXiv preprint arXiv:1705.08617. Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37(5A):2178. Xie, H. and Huang, J. (2009). Scad-penalized regression in high-dimensional partially linear models. Zoppoli, G., Regairaz, M., Leo, E., Reinhold, W. C., Varma, S., Ballestrero, A., Doroshow, J. H., and Pommier, Y. (2012). Putative dna/rna helicase schlafen-11 (slfn11) sensitizes cancer cells to dna-damaging agents. Proceedings of the National Academy of Sciences, 109(37):15030–15035. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429. Zou, H. and Hastie, T. (2003). Regression shrinkage and selection via the elastic net, with applications to microarrays. JR Stat Soc Ser B, 67:301–20. 36 APPENDIX A PROOF OF THEOREM 2.3.6 We recall that in order to show the FDR control, we only need to verify limn→∞ E (R(∆∗ )) ≤ 1, where ∆∗ is the data-dependent optimum cutoff in eq. 3 and ³ ´ w j ̸∈S 1 I j ≥ ∆ P ∗ R(∆∗ ) = P ³ ´ P ³ ´ j ∈S 1 I j uw ≥ ∆∗ + j ̸∈S 1 I juw ≥ ∆∗ By constructing ∆∗ , the denominator of the R(∆∗ ) is strictly positive. So, in order to show limn→∞ E (R(∆∗ )) ≤ 1, we show the following: 1. P (∆∗ ≥ 12 ) → 1 as n → ∞. ³ ´ I juw ≥∆ ∗ P 2. j ∈S 1 > 0, and ³P ³ ´ P ³ ´ ´ 3. P | j ̸∈S 1 I jw ≥ ∆∗ − j ̸∈S 1 I juw ≥ ∆∗ | > u n → 0, as n → ∞ with u n = o( n1 ). We will show this one by one. Part 1:, Now, by the error bound in Lemma 2.3.2, for any j ∈ S, ³ ´ 1. P |β̂uw j | > τ > 1 − o(e −nc6 ), ∀λ1 , . . . , λr , but ³ ´ 2. P |β̂uw j | < τ > 1 − o(e −nc6 ), ∀λr +1 , . . . , λ2r . ³ ´ 3. P |β̂w j |>τ > 1 − o(e −nc6 ), ∀λ1 , . . . , λ2r ³ ´ Additionally, for any j ∈ Sc , P |β̂w j | < τ > 1−o(e −nc6 ), ∀λ1 , . . . , λ2r . Similar inequality holds for the ³ ´ P unweighted lass0 as well. Hence, with sufficiently small τ < ψ, I jw = 2r1 2r β̂ w τ P i =1 1 | j (λi )| > → 1. P Similarly, for j ∈ S, I juw → 12 , as n → ∞. Now, for any cutoff ∆, we observe, Pp ³ ´ P ³ ´ P ³ ´ uw uw uw j =1 1 I j ≥ ∆ j ∈S 1 I j ≥ ∆ + j ̸∈S 1 I j ≥ ∆ Z1 (∆) + Z2 (∆) F Pˆ R(∆) = P ³ ´ = P ³ ´ P ³ ´ = (A.1) p Z3 (∆) + Z4 (∆) j =1 1 I jw ≥ ∆ j ∈S 1 I j ≥ ∆ + j ̸∈S 1 I j ≥ ∆ w w 37 P Now, for any ϵ > 0 and ∆ ≤ 12 , as I juw → 1 as n → ∞ for j ∈ S ³ ´ X ³ ´ P (Z1 (∆) < s − ϵ) = P ∃ at least one j ∈ S for which I juw < ∆ ≤ P I juw < ∆ → 0 j ∈S P Hence, Z1 (∆) → s for fixed ∆ ≤ 12 . P Further we observe, as I juw → 0 as n → ∞ for j ̸∈ S with exponential rate, ³ ´ X ³ ´ P (Z2 (∆) > ϵ) = P ∃ at least one j ̸∈ S for which I juw ≥ ∆ ≤ P I juw ≥ ∆ → 0 j ̸∈S P Hence, Z2 (∆) → 0 for fixed ∆ ≤ 12 . P P Similarly, Z3 (∆) → s and Z4 (∆) → 0 for fixed ∆ ≤ 12 . Hence, Z1 (∆) + Z2 (∆) P P 1 → 1 ⇒ F Pˆ R(∆) → 1 for any ∆ ≤ Z3 (∆) + Z4 (∆) 2 Next, the uniform convergence can be established by observing that each of the process Zi (∆) P P is monotonic with respect to ∆ for i = 1, 2, 3, 4 and hence inf∆≤ 1 Z1 (∆) → s, sup∆≤ 1 Z2 (∆) → 2 2 P P 0, inf∆≤ 1 Z3 (∆) → s, and sup∆≤ 1 Z4 (∆) → 0. These conjectures further implies P (∆∗ ≥ 12 ) → 1 as 2 2 n → ∞. Part 2:, We note that sup∆∈( 1 ,1) P (Z1 (∆) ≥ Z2 (∆)) → 1, as n → ∞. Hence, by construction of ∆∗ , 2 Pp ³ ´ ³P ³ ´ ´ uw uw j =1 1 I j ≥ ∆ ∗ > 0 ⇒ P j ∈S 1 I j ≥ ∆ ∗ > 0 → 1. Part 3:, Following Lemma 2.3.2, we note that, à ! X ³ w ´ X ³ ´ sup∆∈( 1 ,1) P | 1 Ij ≥ ∆ − 1 I juw ≥ ∆ | > ϵ 2 j ̸∈S j ̸∈S à ! ³ ´ ³ ´ I jw uw ≥ ∆ − 1 Ij ≥ ∆ | > ϵ → 0 X ≤ sup P |1 ∆∈( 12 ,1) j ∈S c as n → ∞ due to the exponential bound in Lemma 2.3.2. These conjectures further imply that limn→∞ E (R(∆∗ )) ≤ 1, guaranteeing the asymptotic FDR control. ³ ´ uw Next, to study the asymptotic power. we start by observing that ê 0 (∆∗ ) = I ∆ ∗ P j ∈Ŝ n 1 j ≥ > 0. We define, ∆ ˜ = sup{ê 0 (∆) > 0} and ∆ ˜ = inf {I w , j ∈ S}. Now, by construction, P (∆ ˜ < 1) > ∆>0 j 2 ∆>0 38 2r ³ ´ 1−o(e −nc6 ). Additionally, for any ϵ > 0, P (∆ ˜ < 1−ϵ) ≤ P P (I jw < 1−ϵ) ≤ P P P |β̂w (λ )| < τ < j i j ∈S j ∈S i =1 −nc 6 o(e ). ³ ´ ³ ´ This implies, P (∆ ˜ > ∆) ˜ >P ∆ ˜<1 >P ∆ ˜ ≥ 1 − ϵ, ∆ ˜ ≥ 1 − ϵ + P ¡∆ ˜ < 1 − 1 ≥ 1 − o(e −nc6 ) + ¢ 2 2 1 − o(e −nc6 ) − 1 = 1 − 2o(e −nc6 ). Hence, with high probability the final discovered set D̂ n = n o n o w w ˜ j ∈ Ŝ n : I j ≥ ∆ ⊂ j ∈ Ŝ n : I j ≥ ∆ . Hence this implies that the asymptotic power converges ∗ to one. 39 APPENDIX B PROOF OF THEOREM 2.3.2 Here we prove the major theorem 2.3.2 on the L 2 error bound on the weighted Lasso. This proof consists of three parts. First in Lemma B.0.1, we show that for specifically designed events A and B, we achieve the desired error bound on A ∩ B. Then, in lemma B.0.2 and B.0.3, we show that the probability of the events A and B converges to one. We define again the following notations: m = max j ∈S {the rank of w j (from smallest to largest)}, the maximum rank of weights for the true features and T = {1 ≤ j ≤ p : rank(w j ) ≤ m}. Consequently, we define, w T = the subspace of w = w 1 , w 2 , . . . , w p indexed by T; and w mi n = min j ∈S c w j . In Section 2.3, we show our proposed weighting scheme satisfies all these assumptions, hence enjoying the tight error bound for weighted lasso even for an increasing sequence of tuning parameter λn . Lemma B.0.1. Define the probability events: || n1 ϵ′ X T ||2 ( ( )) 1 A = λ ≥ 2max , max j ∈T c | ϵ′ X j w −1 j | , (B.1) ||w T ||2 n ∆′ n1 X ′ X ∆ ( ) B = min ≥ κn , (B.2) ∆∈C ||∆||22 ( ) where C = ∆ ∈ R p : w j |∆ j | ≤ 3||w T ||22 ||∆T ||22 . Then on the event, A ∩ B, it holds that P j ∈T c 3 λ ||β̂w − β∗ ||2 ≤ ||w T ||2 (B.3) 2 κn Proof. We first note that, p p 1 1 ∥ Y − X β̂w ∥22 +λ w j |β̂w β ∗ 2 w j |β∗j | X X j | ≤ ∥ Y − X ∥2 +λ 2n j =1 2n j =1 40 p p 1 1 ∥ X β̂w − X β∗ ∥22 ≤ ϵ′ X (β̂w − β∗ ) + λ w j |β∗j | − λ w j |β̂w | X X =⇒ 0 ≤ (B.4) 2n n j =1 j =1 " # λ ∗ w w w j |β̂ j | − 2 w j |β̂ j | X X X ≤ 2 w j |β j | − 2 (B.5) 2 j ∈T j ∈T j ∈T c 1 ′ 1 + ϵ X T (β̂w − β∗ )T + ϵ′ X T c (β̂w − β∗ )T c (B.6) n" n # λ w ∗ w ∗ w ∗ ||w T ||2 ||(β̂ − β )T ||2 + 2 w j |β̂ j − β j | − w j |β̂ j − β j | X X ≤ 2 j ∈T j ∈T c (B.7) As by the construction of λ, 1. 1 ′ n ϵ X T (β̂ w − β∗ )T ≤ || n1 ϵ′ X T ||2 ||(β̂w − β∗ )T ||2 ≤ λ2 ||w T ||2 ||(β̂w − β∗ )T ||2 , and 1 ′ n ϵ X T (β̂ w − β∗ )T c ≤ max | n1 ϵ′ X j w −1 w j |β̂w − β∗j |. P 2. c j | j j ∈Tc j ∈Tc Hence,the error vector ∆ = β̂w −β∗ belongs to the cone C = {∆ ∈ R p : P P w j |∆ j | ≤ 2 w j |∆ j |+ j ∈T c j ∈T ||w T ||2 ||∆T ||2 ≤ 3||w T ||2 ||∆T ||2 }. Now, with the RE property in 2.2.1, we continue from eq. B.7, " # T 1 ′ λ ∆ X X∆≤ X ||w T ||2 ||∆T ||2 + 2 w j |∆ j | (B.8) n 2 j ∈T 3 =⇒ κn ||∆||22 ≤ λ||w T ||2 ||∆T ||2 (B.9) 2 3 λ =⇒ ||∆T ||2 ≤ ||w T ||2 (B.10) 2 κn Now to show that A ∩ B holds with high probability, we assume each row of X is iid from a mean zero sub-Gaussian distribution (with bounded sub-Gaussian norm) with E (X i X i′ ) = Σ such that 0 ≤ ρ 0 ≤ λmi n (Σ) ≤ λmax (Σ) ≤ ρ 1 < ∞. The assumptions we consider in Section 2.2.1 satisfy this sub-gaussianity condition. ½ ½ 1 ′ ¾¾ || n ϵ X T ||2 1 Lemma B.0.2. Recall A = λ ≥ 2max ||w T ||2 , max j ∈T c | n ϵ X j w j | . Further, choose, λ ≥ ′ −1 p hq i m ˜ 2τσ 1 n ||w T ||2 with τ = 2 ρ 1 + p2ρ 0 (2 + m n ). Then, P (A ) ≥ 1 − p −c − 2e −c̃n − 2e −c̃ m . λ µ ¶ µ ¶ c 1 ′ ||w T ||2 1 ′ −1 Proof. First, note that, P (A ) ≤ P || ϵ X T ||2 ≥ λ + P maxc | ϵ X j w j | ≥ . n 2 j ∈T n 2 | {z } | {z } I II 41 Now, with ϵ̃ ∼ N (0, I n ), m µ ¶ 1 ′ I ≤ P || ϵ̃ X T ||2 ≥ τ n n r m 1 µ ¶ 1 ′ 1 1 ≤ p ||| ϵ̃ X T ||2 − || X T ||F | > (τ − c 1 ) , || X T ||2 | ≤ c 2 p n n n n n r ¶ µr m m µ ¶ 1 1 + P || X T ||F > c 1 +P > c2 p n n n n By theorem 6.3.2 in Vershynin (2018), we get, r ¶ m µ µ ¶ £ −2 2 ¤ 1 1 1 I ≤ 2exp −c − 3c 2 (τ − c 1 ) m + P || X T ||F > c 1 + P || X T ||2 > c 2 p n n n n ≤ 2exp −c − 3c 2−2 (τ − c 1 )2 m + 2exp(−c̃n) + 2exp(−c̃n) £ ¤ q m by setting c 1 = c 2 = p1ρ 0 (2 + c n ) and using theorem 5.19 in Vershynin (2010). Following similar arguments in the standard Lasso analysis,  s  1 l og (p)  I I ≤ P maxc | ϵ′ X j w −1 j | ≥ τσ j ∈T n n 1 ≤ 2exp(−cl og (p)) + P (maxc || X j ||22 > τ2 ) j ∈T n τ ≤ 2exp(−cl og (p)) + 2exp(−cn( p − 1)2 ) ρ1 ½ ¾ ∆′ n1 X ′ X ∆ Lemma B.0.3. Recall B = min∆∈C ||∆||22 ≥ κn , where ( ) p C = ∆∈R : w j |∆ j | ≤ 3||w T ||22 ||∆T ||22 X j ∈T c p Then, P (B) ≥ 1 − exp(− ||w T ||2 p w l og (p) − m) mi n 42 1 1 Proof. Setiing X̃ = X Σ− 2 , and ∆ ˜ = Σ 2 ∆, we start by observing, µ ¶ c 1 p P (B ) = P min p |X ∆||2 ≤ κn ∆∈C ,||∆||2 =1 n à ! p =P min ˜ 2 ≤ κn | X̃ ∆|| 1 1 ˜ ∆∈Σ 2 C ,||Σ− 2 ∆|| ˜ 2 =1 à ! p p p ≤P − max ˜ 2 − n||∆|| ||| X̃ ∆|| ˜ 2| + min ˜ 2 ≤ nκn n||∆|| 1 1 ∆∈C ,||∆||2 =1 ˜ ∆∈Σ 2 C ,||Σ− 2 ∆|| ˜ 2 =1 à ! p p p ≤P max ˜ 2 − n||∆|| ||| X̃ ∆|| ˜ 2 | ≥ n(pρ 0 − κn ) 1 1 ˜ ∆∈Σ 2 C ,||Σ− 2 ∆|| ˜ 2 =1 ≤ 2exp(−n 2 )   p p by setting κn = ρ 0 − pcn  sup ˜ 2 +E ||∆|| sup | < g,∆˜ > |, where g ∼ 1 1 1 1 ˜ ∆∈Σ 2 C ,||Σ− 2 ∆||˜ 2 =1 ˜ ∆∈Σ 2 C ,||Σ− 2 ∆|| ˜ 2 =1 N (0, I p ). Now, ˜ 2≤ p 1. sup ||∆|| ρ1. 1 1 ˜ ∆∈Σ 2 C ,||Σ− 2 ∆|| ˜ 2 =1 1 2. E sup | < g,∆ ˜ >|=E sup | < Σ2 g,∆ > | ˜ 1 1 2 C ,||Σ− 2 ∆|| ˜ 2 =1 ∆∈C ,||∆||2 =1 ∆∈Σ X g̃ ≤E sup | w j ∆j | + E sup | < g̃ T , ∆T > | ∆∈C ,||∆||2 =1 j ∈T c w j ∆∈C ,||∆||2 =1 1 ≤ E maxc |g̃ j | 3||w T ||2 + E ||g̃ T ||2 j ∈T w mi n ||w T ||2 q p ≤ l og (p) + m w mi n q q ||w T ||2 l og (p) m 1 This implies as long as w mi n n + n = o P (1), κn can be choosen to be 2 ρ 0 . Hence, combining lemmas B.0.1-B.0.3, choosing  s  r m 1 1 l og (p)  λ ≥ cσ  + n ||w T ||2 w mi n n 43 as long as s r ||w T ||2 l og (p) m + = o P (1), w mi n n n with high probability, 3 λ ||β̂w − β∗ ||2 ≤ ||w T ||2 2 κn 44 CHAPTER 3 SCIDNET: ERROR CONTROLLED FEATURE SELECTION FOR ULTRA HIGH DIMENSIONAL AND HIGHLY CORRELATED FEATURE SPACE USING DEEP LEARNING 3.1 Introduction In modern applications (e.g., genetics and imaging studies), the investigator is often inter- ested in uncovering the true pattern between a quantitative response and a large number of features. The key working assumption, oftentimes, is that there is an underlying sparsity pattern buried in the high dimensional data setting. Selecting the essential features aids in further scientific investigations by offering improved interpretability and explainability, reduced compu- tational cost for prediction and estimation, and less memory usage due to lower dimensional manifolds of the feature space being estimated. Under the linear model (LM) framework, this problem has been extensively studied over the past few decades producing popular algorithms such as Lasso, Elastic net, SCAD, and MCP. A detailed review of this literature can be found in Fan and Lv (2010) and thus is omitted here. However, regardless of their ubiquitous applications, the LM has limited usage, especially when the underlying mechanism is highly nonlinear, with potential interaction effects. Relaxing the linearity assumption, the Artificial Neural Network (ANN) models are well known for efficiently approximating complicated functions. From an information-theoretic viewpoint, Elbrächter et al. (2021) established that deep neural networks (DNN) provide an optimal approximation of a nonlinear function, covering a wide range of func- tional classes used in signal processing. This property has promoted the use of Deep Learning (DL) models for feature selection, an approach that has generated much research interest over the past few years. A major caveat, however, is that the DL models are often used as a black box in many applications. Following the intriguing arguments in Rudin (2019), caution must be exercised regarding the application of DL models for decision-making in real-world problems. Employing only the relevant predictors to construct a predictive model is the right step toward explainable machine learning. However, as suggested in Ghorbani et al. (2019), oftentimes, the feature importance in DL-based algorithms varied drastically under small perturbations in the 45 input or in the presence of added noise. As a solution to this problem, we focus on the reproducible nonlinear variable selection using DL models with some error control. We adopt the False Discovery Rate (FDR) first proposed by Benjamini and Hochberg (1995), known for being suitable for large-scale multiple testing problems. To formally define the FDR, we consider the random variable, F DP , representing e0 the False Discovery Proportion: F DP = N+ ∧1 , where e 0 = number of falsely selected variables, N+ = number of total discoveries. Then, FDR is defined as F DR = E (F DP ). Estimating this expectation poses a unique challenge for the model-free variable selection problem, which many authors have tried to solve from various perspectives. For example, a p-value approach has been proposed as a feature importance criterion in multiple testing literature; see Tansey et al. (2018); Xia et al. (2017); Li and Barber (2019); Lei and Fithian (2018) for a more detailed overview. However, for DL models, generating interpretable p-values is still an unrevealed research problem. To circumvent this limitation, the knockoff framework has been proposed by Candès et al. (2018). Essentially, this is a model-free variable selection algorithm with provable FDR control, assuming one has prior knowledge of the predictors’ distribution. Lu et al. (2018) further proposed the DeepPINK algorithm by integrating the knockoff framework with the DL architecture for improved explainability of the DL models. However, in real-world applications, the predictor’s distribution needs to be estimated to generate the knockoff variables, which adds another layer of uncertainty to the analysis. Recently Barber et al. (2020) showed that the knockoff framework might yield inflation in false discoveries, consistent with the error incurred in estimating the predictor’s distribution. This problem is exacerbated by highly correlated features. An empirical illustration is provided in Appendix 4.2.2, showing how model-X knockoff (Candès et al., 2018) typically fails to control FDR under a simplistic setting with high multicollinearity. In some cases, it may be possible to have prior knowledge of the correlation pattern among the features. For example, in genetics studies, there is a common notion of linkage disequilibrium, which helps to specify the dependency pattern among the alleles at polymorphisms (Sesia et al. (2018)). However, this information is typically unavailable in many other domain sciences. 46 Figure 3.1 How multicollinearity affects feature selection - A demonstration using simplistic simulation setting: We simulate n = 400 iid copies of (y ∈ R, X ∈ R 100 ), where the outcome y is generated from a linear model: y = 0.5(X 20 + X 40 + X 60 + X 80 + X 100 ) + ϵ, ϵ ∼ N (0, 1) and the features X ∼ N100 (0, Σ), (Σ)i j = ρ |i − j | . We implement the Lasso algorithm for 500 Monte Carlo replication of the data, and the y-axis shows the proportion of time each feature is selected out of 500 replications. For a higher autocorrelation ρ, the selection probability of the true features was significantly reduced whereas the null features associated with true features got selected more frequently. Hence any model-specific knockoff generation (Candès et al., 2018; Sesia et al., 2018) would be inefficient in those contexts. Recently, DL-based flexible knockoff generating algorithms have been proposed (Liu and Zheng, 2019; Jordon et al., 2019; Romano et al., 2020); however they are trained in a typical big-n-small-p setting, and it is unclear how they will perform when the sample size n is significantly smaller than the dimension of the covariates p, and the predictors are highly correlated. We next discuss in detail the multicollinearity issue. In many modern high-dimensional datasets arising in genetics and imaging studies, the other challenge is extreme multicollinearity - the predictors are typically correlated among themselves in a complex manner, often with pairwise sample correlations exceeding 0.99. In a simplistic setting of a linear model, Figure 3.1 shows how increased autocorrelation typically reduces the selection probability of the true non-null features. Because extremely correlated features become almost indistinguishable, it would be unrealistic to claim that a particular feature from a cluster is associated with the outcome in a regression setup. Hence, accounting for the uncertainty, it would be pragmatic to aim for group-level variable selection and claim that at least one variable from a densely correlated group is important for the outcome. This approach is not entirely new; as beautifully argued in Brzyski et al. (2017) that in genetics, the discovery of a specific genomic 47 region is treated equivalently as a particular variant-wise discovery in that location. In this context, the term ’true discovery’ implies that the selected cluster can serve as a good proxy for at least one element in the true index set of significant features. However, a complication of this approach is that the notion of FDR becomes non-trivial. For this reason, following Siegmund et al. (2011), we adopt the cluster version of the FDR as the expected value of the "proportion of clusters that are falsely declared among all declared clusters". We denote this as cFDR henceforth. Looking at the extreme multicollinearity problem from a slightly different angle, several algorithms have been proposed in the hierarchical testing literature including CAVIAR (Hormozdiari et al., 2014), SUSIE (Wang et al., 2020), KnockoffZoom (Sesia et al., 2019). While the knockoff-based procedures have the limitation of generating knockoffs from an unknown distribution with a very small sample size, other methods lack applicability in non-linear-nonparametric setups as they typically depend on p-values. Our contribution To address the complications mentioned above in variable selection and un- explored gap while applying DL, we propose SciDNet- Screening & Cleaning Incorporated Deep Neural Network - a novel method for the reproducible high-dimensional nonlinear- nonparametric feature selection with highly correlated predictors. The screening step is a dimension reduction step. We screen out most of the null features and create a set of multi- resolution clusters that collectively contain all the proxy variables needed to cover the truly significant features with high probability. In the cleaning step, using a properly tuned DL model under an appropriate resampling scheme, an estimator of the FDR is proposed. Finally, we select some clusters of highly correlated predictors by controlling the estimated FDR. To this end, ’FDR observed for SciDNet’ would implicitly mean the value of the cFDR discussed above. Our major contributions can be summarized below: • The proposed method SciDNet is based on a combination of techniques from statistical machine learning on sparse modeling. We introduce a resampling-based FDR estimation scheme, which allows us to identify the most relevant features while discarding irrelevant 48 ones in an FDR-controlled setting. Additionally, the proposed algorithm is specifically tailored for highly correlated features, which is proved to be problematic for traditional feature selection methods. • The proposed approach relies on minimal modeling assumptions and is entirely free from p-value, unlike existing state-of-the-art methods, providing a better understanding of the sparse relationship between the outcome and the high-dimensional predictors. Our theoretical study consolidates the empirical results by showing SciDNet’s provable FDR control guarantee in an asymptotic setting. • To the best of our knowledge, in a high-dimensional setting, no other method in the literature accommodates the multicollinearity issue via data adaptive cluster formation, followed by a nonlinear-nonparametric error-controlled feature selection integrated with DL. The results from our extensive simulations and real data analyses demonstrate the proposed method’s validity in general as a proof of concept by achieving higher power, controlled FDR, and higher prediction accuracy. Additionally, we conducted an ablation study which provides a systematic analysis of the contribution of each individual step to the overall performance of the feature selection method. Overall, our contributions offer a powerful tool for researchers and practitioners who face the challenge of selecting relevant variables from highly correlated ultrahigh dimensional datasets applicable in a variety of fields, from biology to finance to social sciences. For the rest of the article, in Section 3.2, we describe the proposed screening and cleaning method, followed by an extensive simulation study in Section 4.2 and an analysis of two real-world gene expression datasets in Section 3.4. Finally, Section 4.4 concludes with a summary and future directions. 3.2 The Algorithm 3.2.1 Notation and assumptions Under the supervised learning framework, let Y denote a continuous response variable, and X = (X 1 , . . . , X p ) denote p continuous covariates. Let F y (·) denote the CDF of the response 49 variable Y , and let F k (·) denote the CDF of the predictor X k . Assuming a sample size n, we consider the ultrahigh-dimensional setting where p = O(exp (n τ )), τ > 0. We assume no specific functional relationship between the outcome Y and the predictors X but we impose a high-level assumption on the distribution of X. In the spirit of Liu et al. (2009), we assume that the predictors follow the nonparanormal distribution; i.e., there exist unknown differentiable functions f (X ) = { f j (X j ), j ∈ {1, 2, . . . , p}}, such that f (X ) ∼ N (µp×1 , Σp×p ). This nonparanormal distribution covers a wide range of parametric families of distributions and its main beauty lies in the fact that f (X ) preserves the conditional dependency structure of the original variables X . Maintaining the sparsity condition, we may assume that there exists a subset S 0 ⊂ {1, 2, . . . , p}, |S 0 | = O(1), such that, conditional on features in S 0 , the response Y is independent of features in S 0c . In other words, S 0 = {k : f (y|X ) depends on X k }, where f (y|X ) is the conditional density of y given X. Our goal is to learn the sparsity structure by estimating S 0 . 3.2.2 Screening Step Under the assumption that the cardinality of S 0 is much smaller than the feature space dimension p, most of the features belong to S 0c . Hence in the screening step, we focus primarily on finding an active set Ŝ n with |Ŝ n | << p such that P (S 0 ⊂ Ŝ n ) → 1 as n → ∞. This property is called the sure screening property (Fan and Lv, 2008), which ensures that all the significant predictors are still retained in Ŝ n and the other predictors {X j , j ∈ Ŝ nc } are henceforth eliminated from the remaining analysis. As these active variables are highly correlated among themselves, in the second step we further cluster them by exploiting the conditional dependency structure. 3.2.2.1 Finding the active set of variables To find the active set, we first consider the nonparanormal transformation on (Y , X ) and then perform the Henze–Zirkler’s (HZ) test on the transformed variable. While the first transforms all the variables to a joint Gaussian variable maintaining their conditional covariance structure; the second test confirms, by pairwise testing, if there is significant dependence in the transformed response and predictors. This workflow has been proposed by Liu et al. (2009),Henze and Zirkler (1990), Xue and Liang (2017). The strategy proceeds as follows: 50 1. nonparanormal transformation: We first consider the following transformation: T y (Y ) = Φ−1 (F y (Y )), Tk (X k ) = Φ−1 (F k (X k )), k = 1, 2, . . . , p, where Φ(·) denotes the CDF of the stan- dard Gaussian distribution. However, in practice, the cdf of Y and X k are unknown, we can estimate it by the truncated empirical cdf as suggested by Liu et al. (2009). Henceforth, let (T̃ y (Y ), T̃k (X k )) denote the corresponding transformations. 2. HZ test: By the basic properties of CDF, it is easy to see that (T y (Y ), Tk (X k )) will jointly follow a bivariate Gaussian distribution N2 (0, I 2 ) if and only if Y is independent of X k . This can be tested using HZ test, Henze and Zirkler (1990), where the test statistic for the predictor X k can be expressed as w k = R 2 |ψk (t ) − exp(− 21 t ′ t )|2 φβ (t )d t , k = 1, 2, . . . , p; R where ψk (t ) is the characteristic function of (T y (Y ), Tk (X k )) and exp(− 12 t ′ t ) represents the characteristic function of N2 (0, I 2 ). It typically measures the disparity between the joint distribution of (T y (Y ), Tk (X k )) and N2 (0, I 2 ) and is expected to be typically high for the non-null predictors X j , j ∈ S 0 indicating significant evidence against the independence of the transformed variable (T y (Y ), Tk (X k )). Next, as in practice, we proceed with (T̃ y (Y ), T̃k (X k )), we calculate the the HZ test statistic as β2 β2 − di w̃ k∗ = n12 2 1 (3.1) Pn Pn Pn − 2 di j 2(1+β2 ) i =1 j =1 e − n(1+β 2) i =1 e + 1+2β 2 where d i j = (T̃k (x ki ) − T̃k (x k j ))2 + (T̃Y (y i ) − T̃Y (y j ))2 and d i = T̃k2 (x ki ) + T̃Y2 (y i ). Consistent 1/6 (1.25n) with the existing literature, we choose the value of the smoothing parameter β as p , 2 which corresponds to the optimal bandwidth for a nonparametric kernel density estimator with Gaussian kernel (Henze and Zirkler (1990)). The observed test statistics w̃ k∗ converge to w k as shown in Xue and Liang (2017). 3. Next, we select the active set of predictors Ŝ n according to the larger values of w̃ k∗ , i.e., Ŝ n = {1 ≤ k ≤ p : w̃ k∗ > cn −κ } where where c and κ are predetermined threshold values. This active set Ŝ n contains all the predictors significantly correlated with the response marginally. Under very mild regularity conditions on the signal strength of the nonnull predictors where 51 mink∈S 0 w k ≥ 2cn −κ with c as a constant and 0 ≤ κ ≤ 41 , the screening process enjoys the advan- tage of sure screening property, i.e., P (S 0 ⊂ Ŝ n ) → 1, as n → ∞. More details on the theoretical guarantee can be found in Xue and Liang (2017). A common practice is to set the active set size |Ŝ n | at νn = [n/l og (n)]. However, as we further cluster the active variables in the next step, our proposed method is fairly robust in terms of the |Ŝ n | as long as we retain most of the significant variables. We propose to select a bigger active set with a size proportional to νn . 3.2.2.2 Clustering the active predictors using the precision matrix Algorithm 3.1 Finding clusters and the representatives Input :(X ∈ R n×p , Y ∈ R n ), The Active set Ŝ n , |Ŝ n | = p 1 < p Estimate the precision matrix: Σ̂−1 = (σ̂i j )i , j ∈{1,2,...,p} using Nodewise Lasso Define the clusters C i = { j ∈ Ŝ n : σ̂i j ̸= 0}, i ∈ Ŝ n for 1 ≤ i ≤ p 1 do for 1 ≤ j ≤ p 1 , j ̸=ni do o Define Ωi j = cor r (X C i , X C j ) = {ρ(X l , X l ′ ), l ∈ C i , l ′ ∈ C j } ∈ R |C i ||C j | if max{Ωi j } ≥ r then C i = C i ∪C j Cj = φ end end end Retain only the non-null clusters: C = {C i : C i ̸= φ, i ∈ Ŝ n } Find the cluster representatives S̃ n = {R j , 1 ≤ j ≤ |C | : R j = ar g max {w̃ l∗ }} l ∈C j Output :Clusters C 1 ,C 2 , ...,C |C | and corresponding cluster representatives S̃ n = {R j , 1 ≤ j ≤ |C |} As the Henze–Zirkler test focuses on the (pairwise) marginal correlation among the predictors and response, it typically includes the null predictors with strong associations with a significant predictor; thus they are highly correlated among themselves. Hence to reduce the high corre- lation in the active set, our strategy is to exploit their conditional dependency structure and divide the active variables {X j : j ∈ Ŝ n } into p c (<< p) non-overlapping clusters: C 1 ,C 2 , . . . ,C p c . By Sp c sure screening property, with asymptotically high probability, S 0 ⊂ j =1 C j . The use of a sparse precision matrix to understand the dependence structure in a high-dimensional feature space has been well acknowledged in statistics literature (e.g., Lauritzen (1996), Shojaie and Michailidis 52 (2010)) due to its scalability. In some contexts, it brings more insight compared to the analysis of a simple covariance matrix. For example, in the human brain, two separate regions can be highly correlated with no direct relation and only due to their strong interaction with a common third region. So, understanding the conditional dependence structure and using it in clustering the brain regions is more informative in the context of understanding the functional connectivity in the human brain (Das et al., 2017). Otherwise, simple correlation-based clustering will result in huge cluster sizes with less interpretable groups of brain regions. To this end, in order to estimate the precision matrix we implement the nodewise Lasso algo- rithm (Van de Geer et al., 2014) on the transformed variables (T̃ y (Y ), T̃k (X k )), k ∈ Ŝ n . Nodewise Lasso regression is generally entertained to estimate a sparse precision matrix in the context of the Gaussian graphical model by performing simultaneous Lasso regression on each predictor. The tuning parameters in each nodewise Lasso are typically selected using cross-validation. More details on this algorithm and its theoretical guarantees can be found in (Meinshausen and Bühlmann, 2006). Let Σ̂−1 be the estimated precision matrix by the nodewise Lasso algorithm and ρ(Z , Z ′ ) denotes any correlation metric for two random variables Z and Z ′ , e.g. Pearson’s correlation. Algorithm 3.1 summarizes the clustering step. Here not only we are clustering the active predictors, but also selecting an appropriate representative from each cluster. First, for each active predictor X i ∈ Ŝ n , we collect all the other active predictors conditionally dependent on X i , and make cluster C i . Although clustering using conditional dependence produces smaller clusters, there might be some overlaps owing to the complex association in the original predictor space. Hence, to reduce the excessive intercluster correlation, we merge all those clusters having a maximum correlation greater than some pre- specified threshold r (we typically set r=0.9). Next, each cluster is updated by adding all the other features conditionally dependent on the existing cluster members. Finally, to find the appropriate cluster representatives, we focus on the HZ-test statistic w̃ k∗ in 3.1 which measures the extent of resemblance between the distribution of each nonparanormally transformed variable in a cluster and the null distribution N (0, I 2 ). So, for cluster C i , we select the variable R i = ar g max j ∈C i {w̃ ∗j } 53 indicating its strongest association with the response variable compared to the other predictors in the cluster. 3.2.3 Cleaning with Deep Neural Network (DNN) We start the cleaning step by modeling the response Y and the cluster representatives X S̃ n obtained through 3.2.2.2. In order to perform the error-controlled variable selection, each representative will be assigned an importance score followed by a resampling algorithm to finally control the FDR. While it is possible to adopt any other generic sparsity-inducing DNN procedure, here we focus on the LassoNet algorithm recently proposed by Lemhadri et al. (2021) for its elegant mathematical frameworks which naturally sets the stage for nonlinear feature selection. To approximate the unknown functional connection, it considers the class of all fully connected feed-forward residual neural networks; namely, F = { f ≡ f θ,W : x 7→ θ T x + hW (x)}. Here, W denotes the network parameters, K denotes the size of the first hidden layer, W (0) ∈ R p×K denotes the first hidden layer parameters, θ ∈ R p denotes the residual layer’s weights. In order to minimize the reconstruction error: the LassoNet objective function can be formulated as: min L(θ,W ) + λ||θ||1 subject to ||W j(0) ||∞ ≤ M |θ j |, j = 1, 2, . . . , p (3.2) θ,W With L(θ,W ) = n1 Pn i =1 l ( f θ,W (x i ), y i ) as the empirical loss on the training data and x i as the vector of cluster representatives observed for the i t h individual. While the main feature sparsity is induced by the L 1 norm on residual layer parameter θ, the second constraint controls the total amount of nonlinearity of the predictors. As mentioned in Lemhadri et al. (2021), LassoNet can be argued as an extension of the celebrated Lasso algorithm to nonlinear variable selection. In L 1 penalization framework, the importance of a specific feature is naturally embedded into the highest penalization level up to which it can survive in the model. So, to measure the importance of each representative, the LassoNet algorithm is executed over a long range of tuning parameter λ1 ≤ λ2 ≤ · · · ≤ λr on (Y , X S̃ n ). In practice, a small value is fixed for λ1 where all the variables are present in the model. Then we gradually increase the value of the tuning parameter 54 and stop at λr , where no variables are present in the model. Next, the importance score for the j -th cluster is defined as λ̂ j = maximum value of λ up to which the j-th representative exists in the ³ ´ model, and then the following rank statistic is computed: I j = j ′ ̸= j 1 λ̂ j ≤ λ̂ j ′ for j = 1, 2, . . . ,C . P A lower I j means that the j -th cluster representative stays in the model up to a higher value λ implying its high potential as a significant cluster. In contrast, a higher I j indicates the corresponding cluster leaves the model even for a smaller value of λ as a consequence of being simply a collection of null features. Hence, we should only focus on the clusters with lower ranks. Additionally, in order to control the FDR, understanding the behavior of the predictors under the null distribution is important. In traditional FDR controlling algorithms, this is typically done by generating the p-values. Here, as a p-value-free algorithm, we propose the following resampling-based approach: n oB 1. Generate B bootstrap versions of the data Y b , X b considering only the cluster rep- S̃ n b=1 resentatives S̃ n . For each bootstrap version, run the LassoNet algorithm parallelly, and calculate the importance of each representative by measuring λ̂bj = maximum value of λ up to which the j-th predictor exists in the model for b-th bootstrap version, and then the ³ ´ ranks I jb = j ′ ̸= j 1 λ̂bj ≤ λ̂b′ . P j Therefore, the averaged rank is: I¯j = B1 PB b=1 I jb 2. For an arbitrary threshold δ, we would select the cluster representatives with averaged rank I¯j lower than δ; so we define, N+ (δ) = j ∈S̃ n 1(I¯j ≤ δ) representing the number of P selected clusters with respect to the cutoff δ. 3. Next, to estimate the expected number of falsely discovered clusters, define R b = { j : I jb ≤ δ}, the number of cluster representatives with higher importance score so that the corresponding rank is lower than the cutoff δ in the b-th bootstrap version. Additionally, define a neighbourhood N (I¯j , κ) = {l ∈ {1, 2, . . . ,C } : I¯j − l ≤ κ}, for some specific small number κ. 55 4. Further, we estimate the number of falsely discovered clusters and hence an estimator of ˆ ê 0 (δ) the FDR can be constructed as F DR(δ) = N+ (δ) where, nP o ê 0 (δ) = 2 PB b ¯ ̸ N (I j , κ)) B b=1 j ∈R b 1(I j ∈ (3.3) 5. The F DRˆ is sequentially estimated with δ = I¯(1) , I¯(2) , . . . , I¯(C ) and the optimum threshold ˆ is ∆∗ = max{δ > 0 : F DR(δ) < q} for some pre-specific FDR control level q. The final selected set of clusters with controlled FDR is given by D̂ n = {C j , j = 1, 2, . . . ,C : I¯j ≤ ∆∗ } The proposed method certainly has a close resemblance with an FDR-controlling approach. The notion of false discovery is incorporated into the algorithm via the resampling: if a null predictor gets a relatively higher importance score, that is most possibly due to that specific bootstrap version which creates the spurious relation, however, that would not be consistent for all the other bootstraps in general. On the other hand, all the bootstrap versions should consistently produce higher importance scores for the significant predictors. As a consequence, the variability in the ranks of the importance scores will be much higher for the null predictors compared to their nonnull counterparts. This notion was introduced in the statistics literature in the last twenty years as bagging methods (Breiman, 1996; Bühlmann and Yu, 2002) for reducing the variance of a black-box prediction. The proposed method utilizes this phase transition in the feature selection framework to effectively identify the false discoveries; an empirical illustration of which is provided below. The theoretical investigation is relegated to Appendix A, where in the spirit of Ng and Newton (2022), we use the idea of random-weighted Group Lasso penalization to mimic the resampling setup. How to choose the hyperparameter κ: Choosing an appropriate value of κ has a significant effect on the performance of SciDNet. A higher value of κ might lead to weaker control over the inclusion of false discoveries, whereas choosing a small κ will create tighter error control resulting in reduced power. However, we propose an effective way to tune the κ with the assistance of phase transition in the ranks of the importance score I¯j of the cluster representatives. For an 56 Figure 3.2 Illustration of phase transition using synthetic data: Features selected by SciDNet at q = 0.2 clearly have lower bootstrap variability compared to the other irrelevant features. illustration, in Figure 3.2 we consider a single index model (see section 4.2 for more details of the data-generating mechanism), and the features with top 15 importance scores are shown along the x-axis. The first 5 representative features are the only relevant predictors (indicated by the vertical dotted red line). Along the y-axis, the center of the ellipse for each feature represents the rank of the importance scores I¯j averaged over 50 bootstrap replications and the area of each ellipse represents the bootstrap variability around the averaged score. One would observe a clear phase transition in the bootstrap distribution of the ranks. For the significant features, the ranks are lower with extremely precise estimates. On the other hand, for the rest of the null features, the averaged ranks possess much higher values coupled with huge variability producing bigger ellipses. Hence, for a compact neighborhood N (I¯j , κ) to capture only the small variability in the bootstrap ranks of the significant features, we simply fix κ= K* (in figure 3.2), the phase transition point for the averaged rank. This phase transition property is further illustrated on real data in Appendix B.0.2. 57 Figure 3.3 Illustration of the effect of multicollinearity on the feature selection methods. 3.3 Numerical Illustrations In this section, we investigate the finite-sample performance of SciDNet using a wide spec- trum of simulation scenarios. We compare SciDNet to several baseline methods which are widely used in practice. We conduct this simulation study on synthetic data generated from a high-dimensional regression problem. Data Generation: We first consider the single index model for the data-generating mecha- nism, which is a straightforward yet flexible example of nonlinear models. Here the response is related to a linear combination of the features through an unknown nonlinear, monotonic link x3 x function, i.e., y = g (x ′ β) + ϵ. We choose the following link functions: g (x) = 10 + 3 10 . We set n = 400 and p = 5000. The coefficients β ∈ R p is sparse with the true nonzero locations S 0 = {50, 150, 250, 350, 450}, s = |S 0 | = 5, where βS 0c = 0, βS 0 ∼ N5 (uβ0 , 0.1I 5×5 ) with u = {±1}5 . The value of β0 is set at β0 = 2, 4 to incorporate varying signal strength. The random error ϵ ∼ N (0, σ2 ), with three increasing noise level as σ2 = 1, 5, 10. The high dimensional predictors are generated from X ∼ N p (0, Σ) where the covariance matrix Σ is chosen as a Toeplitz matrix with Σi j = ρ |i − j | . 58 To check the effect of multicollinearity, we consider three different settings: ρ = 0.1, 0.5, 0.95. Evaluation Metrics: Let D̂ n denote the selected set of features by some algorithm, then we use the following three metrics to evaluate the performance of these feature selection algorithms: |D̂ n ∩S 0 | 1. Power= |S 0 | , the proportion of relevant features that are correctly identified |D̂ n ∩S 0c | 2. Empirical FDR = , the proportion of falsely identified features among all the identi- |D̂ n | fied features 3. n_v ar = |D̂ n |, the number of total features selected and n_cl ust = |S̃ n |, the number of clusters selected by a group feature selection method like SciDnet. The whole experiment is repeated over 100 Monte Carlo replications and We summarize the results from the empirical evaluations in the following three subsections: (1) The effect of multicollinearity on major existing feature selection methods, (2) the Power vs. FDR balance of SciDNet, and (3) an ablation study showing the effectiveness of all the steps for SciDNet. 3.3.1 The effect of multicollinearity on major existing feature selection methods we present a simulation study to evaluate the performance of our proposed algorithm in comparison with the following five baseline feature selection methods 1. Lasso (Tibshirani, 1996): the L 1 penalized linear regression to prevent overfitting and improve model interpretability. 2. Model-X knockoff (Candès et al., 2018): A theoretically guaranteed statistical method for FDR control used in high-dimensional variable selection. It constructs "knockoff" variables that mimic the correlations between the original variables and their relationship with the response variable, allowing for control of the false discovery FDR while identifying important variables. 3. SurvNet (Song and Li, 2021): A DNN-based FDR control method for feature selection applicable to high-dimensional large datasets. 59 4. Deep Feature Selection (DFS) (Chen et al., 2021): A novel DNN method for feature se- lection in a high-dimensional setting with complex nonlinear relationships utilizing L 0 penalization. 5. LassoNet (Lemhadri et al., 2021): The nonlinear extension of Lasso. It combines the advantages of both L 1 penalization and neural network structures to identify the important features. Although several other existing methods exist in the literature for ultra-high dimensional feature selection, we choose these five baseline methods because of their wide applicability and reliable theoretical guarantees. Also, these five methods can be thought of as representative of different classes of algorithms. For example, Model-X knockoff represents the big class of knockoff-based algorithms; whereas SurvNet, DFS, and LassoNet show the effectiveness of the DL-based feature selection methods. Among these five baseline methods, Model-X knockoff and SurvNet are designed to control the FDR and we use q = 0.2 as the FDR-control threshold. Also, Lasso, LassoNet, and DFS require proper tuning for their L 1 or L 0 penalty parameters, which can be done via a grid search. For this purpose, we optimize a BIC-type criterion (Chen and Chen, 2008) to tune these hyperparameters, as suggested by the authors of Chen et al. (2021). For the practical implementation of these baseline methods, we used the code/hyperparameters provided by the authors in the respective papers. Figure 3.3 demonstrates the Power, FDR, and the n_v ar of all these methods under different correlation strengths and varying noise levels, fixing β0 = 2. Also, as SciDNet selects the features as clusters, we show the n_cl ust in numbers along with n_v ar , in the third row of Figure 3.3. One would observe that the baseline methods perform poorly as the autocorrelation increases; they result in reduced power and inflated false discoveries. The FDR-controlling methods such as Model-X knockoff and SurvNet fail to control their FDR under the pre-specified threshold q = 0.2 for higher autocorrelation. This is somehow expected, as these methods are not tailored for handling such huge multicollinearity. This demonstration empirically motivates as well why 60 we need a feature selection method designed for highly correlated feature space. The DL-based method like LassoNet, DFS, or Survnet additionally suffers from insufficient data under the current big-p-small-n setting. Our additional experiments, presented in the supplementary ma- terial, demonstrate how these DL-based models gain better power-FDR balance given sufficient training data and moderate correlation among the features. Compared to these baseline meth- ods, SciDNet successfully maintains its power while controlling the FDR below the pre-specified threshold q = 0.2 irrespective of the correlation strength. Also, under moderate multicollinearity, when there is no need for clustering, most of the selected clusters by SciDNet are just singleton sets. On the other hand, under the setting of excessive multicollinearity, the individual features become almost indistinguishable. SciDNet addresses this added uncertainty by selecting larger clusters for higher autocorrelation, as demonstrated in the third row of Figure 3.3. 3.3.2 The Power vs. FDR balance of SciDNet Figure 3.4 Power vs FDR balance for SciDNet. One major validation for an FDR controlling method is to check how it retrieves its power with respect to gradually increasing the FDR-control threshold q. Figure 3.4 shows this power-FDR trade-off for SciDNet. Also, Section 3.3.1 demonstrates that the baseline methods are not well suited for highly correlated feature space. As SciDNet is specifically designed for this setting, 61 Figure 3.5 Ablation study for SciDNet. for ease of illustration, we only present the performance of SciDNet setting ρ = 0.95. Figure 3.4 illustrates how our method enjoys quick recovery in power when we gradually increase the FDR-controlling threshold q from 0.01 to 0.20 and also maintain the number of false discoveries below the required level. This illustration empirically validates SciDNet as an FDR-controlled feature selection method. 3.3.3 An ablation study The proposed method SciDNet is a multi-step process. Its screening step reduces the dimen- sion while retaining the main important features and clustering the highly correlated features. Following this, the cleaning step further uses a sparsity-inducing DL model and finally selects some cluster of features by controlling the FDR via resampling. We aim to analyze the impact of these two steps on the overall performance of the model by adding them one by one and observing the change in the evaluation metrics. We also compare our proposed cleaning step (i.e. resampled LassoNet with FDR estimation) with other possible alternatives like using knockoffs for the cleaning. Hence, for the ablation study, we compare the following four methods under varying signal and noise strength: 62 1. Screening only, 2. Screening+LassoNet: Here we consider LassoNet as the cleaning step 3. Screenning+Knockoff: Here we consider Model-X knockoff as the cleaning step, with the FDR-controlling threshold q = 0.2 4. Screening+resampled LassoNet, i.e. SciDNet. We use two FDR-controlling thresholds q = 0.1, and 0.2. The power, empirical FDR, n_v ar are illustrated in Figure 3.5. In addition, the selected num- ber of clusters, i.e., n_cl ust , are written in numbers in the third row of Figure 3.5. it empirically consolidates several interesting characteristics: (a) Due to the sure-screening property 3.2.2.1, the screening step selects a slightly bigger set of features resulting in high power and high FDR which necessitates further cleaning; (b) All the alternative cleaning steps aim to further eliminate the null features and reduce the FDR while maintaining the power of the screening step; (c) For higher signal and low noise case, SciDNet is comparable to other alternatives. However, for difficult scenarios like low signal and high noise case, which is common in modern genomic and imaging datasets, SciDNet maintains its performance. One would notice that Model-X knockoff loses its FDR control at the nominal level q = 0.2 and results in inflated FDR in the presence of high noise. On the other hand, by effectively using the added information from the resampling, SciDNet achieves the best performance (in terms of the power-FDR tradeoff) and continues to preserve its FDR below the nominal level q. Overall, our ablation study showed that the proposed cleaning following the screening step contributes to the overall performance of SciDNet and that removing any of them leads to a significant drop in accuracy. The screening helps in reducing the dimension of the feature space by removing a large chunk of irrelevant features, which further makes the cleaning step computationally very efficient. Additionally, one would notice the different tasks in the screening and cleaning steps can be easily done in parallel. Specifically, for all the simulation studies SciDNet takes ∼ 4.1 minutes to complete. In our experience, this is highly competitive with 63 other methods like DFS, especially when an exhaustive grid search has to be done to optimize the hyperparameters for several baseline methods. We also conducted a further simulation study considering several nonlinear models as data-generating processes with Gaussian and non-Gaussian features. The results are presented in Appendix B.0.3 which further substantiates the above-mentioned results that SciDNet maintains a satisfactory power-FDR balance for various complicated nonlinear models with and without interaction terms. The hyperparameter selection and further implementation details of SciDNet are relegated to Appendix B.0.1 and B.0.5, respectively. 3.4 Real Data Analysis In addition to the simulation studies, we implemented the proposed algorithm SciDNet in the following two publicly available gene-expression data sets - the CCLE dataset and the riboflavin dataset. We substantiate the findings in two ways: We first provide supporting evidence from the domain research. Additionally, as a more data-aligned validation, we demonstrate that several generic prediction models significantly gain in test accuracy when applied only on the few features selected by SciDNet compared to the prediction result considering the whole feature space. For this purpose, consistent with the other genomic studies, we use the prediction correlation Corr(YP r ed , YTest ) in addition to the test MSE as a metric to measure the test performance. To overcome the extra burden of the low sample size and ultrahigh dimensionality in these data sets, we consider 50 independent replications where the data is divided into training and testing maintaining an 8 : 2 ratio to get the metrics for the test performance. The final estimate is obtained by averaging all the test MSEs calculated on each of these replications. A similar approach is considered for the correlation metric as well. 3.4.1 Selection of Drug Sensitive Genes using CCLE dataset A recent large-scale pharmacogenomics study, namely, the cancer cell line encyclopedia (CCLE, link available here), investigated multiple anticancer drugs over hundreds of cell lines. Its main objective is to untangle the response mechanism of anticancer drugs which is critical to precision medicine. The data set consists of dose-response curves for 24 different drugs 64 Figure 3.6 A snapshot of correlation strength for first 100 genes considered for the drug Topotecan in CCLE dataset (left) and riboflavin dataset (right). across over n = 400 cell lines. For each cell line, it consists of the expression data of p = 18, 926 genes, which we consider as features. For the response, we used the activity area (Barretina et al., 2012) to measure the sensitivity of a drug for each cell line. Here we seek to uncover the set of genes associated with the following five specific anticancer drug sensitivity: Topotecan, 17-AAG, Irinotecan, Paclitaxel, and AEW541. These drugs have been used to treat ovarian cancer, lung cancer, and other cancer types. Previous research outputs on these drugs and related gene expression data can be found elsewhere (Barretina et al., 2012). Table 3.1 Drug-sensitive genes identified by SciDNet and related prediction performance # genes (clusters) selected Test MSE Corr(YP r ed , YTest ) Drug by SciDNet by LassoNet LassoNet SciDNet + MLP SciDNet + RT Lassonet SciDNet + MLP SciDNet + RT Topotecan 25 (9) 18469 1.25 (0.21) 1.23 (0.14) 0.81 (0.16) 0.47 (0.11) 0.58 (0.06) 0.69 (0.07) 17-AAG 12 (8) 7152 1.04 (0.16) 1.05 (0.09) 0.83 (0.15) 0.20 (0.16) 0.33 (0.10) 0.49 (0.10) Irinotecan 18 (7) 17727 0.93 (0.20) 1.09 (0.18) 0.61 (0.13) 0.59 (0.10) 0.63 (0.07) 0.73 (0.08) Paclitaxel 18 (8) 16437 1.46 (0.33) 1.46 (0.23) 1.11 (0.24) 0.44 (0.14) 0.45 (0.11) 0.59 (0.09) AEW541 12 (10) 15145 0.33 (0.06) 0.39 (0.09) 0.27 (0.05) 0.30 (0.14) 0.49 (0.10) 0.47 (0.12) SciDNet produces multi-resolutional clusters of genes for each of the five drugs considered which are interpretable from the domain science perspective. For example, SciDNet discovers SLFN11 as the top drug-sensitive gene for the drugs Topotecan and Irinotecan. This is consistent with the previous findings as Barretina et al. (2012); Zoppoli et al. (2012) reported the gene SLFN11 to be highly predictive for both drugs. For another drug 17-AAG, SciDNet discovers the 65 gene NQO1 as the topmost important gene which is known to be highly sensitive to 17-AAG (Hadley and Hendricks, 2014). The full table containing all the genes selected by SciDNet at q = 15% error-control level and relevant findings from previous genomic studies have been relegated to Appendix B.0.6. Additionally, as in a real setting, it is difficult to check the performance of a feature selection algorithm, here we present a more data-oriented statistical evaluation for further endorsement of SciDNet’s discoveries. From the prediction aspect, one would expect that a prediction model implemented only on a handful of features selected by a successful feature selection algorithm would maintain the similar performance of a model implemented on the whole feature space; in some cases, it might enhance the accuracy. To validate this, we randomly split the whole data into 8:2 for training and testing. First, the SciDNet is implemented in the training part, and then two separate prediction models are used only focusing on the selected features : (1) an MLP - a feed- forward multi-layer perceptron with two hidden layers and (2) bagged regression tree (Breiman, 1996). These two experiments are henceforth called: "SciDNet+MLP" and "SciDNet+RT". Next, the test data is used to check the out-of-sample prediction accuracy. Furthermore, similar to the simulation study in section 4.2, we separately implement the LassoNet on the training data for its simultaneous sparsity-induced prediction-optimal characteristics. The summary of the results is presented in table S4 which indicates several interesting points. First, to get the prediction optimal result, LassoNet fails to capture the sparsity and discovers a huge number of genes. This is somehow expected as most of the prediction-optimal sparse methods tend to select a larger set of features to maintain the prediction quality (Wasserman and Roeder, 2009). On the other hand, SciDNet produces only ∼ 10 clusters with an average cluster size ∼ 2.5. Even with this huge dimension reduction, the added gain in test MSE and Corr(YP r ed , YTest ) further proves that ScidNet successfully retains all the significant predictors. One would further notice, SciDNet+RT achieves the best stable performance which is consistent for all the drugs and our simulation study as well. This experiment demonstrates that a black-box predictive model produces more accurate results when applied on the features selected by SciDNet rather than its implementation 66 Table 3.2 Prediction performance of SciDNet for Riboflavin production data set Algorithm Test MSE Corr(YP r ed , YTest ) LassoNet 0.83 (0.18) 0.36 (0.30) SciDNet + LassoNet 0.19 (0.15) 0.89 (0.12) SciDNet + RT 0.42 (0.16) 0.74 (0.15) SciDNet + MLP 2.64 (0.79) 0.28 (0.37) on the whole feature space, indicating SciDNet’s potential use in both feature selection and prediction. 3.4.2 Selection of associated genes in Riboflavin production data set We further implement the SciDNet in the context of riboflavin (vitamin B2) production with bacillus subtilis data, a publicly accessible dataset available in the ‘hdi’ package in R. Here the continuous response is the logarithm of the riboflavin production rate, observed for n = 71 samples along with the logarithm of the expression level of p = 4088 genes which are treated here as the predictors. Unlike the previous CCLE data, significant multicollinearity is present in most of the Riboflavin data set, as demonstrated in Figure 3.6. Hence, to determine which genes are important for riboflavin production, SciDNet resulted in finding 9 clusters of a total of 160 correlated genes at the q = 15% FDR control level, making the average cluster size of ∼ 17.78, which is much bigger compared to the previous analysis. SciDNet discovered the gene YCIC_at as one of the expressive genes related to riboflavin production which was identified by Bühlmann et al. (2014) as a causal gene in this context. The full list of the selected cluster of genes by SciDNet is relegated to Appendix B.0.8. The results from the empirical validation of SciDNet’s feature selection on the riboflavin dataset are presented in table 3.2. However, for the empirical evaluation, SciDNet+MLP is per- forming poorly as the inflated cluster dimensions make the input layer of the MLP comparatively large where the number of the training data point is ≈ 57. This necessitates the need for a sparse model here, and we adopt the idea of Relaxed Lasso, first proposed by Meinshausen (2007). Here we implement the LassoNet again on the selected features by SciDNet, which certainly improves the prediction accuracy. Consistent with the previous experiments, SciDNet+RT effectively 67 maintains its prediction performance. This further consolidates the need for applying an apt feature selection method before fitting a predictive model for an explainable research outcome. 3.5 Discussion While the explainable AI is the need of the hour, statistical models coupled with cutting-edge ML techniques have to push forward because of their solid theoretical foundation clipped with principled algorithmic advancement. The proposed method SciDNet efficiently exploits several existing statistics and ML literature tools to circumvent some of the complexities that simple adaptation of current DL-based models fail to address appropriately. The basic intuition and exciting empirical results of SciDNet on simulated and real datasets open avenues for further research. For example, one may be interested in developing a theoretical foundation for this ’screening’ and ’cleaning’ strategy for provable FDR control. It would be worth mentioning that although we used the sure independence screening with HZ-test and LassoNet as the main tools, SciDNet puts forward a more generic framework and can be implemented with any other model-free feature screening method and sparsity-inducing DL algorithms like Feng and Simon (2017). In the screening part, a further methodological extension would consider relaxing the assumption of nonparanormally distributed features for a more flexible approach. Additionally, as the dimensionality is reduced after the screening step, it would be interesting to implement model-free knockoff generating algorithms like Romano et al. (2020) in the cleaning step as further algorithmic development. One limitation is that we mainly focus on the regression setup with the continuous outcome because of the requirements of the HZ sure Independence test used in the screening step. For a classification task, any model-free feature screening method like Zhou and Zhu (2018) can be applied in a more general framework. 68 BIBLIOGRAPHY Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. The Annals of Statistics, 48(3):1409–1431. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V., Sonkin, D., et al. (2012). The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603–607. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Method- ological), 57(1):289–300. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Brzyski, D., Peterson, C. B., Sobczyk, P., Candès, E. J., Bogdan, M., and Sabatti, C. (2017). Control- ling the Rate of GWAS False Discoveries. Genetics, 205(1):61–75. Bühlmann, P., Kalisch, M., and Meier, L. (2014). High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278. Bühlmann, P. and Yu, B. (2002). Analyzing bagging. The Annals of Statistics, 30(4):927 – 961. Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 80(3):pp. 551–577. Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771. Chen, Y., Gao, Q., Liang, F., and Wang, X. (2021). Nonlinear variable selection via deep neural networks. Journal of Computational and Graphical Statistics, 30(2):484–492. Das, A., Sampson, A. L., Lainscsek, C., Muller, L., Lin, W., Doyle, J. C., Cash, S. S., Halgren, E., and Sejnowski, T. J. (2017). Interpretation of the Precision Matrix and Its Application in Estimating Sparse Brain Connectivity during Sleep Spindles from Human Electrocorticography Recordings. Neural Computation, 29(3):603–642. Das, D. and Lahiri, S. N. (2019). Distributional consistency of the lasso by perturbation bootstrap. Biometrika, 106(4):957–964. Dinh, V. C. and Ho, L. S. (2020). Consistent feature selection for analytic deep neural networks. Advances in Neural Information Processing Systems, 33:2420–2431. Dorman, S. N., Baranova, K., Knoll, J. H. M., Urquhart, B. L., Mariani, G., Carcangiu, M. L., and 69 Rogan, P. K. (2016). Genomic signatures for paclitaxel and gemcitabine resistance in breast cancer derived by machine learning. Mol. Oncol., 10(1):85–100. Elbrächter, D., Perekrestenko, D., Grohs, P., and Bölcskei, H. (2021). Deep neural network approximation theory. IEEE Transactions on Information Theory, 67(5):2581–2623. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911. Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101. Feng, J. and Simon, N. (2017). Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592. Ghorbani, A., Abid, A., and Zou, J. (2019). Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3681–3688. Hadley, K. E. and Hendricks, D. T. (2014). Use of nqo1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-aag. BMC cancer, 14(1):1– 8. Henze, N. and Zirkler, B. (1990). A class of invariant consistent tests for multivariate normality. Communications in Statistics - Theory and Methods, 19(10):3595–3617. Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B., and Eskin, E. (2014). Identifying causal variants at loci with multiple signals of association. Genetics, 198(2):497–508. Jordon, J., Yoon, J., and van der Schaar, M. (2019). KnockoffGAN: Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations. Lauritzen, S. L. (1996). Graphical models, volume 17. Clarendon Press. Lee, H. J., Hanibuchi, M., Kim, S.-J., Yu, H., Kim, M. S., He, J., Langley, R. R., Lehembre, F., Regenass, U., and Fidler, I. J. (2016). Treatment of experimental human breast cancer and lung cancer brain metastases in mice by macitentan, a dual antagonist of endothelin receptors, combined with paclitaxel. Neuro-oncology, 18(4):486–496. Lei, L. and Fithian, W. (2018). Adapt: an interactive procedure for multiple testing with side infor- mation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649–679. Lemhadri, I., Ruan, F., Abraham, L., and Tibshirani, R. (2021). Lassonet: A neural network with feature sparsity. Journal of Machine Learning Research, 22(127):1–29. 70 Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini–hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):45– 74. Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139. Liang, F., Li, Q., and Zhou, L. (2018). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972. Liu, H., Lafferty, J., and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(10). Liu, Y. and Zheng, C. (2019). Deep latent variable models for generating knockoffs. Stat, 8(1):e260. Lu, Y., Fan, Y., Lv, J., and Stafford Noble, W. (2018). Deeppink: reproducible feature selection in deep neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. Meinshausen, N. (2007). Relaxed lasso. Comput. Stat. Data Anal., 52:374–393. Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3):1436 – 1462. Ng, T. L. and Newton, M. A. (2022). Random weighting in lasso regression. Electronic Journal of Statistics, 16(1):3430–3481. Romano, Y., Sesia, M., and Candès, E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532):1861–1872. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215. Sesia, M., Katsevich, E., Bates, S., Candès, E., and Sabatti, C. (2019). Multi-resolution localization of causal variants across the genome. bioRxiv. Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18. Shojaie, A. and Michailidis, G. (2010). Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519–538. Siegmund, D. O., Zhang, N. R., and Yakir, B. (2011). False discovery rate for scanning statistics. Biometrika, 98(4):979–985. 71 Song, Z. and Li, J. (2021). Variable selection with false discovery rate control in deep neural networks. Nature Machine Intelligence, 3(5):426–433. Tansey, W., Wang, Y., Blei, D., and Rabadan, R. (2018). Black box FDR. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4867–4876. PMLR. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288. Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166– 1202. Wang, G., Sarkar, A., Carbonetto, P., and Stephens, M. (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. bioRxiv. Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37(5A):2178. Xia, F., Zhang, M. J., Zou, J. Y., and Tse, D. (2017). Neuralfdr: Learning discovery thresholds from hypothesis features. In NIPS. Xue, J. and Liang, F. (2017). A robust model-free feature screening method for ultrahigh- dimensional data. Journal of Computational and Graphical Statistics, 26(4):803–813. Zhang, A. R. and Zhou, Y. (2020). On the non-asymptotic and sharp lower tail bounds of random variables. Stat, 9(1):e314. e314 sta4.314. Zhou, Y. and Zhu, L. (2018). Model-free feature screening for ultrahigh dimensional datathrough a modified blum-kiefer-rosenblatt correlation. Statistica Sinica, 28(3):1351–1370. Zoppoli, G., Regairaz, M., Leo, E., Reinhold, W. C., Varma, S., Ballestrero, A., Doroshow, J. H., and Pommier, Y. (2012). Putative dna/rna helicase schlafen-11 (slfn11) sensitizes cancer cells to dna-damaging agents. Proceedings of the National Academy of Sciences, 109(37):15030–15035. 72 APPENDIX A THEORETICAL STUDY Although the proposed method SciDNet is demonstrated based on the LassoNet algorithm (Lemhadri et al., 2021), any other sparsity-inducing Deep learning framework can be adopted at the cleaning step of SciDNet. Hence, for the theoretical study, we consider a broader framework with a general analytic DNN. Also, as the screening step is theoretically guaranteed by Xue and Liang (2017), we start the theoretical study directly from the cleaning step, assuming the iid data is {Y i ∈ R, X i ∈ X }ni=1 ∼ P D , where X is a bounded open set in Rp , p < n and the input density p x is positive and continuous on its domain X . In the spirit of Dinh and Ho (2020), we consider the following general analytic neural network model f α (x), an L-layer neural network with parameters α = (t i n , Ti n , t out , Tout , S) and defined by • Input layer: h 1 (x) = t i n + Ti n x; • Hidden layers: h j (x) = φ j −1 (S, h j −1 (x), h j −2 (x), . . . , h 1 (x)), j = 2, 3, . . . , L − 1; and • Output layers: f α (x) = h L (x) = t out + Tout h L−1 (x) with d i = size of the i -th layer, d 1 = p, d L = 1, Ti n ∈ Rd2 ×p , t i n ∈ Rd2 , Tout ∈ R1×D L−1 , t out ∈ R and φ1 , φ2 , . . . , φL−2 are analytic functions parameterized by the hidden layers’ parameter S. This gen- eral framework covers a wide range of models, including feed-forward networks, convolutional networks, and a major subclass of residual networks. For the sake of theoretical study, we further assume that (1) the set of all feasible vectors α of the model is a hypercube W = [−A, A]nα , (2) both W and X are bounded, (3) f α is analytic in the sense that there exist C 1 ,C 2 > 0 such that | f α (x)| ≤ C 1 , and || ▽α f α (x)||∞ ≤ C 2 , ∀α ∈ W , x ∈ X and these functions are Lipschtz continuous, and (4) Y = f α∗ (X ) + ϵ with ϵ ∼ N (0, σ2ϵ ), α∗ ∈ W = [−A, A]nα and we assume that the “true” model f α∗ (·) only depends on x through a subset of significant features S ∈ 1, 2, . . . , p while being inde- pendent of features in S c , |S| = s = O(1). To this end, a general group Lasso estimator (Feng and 73 Simon, 2017; Dinh and Ho, 2020) has been defined as n p 1X α̂n =α { l (α, X i , Y i ) + λn ||α[:, j ] ||} X (A.1) n i =1 j =1 where l (α, x, y) = (y − f α (x))2 is the square-error loss, λn > 0 is the associated penalty parameter, || · || is the standard Euclidean norm and α[:, j ] is the vector of parameters associated with j-th input feature. Now, we incorporate the notion of resampling/bootstrapping by introducing the random- weighted group lasso defined as follows n p 1X α̂nw =α { Wi l (α, X i , Y i ) + λn ||α[:, j ] ||} X (A.2) n i =1 j =1 iid where Wi ∼ FW , independent of the data distribution P D . The random weighting scheme effectively maintains the flavor of perturbation bootstrap (Das and Lahiri, 2019), thus can be used for uncertainty quantification in the context of sparse models. Ng and Newton (2022) studied this in the context of the sparse linear model. Here, we first show that under appropriate random weights, α̂nw converges to a set of well-behaved optimal hypotheses in the sense that K = {α ∈ W : f α = f α∗ and α[:, j ] = 0, for j ∈ S c }. We note that with the addition of random weights the overall probability measure changed to P = P D × PW where PW is the probability measure of the triangular array of random weights. We define the following three sigma fields: Fnw = σ{W1 ,W2 , . . . ,Wn }, Fnx = σ{X 1 , X 2 , . . . , X n }, and y Fn = σ{Y1 , Y2 , . . . , Yn }. We further define, the risk function R(α) = E P D (Y − f α (X ))2 , the empirical risk function ¢2 R n (α) = n1 ni=1 Y i − f α (X i ) and the weighted empirical risk function adding the random P ¡ ¢2 weights as R nw (α) = n1 ni=1 Wi Y i − f α (X i ) . The equivalence class can be expressed as H ∗ = P ¡ {α ∈ W : R(α) = R(α∗ )}. For simplicity, in order to obtain a bounded weight, we assume FW is such that PW (C 2 < W < C 3 ) = 1, for some C 3 > 0. Lemma A.0.1 (Probabilistic Lipschitzness of the random-weighted empirical risk). For any δ > 0, there exists M δ > c 0 such that R nw (α) is an M δ -Lipschitz function with probability at least 1 − δ. 74 Proof. We note that, |R(α) − R(β)| = |E (Y − f α (X ))2 − (Y − f β (X ))2 | ¡ ¢ ¡ ¢¡ ¢ ≤ E | f α (X ) − f β (X ) 2Y − f α (X ) − f β (X ) | ≤ C 2 ||α − β||E | 2Y − f α (X ) − f β (X ) | ¡ ¢ ≤ C 2 ||α − β|| 2E |Y − f α∗ (X )| + E | f α (X ) + f β (X ) − 2 f α∗ (X ) | ¡ ¡ ¢¢ ≤ C 2 ||α − β||(2σ + 4C 1 ) Similarly, 1X n ³ ´ |R nw (α) − R nw (β)| = | Wi (Y i − f α (X i ))2 − (Y i − f β (X i ))2 | n i =1 1X n ³ ´³ ´ ≤ Wi | f α (X i ) − f β (X i ) 2Y i − f α (X i ) − f β (X i ) | n i =1 1X n ³ ³ ´´ ≤ C 2 ||α − β|| Wi 2|Y i − f α∗ (X i )| + | f α (X i ) + f β (X i ) − 2 f α∗ (X i ) | n i =1 ³ 2X n ´ ≤ C 2 ||α − β|| 4c 1 + Wi |ϵi | n i =1 Thus for all M δ > 4C 1C 2 , the proof is completed by noting the following ³ ´ P |R nw (α) − R nw (β)| ≤ M δ ||α − β||, ∀α, β ∈ W ³1 X n Mδ ´ h ³1 X n Mδ ´i ≥P Wi |ϵi | ≤ − 2C 1 = E PW P Wi |ϵi | ≤ − 2C 1 |W1 ,W2 , . . . ,Wn n i =1 2C 2 n i =1 2C 2 h ³1 X n Mδ ´i = E PW 1 − P Wi |ϵi | ≥ − 2C 1 |W1 ,W2 , . . . ,Wn n i =1 2C 2 1 Pn 1 i h n i =1 Wi E |ϵ | C 3 E |ϵ1 | ≥ E PW 1 − Mδ ≥ 1 − Mδ 2C 2 − 2C 1 2C 2 − 2C 1 Lemma A.0.2 (Generalization bound for the random-weighted empirical risk). For any δ > 0, ∃ ³ ´ l og (n) C 4 (δ) > 0 such that P |R nw (α) − R(α)| ≤ C 4 pn ≥ 1 − δ, ∀α ∈ W ³ ´2 P Proof. Note that nR nw (α) = ni=1 Wi Y i − f α (X i ) = ni=1 Wi Zi2 , where Zi = Y i − f α (X i ) which P ³ ´ follows N f α∗ (X i ) − f α (X i ), σ2ϵ , conditional on F X . 75 nR nw (α) Hence, conditional on F X , FW , σ2ϵ ∼ a weighted non-central χ2 distribution. We will use the following Lemma 3 to get a sharp tail bound for weighted non-central χ2 distribution. Hence, ³ t´ h ³ C n2t 2 ´i P |R nw (α) − R(α)| > ≤ E FX ,FW 2exp − ¢2 P 2 2n + 2 ni=1 Wi f α∗ (X i ) − f α (X i ) − ni=1 Wi P ¡ h ³ C n2t 2 ´i ≤ E FX ,FW 2exp − ¢2 2n + 2 ni=1 Wi f α∗ (X i ) − f α (X i ) P ¡ h ³ C n2t 2 ´i ≤ E FX ,FW 2exp − 2 ≤ 2exp(−c̃nt 2 ) 2n + 2nC 3 4C 1 The rest of the proof follows the direct proof of the Lemma 3.3 Dinh and Ho (2020). Theorem A.0.3 (Convergence of random-weighted Group Lasso). For any δ > 0, there exist C δ ,C ′ > 0 and Nδ > 0 such that for all n ≥ Nδ , ³ ν l og n ´ ν1 c l og n d (α̂nw , H ∗ ) ≤ C δ λnν−1 + p and ||α̂nw[:,S ] || ≤ 4C 4 p +C ′ d (α̂n , H ∗ ) n λn (n) where d (x, Z ) = infz∈Z ||x − z||. The proof directly follows from the above mentioned Lemma 1,2 and the theorem 3.3 from Dinh and Ho (2020). Theorem A.0.3 demonstrates the convergence of the random-weighted group-Lasso esti- c mates to a set of “well-behaved” optimal hypotheses K = {α ∈ W : f α = f α∗ and ||α[:,S ] || = 0} ⊂ H ∗ . However, this depends on the regularization parameter λn ; finding its optimum value is generally a daunting task in practice. As a solution, the importance score considering the whole regularization path would provide a more robust way to identify false discoveries. Following our proposed method in Section 3.2.3, here we consider the cleaning step in view of the random- weighted group-Lasso penalized DNN. We repeat this step over B independently generated random-weighting scheme and recall the importance scores λ̂ j and the corresponding ranks of the importance score I j for the b-th scheme as follows: w,[:, j ] λ̂bj = max{λn ∋ ||αn || ̸= 0} = maximum value of λ up to which the j-th feature exists in the model for the b-th random scheme, and ³ ´ I jb = rank of the importance scores = j ′ ̸= j 1 λ̂ j ≤ λ̂ j ′ and the averaged rank I¯j = Bb=1 I jb P P 76 Our basic strategy is to select the features with high-importance scores consistent in all the bootstrap replication. Hence, for a cutoff δ, define the number of selected features as Pp N+ (δ) = j =1 1 I¯j ≤ δ . We further propose our estimate of the FDR as ¡ ¢ ê 0 (δ) ˆ F DR(δ) = , N+ (δ) ( ) B p 2 X 1(I jb ≤ δ, I jb ̸∈ N (I¯j , κ)) X where ê 0 (δ) = estimated number of false discoveries = B b=1 j ∈1 We further calculate the data-dependent optimum cutoff as ∆∗ = max{δ ∈ {I¯1 , I¯2 , . . . , I¯p } : ˆ F DR(δ) < q} for some pre-specific FDR control level q. Theorem A.0.4 (Convergence of the estimated FDR). Using the data-dependent cutoff ∆∗ , the ê 0 (∆∗ ) ³ ´ actual FDR is bounded by the user-specified level q; that is E N+ (∆∗ ) < q as n → ∞. Proof. The feature importance scores I¯j evidently provide the information on the survival of the feature X j over the whole regularization path. Now, under the setup of the random- weighted group-lasso framework, this is uniquely monitored by the KKT condition: ||α[:, j ] || = 0 ∂ f (X ) T p ³ ´ if || ∂αα[:, j ] d i ag (W1 ,W2 , . . . ,Wn )(Y − f α (X ))n×1 || < λ p j , where p j is the number of total p j ×n parameters associated with the j-th feature. Hence, this L 2 norm in the KKT condition can be treated as the importance score λ̂ j , mentioned in Section 3.2.3. Now, for the null features j ∈ S 0c , the derivative term is o n −1/4 l og (n) by the continuity of ▽α f α (x) and the convergence of α̂nw ¡ ¢ to K . Also, by Lemma A.0.2, the residual term is o n −1/2 l og (n) . Also, the random weights W ’s ¡ ¢ are bounded in [C 2 ,C 3 ]. Hence, this new λ̂ j , j = 1, 2, . . . , p become exchangeable asymptotically. As a consequence of the exchangeable variables, the ranks of the importance scores λ̂ j follow ³ ´ uniform distribution asymptotically. Hence, for large n, I jw = j ′ ̸= j 1 λ̂ j ≤ λ̂ j ′ ∼ U (p − s + 1, p). P Now, we consider the following random variable names False Discovery Proportion (FDP) whose 77 expectation is the FDR. Pp ¯j ≤ ∆∗ , j ∈ S c I ¡ ¢ j =1 1 0 F DP = Pp ¯ 1 Ij ≤ ∆ ¡ ¢ ∗ j =1 n Pp o Pp 2 PB b ∆ ∗ I b N I ¯j , κ)) 1 I¯j ≤ ∆∗ , j ∈ S 0c ¡ ¢ B b=1 j ∈1 1(I j ≤ , j ∈ ̸ ( j =1 = Pp . P 1 I¯j ≤ ∆∗ nP o p ¡ ¢ 2 B 1(I b ≤ ∆ ∗ , I b ̸∈ N (I¯ , κ)) j =1 j | {z } |B b=1 j ∈1 j {z j } ≤q R(∆∗ ) The first part of FPR is ≤ q by the typical choice of ∆∗ and hence, in order to show the asymptotic FDR control, all we need to show is limn→∞ E (R(∆∗ )) ≤ 1.  Pp ¯j ≤ ∆∗ , j ∈ S c  I ¡ ¢ j =1 1 0 E (R(∆∗ )) = E  P nP o 2 B p ¯ B b=1 j ∈1 1(I j ≤ ∆ , I j ̸∈ N (I j , κ)) b ∗ b  Pp ³ ´  1 I¯j ≤ ∆∗ , {I b }B ∼ Uni f or m j =1 j b=1 =E P nP o 2 B p ¯ B b=1 j ∈1 1(I j ≤ ∆ , I j ̸∈ N (I j , κ)) b ∗ b  Pp ³ ´ 1 I ¯j ≤ ∆∗ , {I b }B ∼ Uni f or m, I b ∈ N (I¯j , κ), ∀b j =1 j b=1 j =E nP o + 2 B p ¯ 1(I j ≤ ∆ , I j ̸∈ N (I j , κ)) b ∗ b P B b=1 j ∈1  Pp ³ ´ 1 I ¯j ≤ ∆∗ , {I b }B ∼ Uni f or m, I b ̸∈ N (I¯j , κ), ∀b j =1 j b=1 j E nP o  2 B p ¯ 1(I j ≤ ∆ , I j ̸∈ N (I j , κ)) b ∗ b P B b=1 j ∈1  Pp ³ ´ 2 j =1 1 I¯j ≤ ∆∗ , {I jb }Bb=1 ∼ Uni f or m, I jb ̸∈ N (I¯j , κ), ∀b ≤E nP o  2 PB p ¯ B b=1 j ∈1 1(I j ≤ ∆ , I j ̸∈ N (I j , κ)) b ∗ b Pp ³ ´  2 j =1 1 I¯j ≤ ∆∗ , I jb ̸∈ N (I¯j , κ), ∀b  ≤E P nP o 2 B p ∗ , I b ̸∈ N (I¯ , κ)) B b=1 j ∈1 1(I b j ≤ ∆ j j nP o  P 2 B p b b ¯j , κ)) B b=1 j ∈1 1(I j ≤ ∆ ∗ , I j ∈ ̸ N ( I ≤E P nP o = 1 2 B p ∗ , I b ̸∈ N (I¯ , κ)) B b=1 j ∈1 1(I b j ≤ ∆ j j The third inequality is true by the fact that the neighborhood N (I¯j , κ) is much smaller than its complement region. The last inequality is true because of the fact that P (Z1 ≤ z, Z2 ≤ Pp z, . . . , Z p ≤ z) ≤ p1 i =1 P (Zi ≤ z), for any set of random variables Z1 , Z2 , . . . , Z p . Hence, the actual FDR is controlled at the user-specific bound q by selecting the features with ranks of their importance score greater than the data-dependent threshold ∆∗ . 78 Lemma 3: Sharp tail bound for weighted non-central χ2 distribution Consider a weighted non-central χ2 distributed random variable Y = ki=1 u i Zi2 , Zi ∼ N (µi , 1) P independently and ki=1 µ2i = λ. Then for the centralized random variable X = Y − ki=1 u i (1+µ2i ), P P the following sharp tail bound holds: there exists constants c,c’,C>0, such that ct2 k k ), ∀0 ≤ x ≤ c ′ (2k + 2 u i µ2i − X X P (X ≥ x) ≤ c exp(− Pk Pk ui ) i =1 u i µi − 2 2k + 2 i =1 u i i =1 i =1 Proof: The moment-generating function of X: h 2t 2 X k k k i φ X (t ) = exp u i µ2i − (l og (1 − 2t ) + 2t ) + t (k − X ui ) 1 − 2t i =1 2 i =1 2 2t Note, for 0 ≤ t ≤ 1/2, 2t 2 ≤ −l og (1 − 2t ) − 2t ≤ 1−2t . This implies, for 0 ≤ t ≤ 2/5, k k k k t 2 (2k + 2 u i µ2i − u i ) ≤ l og (φ X (t )) ≤ 5t 2 (2k + 2 u i µ2i − X X X X ui ) i =1 i =1 i =1 i =1 Next, the exact tail bound can be obtained by applying theorem 1 from Zhang and Zhou (2020). 79 APPENDIX B ADDITIONAL TECHNICAL DETAILS In this section, additional technical and implementation details on the proposed algorithm SciDNet are provided. B.0.1 Hyperparameter Selection Recently developed Deep Learning (DL) models are generally governed by several hyperpa- rameters and properly tuning them is necessary to get effective results. The proposed SciDNet relies on the following hyperparameters: (1) size of the active set Ŝ n , (2) the intracluster cor- relation bound r , (3) LassoNet tuning parameters λ and M and (4) κ used in neighbourhood selection in cleaning step. SciDNet is fairly robust to most of the associated hyperparameters. We discuss a practical way to tune all these hyperparameters here: 1. To choose the size of the active set, we propose to select a bigger active set with size proportional to νn = [n/l og (n)]. As we further cluster the active variables in the clustering step, a slightly bigger active set with boost up the confidence of sure screening property, See the section B.0.5 for an example. 2. After clustering, the intra-cluster correlation bound r should be fixed at some higher value (usually at 0.9 or 0.95) otherwise the cluster sizes will be inflated. 3. In the cleaning step, a thorough grid search has been done over λ considering λ1 ≤ λ2 ≤ · · · ≤ λr ; in practice, a small value is fixed for λ1 where all the variables are present in the model. Then the value of the tuning parameter gradually increased up to λr , where there are no variables present in the model. The other hyperparameter for LassoNet is the hierarchy coefficient M for which we follow the path considered in Lemhadri et al. (2021) and set M = 10. However, a more flexible approach would be a parallel grid search for M as well. 4. The neighborhood length κ can be chosen using the phase transition in the ranks of the 80 Table S1 Empirical power and observed FDR of SciDNet with standard error in parentheses for Gaussian features Nonlinear Additive Nonlinear with interaction Linear ρ snr q 0.01 0.05 0.1 0.15 0.01 0.05 0.1 0.15 0.01 0.05 0.1 0.15 Power 0.79 (0.19) 0.93 (0.11) 0.96 (0.09) 0.96 (0.09) 0.99 (0.06) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) snr = 9 : 1 FDR 0.00 (0.00) 0.00 (0.02) 0.01 (0.03) 0.02 (0.05) 0.00 (0.00) 0.02 (0.05) 0.03 (0.06) 0.04 (0.08) 0.00 (0.00) 0.01 (0.04) 0.01 (0.05) 0.01 (0.05) Power 0.59 (0.20) 0.82 (0.15) 0.86 (0.14) 0.87 (0.13) 0.84 (0.24) 1.00 (0.03) 1.00 (0.03) 1.00 (0.03) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) ρ = 0.9 snr = 8 : 2 FDR 0.00 (0.03) 0.00 (0.03) 0.03 (0.07) 0.04 (0.08) 0.00 (0.00) 0.01 (0.03) 0.02 (0.06) 0.04 (0.07) 0.01 (0.05) 0.02 (0.06) 0.02 (0.06) 0.02 (0.06) Power 0.42 (0.20) 0.65 (0.12) 0.77 (0.14) 0.81 (0.14) 0.72 (0.26) 0.94 (0.12) 0.95 (0.12) 0.96 (0.11) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) snr = 7 : 3 FDR 0.00 (0.00) 0.00 (0.00) 0.00 (0.03) 0.01 (0.05) 0.01 (0.04) 0.03 (0.06) 0.03 (0.07) 0.05 (0.09) 0.00 (0.02) 0.02 (0.05) 0.02 (0.05) 0.02 (0.06) Power 0.83 (0.18) 0.96 (0.08) 0.98 (0.06) 0.98 (0.06) 0.99 (0.04) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) snr = 9 : 1 FDR 0.00 (0.03) 0.02 (0.05) 0.03 (0.07) 0.04 (0.07) 0.00 (0.00) 0.04 (0.07) 0.08 (0.09) 0.11 (0.10) 0.00 (0.02) 0.06 (0.09) 0.08 (0.10) 0.11 (0.12) Power 0.60 (0.29) 0.82 (0.17) 0.87 (0.14) 0.89 (0.12) 0.98 (0.08) 0.99 (0.05) 0.99 (0.05) 0.99 (0.05) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) ρ = 0.95 snr = 8 : 2 FDR 0.00 (0.03) 0.02 (0.05) 0.02 (0.06) 0.03 (0.07) 0.01 (0.04) 0.04 (0.07) 0.08 (0.12) 0.12 (0.12) 0.01 (0.03) 0.05 (0.09) 0.07 (0.10) 0.09 (0.11) Power 0.32 (0.22) 0.61 (0.19) 0.79 (0.15) 0.82 (0.15) 0.80 (0.27) 0.96 (0.11) 0.98 (0.05) 0.98 (0.05) 0.93 (0.19) 0.96 (0.11) 0.97 (0.09) 0.97 (0.09) snr = 7 : 3 FDR 0.00 (0.00) 0.01 (0.04) 0.01 (0.05) 0.03 (0.08) 0.00 (0.02) 0.04 (0.07) 0.07 (0.09) 0.12 (0.10) 0.00 (0.03) 0.04 (0.08) 0.06 (0.08) 0.07 (0.09) importance scores, as described in Section 3.2.3. B.0.2 Phase transition observed for the CCLE data The main reason for the phase transition is that, for a null predictor X j , j ∈ S 0c , different bootstrap replicates reshuffle its feature importance each time, whereas, for a nonnull predictor X j , j ∈ S 0 , the feature importance is much stable in different bootstrap replicates. SciDNet effectively captures this characteristic to identify the null features. As a demonstration, here we present in Figure B.1, the bootstrap distribution of rank of the importance scores for the top 25 important cluster representatives via box plots. The green and purple colors respectively indicate if the cluster representatives are selected or rejected by the SciDNet. We can observe the phase transition consistently for all five drugs, and SciDNet selects only those important representatives with reduced variability over the bootstrap replicates. B.0.3 More Simulation Results Here we demonstrate finite sample performance of SciDNet under various linear and nonlin- ear models with varying multicollinearity level under different signal-to-noise-ratio. B.0.3.1 Using Gaussian Features For the high dimensional predictors, n i.i.d. copies are first generated from X ∼ N p (0, Σ), where n = 600, p = 5000 and the covariance matric Σ is chosen as a toeplitz matrix with Σi j = ρ |i − j | . The value of ρ is varied to explore different correlation strengths. We set the set of truly significant variables S = {100, 200, 300, 400, 500} with s = 5. The response y is generated from 81 Figure B.1 The phase transition property illustrated for the five anticancer drugs considered: (1) Topotecan, (2) 17-AAG, (3) Irinotecan, (4) Paclitaxel, and (5) AWE541, respectively (from top left). 82 Table S2 Empirical power and observed FDR of SciDNet with standard error in parentheses for non-gaussian features Nonlinear Additive Nonlinear with interaction Linear ρ snr q 0.01 0.05 0.1 0.15 0.01 0.05 0.1 0.15 0.01 0.05 0.1 0.15 Power 0.68 (0.12) 0.96 (0.15) 1.00 (0.09) 1.00 (0.07) 0.92 (0.05) 0.95 (0.02) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) snr = 9 : 1 FDR 0.00 (0.00) 0.00 (0.01) 0.04 (0.01) 0.07 (0.05) 0.00 (0.00) 0.01 (0.06) 0.04 (0.06) 0.06 (0.07) 0.00 (0.00) 0.01 (0.04) 0.01 (0.05) 0.01 (0.05) Power 0.56 (0.24) 0.86 (0.11) 0.94 (0.14) 0.95 (0.13) 0.76 (0.23) .93 (0.02) 1.00 (0.04) 1.00 (0.03) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) ρ = 0.9 snr = 8 : 2 FDR 0.00 (0.01) 0.00 (0.01) 0.06 (0.05) 0.09 (0.07) 0.00 (0.00) 0.00 (0.02) 0.04 (0.03) 0.07 (0.07) 0.01 (0.05) 0.02 (0.06) 0.02 (0.06) 0.02 (0.06) Power 0.42 (0.21) 0.77 (0.13) 0.91 (0.16) 0.94 (0.12) 0.73 (0.26) 0.94 (0.12) 0.95 (0.15) 0.96 (0.14) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) snr = 7 : 3 FDR 0.00 (0.00) 0.00 (0.01) 0.02 (0.03) 0.06 (0.04) 0.01 (0.05) 0.04 (0.06) 0.05 (0.07) 0.05 (0.09) 0.00 (0.02) 0.02 (0.05) 0.02 (0.05) 0.02 (0.06) Power 0.81 (0.19) 0.95 (0.07) 0.98 (0.06) 0.98 (0.07) 0.99 (0.04) 0.99 (0.03) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) snr = 9 : 1 FDR 0.00 (0.01) 0.03 (0.06) 0.03 (0.04) 0.05 (0.03) 0.00 (0.00) 0.04 (0.07) 0.08 (0.09) 0.09 (0.13) 0.00 (0.01) 0.03 (0.05) 0.07 (0.11) 0.10 (0.14) Power 0.65 (0.29) 0.84 (0.16) 0.89 (0.17) 0.89 (0.12) 0.94 (0.07) 0.97 (0.04) 0.99 (0.07) 0.99 (0.07) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) ρ = 0.95 snr = 8 : 2 FDR 0.00 (0.03) 0.01 (0.05) 0.04 (0.06) 0.05 (0.06) 0.01 (0.03) 0.04 (0.04) 0.07(0.14) 0.11 (0.11) 0.01 (0.02) 0.05 (0.09) 0.06 (0.14) 0.09 (0.11) Power 0.47 (0.22) 0.64 (0.17) 0.75 (0.19) 0.87 (0.11) 0.82 (0.27) 0.95 (0.10) 0.98 (0.04) 0.98 (0.02) 0.95 (0.10) 0.96 (0.11) 0.97 (0.08) 0.97 (0.06) snr = 7 : 3 FDR 0.00 (0.00) 0.02 (0.04) 0.02 (0.04) 0.04 (0.09) 0.00 (0.01) 0.04 (0.05) 0.09 (0.03) 0.13 (0.09) 0.00 (0.02) 0.04 (0.05) 0.05 (0.08) 0.08 (0.07) y = g (x) + ϵ. Here we entertain the following three models: 1. Linear: g (x) = x S βS with βS generated from N (2, 0.1) independently and βS c = 0, 2. Nonlinear additive: g (x) = 2x 100 +2x 200 3 +e x300 +6 sin x 400 +2ReLu(x 500 3 ), where ReLu(x)=max(x,0) 3. Nonlinear with interaction: g (x) = 2x 100 + 2x 200 3 + e x300 + 6x 400 x 500 In each case, the random noise ϵ is independently generated from N (0, σ2 ), where the value of σ2 is chosen to maintain the signal-to-noise ratio at the desired level. To this end, we define v ar (g (x)) the signal-to-noise ratio as snr = σ2 . Here we consider three levels of snr = 9 : 1, 8 : 2, and 7 : 3. Table S1 shows that SciDNet continues to maintain satisfactory power while successfully controlling the FDR below the threshold q = 0.01, 0.05, 0.1, 0.15. The average cluster size is observed at 8.3 for ρ = 0.9 and 13.4 for ρ = 0.95. B.0.3.2 Using Non-gaussian Features To check SciDNet’s performance under a non-gaussian setup, n iid copies of high-dimensional feature vector X are generated from multivariate t p (5) distribution considering the same correla- tion structure as in the previous section B.0.3.1, with n=600, p=5000. The remaining simulation setting is consistent with the previous section 2.1. The performance of SciDNet is presented in table S2 which is quite analogous to the results of gaussian features. 83 B.0.4 Performance of existing feature selection methods in the presence of high multi- collinearity In this section, we present a numerical illustration of the performance of several recently proposed nonlinear FDR-controlled feature selection algorithms. The predictors are first gen- erated from X i ∼ N p (0, Σ), i = 1, 2, . . . , n, for multiple combination of (n, p) and the covariance matric Σ is chosen as a toeplitz matrix with Σi j = ρ |i − j | , ρ = 0.1, 0.5, and 0.9. Under simplistic setting, the response y is generated from y = x S βS + ϵ, S = {5, 10, . . . , 50}, |S| = 10, with βS gener- ated from N (β0 , 0.1) independently and βS c = 0. The random noise ϵ ∼ N (0, 1). We focus on the Model-X knockoff (Candès et al., 2018), SurvNet (Song and Li, 2021), and DeepPINK (Lu et al., 2018). For a more rigorous analysis, we consider two different versions of Model-X knockoff - (1) Model-X-Estimated, where the knockoffs are generated using an estimated multivariate Gaus- sian distribution and (2) Model-X-True, where the knockoffs are generated using the true data generating multivariate gaussian distribution mentioned above. For the knockoff generation, we consider the equicorrelated construction using the R package knockoff: The Knockoff Filter for Controlled Variable Selection. To implement the SurvNet and DeepPINK, we use the codes mentioned in the respective papers Song and Li (2021); Lu et al. (2018). We set q = 0.15 as the FDR control threshold. Table 4.1 reveals several interesting characteristics. Both Model-X-Estimated and Model- X-True maintain the power-FDR balance under a low correlation setup. However with higher multicollinearity, Model-X-Estimated fails to control the FDR below the specified threshold while the Model-X-True controls the FDR efficiently. This disparity indicates Model-X procedure induces inflation in false discoveries if the knockoffs are not generated properly under a ’difficult’ situation. As expected, the DL-based algorithms, such as SurvNet and DeepPINK work much better in big-n-small-p and low correlation setups but typically fail in other cases, indicating their reduced effectiveness in ultrahigh dimensional data with small sample sizes. 84 Table S3 Empirical power and observed FDR of various feature selection algorithms with standard error in parentheses β=2 β=4 ρ =0.1 ρ =0.5 ρ =0.9 ρ =0.1 ρ =0.5 ρ =0.9 ¡ ¢ n, p Power 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) Model-X-Estimated (400, 1000) FDR 0.13 (0.17) 0.12 (0.12) 0.27 (0.18) 0.11 (0.19) 0.20 (0.18) 0.27 (0.20) Power 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) Model-X-True (400, 1000) FDR 0.08 (0.13) 0.09 (0.12) 0.14 (0.17) 0.12 (0.14) 0.11 (0.13) 0.08 (0.12) Power 0.27 (0.20) 0.32 (0.22) 0.35 (0.24) 0.49 (0.24) 0.52 (0.28) 0.58 (0.29) (400, 1000) FDR 0.31 (0.36) 0.53 (0.30) 0.59 (0.23) 0.21 (0.21) 0.53 (0.18) 0.60 (0.17) SurvNet Power 0.99 (0.05) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) (10000, 60) FDR 0.20 (0.15) 0.80 (0.02) 0.78 (0.07) 0.14 (0.11) 0.80 (0.02) 0.56 (0.32) Power 0.01 (0.02) 0.03 (0.04) 0.00 (0.00) 0.03 (0.04) 0.01 (0.03) 0.02 (0.05) (400, 1000) FDR 0.23 (0.40) 0.35 (0.42) 0.33 (0.47) 0.45 (0.44) 0.24 (0.41) 0.24 (0.40) DeepPINK Power 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) (10000, 60) FDR 0.18 (0.04) 0.29 (0.13) 0.25 (0.11) 0.17 (0.01) 0.24 (0.12) 0.24 (0.12) B.0.5 Model implementation details and Sensitivity Analysis In this section, we mention the implementation details of SciDNet that we consider for the simulation study and real data analysis. To select the size of the active set Ŝ n in the screening step, in consistence with Xue and Liang (2017), we set |Ŝ n | = [ l og2n(n) ] by selecting the predictors with the top |Ŝ n | Henze–Zirkler test statistic w̃ k∗ , where [z] denotes the integer part of z. In all our simulation scenarios, we set r=0.9, the hyperparameter for intra-cluster correlation bound to further integrate highly correlated conditionally dependent clusters. In the cleaning step, for LassoNet 100 dimensional one-hidden-layer feed-forward neural network has been used; a more detailed model architecture can be found in the appendix in Lemhadri et al. (2021). For creating the compact neighborhood in the cleaning step, each time we choose the value of κ utilizing the phase transition property mentioned in section 2.2 of the main manuscript. The feature selection performance of the SciDNet is demonstrated by calculating the average power and cFDR along with their standard error observed in 50 Monte Carlo replications. Each data set is randomly divided into train, validation, and test with a 70-10-20 split. To assess the prediction performance, the test Mean Square Error (MSE) before and after the variable selection has been shown as part of the simulation study. For the prediction model, a 40-dimensional two-hidden- layer feed-forward neural network with ReLU and linear activation function is considered with Adam as the optimizer. For the regression tree, we used the bagging for further stabilization, as mentioned in Breiman (1996). The number of leaves and nodes is chosen by minimizing the 85 Table S4 Drug-sensitive genes identified by SciDNet and confirming references Drug Selected clusters of genes Confirming references Topotecan {SLFN11},{TUFT1,THRB},{CDT1,SF3A2,SNRPA},{FTH1P10,FTH1},{RPL18},{KLF5}, Barretina et al. (2012) {RPL11,RPL5P4,RPS8,RPL5,RPL10A,AL162151.3,RPS9,RPL3}, Li et al. (2012) {KIF15,CCNA2,LMNB1,KIF22,AC009133.14},{MATN2,HSPB8}, 17-AAG {NQO1,CTD.2033A16.1},{BAX},{SLC16A3},{PHPT1,SH3BP1} Hadley and Hendricks (2014) {SPCS3,DCTD},{CTD.2008A1.2,SORD},{NSMCE4A},{CSK} Barretina et al. (2012) Irinotecan {SLFN11},{KIF15,LMNB1,ARHGAP19},{TCEANC2},{KIF21B},{SQSTM1},{HDAC11} Barretina et al. (2012) {KHDRBS1,HNRNPA1P35,HMGB2,HNRNPA1,HNRNPA1L2, Li et al. (2012) HNRNPA1P48,HNRNPA1P7,AC021224.1,HNRNPA1P10,RBMX} PaclitaXel {PARP1,BCL2L1},{MMP24},{DIMT1},{RP11.872D17.4,SSRP1,MTA2},{DCUN1D3} Dorman et al. (2016) {RPL10AP6,RPL10A,EEF2,RPL3},{ARHGAP11B,ARHGAP11A,BUB1B,CASC5},{HCLS1,LCP1} Lee et al. (2016) AEW541 {TCEAL4,MID2},{E2F6},{AC096772.6},{SLC44A1},{PGM1} Liang et al. (2018) {ATP8B2,RNF122}, {RP11.1017G21.4},{ETNK2},{NHS},{ATG13} MSE on the validation set. To access the error bar for the sensitivity analysis, we generate typical data using the polyno- mial setup (section 3 in the main manuscript, with β = 2, σ2 = 1) and rerun SciDNet 50 times on the same data and set q = 0.1 as FDR-control threshold. The mean and standard deviations from these 50 replications are following: power = 0.99 (0.01), observed FDR = 0.03 (0.03), test error by LassoNet = 1.048 (0.101), test error by SciDNet+RT = 0.710 (0.002). To reduce computational complexity, after the screening, the bootstrapped LassoNet can be run in a parallel loop and we conduct all the experiments in a high-performing computing facility with Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and 4 Tesla V100S. The codes are available at an anonymous repository (https://anonymous.4open.science/r/SciDNet-3CA8). B.0.6 Important clusters of gene discovered by SciDNet B.0.7 For CCLE dataset The following Table S4 presents all the selected clusters of genes by SciDNet for the five anticancer drugs considered. The genes in a single cluster are mentioned in the "{}". Previous research on this gene-expression data has revealed several genes as biologically associated with the corresponding drugs. SciDNet successfully discovers these genes as the top-most important gene associated with the drugs. In Table S4, the selected genes which are confirmed by previous domain research, are highlighted and corresponding references are mentioned in column 3. 86 Table S5 Selected clusters of genes by SciDNet applied in the riboflavin gene data example Cluster No. Genes selected EPR_at, IOLD_at, KAPB_at, PROJ_at, RPLQ_at, UREA_at, 1 YCGB_at, YCGM_at, YCGN_at, YCSN_at, YCGO_at, YCGT_at, YDBM_at, YHXA_at, YKZC_at, YOAB_at, YPJB_at, YUSX_at, YVFH_at COMX_at, CSPC_at, HAG_at, MPR_at, YBDL_at, YDBM_at, 2 YHCB_at, YJFB_at, YHFS_at, YOAB_at, YODF_at, YOAC_at, YONU_at, YOTL_at, YQKI_at, YQZH_at, YTEI_at, YUSV_at HIT_at, KATX_at, LICH_at, NASA_at, OPUCB_at, PHRG_i_at, PHRK_at, ROCB_at, ROCR_at, SACB_at, SPOIIE_at, 3 TMRB_at, YACN_at, YBBJ_at, YBGB_at, YCBF_at, YFKJ_at, YHCS_at, YHXA_at, YJBF_at, YLBA_at, YLOU_at, YPUI_at, YQGY_i_at, YUKE_at, YVYD_at, YXLJ_at, YXZF_at APPA_at, BGLS_at, ccpB_at, MMR_at, SIGY_at, SOJ_at, TREA_at, YBGB_at, YDGF_at, YOPR_at, YQEB_at, YVCI_at, 4 YVDR_at, YWBG_at, YWDE_at, YWFM_at, YXBB_at, YXIL_at, YXIO_at, YXIQ_at, YXJA_at, YXJN_at, YXLC_at, YXLD_at, YXLE_at, YXLF_at, YXLG_at, YXLJ_at, YXZF_at, YYBF_at LYTD_at, SQHC_at, XKDE_at, YFIG_at, YFIH_at, YFII_at, 5 YFNC_at, YHDV_at, YIST_at, YJGA_at, YTCP_at, YTMP_at YCDH_at, YCDI_at, YCEA_at, YCIA_at, YCIB_at, YCIC_at, 6 YDAR_at, YHZA_at, YRPE_at, YTGA_at, YTGB_at, YTGC_at, YTGD_at, YTIA_at, YVQH_at OPUBD_at, PHRE_at, SIPS_at, YBFF_at, YDEM_at, YNAB_i_at, YNAC_at, YNEK_at, YOBF_at, YOKG_at, YONX_at, YOPA_at, 7 YOPR_at, YOTL_at, YPBB_at, YQZH_at, YRDA_at, YRKK_at, YRKL_at, YTGB_at, YUXI_at, YWCE_at, YWQK_at, YYDB_at, YYDF_i_at 8 ARGB_at, ARGC_at, ARGD_at, ARGJ_at, CARA_at, CARB_at 9 PROJ_at, RPLF_at, RPLJ_at, RPLL_at, RPSN_at, RPSP_at, YLQC_at B.0.8 For Riboflavin dataset The Riboflavin production dataset contains a much more complicated correlation structure than the CCLE data, see Figure 1 in the main manuscript for a visual illustration. As a result, SciDNet has produced a much larger cluster of genes compared to the cluster sizes from the CCLE dataset. For example, the average cluster size for CCLR and Riboflavin datasets is respectively 2.5 and 17.78. The following Table S5 shows the 9 selected clusters of genes selected by SciDNet while the FDR is controlled at q = 0.15. Additionally, SciDNet discovered the gene YCIC_at as one of the expressive genes related to riboflavin production which was identified by Bühlmann et al. (2014) as a causal gene in this context. 87 CHAPTER 4 DEEP LEARNING-AIDED FEATURE SELECTION FOR COGNITIVE RESERVE WITH HIGHLY CORRELATED TRACTOGRAPHY DATA 4.1 Introduction 4.1.1 A motivating case study and building the statistical framework The common observation that some older adults maintain normal levels of cognitive func- tion despite apparent brain pathology is referred to as cognitive reserve (Stern et al. (2020)). However, the neurobiological substrates, mechanisms, and potential neuroimaging markers of such resilience remain poorly understood. The immense system of structural connections between brain regions, which constitutes subcortical white matter (WM) is one promising source of neuroimaging correlates of cognitive reserve (Chang et al. (2021); Wang et al. (2020a)). Recent studies in Alzheimer’s disease, inflammatory, and vascular pathologies report that WM pathways are both enhanced by cognitive and environmental enrichment and are among the earliest systems to be negatively impacted by these pathologies (McPhee et al. (2019); Uddin (2021)). This recent evidence highlights the brain’s WM fiber pathways as a potential biological correlate of cognitive reserve. Diffusion magnetic resonance imaging (DMRI) methods characterize microstructural tis- sue organization based on the differential movement of water molecules in the presence of barriers (Beaulieu (2002)). In the brain, the hydrophobic myelin sheath that surrounds long neuronal axons serves as such a barrier, making it useful for characterizing subcortical WM tissue microstructure. Sampling biologically informative rates of diffusion in multiple spatial directions permits representing diffusion using summary measures, such as a tensor. The tensor eigenvalues can be averaged to provide a measure of mean diffusivity (MD), or used to calculate parameters such as fractional anisotropy (FA), which reflects the uniformity of diffusion. How- ever, diffusion tensor parameters only afford valid representations of WM microstructure in MRI volumetric pixels (i.e., voxels) that include a single spatial orientation of WM fibers. Because an estimated 60-90% of WM voxels include multiple orientations of fiber populations (Vos et al. 88 (2011)), the standard methods for representing WM microstructure using the metrics estimated from tensor decomposition is at best questionable. In contrast to voxel-level scalar values like FA, DMRI tractography represents WM organiza- tion as continuous paths between adjacent voxels, or streamlines, computed using directional information from tensor or other diffusion models. These tractography streamlines most com- monly serve as masks to delineate anatomically specific tracts for sampling from diffusion tensor (e.g., FA) or other image modalities in the same image space. Most extant applications of tractog- raphy methods collapse tract segments, having only a single mean diffusion magnetic resonance imaging (DMRI) metric and variance estimate for each tract and each subject, potentially ig- noring a rich anatomical variation along the tracts. Consequently, such summarization reduces the sensitivity and specificity of this neuroimaging technique. As a solution, the along-tract workflow which measures the DMRI metrics at several vertices spread evenly along the tract has been proposed (Colby et al., 2012; Wasserthal et al., 2018a); however, these methods can only model variation within individual fiber tracts as the response. The University of Michigan Memory and Aging Project (UM-MAP) is a clinical cohort study of aging and dementia that includes data from multi-modal MRI neuroimaging, including DMRI scans. These afford a more detailed study of the distribution of estimable DMRI metrics along the WM tracts in whole brain tractography data. In addition, UM-MAP participants include cognitively unimpaired older adults as well as those diagnosed with mild cognitive impairment (MCI) or dementias of the Alzheimer’s type and other etiologies. In our study, we are interested in determining the most effective and representative parts of WM tracts to characterize cognitive reserve. To achieve this goal, we combine the DMRI tractography (Wasserthal et al., 2018a) with along tract workflow (Colby et al., 2012) to systematically quantify the WM microstructure through specified DMRI metric (e.g. fractional anisotropy (FA), or spherical harmonic peak amplitudes (sh-peaks)) observed at spatially equidistant vertices spread along the major WM tracts. In this work, we mainly considered the spherical harmonic peak values (sh-peaks, Bastiani et al. (2017)) sampled from the fiber orientation distributions which represent the diffusion 89 maxima for a specific orientation of white matter fibers within a voxel. Thus, unlike FA, which integrates the diffusion signal from multiple directions, sh-peak values are not confounded by crossing fibers. Focusing on p t = 50 major WM tracts in human brains, our main objective is to determine which parts of these WM tracts are associated with some neurogenetic phenotype like the cognitive reserve. To fix ideas, suppose, y ∈ Rn is a vector of observed cognitive reserve values for n subjects in the cohort study and X ∈ Rn×p , p = p t p v is, column by column, the matrix of DMRI metrics observed at p v vertices of each of the p t tracts. Formally, we are interested in selecting important features to characterize cognitive reserve among the p potential features using the nonparametric regression: y = g (x 1 , x 2 , . . . , x p ) +ϵ; where g is an unknown link function and ϵ ∼ N (0, σ2 I n ) is the random noise. The feature selection with DMRI tractography data from UM-MAP faces additional methodological complications beyond those encountered in typical feature selection problems. These issues are further discussed in Section 4.1.2 and empirically evaluated in Section 4.2. Key points of our general analytic strategy are discussed in Section 4.1.3. 4.1.2 Statistical challenges and related literature The analysis of tractography data from the UM-MAP study is complicated due to the high di- mensionality of data sampling possible from streamlines, especially considering the more limited sample size. Furthermore, the strong associations within and between tract-level measurements and the potentially complicated functional relationship between these tract-measured data and phenotype that precludes the linearity assumption create further challenges. The extant methods in the statistical literature are either too restrictive in view of the modeling assumptions or rely on nontrivial extensions to account for the complex association among the high-dimensional predictors. Failure to adequately address these data complications may result in a higher rate of false discoveries. Although Chapter 3 extensively discusses these issues, we reiterate their significance here, focusing on the aforementioned motivating example based on tractography data. 90 Reproducible high-dimensional non-linear feature screening The feature selection problem, under the linear model assumption, has been extensively studied over the last twenty years under various data setups. Popular algorithms include the Lasso, Elastic net, SCAD, and MCP; a full review of existing methods can be found elsewhere (e.,g., Fan and Lv (2010). Despite their successes, feature selection algorithms under the linear model assumption tend to have poor performance in settings where the underlying functional form deviates from the linearity. To highlight this limitation, we conducted a basic implementation of the prediction optimal elastic net (Zou and Hastie, 2005) and found that this method results in very poor performance (∼ 20% coefficient of determination), hence necessitating the need to entertain non-linear association approaches. As we discussed in detail in Chapter 3, the Artificial Neural Network (ANN) models, which relax the linearity assumption, are well known for efficiently approximating complicated functions. This key feature of ANN models has motivated the use of Deep Learning (DL) models for feature selection in recent years. Popular examples include Deep Feature Selection (Chen et al., 2021), DeepPink (Lu et al., 2018), and SurvNet (Song and Li, 2021). Despite their popularity, many of the existing DL algorithms can be overly sensitive to noise. Recent works have shown that a small amount of added noise can drastically change the importance of variables in the model (Ghorbani et al., 2019). As a solution, reproducible variable selections with some form of error control have been advocated. In this realm, the control of the False Discovery Rate (FDR), first proposed by Benjamini and Hochberg (1995), has emerged as a major approach due to being less conservative and more powerful than the Family Wise Error rate (FWER), especially in large-scale multiple testing problems. However, estimating the expectation in the FDR poses a unique challenge for the model-free variable selection methods, and has been investigated in various outlets. Much of the existing approaches have focused on p-values as feature importance in a multiple testing context; see Xia et al. (2017); Li and Barber (2018); Lei and Fithian (2018) for more details. However, generating p-values for DL models that are interpretable is proven to be complicated, prompting researchers to seek alternatives. One such alternative is the knockoff method proposed by Candès et al. (2018), which is essentially a model-free variable selection 91 algorithm with provable FDR control, assuming a well-specified predictor’s distribution. Hence, to generate the knockoff, it is necessary to specify or estimate a high-dimensional predictor’s distribution, which in many practical settings may be an overwhelming hurdle to overcome. Unlike in genetics studies where the linkage disequilibrium assumption is often made to describe a consistent dependence pattern between the alleles at polymorphisms (Sesia et al. (2018)), with tractography data such a prior correlation pattern cannot be imposed. For example, figure 4.1 shows that for some tracts, the correlations among spatially distant vertices along a tract may be much higher reflecting the spatial symmetry of the white matter representation. Hence, for the tractography data, any model-specific knockoff generation algorithm would be highly inefficient. Recently, DL-based flexible knockoff-generating algorithms have also been proposed (Jordon et al., 2019; Romano et al., 2020); however, these methods are often trained with large samples in big-n-small-p data settings. It is unclear how these methods will perform when the sample size n is significantly smaller than the dimension of the covariates p, as in the current tractography dataset with n = 210 subjects and p ≃ 5000 high-dimensional DMRI metrics. Additionally, in the context of hierarchical testing, several other competing algorithms have also been proposed including SUSIE (Wang et al., 2020b), KnockoffZoom (Sesia et al., 2019). While the knockoff-based procedures have the limitation of generating knockoffs from an unknown complex distribution with a comparatively small sample size, most of the non-knockoff-based methods lack their applicability in non-linear setups as they typically depend on p-values. Highly correlated predictors measured intermittently along the tracts Basic exploratory analyses of UM-MAP data show that the DMRI metrics are highly correlated, with most of the local pairwise correlations exceeding 0.95 and even 0.99 for some tracts (see Figure 4.1). This is a well-known complication for feature selection problems in many modern data sets arising in genetics and imaging studies. This extreme association is problematic for a typical variable selection analysis as the highly correlated predictors become almost indistinguishable in view of the phenotype. Ignoring this extreme association and conducting a variable selection that 92 Figure 4.1 Correlation heatmap of the sh-peaks metric observed along six major WM tracts (1) CC-4, (2) ICP-left, (3) MCP, and (4) STR-left (showing distinguishable correlation structure. focuses solely on individual variables is meaningless as it does not account for the uncertainty due to highly correlated predictors. One can argue that it would be unjustifiable to specifically claim that one of the highly correlated predictors is associated with the response. Alternatively, an approach that accounts for high correlations is the group or cluster selection method, which has been applied in genetics and many other applications. A complication of this approach, however, is the ambiguity on how to define the clusters and subsequently perform the selection for the clusters. In recent works, Candès et al. (2018) considered a heuristic approach where the predictors are clustered according to their pairwise correlations. This method works reasonably well in general but may generate larger clusters, rendering the choice of the representative feature of each cluster nontrivial. And more importantly, larger clusters may not be very informative from a substantive viewpoint. Cluster-based variable selection methods using conditional associations which generate smaller clusters have also been proposed but their application to tractography data is critically lacking. 93 In addition to the complexities related to high correlation, although the tractography stream- lines provide a continuous representation of WM, the DMRI metrics were measured only at a few equidistant locations on each tract. This poses the generic problem of ‘missing values’ in statistical research as is unclear whether the observed measurements represent the true locations. Without a better understanding of the true DMRI metrics locations, determining the true signal location is at best problematic. This problem is exacerbated in neuroscience research where low sensitivity is real for many feature selection exercises. 4.1.3 The general analytic strategy The study of WM as a potential marker of cognitive reserve using UM-MAP data is hampered by various methodological and conceptual limitations in WM measurement and statistical modeling. The above discussion highlights the need to develop a methodology capable of accommodating variation across vertices while accounting for the high correlation both between and within tract value measurements. The present study sought to address and overcome these challenges by providing a novel quantitative neuroimaging approach for modeling WM pathways as a neural marker for predicting cognitive reserve. Specifically, it will help determine the most important regions of human WM tracts that are associated with cognitive reserve as well as formulate a general workflow for high-dimensional nonlinear variable selection with highly correlated predictors. Our methodological contribution relies essentially on a novel screening and cleaning method for the reproducible high-dimensional nonlinear feature selection with highly correlated predic- tors. As the spatial positions are inherently continuous while the DMRI metrics are observed at some discrete vertices on the tract, we consider here the cluster-level discovery of the positions by leveraging the high association among the vertices. To be concrete, let C causal = {Z1c , Z2c , . . . , Z sc } denote the unknown true set of causal locations along the tracts that harbor the variability which influences the phenotype of interest. We also denote by C obser ved = {Z1o , Z2o , . . . , Z po } the locations at which the DMRI metrics are observed. Although there is no guarantee that these metrics are observed at the specific positions in C causal , there are two possibilities for observing some 94 true causal location Z jc ∈ C causal : (1) it is observed; i.e. Z jc = Zko for some k; or (2) although Z jc is not observed, it is fair to assume that a close proxy location Zko is observed for some k due to the strong local dependency along each tract. For this reason, we seek to discover some clusters of locations that can jointly serve as a good proxy for the locations in C causal . With this in mind, we set our target to uncover the subset S 0 ⊂ {1, 2, . . . , p} such that, conditional on features in S 0 , the response Yi is independent of features in the complement set S 0c . In other words, S 0 = {k : f (y|X ) depends on X k }, where X k is the realization of the DMRI metric at the observed location Zko and f (y|X ) is the conditional density of y given X. The members of the subset S 0 can be interpreted as either the true causal location or the closest proxy of a causal location. To achieve this goal, we implement our proposed method ScIDNet in Chapter 3, where we divide our method into two parts: Screening and Cleaning. The screening step is a dimension reduction step. We screen out most of the null variables and select an active set of variables Ŝ n which will surely contain all the proxy variables needed to cover the causal set C , implying the sure screening property. To reduce the high amount of correlation in the active set, by exploiting their conditional dependency structure, we divide the active variables {X j : j ∈ Ŝ n } into p c (<< p) spatially connected non-overlapping clusters: C 1 ,C 2 , . . . ,C p c and select an appropriate cluster representative from each cluster. In the cleaning step, we develop an estimate of the number of false discoveries using the resampling technique followed by developing a surrogate of the FDR. Finally, by controlling the surrogate FDR, we select some clusters of highly correlated predictors. Here true discovery implies that the selected cluster can serve as a good proxy for at least one element in the true causal set C . Our comprehensive empirical study utilizing DMRI metrics provides compelling evidence for the effectiveness of the proposed method as a proof of concept, as it achieves higher power and controls the false discovery rate (FDR). While achieving theoretically guaranteed FDR control within a deep learning framework remains an active area of research, our study paves the way for further theoretical investigations into the method’s generalizability and broader validity by utilizing SciDNet’s theoretical guarantees on FDR control. The proposed approach stands out 95 by not relying on strict modeling assumptions between the response and features and being completely independent of p-values, distinguishing it from other state-of-the-art methods. Consequently, it facilitates a deeper understanding of micro-structural relationships in the human brain within the context of dementia and beyond. After providing a comprehensive discussion of assumptions and algorithms in Chapter 3, we now proceed directly to the numerical analysis section involving tractography data. The notations used throughout this section remain consistent with those introduced in Chapter 3. We begin by presenting an extensive simulation study that specifically focuses on the tractography data in Section 4.2. Subsequently, we showcase the application of our methodology to the UM- MAP Tractography dataset in Section 4.3. Finally, we conclude with a summary of our findings and outline future research directions in Section 4.4. 4.2 Numerical Illustrations In this section, we extensively investigate the finite sample performance of the proposed method SciDNet, along with several other state-of-the-art FDR controlling algorithms, using a simulation study. Firstly, Section 4.2.1 verifies the suitability of the nonparanormal transfor- mation for the tractography data, thereby confirming our assumption that the distribution of the features belongs to the non-paranormal family. Secondly, in Section 4.2.2, we demonstrate how major FDR controlling approaches fail in the presence of severe multicollinearity in an ultrahigh-dimensional setting. Additionally, Section 4.2.3 presents the performance evaluation of the proposed method in a synthetic setting, where DMRI-metrics (e.g., sh-peaks) from the tractography data serve as predictors, and the response is generated using a non-linear function. We utilize two metrics to assess the feature selection performance of the algorithms: (1) Power = |D̂ n ∩S 0 | |D̂ n ∩S 0c | |S 0 | and (2) empirical FDR = . |D̂ n | 4.2.1 Checking the non-paranormal transformation on tractography data The non-paranormal transformation Liu et al. (2009) plays a crucial role in the proposed method SciDnet, particularly in its screening step. This step utilizes clustering to capture the conditional dependency structure after dimensional reduction, ensuring the sure independence 96 Figure 4.2 Illustration of nonparanormal transformation using simulated and the tractography dataset. screening property (3.2.2.1) when the features belong to a non-paranormal family. Therefore, before applying SciDnet to the tractography data, it is essential to evaluate the effectiveness of the nonparanormal transformation on this dataset. To assess the efficiency of the nonparanormal transformation, we employ the following ap- proach. We generate two sets of features: X G(i ) ∼ N (0, Σp×p ) and X t(i ) ∼ t p (5), where i = 1, 2, . . . , n, from a multivariate Gaussian and a t 5 distribution, respectively, while maintaining a correla- tion structure defined by Σi j = 0.95|i − j | . We set n = 220 and p = 5000 to mimic the dimen- 97 sions of the tractography data. Consequently, we obtain the original DMRI tractography data X t r ac t ∈ R 220×5000 . Next, we apply the nonparanormal transformation to the features in X G , X t , and X t r ac t . To evaluate the effectiveness of the transformation, we present Figure 4.2, which displays the QQ-plots of the transformed features compared to the Gaussian distribution. The figure indicates a favorable alignment of the transformed features with the Gaussian distribution, demonstrating a similar pattern to the simulated settings. For ease of demonstration, we only depict the results for the 1st and 10th features. This experimental analysis verifies the validity of employing the nonparanormal transformation in the context of tractography data. Additionally, it confirms that the clusters generated by SciDNet effectively capture the conditional dependencies among the features. 4.2.2 Performance of existing feature selection methods in the presence of high multi- collinearity The predictors are first generated from X i ∼ N p (0, Σ), i = 1, 2, . . . , n, for multiple combina- tion of (n, p) and the covariance matric Σ is chosen as a toeplitz matrix with Σi j = ρ |i − j | , ρ = 0.1, 0.5, and 0.9. Under simplistic setting, the response y is generated from y = x S βS + ϵ, S = {5, 10, . . . , 50}, |S| = 10, with βS generated from N (β0 , 0.1) independently and βS c = 0. The random noise ϵ ∼ N (0, 1). We focus on the Model-X knockoff (Candès et al., 2018), SurvNet (Song and Li, 2021), and DeepPINK (Lu et al., 2018). For a more rigorous analysis, we consider two different versions of Model-X knockoff - (1) Model-X-Estimated, where the knockoffs are generated using an estimated multivariate Gaussian distribution and (2) Model-X-True, where the knockoffs are generated using the true data generating multivariate gaussian distribution mentioned above. For the knockoff generation, we consider the equicorrelated construction using the R package knockoff: The Knockoff Filter for Controlled Variable Selection. To implement the SurvNet and DeepPINK, we use the codes mentioned in the respective papers Song and Li (2021); Lu et al. (2018). We set q = 0.15 as the FDR control threshold. Table 4.1 reveals several interesting characteristics. Both Model-X-Estimated and Model- 98 X-True maintain the power-FDR balance under a low correlation setup. However, with higher multicollinearity, Model-X-Estimated fails to control the FDR below the specified threshold while the Model-X-True controls the FDR efficiently. This disparity indicates Model-X procedure induces inflation in false discoveries if the knockoffs are not generated properly under a ’difficult’ situation. As expected, the DL-based algorithms, such as SurvNet and DeepPINK work much better in big-n-small-p and low correlation setups but typically fail in other cases, indicating their reduced effectiveness in ultrahigh dimensional data with small sample sizes. Table 4.1 Empirical power and observed FDR of various feature selection algorithms with stan- dard error in parentheses β=2 β=4 ρ =0.1 ρ =0.5 ρ =0.9 ρ =0.1 ρ =0.5 ρ =0.9 ¡ ¢ n, p Power 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) Model-X-Estimated (400, 1000) FDR 0.13 (0.17) 0.12 (0.12) 0.27 (0.18) 0.11 (0.19) 0.20 (0.18) 0.27 (0.20) Power 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) Model-X-True (400, 1000) FDR 0.08 (0.13) 0.09 (0.12) 0.14 (0.17) 0.12 (0.14) 0.11 (0.13) 0.08 (0.12) Power 0.27 (0.20) 0.32 (0.22) 0.35 (0.24) 0.49 (0.24) 0.52 (0.28) 0.58 (0.29) (400, 1000) FDR 0.31 (0.36) 0.53 (0.30) 0.59 (0.23) 0.21 (0.21) 0.53 (0.18) 0.60 (0.17) SurvNet Power 0.99 (0.05) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) (10000, 60) FDR 0.20 (0.15) 0.80 (0.02) 0.78 (0.07) 0.14 (0.11) 0.80 (0.02) 0.56 (0.32) Power 0.01 (0.02) 0.03 (0.04) 0.00 (0.00) 0.03 (0.04) 0.01 (0.03) 0.02 (0.05) (400, 1000) FDR 0.23 (0.40) 0.35 (0.42) 0.33 (0.47) 0.45 (0.44) 0.24 (0.41) 0.24 (0.40) DeepPINK Power 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) (10000, 60) FDR 0.18 (0.04) 0.29 (0.13) 0.25 (0.11) 0.17 (0.01) 0.24 (0.12) 0.24 (0.12) 4.2.3 Performance of the proposed method on synthetic data Prior to implementing the proposed method on the cognitive reserve, we conduct a synthetic study to understand its effects on the DMRI tractography data. As mentioned in the section 4.1.1, we consider p t = 50 major WM tracts for n = 220 subjects. We measure the DMRI-metric named spherical harmonic peak amplitudes (sh-peaks) at spatially equidistant p v = 98 vertices on each of the tracts. Hence the feature vector x i ∈ Rp , p = p t p v = 5000, contains all the sh-peaks values measured at each of the vertices of the tracts for the i-th subject, i = 1, 2, . . . , 220. We further simulated the response y (i ) from x (i ) using some nonlinear model. For this purpose, single Index models are straightforward yet flexible examples of nonlinear models where the response is related to a linear combination of the features through an unknown 99 nonlinear, monotonic link function, i.e. y (i ) = g (x (i )′ β) + ϵ. Here we choose the following two link x3 x functions as our data-generating process: (1) M 1 : g (x) = 10 + 3 10 and (2) M 2 : g (x) = l og (10 + e x ). Similar to Section 4.2.2, the coefficients β ∈ Rp is sparse with the true nonzero locations S = {50, 100, 150, 200, 250}, where βS c = 0, βS ∼ NS (uβ0 , 0.1) with u = {±1}S . The value of β0 is set as β0 = 1.5. The random error ϵ ∼ N (0, σ2 ), maintaining the decreasing signal-to-noise ratio as snr = 7 : 3 and 3 : 7. All the performance metrics are based on 50 Monte Carlo replications. Table 4.2 Power and empirical FDR of the proposed method with standard error in parentheses from the synthetic study Screening + Screening + Screening + Screening + g (·) snr Screening Model-X Knockoff Model-X Knockoff proposed Cleaning LassoNet (Linear) (RF) (SciDNet)) Power 1.000 (0.00) 0.996 (0.03) 0.870 (0.34) 0.921 (0.39) 0.984 (0.05) FDR 0.992 (0.00) 0.525 (0.14) 0.178 (0.19) 0.182 (0.12) 0.0.098 (0.10) 7:3 n_var 624 (0.00) 419.09 (44.56) 286.85 (121.50) 291.23 (37.41) 281.06 (60.59) n_clust 43.22 (2.60) 11.96 (5.91) 5.80 (2.87) 6.64 (0.62) 5.50 (0.73) M1 Power 1.000 (0.00) 0.992 (0.04) 0.788 (0.40) 0.841 (0.37) 0.970 (0.08) FDR 0.992 (0.00) 0.619 (0.14) 0.174 (0.24) 0.193 (0.21) 0.004 (0.02) 3:7 n_var 624.00 (0.00) 405.56 (53.33) 236.90 (131.70) 248.76 (158.57) 285.54 (53.83) n_clust 49.44 (5.19) 14.94 (6.04) 5.78 (4.03) 6.92 (4.89) 4.87 (0.45) Power 1.000 (0.00) 1.000 (0.00) 0.776 (0.42) 0.825 (0.29) 0.996 (0.03) FDR 0.992 (0.00) 0.656 (0.08) 0.176 (0.19) 0.183 (0.26) 0.021 (0.06) 7:3 n_var 624 (0.00) 441.32 (48.80) 247.22 (140.54) 262.50 (88.01) 274.92 (45.48) n_clust 43.58 (4.01) 15.32 (3.89) 5.38 (3.41) 7.98 (2.75) 5.08 (0.34) M2 Power 0.991 (0.04) 0.960 (0.09) 0.427 (0.47) 0.436 (0.25) 0.533 (0.24) FDR 0.992 (0.00) 0.762 (0.10) 0.106(0.21) 0.153 (0.17) 0.026 (0.09) 3:7 n_var 624.00 (0.00) 357.73 (92.57) 98.93 (113.80) 99.46 (125.77) 115.04 (62.58) n_clust 49.18 (8.63) 24.33 (11.53) 3.16 (4.15) 3.22 (1.43) 2.71 (1.16) Table 4.2 demonstrates an ablation study where we compare the following four methods: (1) Screening only, (2) Screening + Cleaning with LassoNet, (3) Screening + Cleaning with knockoff with q=20%, and (4) the proposed method Screening + Cleaning with resampled LassoNet with q=10% and 20%, where q= the error-control cutoff. Focusing on the cluster-level discoveries, two additional performance metrics are included here: (1) n_v ar , the total number of features in the selected clusters, and (2) n_cl ust , the total number of selected clusters. Lower values of n_v ar and n_cl ust indicate better support recovery for a group-level feature selection method. In this context, Table 4.2 empirically consolidates several interesting characteristics: (a) 100 The screening step maintains the sure screening property and thereby selects a slightly bigger set of features resulting in high power and high FDR which necessitates further cleaning; (b) All the cleaning steps aim to reduce the FDR while maintaining the power of the screening step; (c) The proposed method achieves the best performance (in terms of the power-FDR tradeoff) by effectively using the added information from the resampling. Although for a high noise setup, the power is comparatively low, it still maintains the nominal rate of false discoveries. This empirical study sets the stage for the real data analysis in section ?? where the simulated y is replaced by the cognitive reserve values from the UM-Map study as the outcome variable. 4.3 Real Data Analysis - UM-MAP Tractography Data As discussed in Section 4.1.1, the WM fiber tract segmentation facilitates in rigorous analysis of WM microstructural quantities and their relation to the cognitive performance of the human brain. In order to understand which segments of human WM tracts are associated with the cognitive reserve, the proposed method has been implemented in Michigan Alzheimer’s Disease Research Center (MADRC) dataset. We considered 50 major tracts that shield almost 90% of human WM (Wasserthal et al., 2018a); a full list of these tracts is relegated to Appendix 4.4. 4.3.1 Results and Selected Tracts We consider the cognitive reserve values calculated for n = 220 subjects as the outcome variable. Further, for the predictors, the sh-peaks metric is observed at p v = 98 vertices spread in a spatially equidistant fashion along each of the p t = 50 major WM tracts. Hence the feature space becomes of dimension p = p t p v = 4900. With this setting, we implement the proposed screening h i n and cleaning method. In the screening step, we set the size of the active set, |Ŝ n | = 7 l og (n) = 657 and the other 4143 vertices are rejected by the HZ-test, thereby discarded from the remaining analysis. The active set Ŝ n is further divided into 137 disjoint clusters of spatially connected and highly correlated vertices. From each cluster, the proper representative has been chosen by the maximum value of the HZ-statistic among the cluster members, following which the set of cluster representatives S̃ n has been created. Next in the cleaning step, the LassoNet has been implemented parallelly for B = 1000 bootstrap replications. Each time as described in section 101 3.2.3, the L 1 penalty parameter λ has been chosen in a sequential manner as λ1 ≤ λ2 ≤ · · · ≤ λr ; where for λ1 all the active predictors are present in the model. Then with the gradual increase of λ, one after another the representatives will get eliminated from the model and finally, λr will produce the null model with no predictors. Thus the proposed error estimate F DR ˆ is constructed by combining the variable importance measured over all the regularization paths for all the bootstrap replications. Controlling the estimated error rate by the threshold q = 0.15, the final set of clusters D̂ n is discovered. Figure 4.3 Illustration of down-sampled diffusion tractography for streamlines that included significant clusters. Demonstration of the selected tracts The proposed method discovers several spatially con- nected parts of the following four major tracts: (1) the first (i.e., ventral genual) corpus callosum fiber bundle that serves as the the inferior aspect of the forceps minor (CC1), (2) right cortical spinal tract (right CST), (3) First subdivision of the right hemisphere superior longitudinal fas- ciculus (right SLF-I), and (4) Second subdivision of the left hemisphere superior longitudinal 102 fasciculus (left SLF-II). Figure 4.3 demonstrates these selected tracts by showing three differ- ent orientations of the brain: coronal (top left) or as if facing a mirror, sagittal (top right) or viewed from the side, and axial (bottom left) or viewed from above. In order to display the mean trajectory of the tracts, the TractSeg tractometry algorithm (Wasserthal et al., 2018a) reduces multiple streamlines (e.g., 100 to 1000) originally generated for a given WM fiber tract down to a single representative streamline. Figure 4.3 shows four separate streamlines representing anatomically specific white matter fiber bundles. Each streamline is a curvilinear 3-dimensional object composed of thousands of linked vertices; however, we note that the illustration depicts 3-dimensional streamlines against 2-dimensional cross-sections of the brain. Data are shown in a radiological orientation where the left side of the image corresponds to the anatomical right and vice versa. Additionally, we conducted a sensitivity analysis by implementing the proposed method on several subgroups of the whole data, taking both sh-peaks and FA as the DMRI metric simultaneously. Table 4.3 shows the clusters of the vertices on the selected tracts under different subgroup selections and DMRI metric choices (sh-peaks and FA). These results reveal clusters of vertices in WM streamlines for the left and right superior longitudinal fasciculus (SLF), and the right hemisphere corticospinal tract (CST) where higher sf-peak value significantly predicted higher cognitive reserve. In contrast, lower sf-peak values in the first (e.g., most anterior and ventral) subdivision of the corpus callosum (CC) predicted greater reserve. The SLF is commonly associated with working memory (Koshiyama et al. (2020)), whereas CST is believed to be more related to motor function, albeit less strongly (Min et al. (2014)). Thus, more maintenance of white matter in SLF-I and right CST predict greater preser- vation of memory, despite atrophy of medial temporal lobe gray matter regions. The CC1 fiber bundle is the inferior aspect of the genu which connects the left and right ventral prefrontal cortices via the forceps minor. In contrast with the positive effect in the other tracts, the negative association with the reserve in CC1 shows the reduced sh-peak signal in this region predicts better cognitive maintenance. We note that these effects are observed in regions where fibers from the anterior cingulum bundle cross over the CC genu. How these two fiber systems may 103 Table 4.3 Selected tracts and vertices Experiment Cluster Selected tract Vertices Association sign 55,56,57,58 1 CC-1 - With all subjects, 59,60,61,62 Metric used: sh-peaks right SLF-I 98 + 2 left SLF-II 1,2,3,4,5 + 3 right CST 60,61,62,63 + 54,55,56,57,58,59 1 CC_1 - Excluding AD_MD, 60,61,62,63,64 Metric used: sh-peaks 2 left SLF-I 2,3,4,5 + 3 right CST 60,61,62,63 + 1 CC-1 56,57,58,59,60 - With educ ≥ 15, Metric used: sh-peaks right SLF-I 98 + 2 left SLF-II 1,2,3,4,5 + With educ ≤ 17, right SLF-I 98 + 1 metric Used: sh-peaks left SLF-II 1,2,3,4,5 + 3 left ATR 8,9,10 - 1 CC-4 39,40,41,42,58,59,60 - With all subjects, 1,2,3,4,5,6 - 2 right ST-PREM metric Used: FA 7,8,9, 10,11 - 3 right AF 30,31,32, 33,34,35 - 4 right SLF-II 33,34,35,36,37,38 - interact to inform cognitive reserve remains an area for future inquiry. In addition, limiting the analysis to exclude those with dementia diagnoses resulted in a nearly identical pattern of selected vertices, although it was more sensitive to reserve as reflected in a larger number of significant vertices in CC1. Similarly, the original pattern of results was largely maintained in participants with more years of formal education, although the vertices in the right CST were no longer significant predictors of the reserve. Moreover, limiting the sample to those with less educational attainment revealed a negative effect of sh-peak values in anterior thalamic radiation (ATR) in the left hemisphere on reserve. This tract also crosses CC1-2, cingulum, and inferior frontal-occipital fibers, suggesting a similar cause as for CC1 in the overall model. The use of the sf-peak values estimated from the diffusion FODs provides a more orientation- ally specific estimate than voxel-level scalars like tensor-based estimates of anisotropy (FA) and 104 diffusivity. We also note that reserve is a counterintuitive construct, as high levels of reserve are apparent in those with indications of neurodegeneration (i.e., smaller brain volumes) combined with higher levels of memory performance than would be predicted from the linear relationship alone. Moreover, results from sampling and modeling along-tract FA values produced a markedly different pattern of results. All identified tract segments in the FA analysis were negatively as- sociated with reserve, and all are in areas where FA values are confounded by crossing fibers (Douaud et al. (2011); Chad et al. (2021)). Because lower FA values in crossing fiber regions are associated with dementia risk, modeling sh-peaks and FA separately provide complementary insights into the WM tractography correlates of cognitive reserve. 4.4 Conclusion In this work, we proposed a DL-based multi-resolutional feature selection algorithm tailored for highly correlated ultra-high dimensional feature space. The contributions of our work are twofold: (1) From the statistical perspective, the proposed method efficiently combines several existing tools in statistics and ML literature to circumvent some of the limitations of current DL-based models in handling complex data similar to that of the UM-MAP study. Specifically, it achieves significant dimension reduction while maintaining type-I and type-II error trade-offs by efficiently combining the added information from resampling. Due to the screening step, our method is scalable and its resampling component can be easily implemented as parallel chains for faster computation. (2) From the application perspective, the proposed method addresses at least two critical shortcomings in the extant literature for handling tractography data in answering important questions in cognitive reserve. First, unlike existing approaches that treat diffusion tractography data as responses, the current analysis permits treating diffusion tractography data as the predictor, rather than the response. As prior methods have focused on binary or continuous predictors of differences in streamline-sampled values, the proposed method improves the validity of modeling streamlined WM estimates as a predictor of behavior. Second, other methods for tractometry cannot model multiple streamlines together, much less the large number reported here. By integrating a more robust approach for controlling for 105 multiple tests and variable selection, this method permits simultaneous modeling of whole brain white matter tractography and discovers multiresolution clusters associated with some neurogenetic disorders. Nevertheless, even this report utilized streamlines reduced to centroids with sampling points limited to 100 per WM tract. Future work is needed to capitalize on the considerably higher dimensionality of these data types as predictors of neurocognitive aging outcomes. The basic intuition, scalability, and exciting empirical results of the proposed method on simulated and real datasets encourage further research in multiple directions. From a theoretical aspect, we are actively working on developing a theoretical foundation of this ’screening’ and ’cleaning’ strategy for provable FDR control. It would be worth mentioning that although we used the sure independence screening with HZ-test and LassoNet as the main tools, the proposed method puts forward a more generic framework and can be implemented with any other model- free feature screening method and sparsity-inducing DL algorithms like Feng and Simon (2017). One limitation we mainly focus on here is that the proposed method is developed considering regression setup with a continuous outcome because of the requirements of the Henze–Zirkler sure Independence test used in the screening step. Further research should be conducted for extending the proposed method to classification problems as well. From an application perspective, as the specific parts of the tracts selected by the proposed method have shown to have strong predictive ability, more interpretable nonlinear models (e.g. decision trees or random forest) can be entertained for further analysis. This will further provide deeper insights into the association of these selected tracts on the cognitive reserve, neurodegeneration, and beyond. 106 BIBLIOGRAPHY Bastiani, M., Cottaar, M., Dikranian, K., Ghosh, A., Zhang, H., Alexander, D. C., Behrens, T. E., Jbabdi, S., and Sotiropoulos, S. N. (2017). Improved tractography using asymmetric fibre orientation distributions. Neuroimage, 158:205–218. Beaulieu, C. (2002). The basis of anisotropic water diffusion in the nervous system–a technical review. NMR in Biomedicine: An International Journal Devoted to the Development and Application of Magnetic Resonance In Vivo, 15(7-8):435–455. Bender, A. R., Daugherty, A. M., and Raz, N. (2013). Vascular risk moderates associations between hippocampal subfield volumes and memory. Journal of cognitive neuroscience, 25(11):1851– 1862. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Method- ological), 57(1):289–300. Brandt, J. (1991). The hopkins verbal learning test: Development of a new memory test with six equivalent forms. Clinical Neuropsychologist, 5(2):125–142. Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 80(3):pp. 551–577. Chad, J. A., Pasternak, O., and Chen, J. J. (2021). Orthogonal moment diffusion tensor decompo- sition reveals age-related degeneration patterns in complex fiber architecture. Neurobiology of Aging, 101:150–159. Chang, Y.-L., Chao, R.-Y., Hsu, Y.-C., Chen, T.-F., and Tseng, W.-Y. I. (2021). White matter network disruption and cognitive correlates underlying impaired memory awareness in mild cognitive impairment. NeuroImage: Clinical, 30:102626. Chen, Y., Gao, Q., Liang, F., and Wang, X. (2021). Nonlinear variable selection via deep neural networks. Journal of Computational and Graphical Statistics, 30(2):484–492. Colby, J. B., Soderberg, L., Lebel, C., Dinov, I. D., Thompson, P. M., and Sowell, E. R. (2012). Along-tract statistics allow for enhanced tractography analysis. Neuroimage, 59(4):3227–3242. Craft, S., Newcomer, J., Kanne, S., Dagogo-Jack, S., Cryer, P., Sheline, Y., Luby, J., Dagogo-Jack, A., and Alderson, A. (1996). Memory improvement following induced hyperinsulinemia in alzheimer’s disease. Neurobiology of aging, 17(1):123–130. Dhollander, T., Clemente, A., Singh, M., Boonstra, F., Civier, O., Duque, J. D., Egorova, N., Enticott, P., Fuelscher, I., Gajamange, S., et al. (2021). Fixel-based analysis of diffusion mri: methods, 107 applications, challenges and opportunities. Neuroimage, 241:118417. Dodge, H. H., Goldstein, F. C., Wakim, N. I., Gefen, T., Teylan, M., Chan, K. C. G., Kukull, W. A., Barnes, L. L., Giordani, B., Hughes, T. M., Kramer, J. H., Loewenstein, D. A., Marson, D. C., Mungas, D. M., Mattek, N., Sachs, B. C., Salmon, D. P., Willis-Parker, M., Welsh-Bohmer, K. A., Wild, K. V., Morris, J. C., Weintraub, S., and National Alzheimer’s Coordinating Center (NACC) (2020). Differentiating among stages of cognitive impairment in aging: Version 3 of the uniform data set (UDS) neuropsychological test battery and MoCA index scores. Alzheimers Dement. (N. Y.), 6(1):e12103. Douaud, G., Jbabdi, S., Behrens, T. E., Menke, R. A., Gass, A., Monsch, A. U., Rao, A., Whitcher, B., Kindlmann, G., Matthews, P. M., and Smith, S. (2011). Dti measures in crossing-fibre areas: Increased diffusion anisotropy reveals early white matter alteration in mci and mild alzheimer’s disease. NeuroImage, 55(3):880–890. Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101. Feng, J. and Simon, N. (2017). Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592. Ghorbani, A., Abid, A., and Zou, J. (2019). Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3681–3688. Jordon, J., Yoon, J., and van der Schaar, M. (2019). KnockoffGAN: Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations. Koshiyama, D., Fukunaga, M., Okada, N., Morita, K., Nemoto, K., Yamashita, F., Yamamori, H., Yasuda, Y., Matsumoto, J., Fujimoto, M., Kudo, N., Azechi, H., Watanabe, Y., Kasai, K., and Hashimoto, R. (2020). Association between the superior longitudinal fasciculus and perceptual organization and working memory: A diffusion tensor imaging study. Neuroscience Letters, 738:135349. Lei, L. and Fithian, W. (2018). Adapt: an interactive procedure for multiple testing with side infor- mation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649–679. Li, A. and Barber, R. F. (2018). Multiple testing with the structur. . . adaptive benjaminihochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology). Liu, H., Lafferty, J., and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(10). Lu, Y., Fan, Y., Lv, J., and Stafford Noble, W. (2018). Deeppink: reproducible feature selection in deep neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, 108 N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. McPhee, G. M., Downey, L. A., and Stough, C. (2019). Effects of sustained cognitive activity on white matter microstructure and cognitive outcomes in healthy middle-aged adults: A systematic review. Ageing Res. Rev., 51:35–47. Min, Z.-G., Rana, N., Niu, C., Ji, H.-M., and Zhang, M. (2014). Does diffusion tensor tractography of the corticospinal tract correctly reflect motor function? Med. Princ. Pract., 23(2):174–176. Raffelt, D. A., Tournier, J.-D., Smith, R. E., Vaughan, D. N., Jackson, G., Ridgway, G. R., and Connelly, A. (2017). Investigating white matter fibre density and morphology using fixel-based analysis. Neuroimage, 144:58–73. Raz, N., Lindenberger, U., Rodrigue, K. M., Kennedy, K. M., Head, D., Williamson, A., Dahle, C., Gerstorf, D., and Acker, J. D. (2005). Regional brain changes in aging healthy adults: general trends, individual differences and modifiers. Cerebral cortex, 15(11):1676–1689. Romano, Y., Sesia, M., and Candès, E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532):1861–1872. Sesia, M., Katsevich, E., Bates, S., Candès, E., and Sabatti, C. (2019). Multi-resolution localization of causal variants across the genome. bioRxiv. Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18. Song, Z. and Li, J. (2021). Variable selection with false discovery rate control in deep neural networks. Nature Machine Intelligence, 3(5):426–433. Stern, Y., Arenaza-Urquijo, E. M., Bartrés-Faz, D., Belleville, S., Cantilon, M., Chetelat, G., Ewers, M., Franzmeier, N., Kempermann, G., Kremen, W. S., et al. (2020). Whitepaper: Defining and investigating cognitive reserve, brain reserve, and brain maintenance. Alzheimer’s & Dementia, 16(9):1305–1311. Tournier, J.-D., Calamante, F., and Connelly, A. (2012). Mrtrix: diffusion tractography in crossing fiber regions. International journal of imaging systems and technology, 22(1):53–66. Tournier, J.-D., Smith, R., Raffelt, D., Tabbara, R., Dhollander, T., Pietsch, M., Christiaens, D., Jeurissen, B., Yeh, C.-H., and Connelly, A. (2019). Mrtrix3: A fast, flexible and open software framework for medical image processing and visualisation. NeuroImage, 202:116137. Uddin, L. Q. (2021). Cognitive and behavioural flexibility: neural mechanisms and clinical considerations. Nature Reviews Neuroscience, 22(3):167–179. 109 Vos, S. B., Jones, D. K., Viergever, M. A., and Leemans, A. (2011). Partial volume effect as a hidden covariate in DTI analyses. Neuroimage, 55(4):1566–1576. Wang, F., Ren, S.-Y., Chen, J.-F., Liu, K., Li, R.-X., Li, Z.-F., Hu, B., Niu, J.-Q., Xiao, L., Chan, J. R., and Mei, F. (2020a). Myelin degeneration and diminished myelin renewal contribute to age-related deficits in memory. Nat. Neurosci., 23(4):481–486. Wang, G., Sarkar, A., Carbonetto, P., and Stephens, M. (2020b). A simple new approach to variable selection in regression, with application to genetic fine-mapping. bioRxiv. Wasserthal, J., Maier-Hein, K. H., Neher, P. F., Northoff, G., Kubera, K. M., Fritze, S., Harneit, A., Geiger, L. S., Tost, H., Wolf, R. C., et al. (2020). Multiparametric mapping of white matter microstructure in catatonia. Neuropsychopharmacology, 45(10):1750–1757. Wasserthal, J., Neher, P., and Maier-Hein, K. H. (2018a). Tractseg - fast and accurate white matter tract segmentation. NeuroImage, 183:239–253. Wasserthal, J., Neher, P., and Maier-Hein, K. H. (2018b). Tractseg-fast and accurate white matter tract segmentation. NeuroImage, 183:239–253. Wasserthal, J., Neher, P. F., Hirjak, D., and Maier-Hein, K. H. (2019). Combined tract segmentation and orientation mapping for bundle-specific tractography. Medical image analysis, 58:101559. Xia, F., Zhang, M. J., Zou, J. Y., and Tse, D. (2017). Neuralfdr: Learning discovery thresholds from hypothesis features. In NIPS. Xie, L., Wisse, L. E. M., Das, S. R., Wang, H., Wolk, D. A., Manjón, J. V., and Yushkevich, P. A. (2016). Accounting for the confound of meninges in segmenting entorhinal and perirhinal cortices in t1-weighted MRI. Med. Image Comput. Comput. Assist. Interv., 9901:564–571. Yushkevich, P. A., Pluta, J. B., Wang, H., Xie, L., Ding, S.-L., Gertje, E. C., Mancuso, L., Kliot, D., Das, S. R., and Wolk, D. A. (2015). Automated volumetry and regional thickness analysis of hippocampal subfields and medial temporal cortical structures in mild cognitive impairment. Hum. Brain Mapp., 36(1):258–287. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320. 110 APPENDIX A MADC TRACTOGRAPHY DATA - MORE INFORMATION A.1 Study sample. Data were drawn from the University of Michigan Memory and Aging Project (UM-MAP), the primary clinical cohort at the MADRC. The sample included 221 participants (67% women) from 51 to 89 years of age. A consensus diagnosis was made following neuropsychological evaluation using the National Alzheimer’s Coordinating Center (NACC) criteria by neurologists, neuropsychologists, nurses, social workers and other specialists during a consensus conference. The sample was divided into three subgroups based on the last recorded diagnosis for each par- ticipant (Table 1): cognitively unimpaired (CU; n=117; 73% women), amnestic or non-amnestic MCI (MCI; n=62; 70% women) and multi-domain amnestic dementia (DAT; n=42; 55% women) consistent with Alzheimer’s disease and mixed dementia. A.1.1 MRI acquisition. All neuroimaging data were acquired on a 3 Tesla General Electric Discovery Magnetic Resonance System equipped with a 32-channel receiving/transmitting head coil at the University of Michigan’s Functional MRI Laboratory. T1-weighted structural images were collected with the following parameters: TR=3173.1 ms; TE=24.0 ms; inversion time=896 ms; flip angle=111◦; FOV=220×220 mm; 43 axial slices with thickness=3 mm and no spacing; acquisition time=100 s. A diffusion-weighted 2D dual spin echo pulse sequence was acquired in 81 axial slices with voxel dimensions of 1.7 mm3, repetition time (TR)=4100 ms; echo time (TE)=2.5 ms; field of view (FOV)=240×240 mm. We acquired 96 volumes including 6 without diffusion weighting (b=0), 30 volumes with weightings of b=700 s/mm2, and 60 volumes with b=2000 s/mm2, for a total of 96 gradient encoding directions. In addition, a 2D spin echo field map image was acquired with the same dimensions and was applied to create a reverse phase encoded b0 image for distortion corrections. 111 A.2 MRI processing. Image processing utilized the high-performance computing cluster at Michigan State Univer- sity, with Intel Xeon Gold 6148 CPU cluster nodes. All dMRI data pre-processing and processing closely followed the steps for multi-shell multi-tissue (MSMT) constrained spherical deconvo- lution and fixel-based analysis published in the MRtrix user manual (available here). These included algorithmic preprocessing steps for mitigation of thermal noise, Gibbs ringing artifacts, nonlinear distortions from motion artifacts and eddy currents, as well as intensity bias (Raffelt et al. (2017); Tournier et al. (2019, 2012)). All pre-processed data were upsampled to a voxel dimension of 1.25 mm3. Individual MSMT response functions were estimated for each upsam- pled volume using the Dhollander algorithm (Dhollander et al. (2021)). The individual response functions were averaged, and the mean values were used to compute fiber orientation distribu- tions (FODs) in each voxel. A subset of FODs from 59 cases, equally divided between diagnostic groups, was used to compute a sample-specific FOD template and a template mask; all cases were subsequently nonlinearly spatially transformed to the template. Next, we estimated the spherical harmonic peak amplitudes (sh-peaks) in FODs from each template-registered case and the FOD template. We used the dwi2tensor and tensor2metric functions in MRtrix to compute tensors in the upsampled, preprocessed dMRI volumes and estimate individual FA images. We used the transformation functions previously calculated for registering FODs to the template to transform the upsampled FA data to the group FOD template. Using the TractSeg 2.3 algorithm (Wasserthal et al. (2018b, 2019, 2020)), we estimated 72 white matter fiber bundles using the group template sh-peaks; the TractSeg tractometry function was used to resampled each fiber bundle to a single centroid streamline and distributed 100 equally spaced sampling points across the length of the representative streamline. For each representative streamline, we sampled the sh-peak and FA values under each sampling point for each participant. Automated segmentation of hippocampal subfields (ASHS; Yushkevich et al. (2015)) is a multi-atlas label fusion method for determining the anatomical boundaries and quantifying volumes in specific regions of the brain’s medial temporal lobe. We used the ASHS-PMC-T1 112 atlas (Xie et al. (2016)) developed for segmenting bilateral volumes of the anterior and posterior hippocampi and entorhinal cortices from conventional T1-weighted structural MRI data. All segmentations were inspected by trained laboratory personnel for issues with segmentation quality. The volume of the intra-cranial vault estimated by ASHS was used to statistically adjust output volumes for differences in participants’ head sizes using the ANCOVA method (Bender et al. (2013); Raz et al. (2005)). A.3 Cognitive testing. Neuropsychological tests from the Uniform Data Set 3 (UDS3) included measures of delayed recall from the Hopkins Verbal Learning Test (HVLT) and Craft Story Test (Brandt (1991); Craft et al. (1996)). Measures of delayed recall in these tests of episodic memory are sensitive to amnestic MCI and preclinical Alzheimer’s disease (Dodge et al. (2020)). A.4 Cognitive reserve. We standardized and averaged the scores for delayed recall from the Craft Story and HVLT tasks to create a composite of episodic memory. Similarly, we averaged z-scores for volumes of bilateral entorhinal cortices and anterior and posterior hippocampi to generate a brain volume composite score. The memory composite was regressed on the brain volume composite and the residuals served as the measure of cognitive reserve. 113 APPENDIX B LIST OF MAJOR WM TRACTS The list of 50 major WM tracts considered in this current study is given below. More pictorial demonstration of these tracts and related information can be found that Wasserthal et al. (2018a). AF_left, AF_right, ATR_left, ATR_right, CC_1, CC_2, CC_3, CC_4, CC_5, CC_6, CC_7, CG_left, CG_right, CST_left, CST_right, FPT_left, FPT_right, ICP_left, ICP_right, IFO_left, IFO_right, ILF_left, ILF_right, MCP, OR_left, OR_right, POPT_left, POPT_right, SCP_left, SCP_right, SLF_I_left, SLF_I_right, SLF_II_left, SLF_II_right, SLF_III_left, SLF_III_right, STR_left, STR_right, UF_left, UF_right, T_PREM_left, T_PREM_right, T_PAR_left, T_PAR_right, T_OCC_left, T_OCC_right, ST_FO_left, ST_FO_right, ST_PREM_left, ST_PREM_right 114 CHAPTER 5 CONCLUSIONS In this thesis, we have explored advanced statistical methodologies for feature selection in ultra-high dimensional datasets, specifically focusing on the analysis of tractography data in the context of the UM-MAP study. Our research journey has been guided by the need to address the challenges posed by high dimensionality, strong associations within and between tract-level measurements, and the complex functional relationship between these measurements and phenotype. Through an iterative research process, we started with a linear model-based approach and extended it to develop SciDNet, a deep learning-based method, effectively combining the power of statistical inference and the flexibility of neural networks. We have demonstrated the effec- tiveness of the proposed methods in achieving higher power and controlled false discovery rate (FDR) compared to other state-of-the-art methods through extensive empirical studies and simulations. We have also examined the assumptions and algorithms underlying SciDNet, providing theoretical justification and practical insights into its application. Our findings have significant implications for understanding the micro-structural relation- ships in the human brain, particularly in the context of dementia and beyond. By providing a method that is independent of strict modeling assumptions and p-values, SciDNet contributes to a better understanding of the intricate connections between DMRI metrics and phenotype. In conclusion, this thesis has addressed critical challenges in the analysis of tractography data by proposing and validating the SciDNet method. The insights gained from this research will pave the way for further theoretical investigations, generalizability studies, and broader applications. We are confident that the methodologies developed in this thesis will contribute to advancements in the field of high-dimensional statistics and enhance our understanding of complex brain connectivity patterns. We would like to express our gratitude to all the individuals who have supported and en- couraged us throughout this research endeavor. Their contributions, guidance, and belief in 115 our abilities have been invaluable. This thesis marks the culmination of years of hard work, dedication, and collaboration, and we are grateful for the opportunity to contribute to the field of statistical analysis in such a meaningful way. As we conclude this chapter, we look forward to future research and collaborations that will build upon the foundations laid in this thesis. The journey does not end here; it continues with new questions, challenges, and discoveries that will shape the future of statistical analysis in the realm of high-dimensional datasets. Thank you for joining us on this intellectual exploration and for being a part of this remarkable journey. 116