COMPREHENSIVE APPROACHES TO HIGH-DIMENSIONAL HETEROGENEOUS DATA: SEMIPARAMETRIC METHODS AND FEATURE SELECTION By Sang Kyu Lee A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Doctor of Philosophy 2024 ABSTRACT The advent of new technologies has introduced a broad spectrum of data types, necessitating advanced methodologies beyond traditional linear models. In scenarios where covariates include both linear and unknown or non-linear relationships with the response variable, relying solely on either parametric or nonparametric methods can be inadequate. This dissertation addresses these complexities using a semiparametric approach called the partial linear trend filtering model. Instead of classical nonparametric methods, a newcomer, trend filtering, is integrated into the high-dimensional partial linear model. Simulation stud- ies indicate that these models handle heterogeneous data more effectively than traditional approaches, demonstrating rigorous theoretical results, including convergence rates for the estimates. Additionally, this dissertation explores high-dimensional partial linear quantile regression to assess the heterogeneous effects of covariates on different quantiles of the response variable. By applying trend filtering to partial linear quantile regression, the strengths of both quantile regression and trend filtering are combined, supported by rigorous theoretical and simulation results. For practical validation, the partial linear quantile trend filtering model is applied to the Environment and Genetics in Lung Cancer Etiology (EAGLE) study data, showcasing their applicability and effectiveness in real-world data analysis. Furthermore, only a small number of works have addressed feature selection and False Discovery Rate (FDR) control within the high-dimensional quantile regression framework. Inspired by the model-X knockoff procedure [9], a new method is introduced for simultane- ously controlling FDR and detecting important covariates via the regional quantile regression approach. This three-step procedure identifies signals within the quantile region of interest rather than at a specific quantile level, effectively controlling the FDR. Simulation studies demonstrate the utility of this method, which is also applied to National Health and Nutri- tion Examination Survey (NHANES) data to identify risk factors associated with high body mass index. Copyright by SANG KYU LEE 2024 ACKNOWLEDGEMENTS I am immensely thankful to my advisors, Drs. Hyokyoung Grace Hong and Haolei Weng, for their unwavering support, invaluable guidance, and expert mentorship throughout my thesis journey. Their encouragement, insights, and constructive feedback have been instrumental in shaping my research and academic growth. I would also like to extend my heartfelt appreciation to my committee members, Drs. Antonio Galvao and Yuehua Cui, for their valuable contributions and insights that enriched my research and thesis. Furthermore, I am deeply grateful to my parents for their unconditional love, unwavering support, and sacrifices that have been my foundation throughout my PhD program. Their belief in me has been a constant source of motivation. I am also profoundly thankful to my wife, Hyun Jin, for her unwavering support, under- standing, and patience during this challenging period. Her encouragement and belief in my dreams have been a source of strength. Special thanks are also due to my friends Nicholas Yi and Jong In Lim, among others, for their friendship, support, and understanding during the ups and downs of my academic journey. I am equally grateful to my New Hope Baptist Church family, including Pastor Jeong and Dr. Pyeon, for their prayers, encouragement, and unwavering support throughout my PhD program. Their faith and community have been a source of inspiration and strength. Together, my advisors, committee members, parents, wife, friends, and church family have been my pillars of support, enabling me to navigate the academic challenges with determination and resilience. I am truly blessed to have such exceptional individuals in my life. iv CHAPTER 1 OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 TABLE OF CONTENTS CHAPTER 2 HIGH-DIMENSIONAL PARTIAL LINEAR MODEL WITH TREND FILTERING . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 . . . . . . . . . . . . . . . . . . . . 2.2 Model Description and Assumptions 2.3 Main Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 LINEAR QUANTILE HIGH-DIMENSIONAL PARTIAL REGRESSION WITH TREND FILTERING . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Model Description and Assumptions . . . . . . . . . . . . . . . . . . . . 3.3 Main Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Simulation Studies 3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 FALSE DISCOVERY RATE CONTROL VIA REGIONAL QUANTILE REGRESSION ON ULTRA-HIGH DIMENSION . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 Proposed Screening Method . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 FDR Control via Regional QR-Knockoff . . . . . . . . . . . . . . . . . . 4.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 6 10 11 18 19 40 40 43 46 48 53 61 63 77 77 79 82 87 93 96 97 CHAPTER 5 FUTURE STUDIES . . . . . . . . . . . . . . . . . . . . . . . . . 105 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 v CHAPTER 1 OVERVIEW This dissertation presents a comprehensive exploration of advanced statistical methodologies for high-dimensional data analysis, addressing critical challenges in handling complex and diverse data structures. The research is structured into three interconnected chapters, each building upon and extending state-of-the-art techniques to enhance the accuracy, robustness, and applicability of data analysis in scenarios where traditional models prove inadequate. The second chapter introduces an innovative hybrid modeling approach, integrating para- metric and nonparametric regression techniques within high-dimensional partial linear mod- els. This methodology synthesizes the strengths of least squares regression and nonparamet- ric methods, offering a sophisticated solution for complex biological data, particularly in gene expression studies. The chapter leverages penalization techniques such as LASSO and the local adaptivity of trend filtering, establishing a robust framework for high-dimensional het- erogeneous data. Rigorous theoretical proofs and convergence rate results for the estimators underscore the statistical validity of this approach. The third chapter extends the research paradigm to partial linear quantile regression, incorporating trend filtering to address high-dimensional sparse heterogeneous data. This integration combines the local adaptivity and computational efficiency of trend filtering with the quantile-adaptive, sparsity-inducing properties of quantile LASSO. The chapter provides a more nuanced understanding of covariate-response relationships by estimating conditional quantiles at various levels. Theoretical contributions include comprehensive proofs and con- vergence rates for estimators, distinguishing this work from previous research in the field. Ad- ditionally, this chapter includes a real data analysis utilizing the Environment And Genetics in Lung Cancer Etiology (EAGLE) study, a large-scale, population-based case-control study. This practical application demonstrates the methodology’s efficacy in analyzing complex, high-dimensional epidemiological data, particularly in the context of lung cancer research, thereby bridging the gap between theoretical advancements and real-world applicability. 1 The fourth chapter addresses the unique challenges posed by ultra-high dimensional data, where the number of covariates (p) grows exponentially relative to the number of observations (n). The research employs regional quantile regression, modeling quantile regression over an interval of quantile levels to enhance the stability and robustness of variable selection. A novel knockoff procedure is proposed, specifically designed for regional quantile regression, guaranteeing exact False Discovery Rate (FDR) control with finite samples. This innova- tive approach ensures robust variable selection across multiple quantile levels, significantly advancing the field of ultra-high dimensional data analysis with quantile regression. Collectively, this dissertation presents a cohesive and progressive framework for tackling the multifaceted challenges of heterogeneous high-dimensional data analysis, offering both theoretical advancements and practical methodologies for researchers and practitioners in fields ranging from bioinformatics to econometrics. Throughout the dissertation, notations may vary between chapters to best suit the specific methodologies and concepts presented. To ensure clarity and consistency, each chapter provides clear definitions of its notations in its introductory sections, facilitating reader comprehension and maintaining the precision necessary for advanced statistical discourse. 2 CHAPTER 2 HIGH-DIMENSIONAL PARTIAL LINEAR MODEL WITH TREND FILTERING 2.1 Introduction With the advent of new technologies, we are now dealing with a wide variety of data types. This diversity necessitates methodologies that go beyond the traditional linear model. In this research, we focus on a scenario where the response variable y is influenced by two predictors, x and z. Specifically, the relationship between y and x is assumed to be linear, while the relationship between y and z is non-linear or unknown. This modeling approach is particularly prevalent in biological studies, such as those involving gene expression data, where nonlinear factors like age or other related variables play a significant role. To address this type of data, it is essential to use techniques that separately account for the linear and non-linear components. Neglecting the non-linear part and using standard least squares regression (parametric) can lead to incorrect inferences. Conversely, applying non-linear regression (nonparametric) to both components results in an overly rough esti- mation, and also there can be a curse of dimensionality when we deal with high-dimensional data. The partial linear model resolves this issue by incorporating both parametric and non- parametric regression simultaneously. This semiparametric approach has been extensively studied in low-dimensional settings [75, 23, 80], where the number of covariates for the lin- ear part is smaller than the number of observations or fixed as a constant. However, these models are not directly applicable to high-dimensional data where the number of covariates significantly exceeds the number of observations. For high-dimensional data involving only x, substantial work has been done using pe- nalization techniques such as LASSO (Least Absolute Shrinkage and Selection Operator) by [67], SCAD (Smoothly Clipped Absolute Deviation) by [16], and Elastic Net by [90]. The LASSO, in particular, has been widely studied and has shown remarkable results in sparse 3 contexts [8]. Recent advances have extended these approaches to high-dimensional semiparametric models based on the partial linear model framework [46, 38, 89, 84]. These models combine penalized techniques with classical nonparametric methods, such as smoothing splines, to achieve robust and interpretable results. In recent years, there has been substantial interest and progress in the nonparametric method known as trend filtering, as proposed by [66, 31, 69]. [69] describes trend filtering as a discrete analog of the locally adaptive regression splines introduced by [44]. The locally adaptive regression splines involves solving the following optimization problem for n number of observations: min f ∈Fk 1 2 n (cid:88) i=1 (yi − f (zi))2 + λ · T V (f (k)). (2.1) Here, Fk denotes the function space defined as Fk = (cid:8)f : [0, 1] → R : f is k times weakly differentiable and T V (f (k)) < ∞(cid:9) , where T V (f (k)) denotes the total variation of kth derivative of f . The discrete analog of (2.1) using trend filtering is min θ∈Rn 1 2 n (cid:88) i=1 (yi − θi)2 + λ∥D(z,k+1)θ∥1, (2.2) where bold symbols denote vector notation, ∥ · ∥1 denotes ℓ1-norm and D(z,k+1) is the kth order difference operator based on z. Equation (2.2) is referred to as the univariate kth order trend filtering. The details of these notations and definitions will be elaborated on in Section 2. Following the proofs of equivalence between (2.1) and (2.2) as shown by [69], trend filtering estimates combine the benefits of locally adaptive regression splines with the advantages of the banded difference operator structure in (2.2). This dual benefit provides local adaptivity and computational efficiency, making trend filtering superior to other nonparametric methods such as smoothing splines and B-spline regressions. 4 Trend filtering has been successfully applied to various statistical fields, including graphi- cal models [78], additive models [56], quantile regression [41], and spatiotemporal models [49]. However, its application within the partial linear model framework has not been explored to date. Our contribution lies in integrating trend filtering with the partial linear model and uti- lizing LASSO to effectively address high-dimensional data. This approach leverages the local adaptivity and computational efficiency of trend filtering along with the sparsity-inducing properties of LASSO, providing significant advantages when dealing with complex and het- erogeneous data. Furthermore, our work provides rigorous theoretical details and establishes convergence rates for the estimators. The rest of the chapter is organized as follows. Section 2.2 presents the preliminaries, including assumptions and model descriptions. Section 2.3 provides the main theoretical re- sults and their interpretations. Section 2.4 details the simulation studies and computational aspects. Section 2.5 discusses our results, and Section 2.6 includes all the proofs. 2.1.1 Notations Before delving into the model description and assumptions, we first clarify the no- tations used throughout this chapter for convenience. We use bold capital letters (e.g. Σ) to denote matrices, and use bold small letters (e.g. x, y) to denote vectors. For a = (a1, . . . , ap) ∈ Rp, denote ∥a∥q = ((cid:80)p q for q ∈ [1, ∞) and ∥a∥∞ = max1≤i≤p |ai|. n a′b. Given a square matrix A = (aij) ∈ Rp×p, λmax(A) and λmin(A) represent its largest and smallest eigenvalues For two vectors a, b ∈ Rn, we write ∥a∥2 i=1 |ai|q) n = 1 na′a, ⟨a, b⟩n = 1 1 respectively. For a general matrix A = (aij) ∈ Rp×q, ∥A∥2 denotes its spectral norm; ∥A∥max = maxij |aij|, ∥A∥F = (cid:113)(cid:80) i,j a2 ij . For a, b ∈ R, a ∧ b = min(a, b), a ∨ b = max(a, b). For a set A, 1A(·) is the usual indicator function, and |A| to be its cardinality. Moreover, an ≲ bn (an ≳ bn) means there exists some constant C > 0 such that an ≤ Cbn (an ≥ Cbn) for all n; thus an ≲ bn (an ≳ bn) is equivalent to an = O(bn) (an = Ω(bn)); an ≍ bn if and 5 only if an ≲ bn and bn ≳ an; an ≫ bn means bn = o(an). We put subscript p on O and o for random variables. For i.i.d. samples {w1, . . . , wn} from a distribution Q supported on norms for functions f : W → R are: ∥f ∥2 some space W, denote by Qn the associated empirical distribution. The L2(Q) and L2(Qn) L2(Q) = (cid:82) i=1 f 2(wi). For simplicity we will abbreviate subscripts and write ∥f ∥, ∥f ∥n for ∥f ∥L2(Q), ∥f ∥L2(Qn) respectively, whenever Q is the underlying distribution of the covariates. W f 2(w)dQ(w), ∥f ∥2 L2(Qn) = 1 (cid:80)n n 2.2 Model Description and Assumptions In this section, we present a detailed introduction and description of the model under study. Following an initial overview of the model, we delineate additional conditions necessary for our proofs, accompanied by brief explanatory comments for each condition. 2.2.1 Model Description and Estimation Method We consider a partial linear regression model as below: y = x′β0 + g0(z) + ε, where ε is independent of (x, z) ∈ Rp+1, β0 ∈ Rp has the support S = {j : β0 j ̸= 0} with |S| ≤ s, and g0 : [0, 1] → R is a nonparametric function. Generally, z ∈ [0, 1] may not be true, but this can be achieved by using a simple affine transformation, so we assume this for technical simplicity. We let {(yi, xi, zi)}n i=1 be n i.i.d observations of (y, x, z), and without loss of generality, we assume that these observations are re-ordered based on zi, i = 1, . . . , n. Moreover, as discussed in the introduction, we focus on the function g0 where the total variation of kth derivative is bounded by a constant, that is, T V (g(k) 0 ) ≤ Lg, with Lg > 0. We consider the case where p ≫ n, indicating a high-dimensional linear model with an unknown function. This model assumption is prevalent in certain biological studies, such as those involving gene expression data, where nonlinear factors are present such as age or other related factors. 6 Expanding upon the univariate trend filtering from [31] and [69], for a given integer k ≥ 0, the kth order partial linear trend filtering (PLTF) estimation for the partial linear model is defined as ((cid:98)β, (cid:98)θ) ∈ arg min β∈Rp,θ∈Rn 1 2 ∥y − Xβ − θ∥2 n + λ∥β∥1 + γ∥D(z,k+1)θ∥1, (2.3) where D(z,k+1) ∈ R(n−k−1)×n is the discrete dervative operator of order k + 1. When k = 0, D(z,1) =          −1 1 0 · · · 0 −1 1 · · · ... 0 0 0 0 0 0 0 · · · −1 1          ∈ R(n−1)×n. (2.4) For k ≥ 1, the dervative operator is defined recursively, that is, D(z,k+1) = D(z,1) · diag (cid:18) k zk − z1 , · · · , (cid:19) k zn − zn−k+1 · D(z,k), where in the above equation, D(z,1) is defined as the form of (2.4) with the dimension of (n − k − 1) × (n − k). When k = 0, this estimate is equivalent to the one-dimensional fused lasso [68], also known as one-dimensional total variation denoising in signal processing [55]. Before further discussing estimation methods, we briefly introduce the falling factorial basis. [69] and [77] demonstrate a connection between univariate trend filtering and falling factorial basis functions, showing that the trend filtering problem can be interpreted as a sparse basis regression problem using these functions. This finding is also applicable to our case since our model is a generalization of the univariate trend filtering model. For given knot points t1 < · · · < tn ∈ R, the kth order falling factorial basis functions are defined as below qi(t) = i−1 (cid:89) (t − tl), i = 1, . . . , k + 1, l=1 k (cid:89) (t − ti+l) · 1{t > ti+k}, qi+k+1(t) = l=1 i = 1, . . . , n − k − 1, 7 where we write q1(t) = 1 as a convention. With these definitions, the estimation in (2.3) is equivalent to ((cid:98)β, (cid:98)α) ∈ arg min β∈Rp,α∈Rn 1 2 ∥y − Xβ − n (cid:88) l=1 αlql(z)∥2 n + λ∥β∥1 + γk! n (cid:88) l=k+2 |αl|, (2.5) which is a lasso form. Indeed, the equivalent alternating expression of (2.5) is ((cid:98)β, (cid:98)g) ∈ arg min β∈Rp,g∈Hk n 1 2 ∥y − Xβ − g(z)∥2 n + λ∥β∥1 + γT V (g(k)), (2.6) where Hk n is the span of the kth order falling factorial basis functions with knots zk+1 ≤ zk+2 . . . ≤ zn−1. Throughout the chapter, we denote Hn to be Hk n for notational simplicity. We omit the proofs of equivalences between equations for (2.3), (2.5) and (2.6), since the proof can be directly taken from Lemma 1 in [69] for (2.3) and (2.5), and Lemma 2 in [69] and Lemma 1 in [56] for (2.5) and (2.6). Based on the equivalence, we argue that the estimate from (2.3) is the same as (2.6). By constructing an appropriate linear combination of falling factorial functions, the form in (2.5) enables interpolation or extrapolation for partial linear trend filtering. Consequently, this formulation enhances the practical applicability of partial linear trend filtering. Our estimation approach in (2.3) introduces a doubly penalized least squares estimator, similar to those proposed by [46] and [84]. This approach involves two penalties: the first (cid:98)g. The controls the sparsity of (cid:98)β, while the second regulates the variability of the function primary distinction between our approach and theirs lies in the estimation method of the function g. While both of their methods employ the smoothing spline technique, our method utilizes trend filtering for estimation. Given the model structure, where the estimation of g significantly influences that of β, accurately estimating g is critical. As discussed in the introduction, prior literature demonstrates that when the smoothness of g0 is heterogeneous, the trend filtering method outperforms the smoothing splines method, in terms of local smoothness as well as the estimation. We show the simulation results later in Section 3.4. We introduce conditions regarding our model in (2.6), needed for the proofs of lemmas and theorems throughout. 8 Condition 1 (Design Condition). The covariate x = (x1, . . . , xp) has sub-Gaussian coordi- nates1: max1≤j≤p ∥xj∥ψ2 ≤ Kx. Condition 1 and 2 are more general conditions than in [46], where they assumed that the errors are standard Gaussian and the maximum value of the design matrix is bounded by a positive constant. Condition 2 (Sub-Gaussian Condition). The noise ε is sub-Gaussian: E(ε) = 0, V ar(ε) = σ2, ∥ε∥ψ2 ≤ Kεσ. Condition 3 (Bounded Input Density). z has a continuous distribution supported on [0, 1]. Its density is bounded below by a constant ℓz > 0. Condition 2 is the same as in [69], and Condition 3 is the same as in [56]. These two conditions are common in the trend filtering literature. Condition 4 (Eigenvalue Condition). Define h(z) = (h1(z), . . . , hp(z)) = E(x|z) and (cid:101)x = x − h(z). Assume λmin(E (cid:101)x(cid:101)x′) ≥ Λmin > 0 and λmax(Eh(z)h(z)′) ≤ Λmax < ∞. Condition 4 is common in semiparametric literature [83, 46, 84]. This condition ensures that there is enough information in the data to identify the parameters in the linear part. Condition 5. max1≤j≤p T V (h(k) j ) ≤ Lh. Condition 6. (s+log p)s log p n = o(1), p → ∞, as n → ∞. Condition 5 is the simiar to those of the Condition 2.6 in [46] and Condition A.5 in [84], and Lh = 0 when z and x is independent. This condition is used for proving Theorem 2. Condition 6 is a technical condition for proofs in Theorem 1 and 2. 1∥ · ∥ψ2 is the sub-Gaussian norm. 9 2.3 Main Theoretical Results In this section, we present our main results. Our main results consist of two parts, the result regarding the g, and the result regarding the β. We now move on to the convergence rate result for g. Theorem 1. Assume Conditions 1-4 and 6. Choose λ = c1 n + n− 2k+2 2k+3 ) with large enough constants c1, c2 > 0. Then, there exist constants c3, c4, n0 > 0 such that n , γ = c2( s log p (cid:113) log p any solution (cid:98)g in (2.6) satisfies ∥(cid:98)g(z) − g0(z)∥2 ≤ c3 ∥(cid:98)g − g0∥2 n ≤ c3 (cid:16) s log p n (cid:16) s log p n + n− 2k+2 2k+3 (cid:17) + n− 2k+2 2k+3 (cid:17) with probability at least 1 − pc4 − nc4, as long as n ≥ n0. The constants c1, c2, c3, c4, n0 may depend on k, Lg, Lh, Kx, Kϵ, ℓz, Λmin, Λmax, σ. Theorem 1 implies that the rate of (cid:98)g depends on two terms: s log p/n and n− 2k+2 (cid:98)g indeed depends on the convergence rate of (cid:98)β, which is discussed in Theorem 2. When the rate for the linear part is faster than the rate of the means that the convergence rate of 2k+3 . This second term, that is when s log p/n = o(n− 2k+2 second term, which implies that ∥(cid:98)g(z) − g0(z)∥2 = Op This convergence rate is the same as the rate achieved by [46] and it is also the same as and ∥(cid:98)g − g0∥2 n = Op . n− 2k+2 2k+3 n− 2k+2 2k+3 2k+2 ), then the convergence rate is governed by the (cid:17) (cid:16) (cid:17) (cid:16) the minimax convergence rate achieved by [84], which were shown in the Sobolev function space. For other situations, when the function g is sufficiently smooth or p is sufficiently high, the rate is governed by the first term. The proof of Theorem 1 can be found in the supplementary section. We proceed to the convergence rates results for β. Theorem 2. Assume Conditions 1-6, with the same choice of λ, γ in Theorem 1, any solution 10 (cid:98)β in (2.6) satisfies ∥(cid:98)β − β0∥2 2 ≤ c3 ∥x′((cid:98)β − β0)∥2 ≤ c3 ∥X((cid:98)β − β0)∥2 n ≤ c3 s log p n s log p n s log p n with probability at least 1 − pc4 − nc4, as long as n ≥ n0. Theorem 2 implies two things about the estimator (cid:98)β; (1) estimation result and (2) prediction result. For estimation rate and prediction rates for both in-sample and out- of-samples, we see that they achieve the oracle rate which is the same as the standard lasso case where the function g is known. This rate coincides with the studies in classical high-dimensional research [81, 54, 74] as well as the research done on smoothing splines [46, 84]. Note that the smoothness for g does not have effect on the results of β, while for g, β results are affected. This one-way relationship is due to the orthogonal decomposition that we use during the proofs. The proof for Theorem 2 can be found in the supplementary section. 2.4 Simulation Studies Through empirical experiments, we evaluate the performance of partial linear trend filtering (PLTF) in comparison to partial linear smoothing splines (PLSS), as presented by [46] and [84]. Additionally, we tested the B-spline basis even though there is no paper using the B-spline for this method; however, its performance was significantly inferior to the other methods, so we decided not to include those results here. 2.4.1 Computational Details We use a Block Coordinate Descent (BCD) algorithm also known as the backfitting approach for our problem in (2.3). Our approach involves two blocks: the first pertains to β and the second to θ. The algorithm iterates over these blocks, updating the estimate for each 11 component at each step using the lasso for β and the univariate trend filtering algorithm for θ. The detailed steps of our algorithm are outlined in Algorithm 2.1. Algorithm 2.1 A BCD algorithm for High-dimensional Partial Linear Trend Filtering Data: {yi, xi, zi}, i = 1, . . . , n Fixed (tuning) Parameters: λ, γ, k 1. Set t = 0 and initialize θ(0) i = 0, i = 1, . . . , n 2. For (t + 1)-th iteration, where t = 0, 1, 2, . . . : a) Block 1: Let y(t)∗ i = yi − θ(t) i , and update β(t) by fitting the lasso: β(t+1) = arg min β ∥y(t)∗ − Xβ∥2 n + λ∥β∥1 = yi − x′ iβ(t+1), and update θ(t) by fitting the univariate trend b) Block 2: Let y(t)∗∗ i filtering: θ(t+1) = arg min θ ∥y(t)∗∗ − θ∥2 n + γ∥D(z,k+1)θ∥1 c) If both β and θ converge, then stop the iteration. If not, continue the iteration until it reaches the predefined maximum iteration number 3. Return (cid:98)β and (cid:98)θ as parameters at convergence The convergence of the BCD algorithm has been rigorously established by [71]. It has been proven that for a convex criterion decomposable into smooth and separable terms, the solution obtained through the iterates of the BCD algorithm is the optimal solution. Therefore, we do not prove the convergence of the algorithm here, since our convergence result can be derived directly from this. For the tuning parameters, λ and γ, we use 5-fold cross-validation (CV) to find the optimally tuned parameters that minimize the overall CV error. However, since there are two grids of tuning parameters, the search space is the product of the lengths of λ and γ, resulting in an exponential increase in computation time compared to an algorithm with a single tuning parameter. To address this, we utilize the warm start method [53] for our tuning parameter grids. Empirical evidence in their work shows that the warm start 12 algorithm significantly reduces the number of iterations needed for convergence, thereby enhancing the convergence speed of the algorithm. Our algorithm is implemented using R software [52]. Typically, for the lasso com- ponent, we utilize the function from the R package glmnet [20], and for trend filtering, we employ the function from the R package glmgen [1]. However, there is no publicly available software for the partial linear smoothing spline method, so we also use BCD algorithm for the smoothing spline model for the computation as well. The R function, stats::smooth.spline, is used for this procedure in the second block. The imple- mented BCD algorithms for the smoothing spline and the trend filtering, named plmR, are available on https://sangkyustat.github.io/plmR/ for public use and reproducibility. 2.4.2 Simulation Setting and Results We now proceed to demonstrate our method through simulations. We first generate (cid:101)X = ((cid:101)x1, . . . , (cid:101)xp+1) from Np+1(0p+1, Σ), where Σ = (σjk) with σjk = 0.5|j−k| and j, k = 1, . . . , p+1. Then we set z = Φ((cid:101)x25), xj = (cid:101)xj for j = 1, . . . , 24 and xj = (cid:101)xj−1 for j = 26, . . . , p + 1. The true models we consider are as follows: Model 1 (Smooth function): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + sin(2πzi) + ϵi, Model 2 (Less smooth function): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + e3zi sin(6πzi)/7 + ϵi, Model 3 (Doppler-type function): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + sin(4/zi) + ϵi for i = 1, . . . , n, where ϵi ∼ N (0, σ2 in Figure 2.1. The values for βj, j = 1, . . . , 4 are (0.5, 1, 1, 1.5). Various values for σ2 ϵ ϵ ). The actual forms of these functions are displayed are used to vary the signal-to-noise ratio (SNR) for the models. Similar to [46] and [69], we define the total signal-to-noise ratio (tSNR) as follows: (cid:115) tSNR = ∥x′β0 + g0(z)∥2 σ2 ϵ , where the expectation in the numerator can be approximated by using the Monte Carlo expectation. We vary the tSNR from 4 to 16 on a logarithmic scale and then calculate 13 Figure 2.1 Configurations of the function g(z) for Models 1 through 3. The structures of the functions become increasingly locally heterogeneous from Model 1 (left) to Model 3 (right). the metrics for each model. Specifically, we consider three different error metrics, including in-sample prediction mean squared error (MSE) and l2-norm squared error, as follows: 1. ∥X(cid:98)β + (cid:98)g(z) − (Xβ0 + g0(z))∥2 n : Overall MSE considering both (cid:98)β and (cid:98)g. 2. ∥(cid:98)β − β0∥2 : l2-norm squared error for (cid:98)β. 3. ∥(cid:98)g(z) − g0(z)∥2 n : MSE for (cid:98)g. The metrics are computed over 150 repetitions of randomly generated datasets for each tSNR value, and the medians of each metric are selected as the final results. We consider p to be 100 and 1000 for low and high-dimensional cases, respectively, with n fixed at 500. For each method, the algorithm is fitted to the training dataset and then the metrics are calculated from the in-sample testing dataset. We present results using both optimally tuned parameters and CV-tuned parameters. The parameters are finely tuned for fair comparison across all models. We fix k = 2 for the trend filtering method and use cubic smoothing splines for the smoothing spline method, tuning only λ and γ for two reasons: (1) minimal impact on results when varying these settings and (2) computational efficiency. The simulation results for each model are displayed in Figures 2.2, 2.3, 2.4. 14 Figure 2.2 Model 1 error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 500 for all cases, and the repetition number for the simulation is 150. 15 Figure 2.3 Model 2 error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 500 for all cases, and the repetition number for the simulation is 150. 16 Figure 2.4 Model 3 error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 500 for all cases, and the repetition number for the simulation is 150. 17 Figure 2.2 demonstrates that when g is a smooth function, the performance of the PLTF method is comparable to that of the PLSS method, with no significant differences observed. However, as shown in Figure 2.3, when the function becomes more heterogeneous, the dif- ference between the two methods becomes more pronounced. Nevertheless, there is still no definitive superiority of one method over the other in terms of overall mean squared error (MSE). In Figure 2.4, the MSE of PLTF is significantly lower than that of PLSS in both low and high-dimensional cases. This disparity becomes more apparent as the tSNR increases. This result argues that PLTF performs much better than PLSS when the error is more heterogeneous. These findings suggest that for smooth functions, our method performs as well as the well-known smoothing spline method. However, as the function becomes more heterogeneous, our method significantly outperforms the smoothing spline approach, particularly in both low and high-dimensional scenarios when the tSNR is high. This result aligns with the findings of [56], which reported similar outcomes for the Doppler-type function we considered. 2.5 Conclusion and Discussion In this research, we have explored the integration of trend filtering with the partial linear model, incorporating Lasso to address high-dimensional data structures. Our approach leverages the local adaptivity and computational efficiency of trend filtering along with the sparsity-inducing properties of LASSO, providing a robust framework for handling complex and heterogeneous data. Our work contributes to the field by extending the application of trend filtering to semi- parametric models, particularly the partial linear model, which has not been previously explored. This integration allows for more precise modeling of high-dimensional data, main- taining both local adaptivity and computational efficiency. The results and discussions provided in this work underscore the potential of our method 18 to enhance the analysis of complex data structures in various statistical fields. Future re- search could further explore the practical applications of this approach and extend it to other types of semiparametric models. For future studies regarding methodology, our work can be expanded to the additive model like [56], which is the expansion version of the univariate trend filtering. This expansion allows us to put multivariate variables as z not only one. In terms of sparsity and variable selection, our work can be modified by using other nonconvex models such as SCAD instead of LASSO. We expect that this will give us more advantages in terms of the accuracy with variable selection of the covariates. 2.6 Proofs Throughout the proofs, we use C1, C2, . . . to denote universal constants and use D1, D2, . . . , ... to denote constants that may only depend on the constants k, Lg, Lh, Kx, Kϵ, ℓz, Λmin, Λmax from Conditions 1-5. An explicit (though not opti- mal) dependence of {Dj, j = 1, 2, . . .} on the aforementioned constants can be tracked down. However, since it does not provide much more insight, we will often not present the explicit forms of {Dj, j = 1, 2, . . .}, and this will greatly help streamline the proofs. The constants {Cj, Dj, j = 1, 2, . . .} may vary from lines to lines. 2.6.1 Proof of Theorem 1 2.6.1.1 Roadmap of the proof For a given function f (x, z) = x′β + g(z) with β ∈ Rp, T V (g(k)) < ∞, introduce the functional τδ0,R(f ) = λ∥β∥1 + γT V (g(k)) 20δ0R + ∥x′β + g(z)∥, δ0 ∈ (0, 1), R > 0, (2.7) where for notational simplicity we have suppressed the dependence of τδ0,R(f ) on the tuning parameters λ, γ > 0. This functional will serve as a critical measure for β and g(z). Define 19 the following events: T1(δ0, R) = T2(δ0, R) = (cid:40) (cid:40) (cid:40) sup τδ0,R(f )≤R sup τδ0,R(f )≤R (cid:12) (cid:12)∥f ∥2 (cid:12) n − ∥f ∥2(cid:12) (cid:12) ≤ δ0R2 (cid:12) (cid:41) , (cid:12) (cid:12) (cid:12)⟨ε, f (X, Z)⟩n (cid:41) , (cid:12) (cid:12) ≤ δ0R2 (cid:12) (cid:41) . 22 log n ℓzn T3 = max 1≤i≤n−1 (z(i+1) − z(i)) ≤ The general proof idea is largely inspired by [46] (see also [84]). We first use a localization technique based on convexity to show that under the event T1(δ0, R) ∩ T2(δ0, R) ∩ T3, the bound on ∥(cid:98)g −g0∥ in Theorem 1 (bound on ∥(cid:98)g −g0∥n is derived similarly) holds. This is done in Lemmas 1-2. Then in Lemmas 3 and 4 we show that the event T1(δ0, R) ∩ T2(δ0, R) ∩ T3 happens with high probability. Along the way we need to choose R, λ, γ, δ0 in (2.7) properly to meet conditions in Lemmas 1-4 and to achieve the desirable error rate in Theorem 1. We complete this step in Section 2.6.1.3. 2.6.1.2 Key lemmas Lemma 1. Assuming Condition 4 and δ0 ≤ 1 100, δ0R2 ≥ D1( log2 n n2 + γ + λ2s), then on T1(δ0, R) ∩ T2(δ0, R) ∩ T3, there exists ¯g ∈ Hn such that (i) T V (¯g(k)) ≤ ak ·T V (g(k) 0 ), ∥¯g−g0∥∞ ≤ 22bk log n ℓzn ·T V (g(k) 0 ), where ak, bk > 0 are constants only dependent on k. (ii) τδ0,R(x′((cid:98)β − β0) + (cid:98)g(z) − ¯g(z)) ≤ R. Proof. According to Lemma 16 in [70] (see also Lemma 13 in [56]), there exists ¯g ∈ Hn such that result (i) holds on T3. We use such a ¯g in the rest of the proof. Consider the convex combination Accordingly, define (cid:101)β = t(cid:98)β + (1 − t)β0, (cid:101)g = t(cid:98)g + (1 − t)¯g, t ∈ [0, 1]. (cid:98)f (x, z) = x′ (cid:98)β + (cid:98)g(z), ¯f (x, z) = x′β0 + ¯g(z), (cid:101)f (x, z) = t (cid:98)f + (1 − t) ¯f = x′ (cid:101)β + (cid:101)g(z). 20 For the choice of t = R R+τδ0,R( (cid:98)f − ¯f ) , it is straightforward to verify that τδ0,R( (cid:101)f − ¯f ) = R · τδ0,R( (cid:98)f − ¯f ) R + τδ0,R( (cid:98)f − ¯f ) ≤ R. (2.8) Hence, to prove result (ii) it is sufficient to show τδ0,R( (cid:101)f − ¯f ) ≤ R inequality due to the convex problem (2.6), 2 . We start with the basic 1 2 ∥y − X(cid:101)β − (cid:101)g(Z)∥2 n + λ∥(cid:101)β∥1 + γT V ((cid:101)g(k)) ≤ 1 2 ∥y − Xβ0 − ¯g(Z)∥2 n + λ∥β0∥1 + γT V (¯g(k)), which can be simplified to 1 ∥X(β0 − (cid:101)β) + g0(Z) − (cid:101)g(Z)∥2 2 1 2 ∥g0(Z) − ¯g(Z)∥2 ≤ n + λ∥(cid:101)β∥1 + γT V ((cid:101)g(k)) n + ⟨ε, X((cid:101)β − β0) + (cid:101)g(Z) − ¯g(Z)⟩n + λ∥β0∥1 + γT V (¯g(k)). Using the triangle inequality for ∥ · ∥1 and T V (·), we can continue from the above to obtain 1 ∥X(β0 − (cid:101)β) + g0(Z) − (cid:101)g(Z)∥2 2 1 2 ∥g0(Z) − ¯g(Z)∥2 ≤ n + λ∥(cid:101)β − β0∥1 + γT V ((cid:101)g(k) − ¯g(k)) n + ⟨ε, X((cid:101)β − β0) + (cid:101)g(Z) − ¯g(Z)⟩n + 2λ∥(cid:101)βS − β0 S∥1 + 2γT V (¯g(k)). Observing from (2.8) that (cid:101)f − ¯f = x′((cid:101)β − β0) + (cid:101)g(z) − ¯g(z) belongs to the set {f : τδ0,R(f ) ≤ R}, on T1(δ0, R) ∩ T2(δ0, R) ∩ T3 we can proceed with (2.9) ∥ (cid:101)f − ¯f ∥2 + 4λ∥(cid:101)β − β0∥1 + 4γT V ((cid:101)g(k) − ¯g(k)) ≤ ∥X(β0 − (cid:101)β) + ¯g(Z) − (cid:101)g(Z)∥2 ≤ 2∥X(β0 − (cid:101)β) + g0(Z) − (cid:101)g(Z)∥2 n + δ0R2 + 4λ∥(cid:101)β − β0∥1 + 4γT V ((cid:101)g(k) − ¯g(k)) n + δ0R2 + 4λ∥(cid:101)β − β0∥1 n + 2∥¯g(Z) − g0(Z)∥2 + 4γT V ((cid:101)g(k) − ¯g(k)) ≤ 4∥g0(Z) − ¯g(Z)∥2 n + 4⟨ε, X((cid:101)β − β0) + (cid:101)g(Z) − ¯g(Z)⟩n + 8λ∥β0 S − (cid:101)βS∥1 + 8γT V (¯g(k)) + δ0R2 ≤ 5δ0R2 + 4b2 kL2 g 222 log2 n √ zn2 + 8akLgγ + 8λ ℓ2 s∥(cid:101)β − β0∥2, (2.10) 21 where the third inequality holds due to (2.9), and in the last inequality we have used result (i) and Cauchy-Schwarz inequality for ∥β0 S − (cid:101)βS∥1. Now Condition 4 implies that √ 8λ s∥(cid:101)β − β0∥2 ≤ 8λ √ ≤ 32λ2sΛ−1 min + 1 2 sΛ−1/2 min ∥(cid:101)x′((cid:101)β − β0)∥ ∥(cid:101)x′((cid:101)β − β0)∥2 ≤ 32λ2sΛ−1 min + 1 2 ∥ (cid:101)f − ¯f ∥2. Here, the second inequality follows ab ≤ 1 2b2, and the third inequality is due to the 2a2 + 1 orthogonal decomposition ∥x′β + g(z)∥2 = ∥(cid:101)x′β∥2 + ∥h(z)′β + g(z)∥2. This together with (2.10) yields ∥ (cid:101)f − ¯f ∥2 + 8λ∥(cid:101)β − β0∥1 + 8γT V ((cid:101)g(k) − ¯g(k)) 222 log2 n zn2 + 16akLgγ + 64λ2sΛ−1 ℓ2 ≤ 10δ0R2 + 8b2 kL2 g min ≤ 16δ0R2, where the last inequality holds under the assumed condition on δ0R2 (with proper choice of constant D1). The above bound further implies ∥ (cid:101)f − ¯f ∥ ≤ 4 λ∥(cid:101)β−β0∥1+γT V ((cid:101)g(k)−¯g(k)) 20δ0R 10R, leading to the bound τδ0,R( (cid:101)f − ¯f ) ≤ R 5R since δ0 ≤ 1 . 10 + 2R δ0R ≤ 2 5 = R , and ≤ 1 √ 100 2 Lemma 2. Under the same conditions of Lemma 1, then on T1(δ0, R) ∩ T2(δ0, R) ∩ T3, it holds that ∥(cid:98)g(z) − g0(z)∥ ≤ (cid:16) 1 + (cid:114) Λmax Λmin (cid:17) R + 22bkLgℓ−1 z log n n . Moreover, define a subevent of T1(δ0, R): T0(δ0, R, (cid:101)R) = (cid:40) sup τδ0,R(f )≤ (cid:101)R (cid:12) (cid:12)∥f ∥2 (cid:12) n − ∥f ∥2(cid:12) (cid:12) ≤ δ0R2 (cid:12) (cid:41) with (cid:101)R = (2 + it holds that (cid:113) Λmax Λmin )R + 22bkLgℓ−1 z log n n + (ak+1)Lgγ 20δ0R . Then on T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3, ∥(cid:98)g(z) − g0(z)∥n ≤ (cid:16) 1 + (cid:112) δ0 + (cid:114) Λmax Λmin (cid:17) R + 22bkLgℓ−1 z log n n . Proof. Part (ii) in Lemma 1 together with Part (i) of Lemma 5 implies that ∥(cid:98)g(z) − ¯g(z)∥ ≤ (1 + )R. As a result, combining it with Part (i) of Lemma 1 gives (cid:113) Λmax Λmin ∥(cid:98)g(z) − g0(z)∥ ≤ ∥(cid:98)g(z) − ¯g(z)∥ + ∥¯g(z) − g0(z)∥ ≤ (cid:16) 1 + (cid:114) Λmax Λmin (cid:17) R + 22bkLg log n ℓzn . 22 To prove the second result, we first use results (i)(ii) of Lemma 1 to bound γT V ((cid:98)g(k) − g(k) 0 ) 20δ0R γT V ((cid:98)g(k) − ¯g(k)) 20δ0R Combining the above with the first result on ∥(cid:98)g(z) − g0(z)∥, we see that γT V (¯g(k) − g(k) 0 ) 20δ0R ≤ R + ≤ + (ak + 1)Lgγ 20δ0R . τδ0,R((cid:98)g − g0) = γT V ((cid:98)g(k) − g(k) 0 ) 20δ0R + ∥(cid:98)g(z) − g0(z)∥ ≤ (cid:101)R. Hence, on T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3, we can use the bound on ∥(cid:98)g(z) − g0(z)∥ to obtain the bound on ∥(cid:98)g(z) − g0(z)∥n: ∥(cid:98)g(z) − g0(z)∥n ≤ (cid:112)∥(cid:98)g(z) − g0(z)∥2 + δ0R2 ≤ ∥(cid:98)g(z) − g0(z)∥ + (cid:112) δ0R. Lemma 3. Assume Conditions 1, 3, 4 hold. Define KF = (1 ∨ Cz,k) (cid:16)20δ0R γ + 1 + (cid:114) Λmax Λmin (cid:17) R, (2.11) where Cz,k > 0 is the constant in Lemma 5. Suppose that (cid:114) log p n ≤ 1, δ0 log p n R2 λ2 ≤ D1, D2δ −4k−4 2k+3 0 K 2 F n −2k−2 2k+3 ≤ R2, D3(1 + K 2 F )R2 log R−1 ≤ nλ2, D4(λ2nR −2k−1 k+1 K −1 k+1 F ∧ λ √ nR−1) ≥ log n. Then, P(T1(δ0, R)) ≥ 1 − 2p−10 − 8pe−D4(λ2nR −2k−1 k+1 K −1 k+1 F √ ∧λ − nR−1) − 8pe 0 R2n δ2 D5K2 F . Further assume 1 2δ0R2 ≥ D1γ, R ≥ log n n where this constant D1 is the one from Lemma 1. It holds that P(T0(δ0, R, (cid:101)R)) ≥ 1 − 2p−10 − 8pe−D4(λ2nR −2k−1 k+1 K −1 k+1 F √ ∧λ nR−1) − 8pe − δ2 0 R2n D5K2 F . Proof. For a given function f (x, z) = x′β + g(z), it is clear that ∥f ∥2 n − ∥f ∥2 = 1 n n (cid:88) ((x′ iβ)2 − ∥x′β∥2) + 1 n n (cid:88) (g2(zi) − ∥g(z)∥2) i=1 (x′ iβg(zi) − Ex′βg(z)), i=1 2 n + n (cid:88) i=1 23 such that sup τδ0,R(f )≤R (cid:12) (cid:12)∥f ∥2 (cid:12) n − ∥f ∥2(cid:12) (cid:12) (cid:12) ≤ sup τδ0,R(f )≤R (cid:12) (cid:12) (cid:12) 1 n n (cid:88) ((x′ (cid:12) iβ)2 − ∥x′β∥2) (cid:12) (cid:12) + sup τδ0,R(f )≤R + sup τδ0,R(f )≤R (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) i=1 1 n 2 n n (cid:88) i=1 n (cid:88) i=1 (cid:12) (g2(zi) − ∥g(z)∥2) (cid:12) (cid:12) (x′ (cid:12) iβg(zi) − Ex′βg(z)) (cid:12) (cid:12) := I + II + III. We now bound the above three terms separately. According to Lemma 5 Part (i), the β in any f (x, z) = x′β + g(z) with τδ0,R(f ) ≤ R satisfies β ∈ B := (cid:110) β : ∥β∥1 ≤ 20δ0R2λ−1, ∥x′β∥ ≤ (2 + (cid:114) Λmax Λmin (cid:111) . )R (2.12) Therefore, I ≤ sup β∈B (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 ((x′ (cid:12) iβ)2 − ∥x′β∥2) (cid:12) (cid:12) ≤ max 1≤j,k≤p (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:12) (xijxik − Exjxk) (cid:12) (cid:12) · sup β∈B ∥β∥2 1, (2.13) where in the last inequality we have used the fact |a′Aa| ≤ ∥a∥2 1 ·∥A∥max, ∀a ∈ Rp, A ∈ Rp×p. are independent, zero-mean, sub-exponential Condition 1 implies that {xijxik − Exjxk}n i=1 random variables with the sub-exponential norm ∥xijxik − Exjxk∥ψ1 ≤ C1K 2 Bernstein’s inequality in Theorem 3 together with a simple union bound to obtain that when . We then use x log p n ≤ 1, max 1≤j,k≤p (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (xijxik − Exjxk) (cid:12) (cid:12) ≤ C2K 2 (cid:12) x (cid:114) log p n holds with probability at least 1 − 2p−10. Putting this result together with (2.13) enables us to conclude P(I ≤ δ0R2/3) ≥ 1 − 2p−10, as long as δ0 (cid:113) log p n Now we bound II. Using Lemma 5 we have II ≤ supg∈F R2 λ2 ≤ D1. (cid:12) (cid:12)∥g∥2 n − ∥g∥2(cid:12) (cid:12), where F := (cid:110) g : ∥g∥ ≤ RF , T V (g(k)) ≤ 20δ0R2γ−1, ∥g∥∞ ≤ KF RF = (cid:16) 1 + (cid:114) Λmax Λmin (cid:17) R, KF = (1 ∨ Cz,k) (cid:16)20δ0R γ + 1 + (cid:111) , (cid:114) Λmax Λmin (cid:17) R. (2.14) 24 We aim to apply Theorem 5 to bound supg∈F Theorem 5, we first calculate J0(t, F): (cid:12) (cid:12)∥g∥2 n − ∥g∥2(cid:12) (cid:12). Adopting the notation in J0(t, F) ≤ C3t (cid:115) (cid:90) 1 0 C4 (cid:16) ut 2KF (cid:17)− 1 k+1 du = C3 (cid:124) (cid:112) C4K 2 1 2k+2 F (cid:123)(cid:122) ¯C3 1 2k+2 (2k + 2) 2k + 1 (cid:125) 2k+1 2k+2 , ·t (2.15) where the first inequality is due to the entropy bound of Corollary 1 in [56]. As a result, H(u) = sup t≥0 (cid:16)2k + 1 4k + 4 = (cid:0)ut − (J −1 0 (t, F))2(cid:1) ≤ sup t≥0 (cid:0)ut − ¯C − 4k+4 2k+1 3 4k+4 2k+1 (cid:1) t (cid:17) 4k+4 2k+3 2k + 3 2k + 1 4k+4 ¯C 2k+3 3 u 4k+4 2k+3 = D2K 2 2k+3 F 4k+4 2k+3 . u We are ready to invoke Theorem 5. The condition R2 F ≥ H(4KF / n) is reduced to √ R2 ≥ D3K 2 F n −2k−2 2k+3 , (2.16) and under this condition it holds with probability at least 1 − e−t that (cid:12) (cid:12)∥g∥2 (cid:12) n − ∥g∥2(cid:12) (cid:12) (cid:12) ≤ D4 · (cid:16) K sup g∈F 2k+1 2k+2 2k+3 2k+2 F R √ n + √ KF R √ n t + K 2 F t n (cid:17) . Given the above, choosing t = δ2 and assuming a slightly stronger condition R2 ≥ 0 R2n D5K2 F −2k−2 2k+3 compared to (2.16), it is then straightforward to verify that P(II ≤ −4k−4 2k+3 D6δ 0 K 2 F n δ0R2/3) ≥ 1 − e − 0 R2n δ2 F . D5K2 Next we bound III. We first use Hölder’s inequality and Lemma 5 to obtain III ≤ sup 2∥β∥1 · τδ0,R(f )≤R 40δ0R2 λ · ≤ sup g∈F ,1≤j≤p (cid:12) (cid:12) (cid:12) i=1 n 1 (cid:88) n i=1 n (cid:88) (cid:13) (cid:13) (cid:13) 1 n (cid:13) (xig(zi) − Exg(z)) (cid:13) (cid:13)∞ (cid:12) (xijg(zi) − Exjg(z)) (cid:12) (cid:12). We then aim to use Theorem 6 to bound supg∈F | 1 j ≤ p. The quantities QF , (cid:98)QF in Theorem 6 become QF = supg∈F i=1(xijg(zi) − Exjg(z))| for each 1 ≤ (cid:113)E(x2 j g2(z)), (cid:98)QF = ijg2(zi). Let (cid:98)RF = supg∈F ∥g∥n, Xj = (x1j, . . . , xnj), Z = (z1, . . . , zn), supg∈F and ϵ1, . . . , ϵn be a Rademacher sequence independent of Xj, Z. Applying Dudley’s entropy (cid:113) 1 n i=1 x2 (cid:80)n n (cid:80)n 25 integral gives (cid:16) E 1 n | sup g∈F ≤C5 (cid:98)QF√ n (cid:90) 1 0 n (cid:88) i=1 (cid:113) (cid:12) (cid:17) (cid:12) (cid:12)Xj, Z ϵixijg(zi)| ≤ C5 (cid:98)QF√ n (cid:90) 1 0 (cid:113) H(u (cid:98)QF , F, ∥ · ∥w n )du, H(u (cid:98)QF ∥Xj∥−1 ∞ , F, ∥ · ∥n)du ≤ J0(2 (cid:98)QF ∥Xj∥−1 ∞ , F), C5∥Xj∥∞ √ n 2C3 2k+1 2k+2 (cid:98)R F 1 2k+2 F ∥Xj∥∞, ≤D7n−1/2K 1 2k+2 F 2k+1 2k+2 (cid:98)Q F 1 2k+2 ∥Xj∥ ∞ ≤ D7n−1/2K (2.17) where J0(·, F) is from (2.15) and the second to last inequality is due to (2.15); ∥ · ∥w n denotes the weighted empirical norm ∥g∥w n = ∥Xj∥∞∥g∥n and (cid:98)QF ≤ ∥Xj∥∞ (cid:98)RF . Now applying the second result in Theorem 6 yields ijg2(zi) and it is clear that ∥g∥w i=1 x2 n ≤ (cid:113) 1 n (cid:80)n (cid:16) P 1 n | sup g∈F n (cid:88) i=1 ϵixijg(zi)| ≥ C6∥Xj∥∞ (cid:16) D7n−1/2K 1 2k+2 F 2k+1 2k+2 (cid:98)R F + (cid:98)RF (cid:112)t/n (cid:17)(cid:12) (cid:12) (cid:12)Xj, Z (cid:17) ≤ e−t. (2.18) Define the events A1 = (cid:110) ∥Xj∥∞ ≤ C7Kx((cid:112)log n + √ t) (cid:111) , A2 = (cid:110) sup g∈F |∥g∥2 n − ∥g∥2| ≤ δ0R2/3 (cid:111) . Condition 1 together with standard union bound for sub-Gaussian tails gives P(Ac δ2 0 R2n D5K2 (with a proper choice of C7). The previous result on term II shows P(Ac − 1) ≤ e−t F , and it 2) ≤ e holds on A2 that (cid:98)RF ≤ (cid:114) sup g∈F |∥g∥2 n − ∥g∥2| + sup g∈F ∥g∥2 ≤ (cid:113) δ0/3 + (1 + (cid:112)Λmax/Λmin)2R. (2.19) As a result, we can further bound (2.18) on A1 ∩ A2 to arrive at (cid:16) P 1 n | sup g∈F n (cid:88) i=1 ϵixijg(zi)| ≥ D8((cid:112)log n + √ (cid:16) n−1/2K 1 2k+2 F R 2k+1 2k+2 + R(cid:112)t/n (cid:17)(cid:12) (cid:17) (cid:12) (cid:12)Xj, Z · t) We integrate out the above conditional probability to obtain the unconditional one: (cid:16) P 1 n | sup g∈F n (cid:88) i=1 ϵixijg(zi)| ≥ D8((cid:112)log n + √ t)(cid:0)n−1/2K 1 2k+2 F R 2k+1 2k+2 + R(cid:112)t/n(cid:1)(cid:17) 1A1∩A2 ≤ e−t ≤ e−t + P(Ac 1 ∪ Ac 2) ≤ 2e−t + e − δ2 0 R2n D5K2 F . 26 It is then direct to verify that choosing t = D9(λ2nR −2k−1 k+1 K −1 k+1 F ∧ λ √ nR−1) ≥ log n gives (cid:16) P 1 n | sup g∈F n (cid:88) i=1 ϵixijg(zi)| ≥ (cid:17) λ 480 ≤ 2e−D9(λ2nR −2k−1 k+1 K −1 k+1 F √ ∧λ nR−1) + e − 0 R2n δ2 D5K2 F . (2.20) We will invoke the symmetrization result of Theorem 6 to transfer the above bound to (cid:80)n i=1(xijg(zi) − Exjg(z))|. In order to do so, we need to show 8 × 4802Q2 F ≤ nλ2. supg∈F | 1 n Note that Q2 F = sup g∈F Ex2 j g2(z) = sup g∈F (cid:0)Ex2 j g2(z)1(|xj| ≤ s) + Ex2 ≤ s2 sup g∈F ∥g∥2 + sup g∈F ∥g∥2 ∞ · Ex2 j 1(|xj| > s) ≤ s2(cid:16) j g2(z)1(|xj| > s)(cid:1) (cid:114) Λmax Λmiin (cid:17)2 1 + R2 + K 2 F · Ex2 j 1(|xj| > s). Using the sub-Gaussian tail bound for xj to bound Ex2 that Q2 F ≤ D10(1 + K 2 condition D12(1 + K 2 F )R2 log R−1 with the choice s = D11 F )R2 log R−1 ≤ nλ2, Theorem 6 together with (2.20) leads to j 1(|xj| > s), it is not hard to obtain (cid:112)log R−1. Therefore, under the (cid:16) P 1 n | sup g∈F n (cid:88) (xijg(zi) − Exjg(z))| ≥ i=1 (cid:17) λ 120 ≤ 8e−D9(λ2nR We follow via a union bound, −2k−1 k+1 K −1 k+1 F √ ∧λ − nR−1) + 4e 0 R2n δ2 D5K2 F . P(III ≥ δ0R2/3) ≤ P (cid:16) sup g∈F ,1≤j≤p (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:12) (xijg(zi) − Exjg(z)) (cid:12) (cid:12) ≥ (cid:17) λ 120 ≤ p · max 1≤j≤p (cid:16) P 1 n | sup g∈F n (cid:88) i=1 (xijg(zi) − Exjg(z))| ≥ (cid:17) λ 120 ≤ 8pe−D9(λ2nR −2k−1 k+1 K −1 k+1 F √ ∧λ nR−1) + 4pe − 0 R2n δ2 D5K2 F . Finally, we collect the results on I, II, III to bound P(T1(δ0, R)) via P(T1(δ0, R)) ≥ 1−P(I ≥ δ0R2/3) − P(II ≥ δ0R2/3) − P(III ≥ δ0R2/3). Regarding the bound for P(T0(δ0, R, (cid:101)R)), first note that 2R ≤ (cid:101)R ≤ D12R under the . Moreover, {f : τδ0,R(f ) ≤ (cid:101)R} ⊂ {f : τδ0, (cid:101)R(f ) ≤ (cid:101)R}. 2δ0R2 ≥ D1γ, R ≥ log n assumption 1 n Hence, P(T0(δ0, R, (cid:101)R)) ≥ P (cid:16) sup τδ0, (cid:101)R(f )≤ (cid:101)R 27 (cid:12) (cid:12)∥f ∥2 (cid:12) n − ∥f ∥2(cid:12) (cid:12) ≤ D−2 (cid:12) 12 δ0 (cid:101)R2(cid:17) . Essentially, all the previous arguments in bounding P(T1(δ0, R)) carry over into P(T0(δ0, R, (cid:101)R)), up to constants Dj’s. We only need to update those constants in the condi- tions and results, in a way so that P(T1(δ0, R)) and P(T0(δ0, R, (cid:101)R)) share the same constants (e.g., taking a minimum or maximum among constants). Lemma 4. Assume Conditions 1-4 hold. (i) When R2 ≥ D1δ 0 −4k−4 2k+3 n 3, then −2k−2 2k+3 (K 2 F +K 2 2k+3 F σ 4k+4 2k+3 ) where KF is defined in (2.11) of Lemma (cid:16) P T2(δ0, R) (cid:17) ≥ 1 − 2pe−C1n min (cid:0) λ2 (KxKεσ)2 , λ KxKεσ (cid:1) − 2e−D2σ−2δ2 − 0 nR2 − e 0 R2n δ2 D3K2 F . (ii) As long as 11 log n ℓzn < 1, then P(cid:0)T3 (cid:1) ≥ 1 − 2ℓzn−10. Proof. Recalling the sets B in (2.12) and F in (2.14) from the proof of Lemma 3, we have sup τδ0,R(f )≤R (cid:12) (cid:12) (cid:12)⟨ε, f (X, Z)⟩n (cid:12) (cid:12) (cid:12) ≤ sup τδ0,R(f )≤R (cid:12) (cid:12) (cid:12) 1 n ≤ sup β∈B ∥β∥1 · (cid:13) (cid:13) (cid:13) n (cid:88) εix′ iβ (cid:12) (cid:12) (cid:12) + sup τδ0,R(f )≤R (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 εig(zi) (cid:12) (cid:12) (cid:12) n (cid:88) i=1 1 n i=1 xiεi (cid:13) (cid:13) (cid:13)∞ + sup g∈F (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:12) (cid:12) εig(zi) (cid:12) ≤ 20δ0R2 λ (cid:13) (cid:13) (cid:13) 1 n n (cid:88) xiεi (cid:13) (cid:13) (cid:13)∞ + sup g∈F (cid:12) (cid:12) (cid:12) 1 n n (cid:88) (cid:12) (cid:12) (cid:12) := I + II. εig(zi) i=1 We first bound term I. Using Bernstein’s inequality in Theorem 3 we obtain that i=1 P(| 1 n (cid:80)n i=1 εixij| ≥ λ 40) ≤ 2e−C1n min (cid:0) λ2 (KxKεσ)2 , λ KxKεσ (cid:1) . This combined with a union bound gives P(I ≤ δ0R2/2) ≥ P (cid:16)(cid:13) (cid:13) (cid:13) 1 n n (cid:88) i=1 xiεi (cid:13) (cid:13) (cid:13)∞ ≤ (cid:17) λ 40 ≥ 1 − 2pe−C1n min (cid:0) λ2 (KxKεσ)2 , λ KxKεσ (cid:1) . To bound term II, let (cid:98)RF = supg∈F ∥g∥n. Conditional on z1, . . . , zn (which are independent from ε1, . . . , εn), Theorem 7 implies that the following holds with probability at least 1−2e−t: II ≤ C2Kεσ · (cid:32) (cid:98)RF√ n (cid:90) 1 0 (cid:113) H(u (cid:98)RF , F, ∥ · ∥n)du + (cid:98)RF (cid:33) , (cid:114) t n ≤ C2Kεσ · (cid:16) D1n−1/2K 1 2k+2 F 2k+1 2k+2 (cid:98)R F + (cid:98)RF (cid:17) (cid:112)t/n , 28 where the last inequality follows as in (2.17). Integrating over z1, . . . , zn, the above holds marginally with probability at least 1 − 2e−t as well. Moreover, from (2.19) we already know that, P( (cid:98)RF ≤ D3R) ≥ 1 − e − δ2 0 R2n D4K2 F , when R2 ≥ D2δ −4k−4 2k+3 0 K 2 F n −2k−2 2k+3 . Combining these two results via a union bound gives us (cid:16) P II ≤ D5σ (cid:16) n−1/2K 1 2k+2 F R 2k+1 2k+2 + R(cid:112)t/n (cid:17)(cid:17) ≥ 1 − 2e−t − e − 0 R2n δ2 D4K2 F . Choosing t = D6σ−2δ2 0nR2 and assuming R2 ≥ D7δ −4k−4 2k+3 0 n direct to confirm −2k−2 2k+3 (K 2 F + K 2 2k+3 F 4k+4 2k+3 ), it is σ P(II ≤ δ0R2/2) ≥ 1 − 2e−D6σ−2δ2 − 0 nR2 − e 0 R2n δ2 D4K2 F . Combining the bounds for I, II gives the bound for P(T2(δ0, R)) ≥ P(I ≤ δ0R2/2, II ≤ δ0R2/2) ≥ 1 − P(I > δ0R2/2) − P(II > δ0R2/2). Finally, the bound on P(cid:0)T3 (cid:1) is directly taken from Lemma 5 in [77]. Lemma 5. Consider any f (x, z) = x′β + g(z) that satisfies τδ0,R(f ) ≤ R. (i) Under Condition 4 it holds that ∥β∥2 ≤ Λ−1/2 (cid:16) ∥g∥ ≤ 1 + min R, ∥β∥1 ≤ (cid:114)Λmax Λmin (cid:17) R, T V (g(k)) ≤ 20δ0R2 γ . 20δ0R2 λ , ∥x′β∥ ≤ (cid:16) 2 + (cid:114) Λmax Λmin (cid:17) R, (ii) Under additional Condition 3, it holds that ∥g∥∞ ≤ Cz,k (cid:16)20δ0R γ + 1 + (cid:114) Λmax Λmin (cid:17) R, where Cz,k > 0 is a constant only dependent on k and ℓz. Proof. The bound on ∥β∥1 and T V (g(k)) is clear from the definition of τδ0,R(f ) in (2.7). Using the orthogonal decomposition: ∥x′β + g(z)∥2 = ∥(cid:101)x′β∥2 + ∥h(z)′β + g(z)∥2 ≤ R2, 29 we have ∥β∥2 ≤ Λ−1/2 min ∥(cid:101)x′β∥ ≤ Λ−1/2 min R, and ∥g(z)∥ ≤ ∥h(z)′β + g(z)∥ + ∥h(z)′β∥ ≤ R + Λ1/2 max∥β∥2 ≤ (cid:16) 1 + (cid:114)Λmax Λmin (cid:17) R. Having the bound on ∥g∥, we can further obtain ∥x′β∥ ≤ ∥x′β + g(z)∥ + ∥g(z)∥ ≤ (2 + (cid:113) Λmax Λmin It remains to bound ∥g∥∞. Define g∗(z) = ∗ ) ≤ 1, ∥g∗∥ ≤ 1 , then T V (g(k) )R. g(z) ( 20δ0R γ +1+ (cid:113) Λmax Λmin )R due to the derived bounds on ∥g∥ and T V (g(k)). Hence it is sufficient to show that sup ∗ )≤1,∥g∗∥≤1 g∗:T V (g(k) ∥g∗∥∞ ≤ Cz,k. To prove the above, we first decompose g∗ = p∗ + q∗, where p∗ is a polynomial of degree k and q∗ is orthogonal to all polynomials of degree k (with respect to the L2 inner product (cid:82) 1 0 p(z)q(z)dz). Note that T V (q(k) ∗ ) = T V (g(k) ∥q∗∥∞ ≤ ck for a constant ck > 0 only depending on k. Now write p∗(z) = (cid:80)k ∗ ) ≤ 1. Then Lemma 5 in [56] implies that j=0 ajzj. We have (cid:112) a′V0a = ∥p∗∥ ≤ ∥g∗∥ + ∥q∗∥ ≤ 1 + ck, where a = (a0, . . . , ak) and V0 = E(v(z)v(z)′) with v(z) = (1, z, z2, . . . , zk). Thus, we obtain ∥p∗∥∞ ≤ ∥a∥1 ≤ √ k + 1∥a∥2 ≤ √ k + 1(1 + ck)λ−1/2 min (V0), which implies ∥g∗∥∞ ≤ ∥p∗∥∞ + ∥q∗∥∞ ≤ min (V0) + ck. Finally, we need to show λmin(V0) is bounded below by a constant only depending on k and ℓz. Under Condition k + 1(1 + ck)λ−1/2 √ 3 we have λmin(V0) = min ∥b∥2=1 E(b′v(z))2 ≥ ℓz · min ∥b∥2=1 E(b′v((cid:101)z))2, (cid:101)z ∼ U nif (0, 1) = ℓz · E(b′ ∗v((cid:101)z))2, where b∗ is the minimizer; E(b′ ∗v((cid:101)z))2 is a constant only depending on k, and it is positive because the roots of any polynomial (not identically zero) form a set of Lebesgue measure zero. 30 2.6.1.3 Completion of the proof We are in the position to complete the proof of Theorem 1. In fact, we shall prove Theorems 1 and 2 altogether. We will choose R2, λ, γ, δ0 such that the conditions in Lemmas 1-4 (for Theorem 1) and Lemmas 6-7 (for Theorem 2) are all satisfied. As a result, we will then conclude that the bounds on (cid:98)g(z) in Lemma 2 and bounds on (cid:98)β in Lemma 6 hold with probability at least P(T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3 ∩ S1 ∩ S2 ∩ S3). Towards this end, we first list the conditions to meet2: (i) Conditions from Lemmas 1-2: δ0 ≤ 1 n2 + γ + λ2s). 100, δ0R2 ≥ D1( log2 n (cid:113) log p n (ii) Conditions from Lemma 3: log p n ≤ 1, δ0 F )R2 log R−1 ≤ nλ2, D5(λ2nR R2 λ2 ≤ D2, D3δ −4k−4 2k+3 0 K 2 F n −2k−2 2k+3 ≤ R2, −2k−1 k+1 K −1 k+1 F ∧ λ √ nR−1) ≥ log n, 1 2δ0R2 ≥ D4(1 + K 2 D1γ, R ≥ log n n . Recall that KF is defined in (2.11). (iii) Conditions from Lemma 4: R2 ≥ D6δ 0 −4k−4 2k+3 n −2k−2 2k+3 (K 2 F + K 2 2k+3 F 4k+4 2k+3 ), σ 11 log n ℓzn < 1. (iv) Conditions from Lemma 6: δ0 ≤ 1 100, δ0R2 ≥ D1( log2 n n2 +γ +λ2s), λ ≥ D7(R2λ−1 log2 n n2 + γ). (v) Conditions from Lemma 7: log p n ≤ 1, δ0 F )R2 log R−1 ≤ nλ2, D11(λ2nR (cid:113) log p n R2 λ2 ≤ D8, D9δ −4k−4 2k+3 0 K 2 F n −2k−2 2k+3 ≤ R2, −2k−1 k+1 K −1 k+1 F ∧ λ √ nR−1) ≥ log n, log n n (1 + D10(1 + K 2 KF + σ + R2λ−1) + σ−1R2 ≤ D12λ. We choose δ0 = 1 100, λ = Dσ (cid:113) log p n , γ = δ0R2 2D1 , where D > 0 is any given constant that may only depend on k, Lg, Lh, Kx, Kϵ, ℓz, Λmin, Λmax. It remains to specify R2. We will consider R2 ≤ 1, which together with γ = δ0R2 2D1 implies KF = (1 ∨ Cz,k) (cid:16) 20δ0R γ + 1 + (cid:114)Λmax Λmin (cid:17) R ≤ (1 ∨ Cz,k) (cid:16) 40D1 + 1 + (cid:114) Λmax Λmin (cid:17) := D13. Define R2 0 = 2D1 δ0 (cid:16)log2 n n2 + λ2s (cid:17) + δ −4k−4 2k+3 0 −2k−2 2k+3 n (cid:16) (D3 + D9)D2 13 + D6(D2 13 + D 2 2k+3 13 σ 4k+4 2k+3 ) (cid:17) . 2With a bit abuse of notation, the subscripts for constants Dj’s we use here may be different from the ones in the lemmas. 31 For any R2 ∈ [R2 the conditions involving n 0, 1], it is straightforward to verify that Conditions from Lemmas 1-2 and all 2k+3 are satisfied. To meet other conditions, we will pick R2 = cR2 0 −2k−2 for any constant c ≥ 1. We verify the remaining conditions by showing that under the scaling = o(1), p → ∞ as n → ∞ (cid:0)so that λ ≍ (s+log p)s log p n o(λ)(cid:1): (cid:113) log p n , R2 ≍ γ ≍ s log p n + n −2k−2 2k+3 , R2 = (ii) Conditions from Lemma 3: log p n = o(1), (cid:113) log p n o(1) ≪ log p ≍ nλ2, λ √ nR−1 ≫ nR ≳ n √ R2 λ2 = O(R2/λ) = o(1), R2 log R−1 = . Moreover, 2k+3 ≫ log n n 4k+6 ≫ log n, R ≳ n −k−1 1 λ2nR −2k−1 k+1 ≳ n holds that 2k+1 2k+3 · log p ≫ log n when s log p n ≤ n −2k−2 2k+3 ; when s log p n > n −2k−2 2k+3 , it still λ2nR −2k−1 k+1 ≳ (cid:16) n s log p (cid:17) 2k+1 2k+2 · log p ≫ s 2k+1 2k+2 · log p ≳ n 2k+1 (2k+2)(2k+3) · (log p) 1 2k+2 ≫ log n. (iii) Conditions from Lemma 4: n log n = o(1). (iv) Conditions from Lemma 6: R2λ−1 log2 n n2 + γ ≍ R2 ≪ λ. (v) Conditions from Lemma 7: Most conditions have been verified for conditions in Lemma 3. It remains to see that log n n (1 + R2λ−1) + R2 ≲ log n n + R2 ≪ λ. Finally, given what we have obtained, it is straightforward to further evaluate the high probability (by choosing the constant D in λ and c in R2 large enough): P(T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3 ∩ S1 ∩ S2 ∩ S3) ≥ 1 − p−D14 − n−D15, when n is large enough under the scaling s log2 p n = o(1), p → ∞. 32 2.6.2 Proof of Theorem 2 2.6.2.1 Roadmap of the proof Recall the sets B in (2.12) and F in (2.14) from the proof of Lemma 3. Define the following set and events: (cid:110) (cid:101)F = S1 = g : ∥g∥∞ ≤ 22bk(Lg + Lh)ℓ−1 z n−1 log n (cid:13) (∆(Z) + (cid:101)X)′(g(Z) − h(Z)β + ε) (cid:13) (cid:13)∞ (cid:111) , λ, ∀β ∈ B, g ∈ F ∪ (cid:101)F, ∆j ∈ (cid:101)F, 1 ≤ j ≤ p (cid:111) , (cid:110)(cid:13) (cid:13) (cid:13) 1 n 1 8 (cid:110)(cid:12) (cid:12)∥ (cid:101)Xβ∥2 (cid:12) ≤ S2 = n − ∥(cid:101)x′β∥2(cid:12) (cid:12) (cid:12) ≤ λ∥β∥1, ∀β ∈ B (cid:111) , S3 = (cid:110)(cid:12) (cid:12)∥Xβ∥2 (cid:12) n − ∥x′β∥2(cid:12) (cid:12) (cid:12) ≤ λ∥β∥1, ∀β ∈ B (cid:111) . The proof of Theorem 1 already yields an error rate for (cid:98)β (see Lemmas 1 and 5), however, the rate is not optimal yet. We will base on this “slow rate" result to perform a finer analysis of (cid:98)β to obtain the “fast rate". This is again inspired by [46]. Specifically, we prove in Lemma 6 that the “fast rate" results on (cid:98)β in Theorem 2 hold by intersecting with another event S1 ∩ S2 ∩ S3. We then show in Lemma 7 that this additional event S1 ∩ S2 ∩ S3 also has large probability. 2.6.2.2 Key lemmas Lemma 6. Assume Conditions 3-5 and δ0 ≤ 1 100 , δ0R2 ≥ D1(n−2 log2 n + γ + λ2s), λ ≥ D2(R2λ−1n−2 log2 n + γ), where the constant D1 is the same as in Lemma 1. Then on T1(δ0, R)∩T2(δ0, R)∩T3 ∩S1 ∩S2, ∥(cid:98)β − β0∥2 2 ≤ 16Λ−2 minsλ2, ∥x′((cid:98)β − β0)∥2 ≤ 32(Λmin + Λmax)Λ−2 minsλ2. Further intersecting with the event S3, we have ∥X(cid:98)β − Xβ0∥2 n ≤ 16(3Λmin + 2Λmax)Λ−2 minsλ2. Proof. Recall j ) ≤ Lh where h(z) = (h1(z), . . . , hp(z)) = E(x|z). According to Lemma 16 in [70], there exist ¯hj ∈ Hn for all that under Condition 5, max1≤j≤p T V (h(k) 33 1 ≤ j ≤ p such that T V (¯h(k) j ) ≤ akLh, ∥¯hj − hj∥∞ ≤ bkLh · max (z(i+1) − z(i)), 1≤i≤n−1 where ak, bk > 0 are constants only depending on k, and Hn is the span of the kth order falling factorial basis functions with knots z(k+1) ≤ z(k+2) . . . ≤ z(n−1). Define ¯h(z) = (¯h1(z), . . . , ¯hp(z)), ˇg(z) = (cid:98)g(z) + ¯h′(z)((cid:98)β − β0). Note that ˇg ∈ Hn since (cid:98)g, ¯hj ∈ Hn. Therefore, we can write down a basic inequality, 1 2 ∥y − X(cid:98)β − (cid:98)g(Z)∥2 n + λ∥(cid:98)β∥1 + γT V ((cid:98)g(k)) ≤ 1 2 ∥y − Xβ0 − ˇg(Z)∥2 n + λ∥β0∥1 + γT V (ˇg(k)), which can be reformulated as 1 2 ∥ (cid:101)X((cid:98)β − β0)∥2 n + λ∥(cid:98)β∥1 ≤ 1 2 ∥(h(Z) − ¯h(Z))((cid:98)β − β0)∥2 n + γ(T V (ˇg(k)) − T V ((cid:98)g(k)))+ (cid:10)g0(Z) − (cid:98)g(Z) − h(Z)((cid:98)β − β0) + ε, (h(Z) − ¯h(Z) + (cid:101)X)((cid:98)β − β0)(cid:11) n + λ∥β0∥1. Using triangle inequality for T V (·) and Hölder’s inequality, we can proceed to obtain 1 2 ∥ (cid:101)X((cid:98)β − β0)∥2 n + λ∥(cid:98)β∥1 ≤ ∥(h(Z) − ¯h(Z))′(h(Z) − ¯h(Z))∥max · ∥(cid:98)β − β0∥2 1 1 2n + γakLh∥(cid:98)β − β0∥1 + (cid:13) (cid:13) 1 n (h(Z) − ¯h(Z) + (cid:101)X)′(g0(Z) − (cid:98)g(Z) − h(Z)((cid:98)β − β0) + ε)(cid:13) (cid:13)∞ · ∥(cid:98)β − β0∥1 + λ∥β0∥1. (2.21) To further simplify the above inequality, we observe the following: (i) 1 n ∥(h(Z) − ¯h(Z))′(h(Z) − ¯h(Z))∥max ≤ maxj ∥hj − ¯hj∥2 ∞ ≤ b2 kL2 h222ℓ−2 z n−2 log2 n on T3 (ii) Conditions in Lemma 6 imply the conditions in Lemma 1 and Lemma 5. Hence, combining Lemmas 1 and 5 shows (cid:98)β − β0 ∈ B, ¯g − (cid:98)g ∈ F, g0 − ¯g ∈ (cid:101)F. 34 Based on these two results and assuming 4840b2 kL2 hδ0ℓ−2 z R2λ−1n−2 log2 n + γakLh ≤ λ/4, we continue from (2.21) to achieve that on T1(δ0, R) ∩ T2(δ0, R) ∩ T3 ∩ S1, 1 2 ∥ (cid:101)X((cid:98)β − β0)∥2 n + λ∥(cid:98)β∥1 ≤(4840b2 kL2 hδ0ℓ−2 z R2λ−1n−2 log2 n + γakLh + λ/4) · ∥(cid:98)β − β0∥1 + λ∥β0∥1 λ 2 ∥(cid:98)β − β0∥1 + λ∥β0∥1. ≤ A standard argument in the literature of Lasso (e.g., Lemma 6.3 in [8]) simplifies the above to n + λ∥(cid:98)β − β0∥1 ≤ 4λ∥(cid:98)βS − β0 S∥1 ≤ 4 sλ∥(cid:98)β − β0∥2 √ ∥ (cid:101)X((cid:98)β − β0)∥2 √ ≤4 sλΛ−1/2 min ∥(cid:101)x′((cid:98)β − β0)∥ ≤ 8sλ2Λ−1 min + 1 2 ∥(cid:101)x′((cid:98)β − β0)∥2 ≤8sλ2Λ−1 min + 1 2 ∥ (cid:101)X((cid:98)β − β0)∥2 n + λ 2 ∥(cid:98)β − β0∥1, where the second to last equality is due to ab ≤ a2/2 + b2/2, and the last inequality holds on S2. Rearranging the terms leads to ∥ (cid:101)X((cid:98)β − β0)∥2 n + λ∥(cid:98)β − β0∥1 ≤ 16Λ−1 minsλ2. (2.22) Now (2.22) combined with the condition from S2 implies ∥(cid:98)β − β0∥2 2 ≤ Λ−1 min · ∥(cid:101)x′((cid:98)β − β0)∥2 ≤ Λ−1 min · (∥ (cid:101)X((cid:98)β − β0)∥2 n + λ∥(cid:98)β − β0∥1) ≤ 16Λ−2 minsλ2, ∥x′((cid:98)β − β0)∥2 = (∥(cid:101)x′((cid:98)β − β0)∥ + ∥h(z)′((cid:98)β − β0)∥)2 ≤ 2∥(cid:101)x′((cid:98)β − β0)∥2 + 2∥h(z)′((cid:98)β − β0)∥2 ≤ 32Λ−1 minsλ2 + 2Λmax∥(cid:98)β − β0∥2 2 ≤ 32(Λmin + Λmax)Λ−2 minsλ2. Further intersecting with S3, we obtain ∥X((cid:98)β − β0)∥2 n ≤ ∥x′((cid:98)β − β0)∥2 + λ∥(cid:98)β − β0∥1 ≤ 32(Λmin + Λmax)Λ−2 minsλ2 + 16Λ−1 minsλ2. Lemma 7. Assume Conditions 1-5 hold. 35 (2.23) (2.24) (2.25) (i) Suppose D1δ −4k−4 2k+3 0 K 2 F n −2k−2 2k+3 ≤ R2, D2(1 + K 2 F )R2 log R−1 ≤ nλ2, D3(λ2nR −2k−1 k+1 K √ −1 k+1 F ∧ λ nR−1) ≥ log n, n−1 log n(1 + KF + σ + R2λ−1) + σ−1R2 ≤ D4λ, where KF is defined in (2.11) of Lemma 3. Then, P(S1) ≥1 − 6pe−n − 2(p + p2)e−D5n min(σ−2λ2,σ−1λ) 0 R2n δ2 D7K2 F . − nR−1) − 4pe − 8pe−D6(λ2nR −1 k+1 F k+1 K −2k−1 ∧λ √ (ii) Suppose log p n ≤ 1, δ0 (cid:113) log p n R2 λ2 ≤ D8. Then, P(S2) ≥ 1 − 2p−10, P(S3) ≥ 1 − 2p−10. Proof. For ∆(z) = (∆1(z), . . . , ∆p(z)) with ∆j ∈ (cid:101)F, 1 ≤ j ≤ p, (cid:13) (cid:13) (cid:13) 1 n ∆(Z)′(g(Z) − h(Z)β + ε) (cid:13) (cid:13) (cid:13)∞ ≤ ∥∆(Z)∥max · (cid:16) 1 1 n n (cid:16) ≤ 22bk(Lg + Lh)ℓ−1 z n−1 log n · ∥g(Z)∥1 + ∥h(Z)β∥1 + (cid:17) ∥ε∥1 1 n ∥g∥∞ + ∥β∥1 · max j ∥hj(Z)∥1/n + ∥ε∥1/n (cid:17) . Using Hoeffding’s inequality in Theorem 3, we obtain3 P(∥ε∥1/n ≤ C3σKε) ≥ 1 − 2e−n, P(max j ∥hj(Z)∥1/n ≤ C4Kx) ≥ 1 − 2pe−n. Thus with probability at least 1 − 2(p + 1)e−n, sup β∈B,g∈F∪ (cid:101)F ,∆j ∈ (cid:101)F (cid:13) (cid:13) (cid:13) 1 n (cid:13) ∆(Z)′(g(Z) − h(Z)β + ε) (cid:13) (cid:13)∞ ≤D1n−1 log n · (cid:0)KF + n−1 log n + R2λ−1 + σ(cid:1). (2.26) 3Given that ∥xj∥ψ2 ≤ Kx, it holds that ∥hj(z)∥ψ2 ≤ C1Kx, ∥(cid:101)xj∥ψ2 ≤ C2Kx. 36 Next we have (cid:13) (cid:13) (cid:13) 1 n (cid:13) (cid:101)X′(g(Z) − h(Z)β + ε) (cid:13) (cid:13)∞ ≤ (cid:13) (cid:13) (cid:13) 1 n ≤ max 1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) + ∥β∥1 · max 1≤j,k≤p + (cid:13) (cid:13) 1 (cid:101)X′g(Z) (cid:13) (cid:13) (cid:13)∞ (cid:13) n n (cid:88) (cid:101)xijhk(zi)(cid:12) (cid:12) (cid:12) 1 n i=1 (cid:101)X′h(Z)β (cid:12) + (cid:13) (cid:13) 1 n (cid:13) (cid:13) (cid:13)∞ n (cid:88) + (cid:13) (cid:13) (cid:13) 1 n (cid:101)X′ε (cid:13) (cid:13) (cid:13)∞ (cid:101)xiεi (cid:13) (cid:13)∞. i=1 Note that we have used Bernstein’s inequality to bound (cid:13) (cid:13) 1 n (cid:80)n i=1 xiεi (cid:13) (cid:13)∞ in the proof of Lemma 4. The same result holds here, P (cid:16)(cid:13) (cid:13) (cid:13) 1 n n (cid:88) i=1 (cid:101)xiεi (cid:13) (cid:13) (cid:13)∞ ≤ (cid:17) λ 40 ≥ 1 − 2pe−C5n min (cid:0) λ2 (KxKεσ)2 , λ KxKεσ (cid:1) . are independent, zero-mean, sub-exponential random variables with i=1 Since {(cid:101)xijhk(zi)}n ∥(cid:101)xijhk(zi)∥ψ1 ≤ ∥(cid:101)xij∥ψ2 · ∥hk(zi)∥ψ2 ≤ C6K 2 1 n (cid:101)xijhk(zi)(cid:12) max 1≤j,k≤p n (cid:88) P (cid:16) (cid:12) (cid:12) x i=1 , we use again Bernstein’s inequality to obtain (cid:12) ≤ λ/(20σ) (cid:17) ≥ 1 − 2p2e −C7n min (cid:0) λ2 xσ2 , λ K2 K4 xσ (cid:1) . Hence with probability at least 1 − 2(p + p2)e −C8n min (cid:0) λ2 xσ2(K2 K2 ε +K2 x) , λ Kxσ(Kε+Kx) (cid:1) , sup β∈B,g∈F∪ (cid:101)F ,∆j ∈ (cid:101)F (cid:13) (cid:13) (cid:13) 1 n (cid:13) (cid:101)X′(g(Z) − h(Z)β + ε) (cid:13) (cid:13)∞ ≤ sup g∈g∈F∪ (cid:101)F ,1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) + σ−1δ0R2 + λ 40 . We continue in the following way, sup g∈F∪ (cid:101)F ,1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) ≤ sup g∈F ,1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) + sup g∈ (cid:101)F ,1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) ≤ sup g∈F ,1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) + 22bk(Lg + Lh) log n ℓzn · max 1≤j≤p As in bounding maxj ∥hj(Z)∥1/n earlier, we bound (cid:16) P max 1≤j≤p 1 n n (cid:88) i=1 |(cid:101)xij| ≤ C9Kx (cid:17) ≥ 1 − 2pe−n. (2.27) 1 n n (cid:88) i=1 |(cid:101)xij| (2.28) Moreover, applying the same arguments for bounding supg∈F | 1 the proof of Lemma 3, we have that under the conditions (2.23)-(2.24), n (cid:80)n i=1(xijg(zi) − Exjg(z))| in (cid:16) P sup g∈F ,1≤j≤p (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:101)xijg(zi)(cid:12) (cid:12) ≥ (cid:17) λ 120 ≤ 8pe−D2(λ2nR 37 −2k−1 k+1 K −1 k+1 F √ ∧λ nR−1) + 4pe − δ2 0 R2n D3K2 F . (2.29) Combining (2.25)-(2.29) completes the bound for P(S1). Regarding the bound for P(S2) and P(S3), the proof follows the same arguments used for bounding term I in the proof of Lemma 3. 2.6.2.3 Completion of the proof See Section 2.6.1.3. 2.6.3 Reference materials Theorem 3. (General Hoeffding’s inequality). Let x1, . . . , xn ∈ R be independent, zero- mean, sub-gaussian random variables. Then for every t ≥ 0, we have P (cid:18)(cid:12) (cid:12) (cid:12) n (cid:88) i=1 (cid:19) (cid:12) (cid:12) (cid:12) ≥ t xi ≤ 2 exp (cid:16) − ct2 i=1 ∥xi∥2 ψ2 (cid:80)n (cid:17) , where c > 0 is an absolute constant, and ∥ · ∥ψ2 inf{t > 0 : Eex2/t2 ≤ 2}. is the sub-gaussian norm defined as ∥x∥ψ2 = (Bernstein’s inequality). Let x1, . . . , xn ∈ R be independent, zero-mean, sub-exponential random variables. Then for every t ≥ 0, we have P (cid:18)(cid:12) (cid:12) (cid:12) n (cid:88) i=1 (cid:19) (cid:12) (cid:12) (cid:12) ≥ t xi (cid:34) ≤ 2 exp − c min (cid:18) t2 i=1 ∥xi∥2 ψ1 (cid:80)n , t maxi ∥xi∥ψ1 (cid:19)(cid:35) , where c > 0 is an absolute constant, and ∥ · ∥ψ1 ∥x∥ψ1 = inf{t > 0 : Ee|x|/t ≤ 2}. is the sub-exponential norm defined as The above two results are Theorem 2.6.2 and Theorem 2.8.1, respectively in [73]. Theorem 4. Let x1, . . . , xp ∈ R be sub-gaussian random variables, which are not necessarily independent. Then there exists an absolute constant c > 0 such that for all p > 1, E max 1≤i≤p |xi| ≤ c(cid:112)log p max 1≤i≤p ∥xi∥ψ2. The above result can be found in Lemma 2.4 of [7]. 38 Theorem 5. (Uniform convergence of empirical norms). Denote ∥g∥2 n = 1 n i.i.d. samples zi ∈ Z. For a class F of functions on Z, let J0(t, F) = C0t (cid:82) 1 0 sup{z1,...,zn}⊆Z (cid:112)H(ut/2, F, ∥ · ∥n)du, t > 0, where C0 > 0 is some universal constant and H is the metric entropy. Let RF := supg∈F ∥g∥, KF := supg∈F ∥g∥∞, and H(t) be the convex conjugate of i=1 g(zi)2 for (cid:80)n G(t) := (J −1 0 (cid:32) (t, F))2. Then, for all R2 n − ∥g∥2(cid:12) (cid:12) (cid:12) ≥ C1 (cid:12) (cid:12)∥g∥2 (cid:12) P sup g∈F F ≥ H(4KF / n) and all t > 0, (cid:18) 2KF J0(2RF , F) + KF RF √ t (cid:19)(cid:33) + K 2 F t n ≤ e−t, √ √ n where C1 > 0 is a universal constant. The above is Theorem 2.2 from [21]. Theorem 6. (Symmetrization and concentration). Let Y = (y1, . . . , yn) be i.i.d. samples in some sample space Y and let F be a class of real-valued functions on Y. Define QF = supf ∈F ∥f ∥, (cid:98)QF = supf ∈F ∥f ∥n. Then, (cid:16) P (cid:16) P (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 n 1 n sup f ∈F sup f ∈F n (cid:88) i=1 n (cid:88) i=1 (cid:12) (f (yi) − Ef (yi)) (cid:12) (cid:12) ≥ 4QF (cid:17) (cid:114) 2t n (cid:16) ≤ 4P (cid:12) (cid:12) (cid:12) 1 n sup f ∈F n (cid:88) i=1 (cid:12) (cid:12) (cid:12) ≥ QF ϵif (yi) (cid:17) (cid:114) 2t n , ∀t ≥ 4, (cid:12) (cid:12) (cid:12) ≥ C2 · ϵif (yi) (cid:16) (cid:16) E (cid:12) (cid:12) (cid:12) 1 n sup f ∈F n (cid:88) i=1 (cid:12) (cid:12) ϵif (yi) (cid:12) (cid:12) (cid:12) (cid:12)Y (cid:17) + (cid:98)QF (cid:112)t/n (cid:17)(cid:12) (cid:12) (cid:12)Y (cid:17) ≤ e−t, ∀t > 0, where ϵ1, . . . , ϵn is a Rademacher sequence independent of Y and C2 > 0 is a universal constant. The first result is Lemma 16.1 in [72] and the second result is implied by Theorem 16.4 in [72]. Theorem 7. (Dudley’s integral tail bound). Let (Xt)t∈T be a separable random process on a metric space (T, d) with sub-Gaussian increments: ∥Xt − Xs∥ψ2 ≤ Kd(t, s), ∀t, s ∈ T . Then, for every u > 0, the event |Xt − Xs| ≤ CK sup t,s∈T (cid:104) (cid:90) diam(T ) 0 (cid:105) (cid:112)H(ε, T, d)dε + u · diam(T ) holds with probability at least 1 − 2e−u2. Here, C > 0 is a universal constant, H is the metric entropy, and diam(T ) = sups,t∈T d(s, t). The result above is Theorem 8.1.6 in [73]. 39 CHAPTER 3 HIGH-DIMENSIONAL PARTIAL LINEAR QUANTILE REGRESSION WITH TREND FILTERING 3.1 Introduction This research addresses the challenges posed by diverse data types in modern analytics, focusing on scenarios where the response variable y is influenced by both linear (x) and non-linear (z) predictors. Such complex relationships, common in biological studies like gene expression analysis, require specialized methodologies that transcend traditional linear models. The study emphasizes the importance of separately accounting for linear and non- linear components to avoid potential pitfalls: using standard least squares regression alone may lead to incorrect inferences, while applying non-linear regression to both components can result in overly rough estimations and encounter the curse of dimensionality in high- dimensional datasets. This approach aims to optimize model accuracy and efficiency in handling complex, heterogeneous data structures. The partial linear model addresses this methodological challenge by concurrently inte- grating parametric and nonparametric regression techniques, thereby offering a more sophis- ticated and robust approach to complex data structures. This semiparametric approach has been extensively studied in both low-dimensional and high-dimensional settings by combin- ing classical nonparametric approaches with various types of linear models [75, 23, 80, 46, 38, 89, 84]. For high-dimensional data, where the number of covariates significantly exceeds the number of observations, several penalization techniques are involved such as LASSO [67], SCAD (Smoothly Clipped Absolute Deviation) by [16], and MCP (Minimax Concave Penalty) by [85]. Nonetheless, the impact of covariates on the central tendency of the conditional distribu- tion may significantly differ from their effects on the distribution’s extremities. Consequently, an exclusive focus on the conditional mean function can lead to misinterpretation. Quan- tile regression, introduced by [32], provides a solution by estimating conditional quantiles 40 at various levels, offering a more nuanced understanding of the relationship between covari- ates and the response variable. Integrating quantile regression into the partial linear model framework allows for the strengths of both methods to be harnessed, resulting in a more robust and comprehensive analytical approach. Partial linear quantile regression has been applied in various types of research. For low- dimensional cases, starting from [24], it has been used in predicting high-cost patients [42], dealing with data with dropouts [43], and in deep learning applications [88] among others. For high-dimensional cases, it has been applied to datasets with missing covariates [61] and to functional predictors [39] among others. Notably, [63] explored the high-dimensional partial framework with various penalty functions using LASSO, SCAD, and MCP combined with the B-spline method, providing solid theoretical properties for their estimators. We compare our results with theirs as a main benchmark. In recent years, there has been substantial interest and progress in the nonparametric method known as trend filtering, as proposed by [66, 31, 69]. [69] describes trend filtering as a discrete analog of the locally adaptive regression splines introduced by [44]. The locally adaptive regression splines involves solving the following optimization problem for n number of observations: min f ∈Fk 1 2 n (cid:88) i=1 (yi − f (zi))2 + λ · T V (f (k)). (3.1) Here, Fk denotes the function space defined as Fk = (cid:8)f : [0, 1] → R : f is k times weakly differentiable and T V (f (k)) < ∞(cid:9) , where T V (f (k)) denotes the total variation of kth derivative of f . The discrete analog of (3.1) using trend filtering is min θ∈Rn 1 2 n (cid:88) i=1 (yi − θi)2 + λ∥D(z,k+1)θ∥1, (3.2) where bold symbols denote vector notation, ∥ · ∥1 denotes ℓ1-norm and D(z,k+1) is the kth order difference operator based on z. Equation (3.2) is referred to as the univariate kth order 41 trend filtering. The details of these notations and definitions will be elaborated on in Section 2. Following the proofs of equivalence between (3.1) and (3.2) as shown by [69], trend filtering estimates combine the benefits of locally adaptive regression splines with the advantages of the banded difference operator structure in (3.2). This dual benefit provides local adaptivity and computational efficiency, making trend filtering superior to other nonparametric methods such as smoothing splines and B-spline regressions. Trend filtering has been successfully applied to various statistical fields, including graph- ical models [78], additive models [56], and spatiotemporal models [49] among others. Espe- cially, [41] studies various univariate quantile regression trend filtering and their theoretical properties. The quantile regression analog of (3.2) considered in their research is min θ∈Rn 1 n n (cid:88) i=1 ρτ (yi − θi) + λ∥D(z,k+1)θ∥1, where ρτ (x) = x(τ −I(x ≤ 0)) is called a check loss. However, for their theoretical properties, they provide all risk and error bounds for estimators using a specific type of Huber loss [27], not squared loss. Therefore, even though their work is on quantile regression, their theoretical contributions and derivations are totally different from what we introduce in our research here, since our loss is utilizing the squared loss for the estimators instead. Also, its adaptation within the partial linear model framework has not been explored to date. This makes our theoretical contribution more distinguishable. Our contribution lies in integrating trend filtering with the partial linear quantile regres- sion model to effectively address high-dimensional sparse heterogeneous data. Our approach leverages the local adaptivity and computational efficiency of trend filtering along with the quantile adaptive sparsity-inducing properties of the quantile LASSO, providing significant advantages when dealing with complex and heterogeneous data. Furthermore, our work provides rigorous theoretical details and establishes convergence rates for the estimators, utilizing the other loss for error, which makes our work distinguishable from the theory com- pared to the previous approach on quantile trend filtering. Also, we address the challenge induced because of the integration. 42 The rest of this chapter is organized as follows. Section 3.2 presents the preliminaries, including assumptions and model descriptions. Section 3.3 provides the main theoretical re- sults and their interpretations. Section 3.4 details the simulation studies and computational aspects. Section 3.5 includes the real data analysis. Section 3.6 discusses our results, and Section 3.7 includes all the proofs. 3.1.1 Notations Notations for this chapter are the same as the previous chapter. Therefore we omit the explanation. 3.2 Model Description and Assumptions 3.2.1 Model Description and Estimation Method We consider the following partial linear quantile regression model: Qy|x,z(τ ) = x′β0 + g0(z), where β0 ∈ Rp has the support S = {j : β0 j ̸= 0} with |S| ≤ s, and g0 : [0, 1] → R is a nonparametric function. Qy|x,z(τ ) = inf{t : P(y ≤ t | x, z) ≥ τ } denotes the τ -th conditional quantile of y given x and z. For the notational convenience, we remove τ from β0, g0, and corresponding estimators if it is obvious. Generally, z ∈ [0, 1] may not be true, but this can be achieved by using a simple affine transformation, so we assume this for technical simplicity. We let {(yi, xi, zi)}n i=1 be n i.i.d observations of (y, x, z), and without loss of generality, we assume that these observations are re-ordered based on zi, i = 1, . . . , n. Moreover, as discussed in the introduction, we focus on the function g0 where the total variation of kth derivative is bounded by a constant, that is, T V (g(k) 0 ) ≤ Lg, with Lg > 0. Expanding upon the univariate trend filtering from [31, 69] and quantile trend filtering from [41], for a given integer k ≥ 0, the kth order partial linear quantile trend filtering 43 (PLQTF) estimation for the partial linear quantile regression is defined as ((cid:98)β, (cid:98)θ) ∈ arg min β∈Rp,θ∈Rn 1 n n (cid:88) i=1 ρτ (yi − x′ iβ − θi) + λ∥β∥1 + γ∥D(z,k+1)θ∥1. (3.3) where D(z,k+1) ∈ R(n−k−1)×n is the discrete difference operator of order k + 1. When k = 0, D(z,1) =          −1 1 0 · · · 0 −1 1 · · · ... 0 0 0 0 0 0 0 · · · −1 1          ∈ R(n−1)×n. (3.4) For k ≥ 1, the difference operator is defined recursively, that is, D(z,k+1) = D(z,1) · diag (cid:18) k zk − z1 , · · · , (cid:19) k zn − zn−k+1 · D(z,k), where in the above equation, D(z,1) is defined as the form of (3.4) with the dimension of (n − k − 1) × (n − k). When k = 0, this estimate is equivalent to the one-dimensional quantile fused lasso [29]. Before further discussing estimation methods, we briefly introduce the falling factorial basis. [69] and [77] demonstrate a connection between univariate trend filtering and falling factorial basis functions, showing that the trend filtering problem can be interpreted as a sparse basis regression problem using these functions. This finding is also applicable to our case since our model is a generalization of the univariate trend filtering model. For given knot points t1 < · · · < tn ∈ R, the kth order falling factorial basis functions are defined as below qi(t) = i−1 (cid:89) (t − tl), i = 1, . . . , k + 1, l=1 k (cid:89) (t − ti+l) · 1{t > ti+k}, qi+k+1(t) = l=1 i = 1, . . . , n − k − 1, where we write q1(t) = 1 as a convention. With these definitions, the estimation in (3.3) is equivalent to ((cid:98)β, (cid:98)α) ∈ arg min β∈Rp,α∈Rn (cid:32) ρτ yi − x′ iβ − 1 n n (cid:88) i=1 (cid:33) αlql(zi) + λ∥β∥1 + γk! n (cid:88) l=k+2 |αl|, (3.5) n (cid:88) l=1 44 which is a lasso form. Indeed, the equivalent alternating expression of (3.5) is ((cid:98)β, (cid:98)g) ∈ arg min β∈Rp,g∈Hn 1 n n (cid:88) i=1 ρτ (yi − x′ iβ − g(zi)) + λ∥β∥1 + γT V (g(k)). (3.6) where Hk n is the span of the kth order falling factorial basis functions with knots zk+1 ≤ zk+2 . . . ≤ zn−1. Throughout the chapter, we denote Hn to be Hk n for notational simplicity. We omit the proofs of equivalences between equations for (3.3), (3.5) and (3.6), since the proof can be directly taken from Lemma 1 in [69] for (3.3) and (3.5), and Lemma 2 in [69] and Lemma 1 in [56] for (3.5) and (3.6). Based on the equivalence, we argue that the estimate from (3.3) is the same as (3.6). By constructing an appropriate linear combination of falling factorial functions, the form in (3.5) enables interpolation or extrapolation for partial lin- ear quantile trend filtering. As a result, this formulation enhances the practical applicability. 3.2.2 Assumptions We introduce conditions regarding our model in (3.6), needed for the proofs of lemmas and theorems throughout. Condition 7 (Design Condition). The covariates x = (x1, . . . , xp) are bounded: max1≤j≤p |xj| ≤ Kx. Condition 7 is the same as in Condition 2 in [63]. This condition is common in high- dimensional frameworks. Condition 8 (Bounded Conditional Response Density). The derivative of the conditional ∂y fy|x,z(y) is bounded in absolute value by constant ¯f ′ uniformly in (y, x, z). More- over, the conditional density evaluated at the conditional quantile is bounded away from zero, density ∂ i.e. fy|x,z(x′β0 + g0(z)) ≥ f > 0. Condition 8 is the same as in [5, 63, 41] among others. This condition is common in quantile regression literature. This condition is less restrictive than the conditions for linear models, which assume specific distributions. 45 Condition 9 (Bounded Input Density). z has a continuous distribution supported on [0, 1]. Its density is bounded below by a constant ℓz > 0. Condition 9 is the same as in [69, 56]. This condition is common in the trend filtering literature. Condition 10 (Eigenvalue Condition). Define h(z) = (h1(z), . . . , hp(z)) = E(x|z) and (cid:101)x = x − h(z). Assume λmin(E (cid:101)x(cid:101)x′) ≥ Λmin > 0 and λmax(Eh(z)h(z)′) ≤ Λmax < ∞. Condition 10 is common in semiparametric literature [83, 46, 84]. This condition ensures that there is enough information in the data to identify the parameters in the linear part. Condition 11. max1≤j≤p T V (h(k) j ) ≤ Lh. Condition 12. s2 log p n = o(1), p → ∞, as n → ∞. Condition 11 is the simiar to those of the Condition 2.6 in [46] and Condition A.5 in [84], and Lh = 0 when z and x is independent. This condition is used for proving Theorem 9. This type of condition is common in smoothing spline and trend filtering literature. Condition 12 is a technical condition for proofs in Theorem 8 and 9. 3.3 Main Theoretical Results In this section, we present our main results. Our main results consist of two parts, the result regarding the g, and the result regarding the β. We now move on to the convergence rate result for g. Theorem 8. Assume Conditions 7-10 and 12. Choose λ = c1 n + n− 2k+2 2k+3 ) with large enough constants c1, c2 > 0. Then, there exist constants c3, c4, n0 > 0 such that n , γ = c2( s log p (cid:113) log p any solution (cid:98)g in (3.6) satisfies ∥(cid:98)g(z) − g0(z)∥2 ≤ c3 ∥(cid:98)g − g0∥2 n ≤ c3 (cid:16)s log p n (cid:16) s log p n + n− 2k+2 2k+3 (cid:17) , + n− 2k+2 2k+3 (cid:17) 46 with probability at least 1 − pc4 − nc4, as long as n ≥ n0. The constants c1, c2, c3, c4, n0 may depend on k, Lg, Lh, Kx, ℓz, Λmin, Λmax, f , ¯f ′. Theorem 8 implies that the rate of (cid:98)g depends on two terms: s log p/n and n− 2k+2 (cid:98)g indeed depends on the convergence rate of (cid:98)β, which is discussed in Theorem 9. When the rate for the linear part is faster than the rate of the means that the convergence rate of 2k+3 . This second term, that is when s log p/n = o(n− 2k+2 second term, which implies that ∥(cid:98)g(z) − g0(z)∥2 = Op This phenomenon is also presented in [63], which uses B-spline approximation instead. This and ∥(cid:98)g − g0∥2 n = Op . n− 2k+2 2k+3 n− 2k+2 2k+3 2k+3 ), then the convergence rate is governed by the (cid:17) (cid:16) (cid:17) (cid:16) convergence rate is similar to the rate achieved by [63]. The nonparametric term is the same but only the other term differs, where this comes from the restrictive condition and use of different penalty terms, their nonparametric rate is achieved based on the Hölder class. However, our rate is achieved within the total variation function class, where the linear smoother is suboptimal. This implies that our optimal rate is a faster rate within the total variational class, especially when p is not growing fast regarding n and g is heterogeneous. The proof of Theorem 8 can be found in the supplementary section. We proceed to the convergence rates results for β. Theorem 9. Assume Conditions 7-12, with the same choice of λ, γ in Theorem 8, any solution (cid:98)β in (3.6) satisfies ∥(cid:98)β − β0∥2 2 ≤ c3 ∥x′((cid:98)β − β0)∥2 ≤ c3 ∥X((cid:98)β − β0)∥2 n ≤ c3 , s log p n s log p n s log p n , , with probability at least 1 − pc4 − nc4, as long as n ≥ n0. Theorem 9 implies two things about the estimator (cid:98)β; (1) estimation results and (2) prediction results. Both the estimation and prediction rates, for in-sample and out-of- sample data, achieve the oracle rate. This rate is consistent with findings in classical 47 high-dimensional quantile regression research [5]. It is noteworthy that the smoothness of g does not impact the results for β, whereas the results for g are influenced by β. This one-way relationship arises due to the orthogonal decomposition used in the proofs, a phenomenon also observed in [63]. The proof for Theorem 9 is provided in the supplementary section. 3.4 Simulation Studies Through empirical experiments, we evaluate the performance of partial linear quantile trend filtering (PLQTF) in comparison to partial linear quantile B-spline (PLQBS), as presented by [63]. Additionally, we represent the performance of partial linear quantile smoothing spline (PLQSS) as a side-by-side comparison even though it is not a main object. 3.4.1 Computational Details We use a combination of the Blockwise Coordinate Descent (BCD) algorithm, also known as the backfitting approach, and the Majorization-Minimization (MM) algorithm for our problem in (3.3). We denote our algorithm qBCD. Our approach mainly involves two blocks: the first pertains to β for the linear part and the second to θ for the trend filtering part. The algorithm iterates over these blocks, updating the estimate for each component at each step using the quantile LASSO for β and the univariate trend filtering algorithm for θ. However, within each block, because of the complexity of the check loss that we consider, we need to consider what kind of algorithm we should consider for the optimization regarding the check loss. For the quantile LASSO part, the R package rqPen [62] deals with the linear programming approach. For the trend filtering part, there is no software available publicly that deals with the check loss. Therefore, we deploy the re-iteratively weighted approximation [57, 48, 58], which is proven to be a type of MM algorithm [34]. Instead of minimizing a check loss from (3.3), we use the weighted least square approximation below: n (cid:88) i=1 w(r)(yi, xi, β, θ(r) i , τ ) (yi − x′ iβ − θi)2 (3.7) 48 where, for a sufficiently small η > 0 (cid:16) w(r)(yi, xi, β, θ(r) i , τ ) = τ I yi > x′ i iβ + θ(r) (cid:114)(cid:16) (cid:17) + (1 − τ )I (cid:16) yi ≤ x′ iβ + θ(r) i (cid:17) . yi − x′ iβ − θ(r) i (cid:17)2 + η2 The big advantage of this approximation is that it becomes the weighted least square problem now. The convergence and numerical stability of MM algorithm by the descent property ensures the convergence of this optimization, for further details of the MM algorithm, we refer to [28]. The convergence of the Block Coordinate Descent (BCD) algorithm has been rigorously established by [71]. They proved that for a convex criterion decomposable into smooth and separable terms, the solution obtained through the iterates of the BCD algorithm is indeed the optimal solution. Consequently, we do not need to prove the convergence of the algorithm again in this study. Our qBCD algorithm’s convergence is guaranteed based on this existing result. The detailed steps of our entire algorithm are outlined in Algorithm 3.1. For the tuning parameters λ and γ, we employ 5-fold cross-validation (CV) to determine the optimal parameters that minimize the overall CV error. However, due to the presence of two tuning parameters, the search space becomes the product of the lengths of λ and γ, leading to an exponential increase in computation time compared to an algorithm with a single tuning parameter. Furthermore, the additional loop within blocks further slows down convergence. To mitigate this, we use the warm start method [53] for our tuning parameter grids. Empirical evidence from their study demonstrates that the warm start algorithm significantly reduces the number of iterations required for convergence, thereby enhancing the convergence speed of the entire algorithm. Our algorithm is implemented using R software [52]. Typically, for the quantile lasso component, we utilize the function from the R package rqPen [62], and for trend filtering, we employ the function from the R package glmgen [1] with weight approximation from (3.7). However, there is no publicly available software for the partial linear smoothing spline method, so we also use qBCD algorithm for the smoothing spline model for the 49 Algorithm 3.1 A qBCD algorithm for High-dimensional Partial Linear Quantile Trend Filtering Data: {yi, xi, zi}, i = 1, . . . , n Fixed (tuning) Parameters: λ, γ, k 1. Set t = 0 and initialize θ(0) i = 0, i = 1, . . . , n 2. For (t + 1)-th iteration, where t = 0, 1, 2, . . . : a) Block 1: Let y(t)∗ i = yi − θ(t) i , and update β(t) by fitting the quantile lasso: β(t+1) = arg min β 1 n n (cid:88) i=1 (cid:16) ρτ y(t)∗ i − x′ iβ (cid:17) + λ∥β∥1 b) Block 2: Let y(t)∗∗ = yi − x′ iβ(t+1), r = 0, and θ(t+1,0) = θ(t) i , i = 1, . . . , n i For (r + 1)-th iteration, where r = 0, 1, 2, . . . : i i. Using approximation with w(·), we update θ(t+1,r) by fitting the re-iteratively univariate weighted trend filtering model as below: (cid:34) θ(t+1,r+1) = arg min θ 1 n n (cid:88) i=1 w(r)(cid:16) y(t)∗∗ i , xi, β(t+1), θ(r) i , τ (cid:17) · (cid:16) y(t)∗∗ i − x′ iβ(t+1) − θi (cid:17)2 + γ∥D(z,k+1)θ∥1 (cid:35) ii. If θ converges, then stop the iteration for r. If not, continue the iteration until it reaches the predefined maximum iteration number iii. Update θ(t+1) as the function at convergence c) If both β and θ converge, then stop the iteration. If not, continue the iteration until it reaches the predefined maximum iteration number 3. Return (cid:98)β and (cid:98)θ as parameters at convergence computation as well. The R function, stats::smooth.spline, is used for this procedure in the second block with weight approximation from (3.7). The implemented qBCD algorithms for the smoothing spline and the trend filtering, named plmR, are available on https://github.com/SangkyuStat/plmR for public use and reproducibility. 50 Figure 3.1 Configurations of the function g(z) for Models 1 and 2. The structures of the functions become increasingly locally heterogeneous from left to right. 3.4.2 Simulation Setting and Results We now proceed to demonstrate our method through simulations. We consider different scenarios with different settings. We first generate (cid:101)X = ((cid:101)x1, . . . , (cid:101)xp+1) from Np+1(0p+1, Σ), where Σ = (σjk) with σjk = 0.5|j−k| and j, k = 1, . . . , p + 1. Then we set z = Φ((cid:101)x25), xj = (cid:101)xj for j = 1, . . . , 24 and xj = (cid:101)xj−1 for j = 26, . . . , p + 1. The true models we consider are as follows: Model 1 (Less heterogeneous): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + e3zi sin(6πzi)/7 + ϵi, Model 2 (More heterogeneous): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + 2 min(zi, 1 − zi)0.2 sin (cid:18) 2.85π 0.3 + min(zi, 1 − zi) (cid:19) + ϵi for i = 1, . . . , n. We vary the settings of errors to make some different scenarios: Scenario 1 (Normal): ϵi ∼ σϵN (0, 1) Scenario 2 (Heavy Tail): ϵi ∼ σϵTdf =3 Scenario 3 (Heteroskedastic): ϵi = σϵxi1ζi, where ζi ∼ Tdf =3 51 The actual forms of these functions are displayed in Figure 3.1. Various values for σϵ are adapted to vary the signal-to-noise ratio (SNR) for the models and scenarios. This kind of simulation setting is adapted in [5] for quantile regression as well. We define the total signal-to-noise ratio (tSNR) for a quantile level τ as follows: (cid:115) tSNR = ∥x′β0 + g0(z)∥2 Var(ϵ) , where the expectation in the numerator and the variance in the denominator can be approx- imated by using the Monte Carlo expectation. We vary the tSNR level from 4 to 16 on a logarithmic scale and then calculate the metrics for each model. Specifically, we consider three different error metrics similar to those of [63], including in-sample prediction mean of quantile error (MQE), l2-norm squared error, and a mean absolute deviation error (MADE), as follows: 1. n−1 (cid:80)n i=1 ρτ (cid:16) i (cid:98)β + (cid:98)g(zi) − (x′ x′ (cid:17) iβ0 + g0(zi)) : Overall MQE considering both (cid:98)β and (cid:98)g. 2. ∥(cid:98)β − β0∥2 : l2-norm squared error for (cid:98)β. 3. n−1 (cid:80)n i=1 ∥(cid:98)g(zi) − g0(zi)∥1 : MADE for (cid:98)g. The metrics are computed over 150 repetitions of randomly generated datasets for each tSNR value, and the medians of each metric are selected as the final results. We consider p to be 100 and 600 for low and high-dimensional cases, respectively, with n fixed at 300. The values for β are (0.8, 1.6, 1.6, 2.4) for both low-dimensional and high-dimensional cases. For each method, the algorithm is fitted to the training dataset and then the metrics are calculated from the in-sample testing dataset. We present results using both optimally tuned parameters and CV-tuned parameters. The optimally tuned parameters are the tuning parameters with the smallest error among the path. The parameters are finely tuned for fair comparison across all models. We fix k = 2 for the trend filtering method, use cubic smoothing splines for the smoothing spline method, and use cubic B-spline for the B-spline analysis. Tuning only λ and γ for smoothing spline and trend filtering, and tuning λ and number of knots for B-spline 52 model for two reasons: (1) minimal impact on results when varying these settings and (2) computational efficiency. The simulation results for each model and case are displayed in Figures 3.2, 3.3, 3.4, 3.5. Figure 3.2 and 3.4 demonstrate that when g is a less heterogeneous function, the per- formance of the PLQTF method is comparable to that of the PLQSS, with no significant differences observed. However, PLQBS is not performing well especially when the tSNR is large compared to others. When we consider the more heterogeneous function on low-dimensional data, which is in Figure 3.3, for Scenario 1 and 2, there was a significant difference between PLQTF and other models, and this gap is getting larger as tSNR grows larger. However for Scenario 3, the performance between PLQTF and PLQSS was similar across the tSNR levels. In Figure 3.5, which is the case for the high-dimensional data, we see that the PLQTF performs better than other methods across the tSNR level except for Scenario 3. For Sce- nario 3 trend filtering outperforms other methods as tSNR grows larger. For these scenarios, we observe that PLQBS performs unstable with MQE which is shown as distinguishable differences between the CV and optimized error. 3.5 Real Data Analysis 3.5.1 Environment and Genetics in Lung Cancer Etiology (EAGLE) Study The EAGLE study, a comprehensive molecular epidemiological investigation, employed a case-control design to examine lung cancer in the Lombardy Region of Italy from 2002 to 2005. This large-scale, multicenter research initiative incorporated a significant biospecimen component. The study enrolled a cohort exceeding 2,000 newly diagnosed lung cancer patients, encompassing both sexes and spanning ages 35 to 79. These cases, representing all histological types of verified lung cancer, were matched with an equivalent number of healthy population-based controls. The matching criteria included age, gender, and residence. Patient recruitment occurred across 13 hospitals, while control subjects were 53 Figure 3.2 Model 1 low-dimensional error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150. 54 Figure 3.3 Model 2 low-dimensional error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150. 55 Figure 3.4 Model 1 high-dimensional error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150. 56 Figure 3.5 Model 2 high-dimensional error comparisons for a sequence of tSNR grid from 4 to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150. 57 randomly selected from the same residential areas as the cases. This approach ensured a representative sample and minimized potential geographical biases. The research protocol involved extensive data collection through two primary methods: an interview-based computer-assisted questionnaire and a self-administered questionnaire. These instruments gathered comprehensive information on various factors, including demographic character- istics, detailed smoking history, family history of lung cancer and other cancers, previous lung diseases, medications, diet, alcohol, attempts at quitting smoking, anxiety, depression, personality scores, occupation, reproductive and residential history. This multifaceted approach to data collection facilitated a thorough examination of potential risk factors and confounding variables associated with lung cancer development. Further details of the study can be found in [33]. 3.5.2 Real Data Analysis Results For our real data analysis, the dataset comprises 132 tumor samples from the EAGLE study, each characterized by whole genome sequencing data, DNA methylation profiles from the EPIC array, and detailed clinical information regarding smoking history. This model aims to identify the relationship between DNA methylations in tumors and the effect of tobacco smoking on mutational signatures. Specifically, the outcome variable is the mu- tational signatures, with a focus on the tobacco-related SBS4, which is quantified by the number of mutations attributed to this signature from the whole genome sequencing data of the EAGLE study. The number of mutations is log-transformed while performing the data analysis. The methylation data, represented by real values ranging from 0 to 1 for each probe, serve as covariates. To manage computational demands, the analysis is restricted to 1,713 CpG probes that show significant differential methylation in tumors compared to normal tissues. Smoking history is considered a factor that has a nonlinear relationship with the mutational signatures, and we consider the total period (in years) during which the subject 58 Figure 3.6 Estimated and 0.8. Here y is set to be the number of mutational signatures, and z is set to be the total duration of smoking (in years). (cid:98)g for the real data analysis with different quantile levels, τ = 0.2, 0.5, smoked cigarettes regularly, whose duration is at least more than a year. After removing the missing observations, we have n = 125 observations and p = 1, 713 variables. As discussed z is set to be the total duration time of smoking. We try to see the heterogeneous effect of smoking and selected features among different quantile levels. Here, we consider 3 different quantile levels, τ = 0.2, 0.5, and 0.8. Our aim of this analysis is twofold. One is to see the selected variables using PLQTF and the other is to compare the prediction error results among all methods. We applied 5-fold cross-validation (CV) to the full dataset and present the results. The fitted plot of the nonparametric term, (cid:98)g varies across different quantile levels. While the effects on the 0.2th and 0.5th quantiles appear similar, (cid:98)g, is shown in Figure 3.6. The effect of 59 the effect on the 0.8th quantile shows a notably higher increment. Although the estimated function shapes are similar across quantiles, their magnitudes differ. For each quantile level, the functions clearly deviate from linearity. For all three quantile levels, the mutational signature increases until the duration reaches around 20 years, after which it stabilizes. This implies that for people who have smoked for 20 years or less, mutations increase rapidly during this period and then stop. Notably, 0.8th quantile level experiences a greater increase compared to others. Quantile Levels List of Selected Variables (From the biggest coefficients to the smallest) τ = 0.2 cg13906823, cg10792987, cg27275023, cg12492087, cg23497016, cg18786873 τ = 0.5 cg07207982, cg07197230, cg10420527, cg10792987, cg04878000, cg09295202, cg13692543, cg27275023, cg02018277, cg18786873, cg00716257, cg07226481, cg08358166, cg07405182, cg09281805, cg13906823, cg24676817, cg06173663, cg17698295, cg06623698 τ = 0.8 cg07197230, cg18451588, cg24127719, cg00558689, cg19080053, cg12091642, cg06947913, cg07981013, cg06796713, cg21207665, cg07686872 Table 3.1 List of selected CpG probes for different quantile levels of the number of mutations. The selected features for three different quantile levels are presented in Table 3.1. Among these, we highlight specific features of interest. Notably, cg10420527 (LRP5 gene), selected for the median, has been previously reported as related to lung cancer, particularly in the context of smoking history [86]. This aligns our results with their findings, which were based on mean-based regression. Additionally, cg07197230 (CECR2 gene) is selected for both the 0.5 and 0.8 quantile levels and has been associated with alcohol use disorder, a 60 condition closely linked to smoking [26]. cg07207982, selected for the median, is reported to be associated with cholangiocarcinoma, which can metastasize to the lung [45]. Cg10792987 is selected for the 0.2th and 0.5th quantiles, and cg13906823 for the 0.2th quantile. However, these findings have not yet been documented in related literature, suggesting that they could represent new discoveries. Other features have smaller coefficients compared to those mentioned and, as we did not apply false discovery control, some of these may be false discoveries. For prediction error calculation, we partition the full dataset into training and test sets, with 115 observations for training and 10 observations for testing. These datasets are ran- domly selected for each repetition. Tuning parameters are optimized using 5-fold cross- validation on the training datasets. This procedure is repeated 50 times to calculate the prediction error. For each repetition, the quantile adaptive prediction error for all methods is calculated as follows: (cid:80)n PE(τ ) = i=1 1{Subject i in testing set}ρτ (yi − x′ i=1 1{Subject i in testing set} (cid:98)g are the estimates from the methods considered. We report the median of the where (cid:98)β and PE values. However, we find that the smoothing spline method performs worse than other i (cid:98)β − (cid:98)g(zi)) (cid:80)n , methods, so we only report the results for PLQTF and PLQBS here. Results are presented in Table 3.2. We see that the PLQTF results outperform on every quantile level. However, since the standard errors are quite large, we cannot say that the results are significantly better, except for the 0.2th quantile. 3.6 Conclusion and Discussion This study addresses the complexities of high-dimensional data influenced by both linear and non-linear predictors. We introduce a novel approach that integrates trend filtering with partial linear quantile regression, providing a robust and efficient method for analyzing such data. 61 Prediction Error (Standard Error) Quantile Levels PLQTF PLQBS τ = 0.2 0.690 (0.084) 1.180 (0.078) τ = 0.5 τ = 0.8 0.792 (0.056) 0.807 (0.063) 0.431 (0.033) 0.460 (0.025) Table 3.2 Median prediction error comparison results for 50 repetitions. Numbers in parentheses denote standard errors. Our model leverages the strengths of both parametric and nonparametric regression tech- niques, along with the comprehensive insights offered by quantile regression. Trend filtering enhances local adaptivity and computational efficiency, making our approach superior to traditional nonparametric methods like smoothing splines and B-spline regressions. Empirical results and theoretical analysis demonstrate that our model outperforms exist- ing methods in both accuracy and efficiency. We established rigorous theoretical properties, including convergence rates for the estimators, distinguishing our work from previous ap- proaches. In summary, our contributions include integrating parametric and nonparametric regres- sion within a partial linear framework, using quantile regression to capture the impact of covariates across the conditional distribution, and enhancing model performance with trend filtering for local adaptivity and computational efficiency. We have demonstrated that our model performs better than existing models. Future work could explore extending this methodology to different data structures and domains, as well as developing more efficient computational algorithms. 62 3.7 Proofs Throughout the proofs, we use C1, C2, . . . to denote universal constants and use D1, D2, . . . , ... k, Lg, Lh, Kx, ¯f ′, f , ℓz, Λmin, Λmax from Conditions 7-11. An explicit (though not opti- that may only depend on the constants to denote constants mal) dependence of {Dj, j = 1, 2, . . .} on the aforementioned constants can be tracked down. However, since it does not provide much more insight, we will often not present the explicit forms of {Dj, j = 1, 2, . . .}, and this will greatly help streamline the proofs. The constants {Cj, Dj, j = 1, 2, . . .} may vary from line to line. 3.7.1 Proof of Theorem 8 The proof has similar ideas as in the partial linear regression problem. Hence, we will not repeat arguments in some parts of the proof. For a given function f (x, z) = x′β + g(z) with β ∈ Rp, T V (g(k)) < ∞, we adopt the same functional τδ0,R(f ) = λ∥β∥1 + γT V (g(k)) 20δ0R + ∥x′β + g(z)∥, δ0 ∈ (0, 1), R > 0. (3.8) Define the event (cid:26) T1 := max 1≤i≤n−1 (z(i+1) − z(i)) ≤ (cid:27) . 22 log n ℓzn According to Lemma 16 in [70], there exists ¯g ∈ Hn such that on T1, the following holds T V (¯g(k)) ≤ ak · T V (g(k) 0 ), ∥¯g − g0∥∞ ≤ 22bk log n ℓzn · T V (g(k) 0 ), (3.9) where ak, bk > 0 are constants only dependent on k. Denote ∆ := 22bkLg log n ℓzn and (cid:40) B := β : ∥β∥2 ≤ Λ−1/2 min R, ∥β∥1 ≤ 20δ0R2 λ , ∥x′β∥ ≤ (cid:16) 2 + (cid:114) Λmax Λmin (cid:41) , (cid:17) R (3.10) (cid:40) G := g : ∥g∥ ≤ (cid:16) 1 + (cid:114) Λmax Λmin (cid:17) R + ∆, ∥g∥∞ ≤ Cz,k (cid:16) 20δ0R γ + 1 + (cid:114)Λmax Λmin (cid:17) R + ∆, T V (g(k)) ≤ 20δ0R2 γ + (1 + ak)Lg (cid:41) , (3.11) 63 where Cz,k > 0 is a constant that only depends on k, ℓz and it can be found in Lemma 11. Further define the following event (cid:12) (cid:12)Gn[ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))](cid:12) (cid:12) ≤ δ0 (cid:26) T2(δ0, R) := sup β−β0∈B,g−g0∈G 3.7.1.1 Key lemmas Lemma 8. Assume that 20Kxδ0R2 λ + Cz,k (cid:16) 20δ0R γ + 1 + (cid:114)Λmax Λmin (cid:17) R + ∆ ≤ 3f 2 ¯f ′ , ∆ + 2γakLg + 8 f λ2sΛ−1 min ≤ 8δ0R2, ∆ ≤ 1 20 R, δ0 ≤ f 200 · 242 . Then on T1 ∩ T2(δ0, R), it holds that τδ0,R(x′((cid:98)β − β0) + (cid:98)g(z) − ¯g(z)) ≤ R, where ¯g ∈ Hn is the one introduced in (3.9). Proof. Consider the convex combination (cid:101)β = t(cid:98)β + (1 − t)β0, (cid:101)g = t(cid:98)g + (1 − t)¯g, t ∈ [0, 1]. Accordingly, define √ nR2 (cid:27) . (3.12) (3.13) (3.14) (3.15) (3.16) (cid:98)f (x, z) = x′ (cid:98)β + (cid:98)g(z), ¯f (x, z) = x′β0 + ¯g(z), (cid:101)f (x, z) = t (cid:98)f + (1 − t) ¯f = x′ (cid:101)β + (cid:101)g(z). We choose t = R R+τδ0,R( (cid:98)f − ¯f ) so that τδ0,R( (cid:101)f − ¯f ) = R · τδ0,R( (cid:98)f − ¯f ) R + τδ0,R( (cid:98)f − ¯f ) ≤ R. Combining (3.9) with Lemma 11, it is straightforward to verify that on T1, (cid:101)β − β0 ∈ B, (cid:101)g − g0 ∈ G. 64 (3.17) (3.18) Moreover, (3.17) implies that to prove (3.16) it is sufficient to show τδ0,R( (cid:101)f − ¯f ) ≤ R 2 . We start with the basic inequality due to the convex problem (3.6), Enρτ (yi − x′ ≤ Enρτ (yi − x′ i (cid:101)β − (cid:101)g(zi)) + λ∥(cid:101)β∥1 + γT V ((cid:101)g(k)) iβ0 − ¯g(zi)) + λ∥β0∥1 + γT V (¯g(k)) ≤ Enρτ (yi − x′ iβ0 − g0(zi)) + En|g0(zi) − ¯g(zi)| + λ∥β0∥1 + γT V (¯g(k)) ≤ Enρτ (yi − x′ iβ0 − g0(zi)) + ∆ + λ∥β0∥1 + γT V (¯g(k)), (3.19) where the second inequality uses the Lipschitz property of check loss and the third inequality is due to (3.9). Based on (3.18), we can re-organize the inequality (3.19) to obtain ≤ Eρτ (y − x′ (cid:101)β − (cid:101)g(z)) − Eρτ (y − x′β0 − g0(z)) (cid:12) (cid:12)n−1/2Gn[ρτ (yi − x′ sup β−β0∈B,g−g0∈G ∆ + λ∥β0∥1 + γT V (¯g(k)) − λ∥(cid:101)β∥1 − γT V ((cid:101)g(k)) iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))](cid:12) (cid:12)+ ≤ δ0R2 + ∆ + 2λ∥(cid:101)βS − β0 S∥1 − λ∥(cid:101)β − β0∥1 + 2γT V (¯g(k)) − γT V ((cid:101)g(k) − ¯g(k)). (3.20) Here in the last step, we have used (3.12) and the triangle inequality for ∥ · ∥1 and T V (·). We further develop lower bounds for (3.20). Using the identity of Knight, we obtain Eρτ (y − x′ (cid:101)β − (cid:101)g(z)) − Eρτ (y − x′β0 − g0(z)) = E = E f 2 (cid:32) ≥ ≥ (cid:90) x′((cid:101)β−β0)+(cid:101)g(z)−g0(z) (cid:0)Fy|x,z(x′β0 + g0(z) + t) − Fy|x,z(x′β0 + g0(z))(cid:1)dt 0 (cid:90) x′((cid:101)β−β0)+(cid:101)g(z)−g0(z) (cid:16) fy|x,z(x′β0 + g0(z))t + y|x,z(x′β0 + g0(z) + (cid:101)t)t2(cid:17) f ′ dt f or (cid:101)t ∈ [0, t] 1 2 0 ∥ (cid:101)f − f0∥2 − f 2 − ¯f ′ 6 · ¯f ′ 6 (cid:16) 20Kxδ0R2 λ E|x′((cid:101)β − β0) + (cid:101)g(z) − g0(z)|3, (cid:16)20δ0R γ + Cz,k + 1 + f0 = x′β0 + g0(z), (cid:114) Λmax Λmin (cid:33) (cid:17) (cid:17) R + ∆ · ∥ (cid:101)f − f0∥2 ≥ f 4 ∥ (cid:101)f − f0∥2. Here, the second equality applies Taylor expansion; the first inequality is due to Condition 8; the second inequality uses (3.18), and the last inequality holds by the condition (3.13). 65 Putting the above lower bound together with (3.20), we can proceed to obtain f 4 ∥ (cid:101)f − f0∥2 + λ∥(cid:101)β − β0∥1 + γT V ((cid:101)g(k) − ¯g(k)) √ s∥(cid:101)β − β0∥2 ≤ δ0R2 + ∆ + 2γT V (¯g(k)) + 2λ ≤ δ0R2 + ∆ + 2γT V (¯g(k)) + 2λ √ ≤ δ0R2 + ∆ + 2γT V (¯g(k)) + 8 f sΛ−1/2 min ∥(cid:101)x((cid:101)β − β0)∥ f 8 ∥ (cid:101)f − f0∥2, min + λ2sΛ−1 (3.21) where the first inequality is from Cauchy–Schwarz inequality, and the last inequality holds by the basic inequality ab ≤ 1 2b2 and the orthogonal decomposition ∥ (cid:101)f − f0∥2 = ∥(cid:101)x′((cid:101)β − β0)∥2 + ∥h(z)′((cid:101)β − β0) + (cid:101)g(z) − g0(z)∥2. Now (3.21) combined with the condition (3.14) yields 2a2 + 1 f 8 This implies that λ∥(cid:101)β−β0∥1+γT V ((cid:101)g(k)−¯g(k)) 20δ0R ∥ (cid:101)f − f0∥2 + λ∥(cid:101)β − β0∥1 + γT V ((cid:101)g(k) − ¯g(k)) ≤ 9δ0R2. ≤ 9 20R, and ∥ (cid:101)f − ¯f ∥ ≤ (cid:113) 72δ0 f R + ∆ ≤ 1 20R under the condition (3.15). Lemma 9. Assume the same conditions of Lemma 8 are satisfied. Then on T1 ∩ T2(δ0, R), it holds that ∥(cid:98)g(z) − g0(z)∥ ≤ (cid:16) 1 + (cid:114) Λmax Λmin (cid:17) R + ∆. Moreover, define another event: T3(δ0, R) = (cid:40) sup g∈G (cid:12) (cid:12)∥g∥2 (cid:12) n − ∥g∥2(cid:12) (cid:12) ≤ δ0R2 (cid:12) (cid:41) , where G was introduced in (3.11). Then on T1 ∩ T2(δ0, R) ∩ T3(δ0, R), it holds that ∥(cid:98)g(z) − g0(z)∥n ≤ (cid:16) 1 + (cid:112) δ0 + (cid:114)Λmax Λmin (cid:17) R + ∆. Proof. Result (3.16) in Lemma 8 together with Part (i) of Lemma 11 implies that ∥(cid:98)g(z) − ¯g(z)∥ ≤ (1 + )R. Combining this result with (3.9) gives (cid:113) Λmax Λmin ∥(cid:98)g(z) − g0(z)∥ ≤ ∥(cid:98)g(z) − ¯g(z)∥ + ∥¯g(z) − g0(z)∥ ≤ (cid:16) 1 + (cid:114) Λmax Λmin (cid:17) R + ∆. 66 To prove the second result, we first use (3.9) and (3.16) to obtain T V ((cid:98)g(k) − g(k) 0 ) ≤ T V ((cid:98)g(k) − ¯g(k)) + T V (¯g(k) − g(k) 0 ) ≤ 20δ0R2 γ + (ak + 1)Lg. Also, from Lemma 11 we know ∥(cid:98)g(z) − g0(z)∥∞ ≤ ∥(cid:98)g(z) − ¯g(z)∥∞ + ∥¯g(z) − g0(z)∥∞ ≤ Cz,k (cid:16) 20δ0R γ + 1 + (cid:114) Λmax Λmin (cid:17) R + ∆. These results show that (cid:98)g − g0 ∈ G. Hence, on T1 ∩ T2(δ0, R) ∩ T3(δ0, R), we can use the bound on ∥(cid:98)g(z) − g0(z)∥ to obtain the bound on ∥(cid:98)g(z) − g0(z)∥n, ∥(cid:98)g(z) − g0(z)∥n ≤ (cid:112)∥(cid:98)g(z) − g0(z)∥2 + δ0R2 ≤ ∥(cid:98)g(z) − g0(z)∥ + (cid:112) δ0R. Lemma 10. Denote KG := (1 ∨ Cz,k) (cid:16) 20δ0R γ + 1 + (cid:113) Λmax Λmin (cid:17) R + ∆ + (1 + ak)Lg. (i) As long as 11 log n ℓzn < 1, then P(cid:0)T1 (cid:1) ≥ 1 − 2ℓzn−10. (ii) If ((1 + (cid:113) Λmax Λmin )R + ∆)2 ≥ D1K 2 Gn −2k−2 2k+3 and K 2k+3 G (R + ∆)2k+1 ≤ D2δ2k+2 0 R4k+4nk+1, − P(T3(δ0, R)) ≥ 1 − e D3nδ2 0 R4 KG (R+∆)2 . (iii) In addition to the conditions in Part (ii), If nδ2 0R4 ≥ D4(R + ∆)2, λ ≥ D5 (cid:112)(log p)/n, (R + ∆)(cid:112)(log p)/n ≤ D6δ0R2, n−1/2K 0 R4 − P(T2(δ0, R)) ≥ 1 − 4e D8nδ2 KG (R+∆)2 − 4p−10. 1 2k+2 G (R + ∆) 2k+1 2k+2 ≤ D7δ0R2, Proof. Proof of Part (i): This is directly taken from Lemma 5 in [77]. Proof of Part (ii). A similar result (with a slightly different G) has been derived in the partial linear regression problem. Hence, we will not repeat all the arguments here. We aim n − ∥g∥2(cid:12) (cid:12). We first calculate J0(t, G) in Theorem (cid:12) (cid:12)∥g∥2 to apply Theorem 12 to bound supg∈G 12, J0(t, G) ≤ C1t (cid:115) (cid:16) ut 2KG (cid:90) 1 0 (cid:17)− 1 k+1 du = C1K 2 1 2k+2 G (cid:124) 67 (cid:123)(cid:122) ¯C1 1 2k+2 (2k + 2) 2k + 1 2k+1 2k+2 . ·t (cid:125) We then calculate the function H(·) in Theorem 12, H(u) = sup t≥0 (cid:16)2k + 1 4k + 4 = (cid:0)ut − (J −1 0 (t, G))2(cid:1) ≤ sup t≥0 (cid:0)ut − ¯C − 4k+4 2k+1 1 4k+4 2k+1 (cid:1) t (cid:17) 4k+4 2k+3 2k + 3 2k + 1 4k+4 ¯C 2k+3 1 u 4k+4 2k+3 = D1K 2 2k+3 G 4k+4 2k+3 . u The condition supg∈G ∥g∥2 ≥ H(4KG/ √ n) required by Theorem 12 is satisfied if (cid:16)(cid:0)1 + (cid:114)Λmax Λmin (cid:1)R + ∆ (cid:17)2 ≥ D2K 2 Gn −2k−2 2k+3 . Then it holds under this condition that with probability at least 1 − e−t, (cid:12) (cid:12)∥g∥2 (cid:12) n − ∥g∥2(cid:12) (cid:12) (cid:12) ≤ D3 · sup g∈G (cid:16) K 2k+3 2k+2 G 2k+1 2k+2 (R + ∆) n √ + KG(R + ∆) √ n √ t + K 2 Gt n (cid:17) . Choosing t = D4nδ2 0 R4 KG (R+∆)2 pletes the proof. and further assuming K 2k+3 G (R + ∆)2k+1 ≤ D5δ2k+2 0 R4k+4nk+1 com- Proof of Part (iii): Recall the definition of T2(δ0, R) in (3.12). We first apply the sym- metrization from Theorem 13. Adopting the notation there, we have QF := sup β−β0∈B,g−g0∈G ∥ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))∥ ≤ sup β−β0∈B,g−g0∈G ∥x′ i(β − β0) + g(zi) − g0(zi)∥ ≤ sup β−β0∈B ∥x′ i(β − β0)∥ + sup g−g0∈G ∥g(zi) − g0(zi)∥ (cid:16) ≤ 2 + (cid:114) Λmax Λmin 0 R4 where the first inequality is due to the Lipschitz property of ρτ (·). Setting t = nδ2 32Q2 F (cid:114) Λmax Λmin (cid:114)Λmax Λmin R + ∆ = R + ∆, 3 + 2 R + 1 + (cid:16) (cid:17) (cid:17) (cid:17) (cid:16) (3.22) , the bound (3.22) together with the condition nδ2 0R4 ≥ 128(cid:2)(3 + 2 t ≥ 4. Hence, we can invoke Theorem 13 to obtain (cid:113) Λmax Λmin )R + ∆(cid:3)2 shows that 1 − P(T2(δ0, R)) (cid:18) ≤ 4P sup β−β0∈B,g−g0∈G (cid:12) (cid:12)Enϵi[ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))](cid:12) (cid:12) ≥ (cid:19) , δ0R2 1 4 (3.23) 68 where ϵ1, . . . , ϵn is a Rademacher sequence independent from the data. To further bound the above, we will apply the concentration result in Theorem 13. Towards this end, let Y := {(yi, xi, zi)}n (cid:18) i=1 . We have E ≤ 2E sup β−β0∈B,g−g0∈G (cid:18) sup β−β0∈B,g−g0∈G (cid:12) (cid:12)Enϵi[ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ (cid:12) (cid:12)Enϵi[x′ i(β − β0) + g(zi) − g0(zi)](cid:12) (cid:12) (cid:12) (cid:12) (cid:12)Y (cid:19) (cid:12) (cid:12) (cid:12)Y iβ0 − g0(zi))](cid:12) (cid:12) (cid:19) (cid:18) ≤ 2E sup β−β0∈B (cid:12) (cid:12)Enϵi[x′ i(β − β0)](cid:12) (cid:12) (cid:19) (cid:12) (cid:12) (cid:12)Y (cid:18) + 2E sup g−g0∈G (cid:12)Enϵi[g(zi) − g0(zi)](cid:12) (cid:12) (cid:12) (cid:19) , (cid:12) (cid:12) (cid:12)Y (3.24) where we have used the contraction theorem (e.g., Theorem 16.2 in [72]) in the first step. To bound the first term above, starting with Hölder’s inequality, we have (cid:18) E sup β−β0∈B (cid:12) (cid:12)Enϵi[x′ i(β − β0)](cid:12) (cid:12) (cid:19) (cid:12) (cid:12) (cid:12)Y ≤ sup β−β0∈B ∥β − β0∥1 · E(cid:0)∥En(ϵixi)∥∞ (cid:12) (cid:12)Y(cid:1) ≤ 20δ0R2 λ · C1Kx (cid:114) log p n . (3.25) In the last step, we have used the standard bound for maximum of sub-Gaussian variables (e.g. Theorem 11). To bound the second term, let (cid:98)QG := supg∈G ∥g∥n and H be the metric entropy. Applying Dudley’s entropy integral gives (cid:16) E (cid:12) (cid:12) 1 n sup g∈G ≤C3 (cid:98)QG√ n (cid:90) 1 0 i=1 (cid:16) u (cid:98)QG KG n (cid:88) ϵig(zi)(cid:12) (cid:12) (cid:17) (cid:12) (cid:12) (cid:12)Y ≤ C2 (cid:98)QG√ n (cid:90) 1 0 (cid:113) H(u (cid:98)QG, G, ∥ · ∥n)du, (cid:17) −1 2k+2 du = (2k + 2)C3 √ n (2k + 1) 1 2k+2 G 2k+1 2k+2 G , (cid:98)Q K (3.26) where the second inequality is due to the entropy bound of Corollary 1 in [56]. Now write down (cid:98)QF from Theorem 13, (cid:98)Q2 F := sup β−β0∈B,g−g0∈G ≤ sup β−β0∈B,g−g0∈G 1 n 1 n n (cid:88) (cid:2)ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))(cid:3)2 i=1 n (cid:88) i=1 (cid:2)x′ i(β − β0) + g(zi) − g0(zi)(cid:3)2 ≤ sup β−β0∈B 2 n ≤ 2K 2 x 400δ2 λ2 n (cid:88) i=1 0R4 (cid:2)x′ i(β − β0)(cid:3)2 + sup g−g0∈G 2 n n (cid:88) i=1 (cid:2)g(zi) − g0(zi)(cid:3)2 + 2 (cid:98)Q2 G. 69 We can then use the second result in Theorem 13, together with (3.24)-(3.26), to obtain (cid:32) P sup β−β0∈B,g−g0∈G (cid:12) (cid:12)Enϵi[ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))](cid:12) (cid:12) ≥ D1 · √ (cid:18) δ0R2( log p + √ n λ √ t) + (cid:98)QG (cid:114) t n + n−1/2K 1 2k+2 G 2k+1 2k+2 (cid:98)Q G (cid:33) (cid:19)(cid:12) (cid:12) (cid:12)Y ≤ e−t, t > 0. (3.27) Note that on T3(δ0, R)) it holds that (cid:98)QG ≤ (cid:114) sup g∈G |∥g∥2 n − ∥g∥2| + sup g∈G ∥g∥ ≤ ( (cid:112) δ0 + 1 + (cid:112)Λmax/Λmin)R + ∆, (3.28) − and P(T3(δ0, R)) ≥ 1−e 0 R4 D2nδ2 KG (R+∆)2 from Part (ii). We can combine (3.27)-(3.28) and integrate out the conditional probability to reach (cid:32) P sup β−β0∈B,g−g0∈G (cid:18) δ0R2( ≥ D3 · (cid:12) (cid:12)Enϵi[ρτ (yi − x′ iβ − g(zi)) − ρτ (yi − x′ iβ0 − g0(zi))](cid:12) (cid:12) √ t) √ log p + √ n λ + (R + ∆) (cid:114) t n + n−1/2K 1 2k+2 G (R + ∆) 2k+1 2k+2 (cid:19)(cid:33) − ≤ e−t + e D2nδ2 0 R4 KG (R+∆)2 . It is direct to verify that choosing t = 10 log p under the given conditions finishes the proof. Lemma 11. Consider any f (x, z) = x′β + g(z) that satisfies τδ0,R(f ) ≤ R. (i) Under Condition 10 it holds that ∥β∥2 ≤ Λ−1/2 (cid:16) ∥g∥ ≤ 1 + min R, ∥β∥1 ≤ (cid:114)Λmax Λmin (cid:17) R, T V (g(k)) ≤ 20δ0R2 γ . 20δ0R2 λ , ∥x′β∥ ≤ (cid:16) 2 + (cid:114) Λmax Λmin (cid:17) R, (ii) Under additional Condition 9, it holds that ∥g∥∞ ≤ Cz,k (cid:16)20δ0R γ + 1 + (cid:114) Λmax Λmin (cid:17) R, where Cz,k > 0 is a constant only dependent on k and ℓz. 70 Proof. The bound on ∥β∥1 and T V (g(k)) is clear from the definition of τδ0,R(f ) in (3.8). Using the orthogonal decomposition: ∥x′β + g(z)∥2 = ∥(cid:101)x′β∥2 + ∥h(z)′β + g(z)∥2 ≤ R2, we have ∥β∥2 ≤ Λ−1/2 min ∥(cid:101)x′β∥ ≤ Λ−1/2 min R, and ∥g(z)∥ ≤ ∥h(z)′β + g(z)∥ + ∥h(z)′β∥ ≤ R + Λ1/2 max∥β∥2 ≤ (cid:16) 1 + (cid:114)Λmax Λmin (cid:17) R. Having the bound on ∥g∥, we can further obtain ∥x′β∥ ≤ ∥x′β + g(z)∥ + ∥g(z)∥ ≤ (2 + (cid:113) Λmax Λmin It remains to bound ∥g∥∞. Define g∗(z) = ∗ ) ≤ 1, ∥g∗∥ ≤ 1 , then T V (g(k) )R. g(z) ( 20δ0R γ +1+ (cid:113) Λmax Λmin )R due to the derived bounds on ∥g∥ and T V (g(k)). Hence it is sufficient to show that sup ∗ )≤1,∥g∗∥≤1 g∗:T V (g(k) ∥g∗∥∞ ≤ Cz,k. To prove the above, we first decompose g∗ = p∗ + q∗, where p∗ is a polynomial of degree k and q∗ is orthogonal to all polynomials of degree k (with respect to the L2 inner product (cid:82) 1 ∗ ) = T V (g(k) 0 p(z)q(z)dz). Note that T V (q(k) ∥q∗∥∞ ≤ ck for a constant ck > 0 only depending on k. Now write p∗(z) = (cid:80)k ∗ ) ≤ 1. Then Lemma 5 in [56] implies that j=0 ajzj. We have (cid:112) a′V0a = ∥p∗∥ ≤ ∥g∗∥ + ∥q∗∥ ≤ 1 + ck, where a = (a0, . . . , ak) and V0 = E(v(z)v(z)′) with v(z) = (1, z, z2, . . . , zk). Thus, we obtain ∥p∗∥∞ ≤ ∥a∥1 ≤ √ k + 1∥a∥2 ≤ √ k + 1(1 + ck)λ−1/2 min (V0), which implies ∥g∗∥∞ ≤ ∥p∗∥∞ + ∥q∗∥∞ ≤ min (V0) + ck. Finally, we need to show λmin(V0) is bounded below by a constant only depending on k and ℓz. Under Condition k + 1(1 + ck)λ−1/2 √ 9 we have λmin(V0) = min ∥b∥2=1 E(b′v(z))2 ≥ ℓz · min ∥b∥2=1 E(b′v((cid:101)z))2, (cid:101)z ∼ U nif (0, 1) = ℓz · E(b′ ∗v((cid:101)z))2, 71 where b∗ is the minimizer; E(b′ ∗v((cid:101)z))2 is a constant only depending on k, and it is positive because the roots of any polynomial (not identically zero) form a set of Lebesgue measure zero. 3.7.1.2 Completion of the proof We choose R2 = κ1 · (cid:16) s log p n + n− 2k+2 2k+3 (cid:17) , γ = κ2 · δ0R2, λ = κ3 · (cid:113) log p n . Under Condition 12, it is straightforward to verify that by choosing sufficiently small constant δ0, proper constant κ2 and sufficiently large constants κ1, κ3, the conditions in Lemmas 8-10 are all satisfied when n is large enough. 3.7.2 Proof of Theorem 9 Denote the following event: T4 = (cid:110)(cid:12) (cid:12)∥Xβ∥2 (cid:12) n − ∥x′β∥2(cid:12) (cid:12) (cid:12) ≤ λ∥β∥1, ∀β ∈ B (cid:111) , where the set B was introduced in (3.10). The proof of Theorem 8 already shows (cid:98)β −β0 ∈ B. We aim to perform a finer analysis to obtain the desirable rates specified in Theorem 9. Recall that under Condition 11, max1≤j≤p T V (h(k) According to Lemma 16 in [70], there exist ¯hj ∈ Hn for all 1 ≤ j ≤ p such that on T1, j ) ≤ Lh where h(z) = (h1(z), . . . , hp(z)) = E(x|z). T V (¯h(k) j ) ≤ akLh, ∥¯hj − hj∥∞ ≤ 22bkLh log n ℓzn := (cid:101)∆. (3.29) Define (cid:101)yi = yi − h′(zi)β0 − g0(zi), ζi = h′(zi)(β0 − (cid:98)β) + g0(zi) − (cid:98)g(zi), and ¯h(z) = (¯h1(z), . . . , ¯hp(z)), ˇg(z) = (cid:98)g(z) + ¯h′(z)((cid:98)β − β0). Note that ˇg ∈ Hn since (cid:98)g, ¯hj ∈ Hn. Hence the optimization (3.6) implies the basic inequality Enρτ (yi − x′ ≤ Enρτ (yi − x′ i (cid:98)β − (cid:98)g(zi)) + λ∥(cid:98)β∥1 + γT V ((cid:98)g(k)) iβ0 − ˇg(zi)) + λ∥β0∥1 + γT V (ˇg(k)) 72 Using the triangle inequality for T V (·) and Lipschitz property of ρτ (·), we can continue from the above to have i (cid:98)β + ζi) + λ∥(cid:98)β∥1 Enρτ ((cid:101)yi − (cid:101)x′ ≤ Enρτ ((cid:101)yi − (cid:101)x′ ≤ Enρτ ((cid:101)yi − (cid:101)x′ iβ0 + ζi) + λ∥β0∥1 + En|(¯h(zi) − h(zi))′(β0 − (cid:98)β)| + γT V ((¯h(k)(z))′(β0 − (cid:98)β)) iβ0 + ζi) + λ∥β0∥1 + ( (cid:101)∆ + γakLh)∥(cid:98)β − β0∥1, (3.30) where the last step is due to (3.29). Moreover, using the identity of Knight, we also obtain Enρτ ((cid:101)yi − (cid:101)x′ (cid:90) (cid:101)x′ i((cid:98)β−β0) = En 0 = En(cid:101)x′ i((cid:98)β − β0) iβ0 + ζi) − Enρτ ((cid:101)yi − (cid:101)x′ (cid:16) i (cid:98)β + ζi) + Enρτ ((cid:101)yi − (cid:101)x′ (cid:17) 1((cid:101)yi − (cid:101)xiβ0 ≤ t) − 1((cid:101)yi − (cid:101)xiβ0 + ζi ≤ t) (cid:90) 1 dt (cid:16) 0 1((cid:101)yi − (cid:101)xiβ0 ≤ (cid:101)x′ (cid:90) 1 (cid:16) i((cid:98)β − β0)t) − 1((cid:101)yi − (cid:101)xiβ0 + ζi ≤ (cid:101)x′ (cid:17) i((cid:98)β − β0)t) dt i (cid:98)β) − Enρτ ((cid:101)yi − (cid:101)x′ iβ0) ≤ ∥(cid:98)β − β0∥1 · (cid:13) En(cid:101)xi (cid:13) (cid:13) 0 1((cid:101)yi − (cid:101)xiβ0 ≤ (cid:101)x′ i((cid:98)β − β0)t) − 1((cid:101)yi − (cid:101)xiβ0 + ζi ≤ (cid:101)x′ (cid:17) i((cid:98)β − β0)t) dt (cid:13) (cid:13) (cid:13)∞ ≤ λ 4 ∥∥(cid:98)β − β0∥1 with probability at least 1 − p−D1. Here, the second equality is by a change of variable, the first inequality is by Hölder’s inequality, and the last inequality is due to the fact that each (cid:101)xi is a centered sub-Gaussian vector and we have used the standard tail bound on maximum of sub-Gaussian variables. Combining the above with (3.30) yields i (cid:98)β) + λ∥(cid:98)β∥1 Enρτ ((cid:101)yi − (cid:101)x′ ≤ Enρτ ((cid:101)yi − (cid:101)x′ ≤ Enρτ ((cid:101)yi − (cid:101)x′ iβ0) + λ∥β0∥1 + (λ/4 + (cid:101)∆ + γakLh)∥(cid:98)β − β0∥1 iβ0) + λ∥β0∥1 + λ 3 ∥(cid:98)β − β0∥1, (3.31) where the last inequality holds when n is large enough since γ = o(λ), (cid:101)∆ = o(λ). Given that (cid:101)x′β0 and the conditional density the τ -level conditional quantile of only differs from fy|x,z(·) by a shifting, we can base on (3.31) to perform standard high- (cid:101)y (conditioning on (cid:101)x) is dimensional analysis of ℓ1-penalized quantile regression. Specifically, following the essential arguments in [5] (which we skip for simplicity), we can obtain that under an additional event 73 that happens with probability at least 1 − p−D2, it holds that ∥(cid:101)x′((cid:98)β − β0)∥2 ≤ D3 s log p n , ∥(cid:98)β − β0∥2 2 ≤ D3 s log p n . We should clarify that the term log(p ∨ n) in [5] is replaced by log p here since we are not aimed for uniform results over the quantile levels. Also, the restricted nonlinear impact condition required in [5] is not needed in our problem, because we can bound the population quantile regression objective function as in the proof of Lemma 8 thanks to Condition 12. Now we can proceed to have ∥x′((cid:98)β − β0)∥2 = (∥(cid:101)x′((cid:98)β − β0)∥ + ∥h(z)′((cid:98)β − β0)∥)2 ≤ 2∥(cid:101)x′((cid:98)β − β0)∥2 + 2∥h(z)′((cid:98)β − β0)∥2 ≤ 2∥(cid:101)x′((cid:98)β − β0)∥2 + 2Λmax∥(cid:98)β − β0∥2 2 ≤ D4 s log p n . Further intersecting with T4, we obtain ∥X((cid:98)β − β0)∥2 n ≤ ∥x′((cid:98)β − β0)∥2 + λ∥(cid:98)β − β0∥1 ≤ ∥x′((cid:98)β − β0)∥2 + 4 √ sλ∥(cid:98)β − β0∥2 ≤ D5 s log p n , where we have used the fact that (cid:98)β − β0 belongs to a cone {ν ∈ Rp : |νSc| ≤ 3|νS|}. It remains to show T4 holds with high probability. We first have ∀β ∈ B, (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 ((x′ iβ)2 − ∥x′β∥2) (cid:12) (cid:12) (cid:12) ≤ max 1≤j,k≤p (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (xijxik − Exjxk) (cid:12) (cid:12) · ∥β∥2 (cid:12) 1, (3.32) where in the last inequality we have used the fact |a′Aa| ≤ ∥a∥2 1 ·∥A∥max, ∀a ∈ Rp, A ∈ Rp×p. are independent, zero-mean, sub-Gaussian Condition 7 implies that {xijxik − Exjxk}n i=1 random variables with the sub-Gaussian norm ∥xijxik − Exjxk∥ψ2 ≤ C1K 2 Hoeffding’s inequality in Theorem 10 together with a simple union bound to obtain that . We then use x max 1≤j,k≤p (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (xijxik − Exjxk) (cid:12) (cid:12) ≤ C2K 2 (cid:12) x (cid:114) log p n holds with probability at least 1 − 2p−10. Putting this result together with (3.32) shows that (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 ((x′ iβ)2 − ∥x′β∥2) (cid:12) (cid:12) (cid:12) ≤ λ∥β∥1, ∀β ∈ B, 74 (cid:113) log p n · 20δ0R2 λ ≤ λ. This condition is satisfied when n is large enough since as long as C2K 2 x (cid:113) log p n λ ∝ and R2 = o(λ). 3.7.3 Reference materials Theorem 10. (General Hoeffding’s inequality). Let x1, . . . , xn ∈ R be independent, zero- mean, sub-gaussian random variables. Then for every t ≥ 0, we have P (cid:18)(cid:12) (cid:12) (cid:12) n (cid:88) i=1 (cid:19) (cid:12) (cid:12) (cid:12) ≥ t xi ≤ 2 exp (cid:16) − ct2 i=1 ∥xi∥2 ψ2 (cid:80)n (cid:17) , where c > 0 is an absolute constant, and ∥ · ∥ψ2 inf{t > 0 : Eex2/t2 ≤ 2}. The above result is Theorem 2.6.2 in [73]. is the sub-gaussian norm defined as ∥x∥ψ2 = Theorem 11. Let x1, . . . , xp ∈ R be sub-gaussian random variables, which are not necessarily independent. Then there exists an absolute constant c > 0 such that for all p > 1, E max 1≤i≤p |xi| ≤ c(cid:112)log p max 1≤i≤p ∥xi∥ψ2. The above result can be found in Lemma 2.4 of [7]. Theorem 12. (Uniform convergence of empirical norms). Denote ∥g∥2 n = 1 n i.i.d. samples zi ∈ Z. For a class F of functions on Z, let J0(t, F) = C0t (cid:82) 1 0 sup{z1,...,zn}⊆Z (cid:112)H(ut/2, F, ∥ · ∥n)du, t > 0, where C0 > 0 is some universal constant and H is the metric entropy. Let RF := supg∈F ∥g∥, KF := supg∈F ∥g∥∞, and H(t) be the convex conjugate of i=1 g(zi)2 for (cid:80)n G(t) := (J −1 0 (t, F))2. Then, for all R2 F ≥ H(4KF / n) and all t > 0, √ (cid:32) P sup g∈F (cid:12) (cid:12)∥g∥2 (cid:12) n − ∥g∥2(cid:12) (cid:12) (cid:12) ≥ C1 (cid:18) 2KF J0(2RF , F) + KF RF √ n √ t (cid:19)(cid:33) + K 2 F t n ≤ e−t, where C1 > 0 is a universal constant. The above is Theorem 2.2 from [21]. 75 Theorem 13. (Symmetrization and concentration). Let Y = (y1, . . . , yn) be i.i.d. samples in some sample space Y and let F be a class of real-valued functions on Y. Define QF = supf ∈F ∥f ∥, (cid:98)QF = supf ∈F ∥f ∥n. Then, (cid:16) P (cid:16) P (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 n 1 n sup f ∈F sup f ∈F n (cid:88) i=1 n (cid:88) i=1 (cid:12) (f (yi) − Ef (yi)) (cid:12) (cid:12) ≥ 4QF (cid:17) (cid:114) 2t n (cid:16) ≤ 4P (cid:12) (cid:12) (cid:12) 1 n sup f ∈F n (cid:88) i=1 (cid:12) (cid:12) (cid:12) ≥ QF ϵif (yi) (cid:17) (cid:114) 2t n , ∀t ≥ 4, (cid:12) (cid:12) (cid:12) ≥ C2 · ϵif (yi) (cid:16) (cid:16) E (cid:12) (cid:12) (cid:12) 1 n sup f ∈F n (cid:88) i=1 (cid:12) (cid:12) ϵif (yi) (cid:12) (cid:12) (cid:12) (cid:12)Y (cid:17) + (cid:98)QF (cid:112)t/n (cid:17)(cid:12) (cid:12) (cid:12)Y (cid:17) ≤ e−t, ∀t > 0, where ϵ1, . . . , ϵn is a Rademacher sequence independent of Y and C2 > 0 is a universal constant. The first result is Lemma 16.1 in [72] and the second result is implied by Theorem 16.4 in [72]. 76 CHAPTER 4 FALSE DISCOVERY RATE CONTROL VIA REGIONAL QUANTILE REGRESSION ON ULTRA-HIGH DIMENSION 4.1 Introduction The advancement of technologies and biotechnologies has enabled researchers to collect vast amounts of data. Ultra-high dimensional data, characterized by an exponentially grow- ing number of candidate covariates p relative to the number of observations n, is increasingly prevalent in diverse scientific domains such as bioinformatics, machine learning, and finance. Despite the large number of covariates, often only a small subset significantly influences the response variable. Identifying these influential covariates amidst the numerous noisy ones is a formidable task. To tackle this issue, researchers employ a technique known as the screening method, which aims to isolate and retain the useful variables while discarding the irrelevant ones. Significant progress has been made in screening ultra-high dimensional data. Starting with [17], who proposed the sure independence screening (SIS) based on the marginal cor- relation between a covariate and a response variable, subsequent research expanded this model. For instance, [18] adapted the SIS for generalized linear models, [4] developed condi- tional sure independence screening for situations with prior information on covariates, and [15] introduced nonparametric independence screening for additive models. However, these methods primarily address the conditional mean of the response, making them less robust to outliers or heteroscedastic errors. This limitation can be mitigated by employing quantile regression (QR). Introduced by [32], quantile regression models the effects of covariates on conditional quantiles, offering a robust alternative to mean-based approaches. In high-dimensional and ultra-high dimen- sional settings, various researchers have advanced this field: [76] used penalized methods, [35] applied the Bayesian information criterion for variable selection, [14] considered the adaptive robust lasso model using a two-step procedure based on the ℓ1-penalized quantile regression 77 model [5], and [40] employed quantile partial correlation for screening. However, these works have focused on local quantile regression, considering single or multiple quantile levels independently. As highlighted by [87], local quantile regression can lead to delicate issues. For instance, when determining which covariates affect low quantiles, the choice of the quantile level is crucial. An incorrect choice, such as τ = 0.1, might result in missing important covariates that significantly influence the conditional quantile. To address this, the regional quantile regression method [50], also known as global quantile regression [87] or quantile function regression [82], can be employed. [51] pioneered this approach for the varying coefficient model, which inspired [87] to use the adaptive robust lasso model [14] in the regional setting. In variable selection, this method models quantile regression over an interval of quantile levels, ∆ ⊂ (0, 1), rather than a single level τ . Using ∆ reduces the risk of overlooking covariates that significantly contribute to other quantiles, enhancing the stability of variable selection. Thus, the regional quantile regression method offers a more general approach than the local quantile regression, as it encompasses the latter when ∆ is a singleton set. Although some researchers have developed methods for variable selection in ultra-high dimensional data using regional quantile regression, no work has considered the control of the false discovery rate (FDR). As noted by [36], recent screening methods tend to sacrifice FDR by adopting conservative cutoffs, a problem also present in quantile regression variable selection. Therefore, there is a need for a new method in quantile regression that robustly selects variables while concurrently considering multiple quantile levels and controlling the FDR. To control the FDR in variable selection problems, [3] proposed an innovative method called knockoff variable selection. The core concept of the knockoff method involves creating artificial variables, known as knockoff variables, and then performing variable selection based on both these knockoff variables and the original variables. The primary advantage of the knockoff method is its ability to guarantee exact FDR control with finite samples. While the 78 method was initially developed for fixed design matrices X, recent extensions have adapted it for random design matrices within the model-X framework, as demonstrated by [9], [60], and [79]. However, in our study, we adopt a multi-step procedure similar to the approaches of [36] and [2] for ultra-high dimensional variable selection with a fixed design matrix. Ad- ditionally, [65] applied this two-step method directly to the log-contrast model, highlighting its versatility and effectiveness. To the best of our knowledge, we are the first to propose the knockoff method for a regional quantile regression setting. In a subsequent section, we demonstrate that adapting the knockoff method to this context is not a straightforward modification of the original approach due to the involvement of multiple quantile levels. Consequently, we present a new knockoff procedure specifically designed for regional quantile regression, which addresses the limitation of using a single quantile level and concurrently controls the FDR level. The rest of this chapter is organized as follows: In Section 4.2, we introduce our new screening method. In Section 4.3, we present our novel knockoff procedure for global quantile regression. Section 4.4 provides numerical results from various experimental setups. In Section 4.5, we demonstrate the effectiveness of our method using real data. Finally, in Section 4.6, we summarize our findings. 4.2 Proposed Screening Method In this section, we introduce our screening method. Before we proceed to the estimation, we begin with introducing the models. We consider the following regional quantile regression form: Qy|X(τ ) = α∗(τ ) + Xβ∗(τ ), τ ∈ ∆, (4.1) where ∆ = [τ1, τ2] ⊆ (0, 1) is a closed interval of quantile levels, y = (y1, . . . , yn)′ ∈ Rn is a response variable, Xj = (x1j, . . . , xnj)′ ∈ Rn is a column vector for covariate matrix X = (X1, . . . , Xp) ∈ Rn×p, β(τ ) = (β1(τ ), . . . , βp(τ ))′ ∈ Rp is a vector of coefficients, 79 α(τ ) ∈ R is an intercept. We denote x′ i to be a row vector for X. 4.2.1 Marginal screening via integrated regional quantile estimator We now move on to our screening procedure. Here, the covariate matrix X is ultrahigh- dimensional and contains a vast number of covariates. We assume that the dimension p of each row of X is allowed to grow with sample size n at an exponential rate, that is log p = O(na) for some constant a. Our goal is to identify a set of active covariates associated with the response by filtering out as many unrelated variables as possible, and coincidently find for which we propose a marginal quantity to rank the covariates. To denote the significant variables, we define the index set M(∆) = (cid:8)1 ≤ j ≤ p : ∃τ ∈ ∆, β∗ j (τ ) ̸= 0(cid:9) . We assume that the number of active variables is smaller than the sample size: |M(∆)| ≤ n. Intuitively, we can expect that if the corresponding variable Xj is important, then the integration of the absolute value of its coefficients within the region should be large, and it should not be close to 0. Then naturally, we can utilize this property as a measure of importance for each covariate within the quantile region. For the estimation, we consider a nonparametric regression approach using B-spline ap- proximation to the quantile coefficients and the intercept within the quantile region. That is, let βj(τ ) = B(τ )T bj, αj(τ ) = B(τ )T aj, where B(τ ) is a B-Spline function for τ which is explained to be later. Without loss of generality, we assume each covariate xj is standardized to have E(xj) = 0, V ar(xj) = 1. Then we proceed to a marginal quantile screening ((cid:98)aj,(cid:98)bj) ∈ arg min a,b L (cid:88) n (cid:88) ℓ=1 i=1 (cid:16) yi − xijB(τℓ)T b − B(τℓ)T a (cid:17) , ρτℓ j = 1, . . . , p, where ρτ (x) = x(τ − I(x ≤ 0)) is a check loss function for a quantile level τ , {τl} is a partition of the interval ∆, and using them as knots we have constructed N = L + k normalized B-spline basis functions B(τ ) = (B1(τ ), . . . , BN (τ )) of order k + 1 and L is the number of discrete quantile levels. For the measure of the importance of each predictor, we 80 (cid:82) use the integrated coefficient estimator ∆ | (cid:98)βj(τ )|2dτ , where |∆| denotes the cardinality. Here, (cid:98)βj(τ ) = B(τ )T(cid:98)bj, so the key integral to evaluate is (cid:82) ∆ B(τ )B(τ )T dτ , which can be even computed or approximated before obtaining (cid:98)bj for each predictor. Then our index set, derived from our model, can be written as follows 1 |∆| (cid:26) (cid:99)M(∆) = 1 ≤ j ≤ p : 1 |∆| (cid:90) ∆ | (cid:98)βj(τ )|2dτ ≥ ν0 (cid:27) , where ν0 > 0 is a cutoff threshold. We show the theoretical result, the sure screening property, for our model. Before we prove the theorem, we first introduce some regularity conditions and some population quantities: For each j ∈ M(∆) (a∗ j , b∗ j ) ∈ arg min a,b∈RN 1 L L (cid:88) ℓ=1 (cid:16) Eρτℓ y − xjB(τℓ)T b − B(τℓ)T a (cid:17) (fj(τ ), gj(τ )) ∈ arg min f,g∈R Eρτ (y − xjf − g). Conditions (4.2) (4.3) 1. (Bounded covariates) The covariates are bounded: maxj∈M(∆) |xj| ≤ Kx. 2. (Lipschitz condition) The functions {fj(τ ), gj(τ )}j∈M(∆) belong to the class of functions whose kth derivative satisfies a Lipschitz condition of order c: |h(k)(s) − h(k)(t)| ≤ c0|s − t|c, for some positive constant c0, where k is a nonnegative integer and c ∈ (0, 1] satisfies d = k + c > 0.5. 3. (Bounded response density) For each j ∈ M(∆), the conditional density evaluated at t = xjfj(τ )+gj(τ ) and xjB(τ )T b∗ j +B(τ )T a∗ j fy|xj (t) ≤ ¯f < ∞. Moreover, its derivative is uniformly bounded: supt |f ′ is bounded uniformly in (y, xj, τ ): 0 < f ≤ (t)| ≤ ¯f ′. y|xj 4. (Identifiability of the true model) The nonlinear impact coefficient q := 3f 2 ¯f ′ · inf a,b,j∈M(∆) (E(axj +b)2)3/2 E|axj +b|3 is bounded away from zero. 5. (Strong marginal regional signal) minj∈M(∆) (cid:82) 1 |∆| ∆ |fj(τ )|dτ ≥ κn−γ for some constants κ, γ > 0. 81 6. γ < 1/2, N −dnγ = o(1), N 3/2n−1/2√ log n = o(1), max(N 2 log N, n) = o(L). Conditions 1, 2, 3 are common in quantile regression literature utilizing B-spline basis [25, 63, 82]. Condition 4 is similar to those of [87, 5], where it ensures the information of the identifiability of the true model. Condition 5 is common in screening literature [25]. Condition 6 describes how fast the number of basis functions grows regarding sample size. Followed by the conditions, we have a sure screening property as follows. Theorem 14. Suppose τ1, . . . , τL 4 n−γ, the sure screening property holds: κ i,i,d ∼ U nif (∆). Assume Conditions 1-6, choosing ν0 = P(M(∆) ⊆ (cid:99)M(∆)) → 1, as n → ∞. We defer the proof of Theorem 14 to Section 4.7.1. Theorem 14 ensures that if we have a suitable number of observations, then we can select all important variables. However, since we try to select all important variables, inevitably there must be some false discoveries within the selected variables, which should be removed. We now move on to control the FDR for these selected variables. 4.3 FDR Control via Regional QR-Knockoff 4.3.1 A short review for the model-X knockoff Recently, various works related to the knockoff method have been proposed. For our case, we typically focus on the model-X approach, which allows us to further extend the knockoff method to high dimensional models. For the detailed explanations of the model-X approach and the knockoff methods, please see [9] and references therein. We introduce one definition from [9]. Definition 1. A variable xj is said to belong to a null set, H0, (we also say xj is null) if and only if y is independent of xj conditionally on the other variables in X for j = 1, . . . , p. 82 We now introduce the knockoffs. Let y be a response vector, and X be a design matrix, we say that (cid:101)X is the model-X knockoffs for X if it has the two properties as follows: 1. (X, (cid:101)X) d= (X, (cid:101)X)swap(S) 2. (cid:101)X ⊥⊥ y|X where S ⊂ H0, and swap(S) implies interchanging the location of knockoff and original variables corresponding to the index set S. Because of the properties 1 and 2, we have that (X, (cid:101)X)|y d= (X, (cid:101)X)swap(S)|y, That is, Xj, j ∈ S are conditionally independent of y. This result ensures us to proceed with the knockoff method. However, this fact is generally not true in the regional quantile regression setting. We introduce the reason in the next section. 4.3.2 Problem with the model-X knockoff for the regional quantile regression For a single observation case, we say that a variable xj is null if βj = 0 for j ∈ {1, . . . , p}, this is satisfied when we assume Proposition 2.2 of [9]. That is xj is null ⇔ βj = 0 ⇔ xj is conditionally independent of y. For quantile regression, this is also true for the single-level quantile regression, since βj(τ ) = 0, ∀τ ∈ (0, 1) ⇔ xj is conditionally independent of y. Therefore, we can directly adapt the original knockoff method to this case using the original y. However, the method cannot be directly adapted to the regional case, since the importance within a region does not necessarily imply conditional independence. This means that, for example, βj(τ ) = 0, ∀τ ∈ ∆ ⇔/ xj is conditionally independent of y. Specifically, the left-hand side does not imply the righthand side of the above equation. 83 4.3.3 Model-X knockoff for the regional quantile regression Because of the problem introduced in the previous section, we cannot directly apply the knockoff method to our model. To solve this problem, we propose the proposition as follows: Proposition 1. Let (cid:101)y ∈ Rn be constructed as below    (cid:101)yi = Qyi|xi(τ1) if yi ≤ Qyi|xi(τ1) yi if Qyi|xi(τ1) ≤ yi ≤ Qyi|xi(τ2) Qyi|xi(τ2) if yi ≥ Qyi|xi(τ2) where x′ i are the variables selected from the screening method for i = 1, . . . , n. Then Xj is (cid:101)y if and only if βj(τ ) = 0, ∀τ ∈ ∆ = [τ1, τ2]. conditionally independent of The proof of Proposition 1 is provided in Section 4.7.2. Remark 1. In practice, we don’t know the true conditional quantiles that are used for the cutoffs in Proposition 1. Therefore, we use estimated conditional quantiles instead. The conditional quantiles can be estimated using various methods, but we utilize the non-crossing model from [6] and [37], which we call non-crossing quantile lasso, that ensures the non-crossing property for conditional quantile estimates for high-dimensional data across multiple quantile levels. This property ensures that (cid:98)Qyi|xi(τ1) ≤ (cid:98)Qyi|xi(τ2), i = 1, . . . , n for τ1 ≤ τ2, where these estimates are estimated using the noncrossing quantile lasso. We provide more details in the simulation section in Section 4.2. Proposition 1 ensures the conditional independence on the region ∆. As discussed in Definition 1, this is the key property that needs to be satisfied. Since this is satisfied, we proceed with incorporating the model-X knockoff method. 84 There are various ways to construct knockoff variables. If we have previous information, we can use the exact construction. However, in most cases, we do not have prior knowl- edge related to the distribution of the design matrix. Therefore, we use the second-order approximation. In other words, we make (cid:101)X, so that (X, (cid:101)X) and (X, (cid:101)X)swap(S) have same first and second orders, instead of the distribution. Obviously, we have E(X) = E( (cid:101)X). For the covariance structure, we write as follows Cov(X, (cid:101)X) =    Σ Σ − diag(s) Σ − diag(s) Σ    where s is a p-dimension vector making Cov(X, (cid:101)X) positive definite, and Cov(X) = Σ. When n > 2p, [3] introduced two methods to select a vector s. 1. Equi-correlated knockoffs: Let s be a vector such that sj = 2λmin(Σ) ∧ 1 for all j = 1, . . . , p, where ∧ denotes the minimum operator. 2. Semidefinite programme (SDP) knockoffs: Let s be a vector such that (cid:88) min (1 − sj) j subject to sj ≥ 0 and diag(s) ⪯ Σ for all j = 1, . . . , p. For some special cases such as a high dimensional case and a non-gaussian case, this second order approximation may not work great. To overcome such issues some methods are discussed in [9] and [60]. However, we thought this is beyond the content of this chapter, so we don’t discuss details here. 4.3.4 FDR control for the regional quantile regression To find important variables, we estimate the statistics Wj for j = 1, . . . , p. The statistics Wj compare the jth variable in X and (cid:101)X, and find out which variable is important or not. 85 We calculate Wj as follows Wj(∆) = L (cid:88) l=1 (cid:26)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) − (cid:12) (cid:98)βj(τl) wl (cid:27) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:98)(cid:101)βj(τl) (cid:12) (cid:12) where L denotes the levels of τl’s within the ∆, and wl is the weights for each τ levels. Also (cid:98)βj(τl) denotes the non-crossing quantile lasso estimator for Xj and level τl, and (cid:98)(cid:101)βj(τl) denotes the non-crossing quantile lasso estimator for (cid:101)Xj and level τl. We omit (∆) for the notational convenience afterward if it is obvious. If |Wj| is largely greater than 0, it implies that jth variable is an active variable. If |Wj| is comparatively small, it implies that jth variable is an inactive variable. When |Wj| small (Xj is null), we have the following lemma Lemma 12. The signs of the null Wj’s are IID coin flips conditionally on (|W1|, . . . , |Wp|). The Lemma 12 is the Lemma 3.3 in [9], therefore we omit the proof. With Lemma 12, for fixed t > 0, we get # {j : Wj ≤ −t} ≥ # {null j : Wj ≤ −t} d= # {null j : Wj ≥ t} . Then we derive the FDP, false discovery proportion, as follows FDP(t) = # {null j : Wj ≥ t} # {j : Wj ≥ t} Conservatively, we can estimate FDP as follows (cid:91)FDP(t) = # {j : Wj ≤ −t} # {j : Wj ≥ t} To get the FDR control with q level, we set the threshold based on knockoff+ from [3]. We get the threshold as follows (cid:26) τ+ = min t : 1 + # {j : Wj ≤ −t} # {j : Wj ≥ t} (cid:27) ≤ q by setting a set (cid:98)A such that (cid:98)A = {j : Wj ≥ τ+}, we control the usual FDR, that is (cid:12)  (cid:12) (cid:12){j ∈ (cid:98)A ∩ H0} | (cid:98)A ∨ 1| FDR = E  ≤ q (cid:12) (cid:12) (cid:12)   86 where ∨ denotes the maximum operator. As a result, if we assume Proposition 1 and Lemma 12, then by Theorem 3.4 in [9], we attain the FDR control on the quantile region ∆. 4.4 Simulation Results 4.4.1 Screening performance In this subsection, we use simulated examples to assess the finite sample performance of the proposed method and compare it with other related methods. The original sure indepen- dence screening paper utilizes the sample correlation between the response and covariates, but it is reasonable for us to design the correlation between the response and covariates based on the quantile correlation. [40] introduced the sample estimate of the quantile correlation as below: qcor (cid:100) τ {y, x} = 1 (cid:112)(τ − τ 2)(cid:98)σ2 x 1 n n (cid:88) i=1 ψτ (yi − (cid:98)Qτ,y)(xi − ¯x) where ψτ (x) = τ − I(x < 0), ¯x = 1 n i=1(xi − ¯x)2. Also (cid:98)Qτ,y = (cid:80)n i=1 I(yi ≤ y) is the empirical distribution function. We denote this quantile correlation screening (QCOR, inf {y : Fn(y) ≥ τ } be the sample τ -th quantile of y1, . . . , yn and Fn(y) = 1 n i=1 xi and x = 1 (cid:98)σ2 n (cid:80)n (cid:80)n Method 1) as one of our screening methods. The other screening methods are well-known screening methods from the literature: model-free screening using the projection correlation (PCOR, Method 2) from [36] and quantile adaptive screening (QaSIS, Method 3) from [25]. These are all model-free and flexible screening methods that are adaptable to various scenarios. We denote our method to be RQRS – ∆ (Method 4) for a model with a regional quantile level and denote RQRS – τ (Method 5) for a model with a singleton quantile level. For the simulation studies, we show multiple examples with various settings. For every simulation, we repeat every procedure 100 times, and within each replication, we rank the features in descending order by the above four screening methods and record the minimum model size (MMS) that contains all active features and True Positives. We report the 87 quantiles of these results for repetitions with cutoff percentages: 5%, 35%, 65%, 95%, the best to the worst. Throughout all examples, we generate design matrix X with N (0, Σ), where each i, jth component of Σ is defined by Σij = σ|i−j| for i, j = 1, . . . , p and σ = 0.5. The level for the targeting quantile region is set to ∆ = [0.40, 0.60], and we set τ = 0.5 for the QCOR and the singleton RCRS model. For the tuning parameters of our model, we use the quadratic B-spline function for the approximations and set L = 5. Example 1 (Linear Models) In this example, we consider linear models generated from the following linear model y = Xβ + ϵ where β = (15, 0p−5)′. Here 1s denotes the vector of s ones and 0s denotes the vector of s zeros. We set n = 200 and p = 1000. We vary the setting by changing the error distribution. We consider the following distributions: Example 1.1 : ϵi ∼ N (0, 1), i = 1, . . . , n Example 1.2 : ϵi ∼ Cauchy(0, 1), i = 1, . . . , n The results for Example 1.1 and Example 1.2 are presented in Table 4.1. It shows that every method performed well with linear models, even though we include a heavy tailed model. Example 2 (Nonlinear Models) As we mentioned before, we use the same settings for n, p, X, and ∆ as in Example 1. For these models, we consider ϵi ∼ N (0, 1). For We set n = 200 and p = 1000. Consider the 88 5% 35% 65% 95% Method 1 Example 1.1 (QCOR) Example 1.2 Method 2 Example 1.1 (PCOR) Example 1.2 Method 3 Example 1.1 (QaSIS) Example 1.2 Method 4 Example 1.1 (RQRS – ∆) Example 1.2 Method 5 Example 1.1 (RQRS – τ ) Example 1.2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 5 5 5 5 Table 4.1 MMS result table for Example 1 with 100 repetitions following nonlinear data-generating models: Example 2.1 : yi = 5xi1 + 2 sin(πxi2/2) + 2xi3I(xi3 > 0) + 2 exp(5xi4) + ϵi, i = 1, . . . , n Example 2.2 : yi = 3xi1 + 3x3 i2 + 3x−1 i3 + 5I(xi4 > 0) + ϵi, i = 1, . . . , n Here, I(·) denotes the indicator function. The results for Example 2.1 and Example 2.2 are presented in Table 4.2. Among the methods, the performance of nonlinear models was moderately good except for Method 3 (QaSIS) in Example 2.1. Even though this model is nonparametric, we see that it does not work on some complex nonlinear models. Method 4 also worked slightly less than the other methods, but it was not distinguishably worse. Example 3 (Heteroskedastic Error Model) We consider a heteroskedastic error here, where the error is multiplied by a covariate. We 89 5% 35% 65% 95% Method 1 Example 2.1 (QCOR) Example 2.2 Method 2 Example 2.1 (PCOR) Example 2.2 4 4 4 4 4 4 4 4 Method 3 Example 2.1 9.95 20 (QaSIS) Example 2.2 Method 4 Example 2.1 (RQRS – ∆) Example 2.2 Method 5 Example 2.1 (RQRS – τ ) Example 2.2 4 4 4 4 4 4 5 4 5 4 5 4 4 4 40 4 5.05 5 5 4 74.05 7 5.35 14.05 4 5 4 5 6.05 5.05 Table 4.2 MMS result table for Example 2 with 100 repetitions generate (cid:101)X ∼ N500(0, Σ) where Σ = σ|i−j|, σ = 0.5, X501 ∼ Ber(0.5). Further let X = [ (cid:101)X X501] ∈ Rn×501. Then we generate the responses with yi = xi1 − xi3 + ϵi where ϵi ∼ (1+xi501)ξi and ξi ∼ N (0, 1). We set n = 200, just like the former examples. Note that the methods should also select xi501 as an important covariate since it is incorporated with the error term. However, for this example, Method 3 is not working because of its limitation on the design matrix. For QCOR and the singleton RQRS, we let τ = 0.6 to see the difference. The result is presented in Table 4.3. As we can see in the result, PCOR presents the worst performance here. QCOR method is quantile adaptive, but it still performs worse than that of our singleton method. For this example, the regional method works more stably than the other models. 90 5% 35% 65% 95% Method 1 (QCOR – τ = 0.6) 18.9 123.65 328 446.5 Method 2 (PCOR) 100.5 314 438.05 492.1 Method 4 (RQRS – ∆) Method 5 (RQRS – τ = 0.6) 3 3 18.60 8 78 116 248.35 424 Table 4.3 MMS result table for Example 3 with 100 repetitions 4.4.2 FDR control performance In this section, we use simulated examples to numerically assess the FDR control of the proposed RQRS-Knockoff procedure. In practice, if crossing happens while we estimate the quantiles in Proposition 1, we cannot censor the response. Therefore during this simulation, we put the response to be original yi, if crossings happen. For FDR control performance, we vary simulation settings by changing models, the num- ber of variables, and observations and see how the statistical powers and FDR change. It is assumed that the dimension of data is successfully reduced by the screening method. We consider a multivariate linear regression model: yi = x′ iβ + ϵi, i = 1, . . . , n where xi ∼ M V N (µ, Σ), µ = 0p, Σ = Ip and ϵi ∼ N (0, 1). Here, β = (2 ∗ 15, −0.5 ∗ 15, 0, . . . , 0)′ ∈ Rp, and 1s denotes a vector of s ones. The target FDR is 0.2, and consider the quantile region to be (0.3, 0.7). The data is split into two disjoint groups. We define the number of observations for the first group as n1 and for the second group as n2. Using n1 observations, conditional quantiles are estimated on the edges of the region. We denote these two quantiles to be (cid:98)Qstart and (cid:98)Qend. Then n2 responses are censored by the estimated conditional quantiles. These are estimated by using the quantile lasso and non-crossing quantile lasso. By using n2 censored response as a new response variable, adopt the model-X 91 knockoff procedure. During the knockoff procedure, the quantile lasso method is also used for the estimation, and the tuning parameters are derived from cross-validation. We repeat 100 times to generate the results. We record FDR, Power, Mean Squared Errors (MSE) for both (cid:98)Qstart and (cid:98)Qend, and the average percentage of crossing that happened among repetition. We refer Oracle to be the one using the true conditional quantiles, QLR to be quantile lasso regression, and nX-QLR to be non-crossing quantile lasso regression. Models n1 Oracle 150 QLR 150 nX-QLR 150 n2 50 50 50 p FDR Power (cid:98)Qstart (cid:98)Qend Crossings (%) 100 0.162 0.828 - - - 100 0.186 0.719 0.590 0.560 6.7% 100 0.172 0.790 0.432 0.423 Oracle 100 100 100 0.184 1 - - QLR 100 100 100 0.255 0.958 1.003 0.986 12.4% nX-QLR 100 100 100 0.305 0.976 0.780 0.754 Oracle 300 100 100 0.161 QLR 300 100 100 0.182 1 1 - - 0.255 0.247 1.1% 0% - 0% - nX-QLR 300 100 100 0.175 0.999 0.195 0.192 0% Table 4.4 Result for FDR control of various settings with 100 repetitions The results are presented in Table 4.4. When n1 is large and n2 is small we can see that every result with the nX-QLR is better than that of QLR for FDR and MSE of both quantiles. Nonetheless, we can see that the FDR is still preserved. When n1 is small and n2 gets bigger, we see that the power is increasing for both methods, but crossing rates and FDR results are getting worse. However, when n1 and n2 both grow larger, we see that all results get better, while the MSEs are still lower for nX-QLR. 92 4.5 Real Data Analysis 4.5.1 Obesity risk factors via body mass index and other dietary related vari- ables for Non-Hispanic Black women The body mass index (BMI) is a widely used indicator of several health risk factors. It is calculated by BMI = Weight (kg) Height2 (m) . Being underweight or overweight can lead to health issues related to weight. Studies have shown that underweight individuals are at risk of malnutrition and compromised immune function [12], while overweight and obese individuals are more likely to develop chronic health conditions compared to those with a normal BMI [47, 19]. However, the relationship between risk factors and the distribution of BMI is complex. The lower and upper BMI distribution may be influenced by different covariates, and the effect size of each covariate may vary across different quantiles of the BMI distribution. Classical mean-based regression models may not be sufficient to address this relationship effectively. We focus on the quantile regions, for both underweight and overweight, and see how much the important variables differ based on our approach. For the data analysis, we use the National Health and Nutrition Examination Survey dataset [10]. We specifically use the data collected from 2011 to 2014. The NHANES dataset provides comprehensive information on demographics, socioeconomic status, and dietary habits. Since we are interested in variables relevant to the obesity of the non-Hispanic Black (NHB) women population, we select dietary and health-related variables that might have potential importance related to BMI. 4.5.2 Real data analysis results There are a total of 93 variables and 383 observations after removing all missing observa- tions. We set the response variable to be the BMI of the population. Our aim of this study is to find important variables related to obesity and we are interested in a high-dimensional dataset, so we intentionally put 1000 fake variables into the dataset, and see whether our 93 method works well with the data. We expect that if the FDR control is working properly, then most of these fake variables won’t be selected after the final step. To adapt our procedure, we split the dataset into 3 disjoint groups (G1, G2, G3), and proceed with each step with the corresponding dataset. The detailed steps are as follows: • Step 1 (Screening): Using n1 = 153 observations (G1), we screen variables based on our method (RQRS) from Section 2.1. • Step 2 (Censoring): Using n2 = 130 observations (G2), we get the estimators (cid:98)β(∆1) and (cid:98)β(∆2) using non-crossing quantile lasso. Based on these coefficients, we can get (cid:98)Q∆1(y|x′) and (cid:98)Q∆2(y|x′), and construct (cid:101)y, where ∆ = [∆1, ∆2]. • Step 3 (Selection by Knockoff ): Using n3 = 100 observations (G3), we proceed with the FDR controlling using regional quantile regression knockoff. The pre-specified level for FDR is 0.2. To compare the effects for lower BMI group and higher BMI group, we use two specific ∆ intervals, [0.15, 0.45] and [0.55, 0.85]. The results are in Table 4.5 and Table 4.6. In Table 4.5, numbers stand for the indices of fake variables. This tells us that if we only do the screening, there might be a lot of false discoveries can be included. This problem can be reduced using a conservative cutoff, but we proceed with our method. As we can see in the result of the FDR-controlled variables, Table 4.6, all of the fake variables, which are chosen from the screening procedure, are removed after we have done the FDR-controlling procedure. The selected variables, those related to the measures of bodies, make sense since they are all closely related to the BMI. The triglyceride, which is selected for both regions, is a type of fat located in the blood and is one of the obesity-related important measures. There are multiple researches such as [13, 64] that use both BMI and triglyceride as a measure of obesity, which implies that this is also an important measure related to BMI. Our study reveals that the effect of covariates, Fasting Glucose, differs across the different BMI groups since it is selected only in the lower BMI group. This result aligns 94 Screened variables for ∆ = [0.15, 0.45] Beta-cryptoxanthin, Octadecatetraenoic, Arm length Arm circumference, Waist circumference, Abdominal Creatinine, HDL Cholesterol, Triglyceride, Serum copper, Glycohemoglobin, Fasting glucose, 42, 96, 157, 162, 227, 292, 372, 503, 515, 516, 530, 582, 664 Screened variables for ∆ = [0.55, 0.85] Beta-cryptoxanthin, Docosenoic, Arm length Arm circumference, Waist circumference, Abdominal Creatinine, HDL Cholesterol, Triglyceride, Serum copper, Glycohemoglobin, Fasting glucose, 29, 137, 162, 207, 258, 317, 502, 536, 582, 598, 657, 691, 738, 800, 834, 909, 929, 963 Table 4.5 Screened variables by using RQRS for two different quantile regions where the response variable is the BMI of NHB women FDR controlled variables for ∆ = [0.15, 0.45] Arm length, Arm circumference, Waist circumference, Abdominal, Triglyceride, Fasting glucose FDR controlled variables for ∆ = [0.55, 0.85] Arm length, Arm circumference, Waist circumference, Abdominal, Triglyceride Table 4.6 FDR controlled variables by using RQRS-Knockoff for two different quantile regions where the response variable is the BMI of NHB women 95 with the finding in [30], where they report that fasting glucose is associated with the highest mortality in the lower BMI group. 4.6 Conclusion and Discussion In this research, we have introduced a novel approach for variable selection in ultra-high dimensional settings using regional quantile regression combined with the knockoff method to ensure false discovery rate (FDR) control. Our methodology extends the capabilities of traditional quantile regression by considering an interval of quantile levels, thereby enhancing the robustness and comprehensiveness of variable selection. This approach mitigates the limitations of local quantile regression, which may overlook important covariates due to its focus on single quantile levels. The primary contribution of our work is the development and implementation of a new knockoff procedure specifically designed for regional quantile regression. This method not only addresses the inherent complexities of handling multiple quantile levels but also main- tains rigorous FDR control, a critical aspect often compromised in high-dimensional data analysis. Through extensive numerical experiments and real data applications, we have demonstrated the effectiveness and reliability of our proposed method. Our findings indicate that the regional quantile regression knockoff method offers a signif- icant improvement over existing approaches, providing a more accurate and stable variable selection process. This advancement opens new avenues for research and application in various scientific domains where ultra-high dimensional data is prevalent. Additionally, our work lays the groundwork for integrating post-selection inference tech- niques, which can provide valid statistical inference after variable selection, further enhancing the utility of our approach. Moreover, sample splitting [11], an alternative to the knockoff method, could be adapted to control FDR, offering a simpler yet effective strategy for robust variable selection. In conclusion, the proposed regional quantile regression knockoff method represents a 96 substantial step forward in high-dimensional variable selection, offering a robust, reliable, and comprehensive solution that balances the need for accuracy and FDR control. Future research could explore the incorporation of post-selection inference techniques and the adaptation of sample-splitting methods, potentially broadening the applicability and impact of our approach in high-dimensional data analysis. 4.7 Proofs 4.7.1 Proof of Theorem 14 We summarize the notations used throughout the proof here for convenience. For a = (a1, . . . , ap) ∈ Rp, denote ∥a∥q = ((cid:80)p i=1 |ai|q) 1 q for q ∈ [1, ∞) and ∥a∥∞ = max1≤i≤p |ai|. Given a square matrix A = (aij) ∈ Rp×p, λmax(A) and λmin(A) represent its largest and smallest eigenvalues respectively. For a general matrix A = (aij) ∈ Rp×q, ∥A∥2 denotes its spectral norm; ∥A∥max = maxij |aij|, ∥A∥F = (cid:113)(cid:80) i,j a2 ij . For a, b ∈ R, a ∧ b = min(a, b), a ∨ b = max(a, b). For a set A, 1A(·) is the usual indicator function. Moreover, an ≲ bn (an ≳ bn) means there exists some constant C > 0 such that an ≤ Cbn (an ≥ Cbn) for all n; thus an ≲ bn (an ≳ bn) is equivalent to an = O(bn) (an = Ω(bn)); an ≍ bn if and only if an ≲ bn and bn ≳ an; an ≫ bn means bn = o(an). Theorem 14 is a direct consequence of Lemmas 13 and 14. Hence, in the rest of the proof, we focus on these two lemmas. Throughout the proof, we use C1, C2, . . . to denote universal constants and use D1, D2, . . . , ... to denote constants that may only depend on the constants Kx, ¯f , ¯f , ¯f ′, κ, q from Conditions 1-5. An explicit (though not optimal) dependence of {Dj, j = 1, 2, . . .} on the aforementioned constants can be tracked down. However, since it does not provide much more insight, we will often not present the explicit forms of {Dj, j = 1, 2, . . .}, and this will greatly help streamline the proofs. The constants {Cj, Dj, j = 1, 2, . . .} may vary from lines to lines. Lemma 13. Assume cn → ∞ and N 3/2n−1/2cn = o(1). With probability at least 1 − 97 n(e−D1c2 n + N 2e− D2L N 2 ), the following holds: ∥(cid:98)bj − b∗ j ∥2 ≤ D3N n−1/2cn, ∀j ∈ M(∆). Proof. For notational simplicity, we will drop the subscript j throughout the proof. Let θ = (a, b) and introduce the following notations: L (cid:88) n (cid:88) (cid:0)yi − xiB(τℓ)T b − B(τℓ)T a(cid:1) ρτℓ i=1 ℓ=1 L (cid:88) Eρτℓ (cid:0)y − xB(τℓ)T b − B(τℓ)T a(cid:1) ℓn(θ) := 1 nL ℓ(θ) := vi(θ) := 1 L 1 L ℓ=1 L (cid:88) ℓ=1 (cid:104) Eρτℓ (cid:0)yi − xiB(τℓ)T b − B(τℓ)T a(cid:1) − Eρτℓ (cid:0)yi − xiB(τℓ)T b∗ − B(τℓ)T a∗(cid:1)(cid:105) A standard argument based on convexity shows that (cid:16) P ∥θ − θ∗∥2 ≥ δ (cid:32) (cid:17) ≤ P inf ∥θ−θ∗∥2=δ ℓ(θ) − ℓ(θ∗) ≤ sup ∥θ−θ∗∥2≤δ (cid:12) (vi(θ) − Evi(θ)) (cid:12) (cid:12) (cid:33) (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (4.4) We first lower bound inf ∥θ−θ∗∥2=δ ℓ(θ)−ℓ(θ∗). Note that θ∗ ∈ arg minθ ℓ(θ) according to (4.2). This optimality together with the identity of Knight yields that for θ satisfying ∥θ −θ∗∥2 = δ, ℓ(θ) − ℓ(θ∗) = 1 L L (cid:88) E (cid:90) xBT (τℓ)(b∗−b)+B(τℓ)(a∗−a) (cid:16) ℓ=1 0 Fy|x(xB(τℓ)T b∗ + B(τℓ)T a∗ + t) − Fy|x(xB(τℓ)T b∗ + B(τℓ)T a∗) (cid:17) dt xBT (τℓ)(b∗ − b) + B(τℓ)(a∗ − a) (cid:17)2 L (cid:88) (cid:16) E ℓ=1 − ¯f ′ 6L L (cid:88) ℓ=1 (cid:12)xBT (τℓ)(b∗ − b) + B(τℓ)(a∗ − a)(cid:12) E(cid:12) (cid:12) 3 (cid:17) δN 1/2(1 + Kx) · 1 L (cid:17) δN 1/2(1 + Kx) λmin L (cid:88) (BT (τℓ)(b∗ − b))2 + (B(τℓ)(a∗ − a))2(cid:17) (cid:16) ℓ=1 (cid:16) 1 L L (cid:88) ℓ=1 (cid:17) B(τℓ)BT (τℓ) δ2 98 (4.5) ≥ ≥ ≥ f 2L (cid:16)f 2 (cid:16)f 2 − − ¯f ′ 6 ¯f ′ 6 We now turn to upper bounding G := sup∥θ−θ∗∥2≤δ conditioning on {τℓ}, the vi(θ)’s are i.i.d. random functions, and 1 n (cid:12) (cid:12) (cid:12) (cid:80)n (cid:12) i=1(vi(θ) − Evi(θ)) (cid:12) (cid:12) . Note that sup ∥θ−θ∗∥2≤δ |vi(θ)| ≤ (1 + Kx)δ (cid:118) (cid:117) (cid:117) (cid:116)N λmax (cid:16) 1 L L (cid:88) ℓ=1 B(τℓ)BT (τℓ) (cid:17) := (1 + Kx)δCN τ Hence, we can apply bounded difference concentration inequality to obtain (cid:16) P G ≥ E(G(cid:12) (cid:12){τℓ}) + t(cid:12) (cid:12){τℓ} (cid:17) ≤ exp (cid:16)−D1nt2 δ2C 2 N τ (cid:17) , ∀t > 0. (4.6) Moreover, we apply symmetrization and contraction to derive that conditional on {τℓ}, E(G) ≤ 2E (cid:16) sup ∥θ−θ∗∥2≤δ (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:17) (cid:12) (cid:12) ϵivi(θ) (cid:12) ≤ 2 L L (cid:88) (cid:16) E ℓ=1 sup ∥θ−θ∗∥2≤δ (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:2)Eρτℓ ϵi (cid:0)yi − xiB(τℓ)T b − B(τℓ)T a(cid:1) − Eρτℓ (cid:0)yi − xiB(τℓ)T b∗ − B(τℓ)T a∗(cid:1)(cid:3)(cid:12) (cid:12) (cid:12) (cid:17) ≤ ≤ 4 L 4 L ℓ=1 L (cid:88) ℓ=1 L (cid:88) (cid:16) E sup ∥θ−θ∗∥2≤δ (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:0)xi(B(τℓ)T (b − b∗) + B(τℓ)T (a − a∗))(cid:1)(cid:12) (cid:12) (cid:12) (cid:17) ϵi (cid:32)(cid:118) (cid:117) (cid:117) (cid:12) E (cid:116) (cid:12) 1 n n (cid:88) i=1 ϵixi (cid:12) (cid:12) 2 + (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:33) (cid:12) (cid:12) ϵi 2 · ∥B(τℓ)∥2δ ≤ 4δCN τ · E (cid:16)(cid:12) (cid:12) 1 n n (cid:88) i=1 ϵixi (cid:12) + (cid:12) (cid:12) (cid:12) 1 n n (cid:88) i=1 (cid:17) (cid:12) (cid:12) ϵi ≤ D2CN τ δn−1/2, (4.7) where in the fourth and fifth inequalities we have used Cauchy–Schwarz inequality and Jensen’s inequality. From Lemma 16 we obtain P(CN τ ≤ D3) ≥ 1 − 2N 2e− D4L N 2 This result combined with (4.6) and (4.7) shows that (cid:16) (cid:17) G ≤ D5δn−1/2 + t P ≥ 1 − e −D6nt2 δ2 − 2N 2e− D4L N 2 . (4.8) Finally, putting together (4.4), (4.5) and (4.8) and applying a union bound completes the proof. 99 Lemma 14. Assume N −d = o(n−γ). Then, with probability at least 1 − 4 L · (cid:80) j∈M(∆) (cid:82) ((cid:82) ∆ |fj (τ )|2dτ ∆ |fj (τ )|dτ )2 , it holds that 1 L L (cid:88) (B(τℓ)T b∗ j )2 ≥ ℓ=1 κ2 4 n−2γ, ∀j ∈ M(∆). Proof. It is known (e.g., [59]) that there exist a0, b0 ∈ RN and a constant C1 > 0 such that |fj(τ ) − B(τ )T b0 j | ≤ C1N −d, sup τ sup τ |gj(τ ) − B(τ )T a0 j | ≤ C1N −d. (4.9) According to the definitions (4.3) and (4.2), we have 1 L L (cid:88) ℓ=1 Eρτℓ(y − xjfj(τℓ) − gj(τℓ)) ≤ ≤ 1 L 1 L L (cid:88) ℓ=1 L (cid:88) ℓ=1 Eρτℓ(y − xjB(τℓ)T b∗ j − B(τℓ)T a∗ j ) Eρτℓ(y − xjB(τℓ)T b0 j − B(τℓ)T a0 j ). (4.10) Then using the identity of Knight like in (4.13), we obtain 0 ≤ ≤ 1 L 1 L L (cid:88) ℓ=1 L (cid:88) ℓ=1 (cid:2)Eρτℓ(y − xjB(τℓ)T b0 j − B(τℓ)T a0 j ) − Eρτℓ(y − xjfj(τℓ) − gj(τℓ))(cid:3) (cid:16) ¯f 2 E(xj(B(τℓ)T b0 j − fj(τℓ)) + B(τℓ)T a0 j − gj(τℓ))2 + ¯f ′ 6 ≤ D1N −2d when N is large, E|xj(B(τℓ)T b0 j − fj(τℓ)) + B(τℓ)T a0 j − gj(τℓ)|3(cid:17) where the last inequality is due to (4.9). This result combined with (4.10) and Lemma 15 shows that when N is large, f 4 (cid:16) 1 L L (cid:88) ℓ=1 |B(τℓ)T b∗ j − fj(τℓ)| (cid:17)2 ≤ 1 L L (cid:88) ℓ=1 (cid:2)Eρτℓ(y − xjB(τℓ)T b∗ j − B(τℓ)T a∗ j ) − Eρτℓ(y − xjfj(τℓ) − gj(τℓ))(cid:3) ≤D1N −2d, 100 which implies 1 L (cid:80)L ℓ=1 |B(τℓ)T b∗ j − fj(τℓ)| ≤ D2N −d. We thus have (cid:118) (cid:117) (cid:117) (cid:116) 1 L L (cid:88) ℓ=1 (B(τℓ)T b∗ j )2 ≥ ≥ ≥ = 1 L 1 L 1 L L (cid:88) ℓ=1 L (cid:88) ℓ=1 L (cid:88) ℓ=1 (cid:90) 1 |∆| ∆ |B(τℓ)T b∗ j | |fj(τℓ)| − 1 L L (cid:88) ℓ=1 |B(τℓ)T b∗ j − fj(τℓ)| |fj(τℓ)| − D2N −d |fj(τ )|dτ − D2N −d + L (cid:88) ℓ=1 1 L (cid:124) |fj(τℓ)| − 1 |∆| (cid:90) ∆ |fj(τ )|dτ (4.11) (cid:123)(cid:122) :=ZL j (cid:125) Now, using Chebyshev’s inequality together with a union bound gives us P (cid:16)(cid:12) (cid:12)Z L j (cid:12) (cid:12) ≥ 1 2|∆| (cid:90) ∆ |fj(τ )|dτ, ∃j ∈ M(∆) (cid:17) ≤ 4 L · (cid:88) j∈M(∆) (cid:82) ((cid:82) ∆ |fj(τ )|2dτ ∆ |fj(τ )|dτ )2 (4.12) Combining (4.11), (4.12) and Condition 5 completes the proof. Lemma 15. Under Conditions 3-4, it holds that Eρτ (y − xjf (τ ) − g(τ )) − Eρτ (y − xjfj(τ ) − gj(τ ))   if E|f (τ ) − fj(τ )| ≤ q 2 4 (E|f (τ ) − fj(τ )|)2 f ≥  f 4 (qE|f (τ ) − fj(τ )| − q2 4 ) if E|f (τ ) − fj(τ )| > q 2 Here, the expectation E(·) is allowed to be taken additionally with respect to τ that is inde- pendent from (y, xj). Proof. The proof is largely motivated by the proof of Lemma 4 in [5]. We use E−τ to denote the expectation taken only with respect to (y, xj). Using the identity of Knight, we first 101 obtain E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ )) = E−τ (cid:16)(cid:2)xj(fj(τ ) − f (τ )) + gj(τ ) − g(τ )(cid:3) · (cid:2)τ − 1(y ≤ xjfj(τ ) + gj(τ ))(cid:3)(cid:17) (cid:90) xj (−fj (τ )+f (τ ))−gj (τ )+g(τ ) + E−τ 0 ≥ f 2 E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2 − (cid:0)Fy|xj (xjfj(τ ) + gj(τ ) + t) − Fy|xj (xjfj(τ ) + gj(τ ))(cid:1)dt ¯f ′ 6 E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|3, (4.13) where in the last step we have used Taylor expansion and Condition 3 to bound the second term, and the first term equals zero due to the definition of fj(τ ), gj(τ ) in (4.3). When E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2 ≤ q2, it implies that E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|3 ≤ 6f 4 ¯f ′ · E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2. Plugging this result into (4.13) yields E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ )) f 4 · E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2. ≥ (4.14) When E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2 > q2, define ¯f (τ ) = tf (τ ) + (1 − t)fj(τ ), ¯g(τ ) = tg(τ ) + (1 − t)gj(τ ), with t = √ q E−τ |xj (f (τ )−fj (τ ))+g(τ )−gj (τ )|2 ∈ [0, 1]. Using the convexity of check loss, we have E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ )) 1 t f q 4 E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2, · (cid:0)E−τ ρτ (y − xj (cid:113) ¯f (τ ) − ¯g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ ))(cid:1) ≥ ≥ (4.15) where the second inequality holds by invoking (4.14). Denote h1(t) := f 4 min(t2, tq) and h2(t) :=   f 4 t2 if t ≤ q 2  f 4 (qt − q2 4 ) if t > q 2 102 It is straightforward to verify that (4.14) and (4.15) together imply E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ )) ≥ h1(|f (τ ) − fj(τ )|) ≥ h2(|f (τ ) − fj(τ )|). Moreover, h2(t) is convex over t ∈ [0, ∞). Hence, we take expectation with respect to τ for the above inequality and apply Jensen’s inequality to conclude Eρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ )) ≥ h2(E|f (τ ) − fj(τ )|). Lemma 16. The B-spline basis vector B(τ ) = (B1(τ ), . . . , BN (τ )) satisfies the following properties: (i) There exist two constants b1, b2 such that b1N −1 ≤ λmin (cid:0)EB(τ )B(τ )T (cid:1) ≤ λmax (cid:0)EB(τ )B(τ )T (cid:1) ≤ b2N −1, where τ ∼ U nif (∆). (ii) Suppose τ1, . . . , τL i,i,d ∼ U nif (∆). Define H := 1 L (cid:80)L ℓ=1 B(τℓ)B(τℓ)T − EB(τ )B(τ )T . There exists a constant b3 > 0 such that (cid:16) P (cid:17) ∥H∥ ≥ t ≤ 2N 2 exp (cid:16) − t2L 2N (b3 + t/3) (cid:17) , ∀t > 0. Proof. These results are directly taken from Appendix A.1 in [25]. 4.7.2 Proof of Proposition 1 Proof. We consider one sample and simple case where x′ = (x1, x2). By the assumption, x is the variables screened from the screening method. We denote Fx to be the cumulative distribution function of a random variable x. 103 At first, if β1(τ ) = 0, τ ∈ ∆, this is true if and only if Qy|x2(τ ) = Qy|x1,x2(τ ), ∀τ ∈ ∆. Then, by the definition of the conditional quantile, where Qy|x(τ ) = inf{a : Fy|x(a) ≥ τ }, we have Fy|x1,x2 (cid:0)Qy|x1,x2(τ )(cid:1) = Fy|x2 = Fy|x2 (cid:0)Qy|x2(τ )(cid:1) (cid:0)Qy|x1,x2(τ )(cid:1) (4.16) for τ ∈ ∆. Since y = (cid:101)y when Qy|x1,x2(τ1) ≤ y ≤ Qy|x1,x2(τ2), this implies that for a, where (cid:101)y|x1,x2(a) = Fy|x1,x2(a) ≤ τ2. Therefore, this Qy|x1,x2(τ1) ≤ a ≤ Qy|x1,x2(τ2), we have τ1 ≤ F (cid:101)y and x1 given x2. and (4.16) imply conditional independence for (cid:101)y is conditionally independent of x1 given x2, we have Since F (cid:101)y|x2(a) = F (cid:101)y|x1,x2(a) where Q (cid:101)y|x1,x2(τ1) ≤ a ≤ Q cumulative distribution is one-to-one mapping, this implies that F −1 (cid:101)y|x2 (cid:101)y|x1,x2 (cid:101)y|x1,x2(τ2). Since the conditional quantile and the conditional (a). Then (a) = F −1 this is true if and only if β(τ ) = 0, ∀τ ∈ ∆, therefore, we finish the proof. This can be easily expanded to Xj and X−j, so we finish the entire proof. 104 CHAPTER 5 FUTURE STUDIES In the future, several avenues can be explored to expand and enhance the methodologies developed in this thesis. For chapters 2 and 3, we only consider univariate trend filtering models. However, this model can be easily generalized to the additive model where the number of additive terms is finite. [56] show that this can be further extended to scenarios where the number of additive terms can diverge to infinity. Therefore, our work can be expanded to such settings, potentially offering improved performance in more complex data structures. Both the partial linear and quantile trend filtering models can benefit from the adaptation of other penalization methods such as SCAD (Smoothly Clipped Absolute Deviation) and MCP (Minimax Concave Penalty). These methods can enhance variable selection perfor- mance, making our approach more robust and efficient in identifying significant predictors. Currently, there is no publicly available R package for partial linear quantile trend filtering or quantile trend filtering. [22] propose an ADMM algorithm for quantile regression, which is fast and applicable to high-dimensional data. This algorithm has already been success- fully deployed for trend filtering [53]. By employing the ADMM algorithm for our models, we expect to achieve faster and more efficient computations. Making this implementation publicly available would allow broader use and validation of our methods. For the partial linear models and the FDR control and regional quantile screening method, deploying a post-selection inference procedure after selecting features can further bolster our method. This would combine inference with our estimator, providing more reliable and interpretable results. Instead of using the model-X knockoff for controlling FDR, we can explore the use of the sample-splitting method [11] with mirror statistics. This approach may offer additional flexibility and robustness in FDR control, especially in ultra-high dimensional settings. By addressing these future research directions, we can further enhance the capabilities 105 and applications of the methodologies developed in this thesis, providing even more robust and efficient tools for high-dimensional heterogeneous data analysis. 106 BIBLIOGRAPHY [1] Taylor Arnold, Veeranjaneyulu Sadhanala, and Ryan Tibshirani. glmgen: Fast algo- rithms for generalized lasso problems. R package version 0.0.3. 2014. [2] Rina Foygel Barber and Emmanuel J. Candès. “A knockoff filter for high-dimensional selective inference”. In: The Annals of Statistics 47.5 (2019), pp. 2504–2537. [3] Rina Foygel Barber and Emmanuel J. Candès. “Controlling the false discovery rate via knockoffs”. In: The Annals of Statistics 43.5 (2015), pp. 2055–2085. [4] Emre Barut, Jianqing Fan, and Anneleen Verhasselt. “Conditional Sure Indepen- dence Screening”. In: Journal of the American Statistical Association 111.515 (2016), pp. 1266–1277. [5] Alexandre Belloni and Victor Chernozhukov. “ℓ1-penalized quantile regression in high- dimensional sparse models”. In: The Annals of Statistics 39.1 (2011), pp. 82–130. [6] Howard D. Bondell, Brian J. Reich, and Huixia Wang. “Noncrossing quantile regression curve estimation”. In: Biometrika 97.4 (Aug. 2010), pp. 825–838. [7] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Feb. 2013. [8] Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [9] Emmanuel Candès et al. “Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80.3 (2018), pp. 551–577. [10] Centers for Disease Control and Prevention (CDC) and National Center for Health Statistics (NCHS). “National Health and Nutrition Examination Survey Data. Hy- attsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention”. In: (2011-2018). [11] Chenguang Dai et al. “False discovery rate control via data splitting”. In: Journal of the American Statistical Association 118.544 (2023), pp. 2503–2520. [12] S. Eli et al. “Weight loss and BMI criteria in GLIM’s definition of malnutrition is associated with postoperative complications following abdominal resections – Results from a National Quality Registry”. In: Clinical Nutrition 39.5 (2020), pp. 1593–1599. [13] Leay-Kiaw Er et al. “Triglyceride glucose-body mass index is a simple and clinically useful surrogate marker for insulin resistance in nondiabetic individuals”. In: Plos one 107 11.3 (2016), e0149731. [14] Jianqing Fan, Yingying Fan, and Emre Barut. “Adaptive robust variable selection”. In: The Annals of Statistics 42.1 (2014), pp. 324–351. [15] Jianqing Fan, Yang Feng, and Rui Song. in Sparse Ultra-High-Dimensional Additive Models”. Statistical Association 106.494 (2011), pp. 544–557. “Nonparametric Independence Screening In: Journal of the American [16] Jianqing Fan and Runze Li. “Variable selection via nonconcave penalized likelihood and its oracle properties”. In: Journal of the American statistical Association 96.456 (2001), pp. 1348–1360. [17] Jianqing Fan and Jinchi Lv. “Sure independence screening for ultrahigh dimensional In: Journal of the Royal Statistical Society: Series B (Statistical feature space”. Methodology) 70.5 (2008), pp. 849–911. [18] Jianqing Fan and Rui Song. “Sure independence screening in generalized linear models with NP-dimensionality”. In: The Annals of Statistics 38.6 (2010), pp. 3567–3604. [19] K. M. Flegal et al. “Association of all-cause mortality with overweight and obesity using standard body mass index categories: a systematic review and meta-analysis”. In: JAMA 309.1 (2013), pp. 71–82. [20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent”. In: Journal of Statistical Software 33.1 (2010), pp. 1–22. [21] Sara van de Geer. “On the uniform convergence of empirical norms and inner products, In: Electronic Journal of Statistics 8 (2014), with application to causal inference”. pp. 543–574. [22] Yuwen Gu et al. “ADMM for high-dimensional sparse penalized quantile regression”. In: Technometrics 60.3 (2018), pp. 319–331. [23] Wolfgang Härdle, Hua Liang, and Jiti Gao. Partially linear models. Springer Science & Business Media, 2000. [24] Xuming He and Peide Shi. “Bivariate Tensor-Product B-Splines in a Partly Linear Model”. In: Journal of Multivariate Analysis 58.2 (1996), pp. 162–181. [25] Xuming He, Lan Wang, and Hyokyoung Grace Hong. “Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data”. In: The Annals of Statis- tics 41.1 (2013), pp. 342–369. 108 [26] Ruifeng Hu et al. “ANCO-GeneDB: annotations and comprehensive analysis of candi- date genes for alcohol, nicotine, cocaine and opioid dependence”. In: Database 2018 (Nov. 2018), bay121. [27] Peter J Huber. “Robust Estimation of a Location Parameter”. In: The Annals of Mathematical Statistics 35.1 (1964), pp. 73–101. [28] David R Hunter and Kenneth Lange. “A tutorial on MM algorithms”. In: The Amer- ican Statistician 58.1 (2004), pp. 30–37. [29] Liewen Jiang, Howard D Bondell, and Huixia Judy Wang. “Interquantile shrinkage In: Computational Statistics & Data and variable selection in quantile regression”. Analysis 69.C (2014), pp. 208–219. [30] Mi-Hyang Jung et al. “Complex interaction of fasting glucose, body mass index, age and sex on all-cause mortality: a cohort study in 15 million Korean adults”. In: Diabetologia 63.8 (2020), pp. 1616–1625. [31] Seung-Jean Kim et al. “ℓ1 trend filtering”. In: SIAM review 51.2 (2009), pp. 339–360. [32] Roger Koenker and Gilbert Bassett. “Regression quantiles”. In: Econometrica: journal of the Econometric Society (1978), pp. 33–50. [33] Maria Teresa Landi et al. “Environment And Genetics in Lung cancer Etiology (EA- GLE) study: An integrative population-based case-control study of lung cancer”. In: BMC Public Health 8.1 (2008), pp. 1–10. [34] Kenneth Lange and Janet S Sinsheimer. “Normal/independent distributions and their applications in robust regression”. In: Journal of Computational and Graphical Statis- tics 2.2 (1993), pp. 175–198. [35] Eun Ryung Lee, Hohsuk Noh, and Byeong U. Park. “Model Selection via Bayesian Information Criterion for Quantile Regression Models”. In: Journal of the American Statistical Association 109.505 (2014), pp. 216–229. [36] Wanjun Liu et al. “Model-Free Feature Screening and FDR Control With Knockoff Features”. In: Journal of the American Statistical Association 117.537 (2022), pp. 428– 443. [37] Yufeng Liu and Yichao Wu. “Simultaneous multiple non-crossing quantile regression In: Journal of nonparametric statistics 23.2 estimation using kernel constraints”. (2011), pp. 415–437. [38] Chi Ma and Jian Huang. “Asymptotic properties of lasso in high-dimensional partially linear models”. In: Science China Mathematics 59 (2016), pp. 769–788. 109 [39] Haiqiang Ma et al. “Quantile regression for functional partially linear model in ultra- high dimensions”. In: Computational Statistics & Data Analysis 129.3 (2019), pp. 135– 147. [40] Shujie Ma, Runze Li, and Chih-Ling Tsai. “Variable Screening via Quantile Par- tial Correlation”. In: Journal of the American Statistical Association 112.518 (2017), pp. 650–663. [41] Oscar Hernan Madrid Padilla and Sabyasachi Chatterjee. “Risk bounds for quantile trend filtering”. In: Biometrika 109.3 (2022), pp. 751–768. [42] Adam Maidman and Lan Wang. “New semiparametric method for predicting high-cost patients”. In: Biometrics 74.3 (2018), pp. 1104–1111. [43] Adam Maidman et al. “Quantile partially linear additive model for data with dropouts In: Statistics in Medicine 42.16 and an application to modeling cognitive decline”. (2023), pp. 2729–2745. [44] Enno Mammen and Sara Van De Geer. “Locally adaptive regression splines”. In: The Annals of Statistics 25.1 (1997), pp. 387–413. [45] Nitish Kumar Mishra et al. “Identification of prognostic markers in cholangiocarcinoma using altered DNA methylation and gene expression profiles”. In: Frontiers in Genetics 11 (2020), p. 522125. [46] Patric Müller and Sara Van de Geer. “The partial linear model in high dimensions”. In: Scandinavian Journal of Statistics 42.2 (2015), pp. 580–608. [47] A. Must et al. “The disease burden associated with overweight and obesity”. In: JAMA 282.16 (1999), pp. 1523–1529. [48] Doug Nychka et al. “A nonparametric regression approach to syringe grading for quality In: Journal of the American Statistical Association 90.432 (1995), improvement”. pp. 1171–1178. [49] Carlos Misael Madrid Padilla, Oscar Hernan Madrid Padilla, and Daren Wang. In: arXiv preprint arXiv:2308.16172 “Temporal-spatial model via Trend Filtering”. (2023). [50] Seyoung Park and Xuming He. “Hypothesis testing for regional quantiles”. In: Journal of Statistical Planning and Inference 191 (2017), pp. 13–24. [51] Limin Peng, Jinfeng Xu, and Nancy Kutner. “Shrinkage estimation of varying covariate effects based on quantile regression”. In: Statistics and Computing 24 (2014), pp. 853– 869. 110 [52] R Core Team. R: A Language and Environment for Statistical Computing. R Foun- dation for Statistical Computing. Vienna, Austria, 2022. [53] Aaditya Ramdas and Ryan J Tibshirani. “Fast and flexible ADMM algorithms for trend filtering”. In: Journal of Computational and Graphical Statistics 25.3 (2016), pp. 839–858. [54] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. “Minimax rates of estimation for high-dimensional linear regression over ℓq-balls”. In: IEEE transactions on information theory 57.10 (2011), pp. 6976–6994. [55] Leonid I Rudin, Stanley Osher, and Emad Fatemi. “Nonlinear total variation based noise removal algorithms”. In: Physica D: nonlinear phenomena 60.1-4 (1992), pp. 259– 268. [56] Veeranjaneyulu Sadhanala and Ryan J Tibshirani. “Additive models with trend filter- ing”. In: The Annals of Statistics 47.6 (2019), pp. 3032–3068. [57] EJ Schlossmacher. “An iterative technique for absolute deviations curve fitting”. In: Journal of the American Statistical Association 68.344 (1973), pp. 857–859. [58] Sabine K Schnabel and Paul HC Eilers. “Simultaneous estimation of quantile curves using quantile sheets”. In: AStA Advances in Statistical Analysis 97.1 (2013), pp. 77– 87. [59] Larry Schumaker. Spline functions: basic theory. Cambridge university press, 2007. [60] M Sesia, C Sabatti, and E J Candès. “Gene hunting with hidden Markov model knockoffs”. In: Biometrika 106.1 (Aug. 2018), pp. 1–18. [61] Ben Sherwood. “Variable selection for additive partial linear quantile regression with missing covariates”. In: Journal of Multivariate Analysis 152 (2016), pp. 206–223. [62] Ben Sherwood, Adam Maidman, and Shaobo Li. rqPen: Penalized Quantile Regres- sion. R package version 3.0. 2022. [63] Ben Sherwood and Lan Wang. “Partially linear additive quantile regression in ultra- high dimension”. In: Annals of Statistics 44.1 (2016), pp. 288–317. [64] Bei Song et al. “Triglyceride glucose-body mass index and risk of incident type 2 diabetes mellitus in Japanese people with normal glycemic level: A population-based longitudinal cohort study”. In: Frontiers in Endocrinology 13 (2022), p. 907973. [65] Arun Srinivasan, Lingzhou Xue, and Xiang Zhan. “Compositional knockoff filter for high-dimensional regression analysis of microbiome data”. In: Biometrics 77.3 (2021), 111 pp. 984–995. [66] Gabriele Steidl, Stephan Didas, and Julia Neumann. “Splines in higher order TV regularization”. In: International journal of computer vision 70 (2006), pp. 241–255. [67] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996), pp. 267–288. [68] Robert Tibshirani et al. “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology 67.1 (2005), pp. 91–108. [69] Ryan J Tibshirani. “Adaptive piecewise polynomial estimation via trend filtering”. In: The Annals of Statistics 42.1 (2014), pp. 285–323. [70] Ryan J Tibshirani. “Divided differences, falling factorials, and discrete splines: An- other look at trend filtering and related problems”. In: Foundations and Trends® in Machine Learning 15.6 (2022), pp. 694–846. [71] Paul Tseng. “Convergence of a block coordinate descent method for nondifferen- tiable minimization”. In: Journal of optimization theory and applications 109 (2001), pp. 475–494. [72] Sara Van de Geer. Estimation and testing under sparsity. Springer, 2016. [73] Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Vol. 47. Cambridge university press, 2018. [74] Nicolas Verzelen. “Minimax risks for sparse regressions: Ultra-high dimensional phe- nomenons”. In: Electronic Journal of Statistics 6 (2012), pp. 38–90. [75] Grace Wahba. Spline models for observational data. Society for Industrial and Applied Mathematics, 1990. [76] Lan Wang, Yichao Wu, and Runze Li. “Quantile Regression for Analyzing Hetero- geneity in Ultra-High Dimension”. In: Journal of the American Statistical Association 107.497 (2012), pp. 214–222. [77] Yu-Xiang Wang, Alex Smola, and Ryan Tibshirani. “The falling factorial basis and its statistical applications”. In: International Conference on Machine Learning. PMLR. 2014, pp. 730–738. [78] Yu-Xiang Wang et al. “Trend Filtering on Graphs”. In: Journal of Machine Learning Research 17.105 (2016), pp. 1–41. [79] Fang Xie and Johannes Lederer. “Aggregating Knockoffs for False Discovery Rate 112 Control with an Application to Gut Microbiome Data”. In: Entropy 23.2 (2021). [80] Huiliang Xie and Jian Huang. “SCAD-penalized regression in high-dimensional par- tially linear models”. In: (2009). [81] Fei Ye and Cun-Hui Zhang. “Rate minimaxity of the Lasso and Dantzig selector for In: The Journal of Machine Learning Research 11 (2010), the ℓq loss in ℓr balls”. pp. 3519–3540. [82] Takuma Yoshida. “Quantile function regression and variable selection for sparse mod- els”. In: Canadian Journal of Statistics 49.4 (2021), pp. 1196–1221. [83] Kyusang Yu, Enno Mammen, and Byeong Uk Park. “Semi-parametric regression: Efficiency gains from modeling the nonparametric part”. In: Bernoulli 17.2 (2011), pp. 736–748. [84] Zhuqing Yu, Michael Levine, and Guang Cheng. partially linear additive models under high dimension”. pp. 1289–1325. “Minimax optimal estimation in In: Bernoulli 25.2 (2019), [85] Cun Hui Zhang. “Nearly unbiased variable selection under minimax concave penalty”. In: Annals of Statistics 38.2 (2010), pp. 894–942. [86] Tongwu Zhang et al. “APOBEC shapes tumor evolution and age at onset of lung cancer in smokers”. In: bioRxiv (2024). [87] Qi Zheng, Limin Peng, and Xuming He. “Globally adaptive quantile regression with ultra-high dimensional data”. In: Annals of statistics 43.5 (2015), pp. 2225–2258. [88] Qixian Zhong and Jane-Ling Wang. “Neural networks for partially linear quantile regression”. In: Journal of Business & Economic Statistics 42.2 (2024), pp. 603–614. [89] Ying Zhu. “Nonasymptotic Analysis of Semiparametric Regression Models with High- In: The Annals of Statistics 45.5 (2017), Dimensional Parametric Coefficients”. pp. 2274–2298. [90] Hui Zou and Trevor Hastie. “Regularization and variable selection via the elastic net”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology 67.2 (2005), pp. 301–320. 113